# Extracting Stock Data Using Web Scraping

Not all data is available via API so we will practice web-scraping to obtain some financial data. </br>
In order to do this, we will use BeutifulSoup

In [23]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import html5lib

We will specifically be extracting stock data from Netflix </br>
This data will need to include: </br>
- Date
- Open
- High
- Low
- Close
- Volume </br>

These are the steps we will follow: </br>
1. Send an HTTP request to the webpage using the requests library.
2. Parse the HTML contonent of the webpage using BeautifulSoup.
3. Identify the HTML tags that contain the data we want to extract.
4. Use BeautifulSoup methods to extract the data from the HTML tags.
5. Print the extracted data.


### 1. Let's send a request to the webpage

In [24]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

In [25]:
data = requests.get(url).text
# will print out entire HTML DOM
# print(data) 

### 2. Now let's parse our data

In [26]:
soup = BeautifulSoup(data,'html5lib')

### 3. Time to identify our HTML tags

In [27]:
# creating our DataFrame
netflix_data = pd.DataFrame(columns=['Date','Open','High','Low','Close','Volume'])

Since we are aiming for a specific table in the webpage we will target the following tags
- <code>table</code>: starts and ends our table
- <code>tr</code></code></code></code>: defines each row
- <code>td</code></code></code>: defines a table cell
- <code>th</code></code>: defines a header cell
- <code>tbody</code>: defines the maine content of the table, containing one or more rows

### 4. Use Beautiful soup to extract data

In [39]:
# First we isolate the body which contains the table
for row in soup.find("tbody").find_all('tr'):
    # Then we loop through each row and find all the column values
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text 
    tokens = pd.Series({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume})
    # Finally we append the data of each row to the table
    netflix_data =pd.concat([netflix_data,tokens], axis=1)   

### 5. Lastly let's print our data

In [47]:
netflix_data = netflix_data.iloc[:,0:7]
# trim the extra columns

In [46]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,78560600,528.21
1,"May 01, 2021",512.65,518.95,478.54,502.81,66927600,502.81
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,111573300,513.47
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900,521.66
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300,538.85


### 5.a Another way 
We also could have gone about this a simpler way, using a function provided by pandas to extact tables from HTML.

In [50]:
read_html__pandas_data = pd.read_html(url)

Since there is only one table on the page, we can just take the first table that is returned.

In [52]:
netflix_dataframe = read_html__pandas_data[0]
netflix_dataframe.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300
