<center>
    <img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png" width="300" alt="cognitiveclass.ai logo"  />
</center>


<h1>Extracting Stock Data Using a Web Scraping</h1>


Not all stock data is available via API in this assignment; you will use web-scraping to obtain financial data. You will be quizzed on your results.\
Using beautiful soup we will extract historical share data from a web-page.


<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li>Downloading the Webpage Using Requests Library</li>
        <li>Parsing Webpage HTML Using BeautifulSoup</li>
        <li>Extracting Data and Building DataFrame</li>
    </ul>
<p>
    Estimated Time Needed: <strong>30 min</strong></p>
</div>

<hr>


In [1]:
#!pip install pandas
#!pip install requests
!pip install bs4
#!pip install plotly

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 1.0MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beau

In [28]:
import pandas as pd
import requests
import yfinance as yf
from bs4 import BeautifulSoup

## Using Webscraping to Extract Stock Data Example


First we must use the `request` library to downlaod the webpage, and extract the text. We will extract Netflix stock data [https://finance.yahoo.com/quote/NFLX/history?period1=1439078400\&period2=1623196800\&interval=1mo\&filter=history\&frequency=1mo\&includeAdjustedClose=true](https://finance.yahoo.com/quote/NFLX/history?utm_medium=Exinfluencer\&utm_source=Exinfluencer\&utm_content=000026UJ\&utm_term=10006555\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01\&period1=1439078400\&period2=1623196800\&interval=1mo\&filter=history\&frequency=1mo\&includeAdjustedClose=true).


In [42]:
url = "https://finance.yahoo.com/quote/NFLX/history?period1=1439078400&period2=1623196800&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true"

data  = requests.get(url).text

Next we must parse the text into html using `beautiful_soup`


In [27]:
requests.get('https://finance.yahoo.com/quote/NFLX/history?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01&period1=1439078400&period2=1623196800&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true')

<Response [404]>

In [29]:
netflix = yf.Ticker('NFLX')

In [8]:
soup = BeautifulSoup(data, 'html5lib')

In [31]:
netflix_data = netflix.history(period='max')

Now we can turn the html table into a pandas dataframe


In [36]:
netflix_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2002-05-23,1.156429,1.242857,1.145714,1.196429,104790000,0,0.0
2002-05-24,1.214286,1.225000,1.197143,1.210000,11104800,0,0.0
2002-05-28,1.213571,1.232143,1.157143,1.157143,6609400,0,0.0
2002-05-29,1.164286,1.164286,1.085714,1.103571,6757800,0,0.0
2002-05-30,1.107857,1.107857,1.071429,1.071429,10154200,0,0.0
...,...,...,...,...,...,...,...
2021-06-28,528.119995,533.940002,524.559998,533.030029,2820200,0,0.0
2021-06-29,533.549988,536.130005,528.570007,533.500000,2314600,0,0.0
2021-06-30,534.059998,534.380005,526.820007,528.210022,2773400,0,0.0
2021-07-01,525.719971,537.039978,525.719971,533.539978,2805400,0,0.0


In [11]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

# First we isolate the body of the table which contains all the information
# Then we loop through each row and find all the column values for each row
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    print(col)
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    # Finally we append the data of each row to the table
    netflix_data = netflix_data.append({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}, ignore_index=True)    

[<td>
      <img alt="Yahoo Logo" src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png"/>
      <h1 style="margin-top:20px;">Will be right back...</h1>
      <p id="message-1">Thank you for your patience.</p>
      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>
      </td>]


We can now print out the dataframe


In [39]:
netflix_data.drop(['Dividends','Stock Splits'], axis=1,inplace=True)

In [40]:
netflix_data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2002-05-23,1.156429,1.242857,1.145714,1.196429,104790000
2002-05-24,1.214286,1.225,1.197143,1.21,11104800
2002-05-28,1.213571,1.232143,1.157143,1.157143,6609400
2002-05-29,1.164286,1.164286,1.085714,1.103571,6757800
2002-05-30,1.107857,1.107857,1.071429,1.071429,10154200


In [34]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jun 01, 2021",504.01,505.41,487.25,492.39,16955200,492.39
1,"May 01, 2021",512.65,518.95,478.54,502.81,66925200,502.81
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,111568500,513.47
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900,521.66
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300,538.85


We can also use the pandas `read_html` function


Beacause there is only one table on the page, we just take the first table in the list returned


In [44]:
netflix_dataframe = read_html_pandas_data[0]

netflix_dataframe.head()

NameError: name 'read_html_pandas_data' is not defined

## Using Webscraping to Extract Stock Data Exercise


Use the `requests` library to download the webpage [https://finance.yahoo.com/quote/AMZN/history?period1=1451606400\&period2=1612137600\&interval=1mo\&filter=history\&frequency=1mo\&includeAdjustedClose=true](https://finance.yahoo.com/quote/AMZN/history?utm_medium=Exinfluencer\&utm_source=Exinfluencer\&utm_content=000026UJ\&utm_term=10006555\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01\&period1=1451606400\&period2=1612137600\&interval=1mo\&filter=history\&frequency=1mo\&includeAdjustedClose=true). Save the text of the response as a variable named `html_data`.


In [48]:
url = 'https://finance.yahoo.com/quote/AMZN/history?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01&period1=1451606400&period2=1612137600&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true'
amzn = yf.Ticker('AMZN')

In [49]:
amzn_data = amzn.history(period='max')

Parse the html data using `beautiful_soup`.


In [50]:
amzn_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1997-05-15,2.437500,2.500000,1.927083,1.958333,72156000,0,0.0
1997-05-16,1.968750,1.979167,1.708333,1.729167,14700000,0,0.0
1997-05-19,1.760417,1.770833,1.625000,1.708333,6106800,0,0.0
1997-05-20,1.729167,1.750000,1.635417,1.635417,5467200,0,0.0
1997-05-21,1.635417,1.645833,1.375000,1.427083,18853200,0,0.0
...,...,...,...,...,...,...,...
2021-06-28,3416.000000,3448.000000,3413.510010,3443.889893,2242800,0,0.0
2021-06-29,3438.820068,3456.030029,3423.030029,3448.139893,2098400,0,0.0
2021-06-30,3441.060059,3471.600098,3435.000000,3440.159912,2404000,0,0.0
2021-07-01,3434.610107,3457.000000,3409.419922,3432.969971,2037100,0,0.0


In [52]:
!pip install requests_html
import requests_html


Collecting requests_html
  Downloading https://files.pythonhosted.org/packages/24/bc/a4380f09bab3a776182578ce6b2771e57259d0d4dbce178205779abdc347/requests_html-0.10.0-py3-none-any.whl
Collecting fake-useragent (from requests_html)
  Downloading https://files.pythonhosted.org/packages/d1/79/af647635d6968e2deb57a208d309f6069d31cb138066d7e821e575112a80/fake-useragent-0.1.11.tar.gz
Collecting parse (from requests_html)
  Downloading https://files.pythonhosted.org/packages/89/a1/82ce536be577ba09d4dcee45db58423a180873ad38a2d014d26ab7b7cb8a/parse-1.19.0.tar.gz
Collecting w3lib (from requests_html)
  Downloading https://files.pythonhosted.org/packages/a3/59/b6b14521090e7f42669cafdb84b0ab89301a42f1f1a82fcf5856661ea3a7/w3lib-1.22.0-py2.py3-none-any.whl
Collecting pyquery (from requests_html)
  Downloading https://files.pythonhosted.org/packages/58/0b/85d15e21f660a8ea68b1e0286168938857391f4ec9f6d204d91c9e013826/pyquery-1.4.3-py3-none-any.whl
Collecting pyppeteer>=0.0.14 (from requests_html)
[?25

In [53]:
url = 'https://finance.yahoo.com/quote/AMZN/history?period1=1451606400&period2=1612137600&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true'
import nest_asyncio

nest_asyncio.apply()


session = requests_html.HTMLSession()
r = session.get(url)

html_str = r.text

In [59]:
soup = BeautifulSoup(html_str, 'html5lib')
table=soup.find('table')

<b>Question 1</b> What is the content of the title attribute:


In [60]:
soup.find('title')

<title>Amazon.com, Inc. (AMZN) Stock Historical Prices &amp; Data - Yahoo Finance</title>

Using beautiful soup extract the table with historical share prices and store it into a dataframe named `amazon_data`. The dataframe should have columns Date, Open, High, Low, Close, Adj Close, and Volume. Fill in each variable with the correct data from the list `col`.


In [62]:
amazon_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    amazon_data = amazon_data.append({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}, ignore_index=True)

Print out the first five rows of the `amazon_data` dataframe you created.


In [63]:
amazon_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jan 01, 2021",3270.0,3363.89,3086.0,3206.2,71528900,3206.2
1,"Dec 01, 2020",3188.5,3350.65,3072.82,3256.93,77556200,3256.93
2,"Nov 01, 2020",3061.74,3366.8,2950.12,3168.04,90810500,3168.04
3,"Oct 01, 2020",3208.0,3496.24,3019.0,3036.15,116226100,3036.15
4,"Sep 01, 2020",3489.58,3552.25,2871.0,3148.73,115899300,3148.73


<b>Question 2</b> What is the name of the columns of the dataframe


In [66]:
amazon_data.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

<b>Question 3</b> What is the `Open` of the last row of the amazon_data dataframe?


In [68]:
amazon_data.iloc[-1]

Date         Jan 01, 2016
Open               656.29
High               657.72
Low                547.18
Close              587.00
Volume        130,200,900
Adj Close          587.00
Name: 60, dtype: object

<h2>About the Authors:</h2> 

<a href="https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

Azim Hirjani


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By    | Change Description        |
| ----------------- | ------- | ------------- | ------------------------- |
| 2021-06-09       | 1.2     | Lakshmi Holla|Added URL in question 3 |
| 2020-11-10        | 1.1     | Malika Singla | Deleted the Optional part |
| 2020-08-27        | 1.0     | Malika Singla | Added lab to GitLab       |

<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>

<p>
