<h1>Extracting Stock Data Using a Web Scraping</h1>


In [None]:
#!pip install pandas==1.3.3
#!pip install requests==2.26.0
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y 
!pip install lxml==4.6.4

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

We will extract Netflix stock data [https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html).


# Steps to be followed for extracting data
1. Send an HTTP request to the webpage using the requests library.
2. Parse the HTML content of the webpage using BeautifulSoup.
3. Identify the HTML tags that contain the data you want to extract.
4. Use BeautifulSoup methods to extract the data from the HTML tags.
5. Print the extracted data


We are using Request library for sending an HTTP request to the webpage.<br>


In [3]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

In [5]:
data  = requests.get(url).text
#print(data)

In [6]:
soup = BeautifulSoup(data, 'html5lib')

As stated above webpage consist of table so, we will be scrapping the content of the HTML webpage and convert the table into a dataframe.


In [44]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])


We will use <b>find()</b> and <b>find_all()</b> methods of the BeautifulSoup object to locate the table body and table row respectively in the HTML. 
   * The <i>find() method </i> will return particular tag content.
   * The <i>find_all()</i> method returns a list of all matching tags in the HTML.


In [45]:
# First we isolate the body of the table which contains all the information
# Then we loop through each row and find all the column values for each row
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    date = col[0].text
    open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    # Finally we append the data of each row to the table
    netflix_data.loc[len(netflix_data)] = {"Date":date, "Open":open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}

netflix_data




Unnamed: 0,Date,Open,High,Low,Close,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.00,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300
...,...,...,...,...,...,...
65,"Jan 01, 2016",109.00,122.18,90.11,91.84,488193200
66,"Dec 01, 2015",124.47,133.27,113.85,114.38,319939200
67,"Nov 01, 2015",109.20,126.60,101.86,123.33,320321800
68,"Oct 01, 2015",102.91,115.83,96.26,108.38,446204400


We can now print out the DataFrame using head() or tail() function


In [46]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300


# Extracting data using `pandas` library


We can also use the pandas `read_html` function from pandas library and use the URL for extracting data.


In [78]:
url1 = 'https://www.worldometers.info/gdp/gdp-by-country/'
r = requests.get(url1).text
soup = BeautifulSoup(r, 'html.parser')
gdp_data = pd.DataFrame(columns = ['Country', 'GDP', 'GDP Growth', 'Population'])
for row in soup.find('tbody').find_all('tr'):
    col = row.find_all('td')
    country = col[1].text
    gdp = col[2].text
    gdp_growth = col[3].text
    pop = col[4].text
    gdp_data.loc[len(gdp_data)] = [country, gdp, gdp_growth, pop]
display(gdp_data)
                                    
read_html_pandas_data = pd.read_html(url1)
read_html_pandas_data

Unnamed: 0,Country,GDP,GDP Growth,Population
0,United States,"$25,462,700,000,000",$25.463 trillion,2.06%
1,China,"$17,963,200,000,000",$17.963 trillion,2.99%
2,Japan,"$4,231,140,000,000",$4.231 trillion,1.03%
3,Germany,"$4,072,190,000,000",$4.072 trillion,1.79%
4,India,"$3,385,090,000,000",$3.385 trillion,7.00%
...,...,...,...,...
172,Sao Tome & Principe,"$546,680,342",$547 million,0.93%
173,Micronesia,"$427,094,119",$427 million,-0.62%
174,Marshall Islands,"$279,667,900",$280 million,1.50%
175,Kiribati,"$223,352,943",$223 million,1.56%


[       #              Country GDP  (nominal, 2022)    GDP  (abbrev.)  \
 0      1        United States  $25,462,700,000,000  $25.463 trillion   
 1      2                China  $17,963,200,000,000  $17.963 trillion   
 2      3                Japan   $4,231,140,000,000   $4.231 trillion   
 3      4              Germany   $4,072,190,000,000   $4.072 trillion   
 4      5                India   $3,385,090,000,000   $3.385 trillion   
 ..   ...                  ...                  ...               ...   
 172  173  Sao Tome & Principe         $546,680,342      $547 million   
 173  174           Micronesia         $427,094,119      $427 million   
 174  175     Marshall Islands         $279,667,900      $280 million   
 175  176             Kiribati         $223,352,943      $223 million   
 176  177               Tuvalu          $60,349,391       $60 million   
 
     GDP growth  Population  (2022) GDP per capita Share of  World GDP  
 0        2.06%           338289857        $75,26

Or we can convert the BeautifulSoup object to a string


In [79]:
read_html_pandas_data = pd.read_html(str(soup))
read_html_pandas_data


[       #              Country GDP  (nominal, 2022)    GDP  (abbrev.)  \
 0      1        United States  $25,462,700,000,000  $25.463 trillion   
 1      2                China  $17,963,200,000,000  $17.963 trillion   
 2      3                Japan   $4,231,140,000,000   $4.231 trillion   
 3      4              Germany   $4,072,190,000,000   $4.072 trillion   
 4      5                India   $3,385,090,000,000   $3.385 trillion   
 ..   ...                  ...                  ...               ...   
 172  173  Sao Tome & Principe         $546,680,342      $547 million   
 173  174           Micronesia         $427,094,119      $427 million   
 174  175     Marshall Islands         $279,667,900      $280 million   
 175  176             Kiribati         $223,352,943      $223 million   
 176  177               Tuvalu          $60,349,391       $60 million   
 
     GDP growth  Population  (2022) GDP per capita Share of  World GDP  
 0        2.06%           338289857        $75,26

Because there is only one table on the page, we just take the first table in the list returned


In [80]:
netflix_dataframe = read_html_pandas_data[0]

netflix_dataframe

Unnamed: 0,#,Country,"GDP (nominal, 2022)",GDP (abbrev.),GDP growth,Population (2022),GDP per capita,Share of World GDP
0,1,United States,"$25,462,700,000,000",$25.463 trillion,2.06%,338289857,"$75,269",25.32%
1,2,China,"$17,963,200,000,000",$17.963 trillion,2.99%,1425887337,"$12,598",17.86%
2,3,Japan,"$4,231,140,000,000",$4.231 trillion,1.03%,123951692,"$34,135",4.21%
3,4,Germany,"$4,072,190,000,000",$4.072 trillion,1.79%,83369843,"$48,845",4.05%
4,5,India,"$3,385,090,000,000",$3.385 trillion,7.00%,1417173173,"$2,389",3.37%
...,...,...,...,...,...,...,...,...
172,173,Sao Tome & Principe,"$546,680,342",$547 million,0.93%,227380,"$2,404",0.00%
173,174,Micronesia,"$427,094,119",$427 million,-0.62%,539013,$792,0.00%
174,175,Marshall Islands,"$279,667,900",$280 million,1.50%,41569,"$6,728",0.00%
175,176,Kiribati,"$223,352,943",$223 million,1.56%,131232,"$1,702",0.00%
