## Extracting Stock Data using Webcraping

In [1]:
#Installing required libralies
!pip install pandas==1.3.3
!pip install requests==2.26.0
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y 
!pip install lxml==4.6.4
!pip install plotly==5.3.1

Collecting pandas==1.3.3
  Downloading pandas-1.3.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
Successfully installed pandas-1.3.3
Collecting requests==2.26.0
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting charset-normalizer~=2.0.0 (from requests==2.26.0)
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Installing collected packages: charset-normalizer, requests
  Attempting uninstall: charset-normalizer
    Found existing installation: charset-normalizer 3.1.0
    Uninstalling cha

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

##### 
In Python, you can ignore warnings using the warnings module. You can use the filterwarnings function to filter or ignore specific warning messages or categories.

In [4]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore", category=FutureWarning)

#### Using Webscraping to Extract Stock Data Example
We will extract Netflix stock data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html.

### Steps for extracting the data
1. Send an HTTP request to the web page using the requests library.
2. Parse the HTML content of the web page using BeautifulSoup.
3. Identify the HTML tags that contain the data you want to extract.
4. Use BeautifulSoup methods to extract the data from the HTML tags.
5. Print the extracted data

#### Step 1: Send an HTTP request to the web page
You will use the request library for sending an HTTP request to the web page.

In [5]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

The **requests.get()** method takes a URL as its first argument, which specifies the location of the resource to be retrieved. In this case, the value of the url variable is passed as the argument to the requests.get() method, because you will store a web page URL in a url variable.

You use the ***.text*** method for extracting the HTML content as a string in order to make it readable.

In [8]:
data = requests.get(url).text

### Step 2: Parse the HTML content
####                                                     What is parsing? 
In simple words, parsing refers to the process of analyzing a string of text or a data structure, usually following a set of rules or grammar, to understand its structure and meaning. Parsing involves breaking down a piece of text or data into its individual components or elements, and then analyzing those components to extract the desired information or to understand their relationships and meanings.

Next you will take the raw HTML content of a web page or a string of HTML code which needs to be parsed and transformed into a structured, hierarchical format that can be more easily analyzed and manipulated in Python. This can be done using a Python library called Beautiful Soup.

#### Parsing the data using the BeautifulSoup library
Generate a new BeautifulSoup object.

*Note:* To generate beautifulSoup object in Python, you need to pass two arguments to its constructor:
1. The HTML or XML content that you want to parse as a string.
2. The name of the parser that you want to use to parse the HTML or XML content. This argument is optional, and if you don't specify a parser, BeautifulSoup will use the default HTML parser included with the library. here in this lab we are using "html5lib" parser.

In [9]:
soup = BeautifulSoup(data, 'html5lib')

#### Step 3: Identify the HTML tags
The web page consists of a table so, we will scrape the content of the HTML web page and convert the table into a data frame.
Let's generate an empty dataframe using **pd.DataFrame()** with the following columns:
* "Date"
* "Open"
* "High"
* "Low"
* "Close"
* "Volume"

In [11]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

#### Working on HTML table
These are the following tags which are used while creating HTML tables.

* <table: This tag is a root tag used to define the start and end of the table. All the content of the table is enclosed within these tags.

* <tr: This tag is used to define a table row. Each row of the table is defined within this tag.

* <td: This tag is used to define a table cell. Each cell of the table is defined within this tag. You can specify the content of the cell between the opening and closing tags.

* <th: This tag is used to define a header cell in the table. The header cell is used to describe the contents of a column or row. By default, the text inside a tag is bold and centered.

* <tbody: This is the main content of the table, which is defined using the tag. It contains one or more rows of elements.

#### Step 4: Use a BeautifulSoup method for extracting data
We will use **find()** and **find_all()** methods of the BeautifulSoup object to locate the table body and table row respectively in the HTML.

* The find() method will return particular tag content.
* The find_all() method returns a list of all matching tags in the HTML.

In [12]:
# First we isolate the body of the table which contains all the information
# Then we loop through each row and find all the column values for each row
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    # Finally we append the data of each row to the table
    netflix_data = netflix_data.append({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}, ignore_index=True) 

#### Step 5: Print the Extracted Data

In [13]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,78560600,528.21
1,"May 01, 2021",512.65,518.95,478.54,502.81,66927600,502.81
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,111573300,513.47
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900,521.66
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300,538.85


## Extracting Data Using `pandas` Library
We can also use the pandas `read_html` function from the pandas library and use the URL for extracting data.

`pd.read_html(url)` is a function provided by the pandas library in Python that is used to extract tables from HTML web pages. It takes in a URL as input and returns a list of all the tables found on the web page.

In [14]:
read_html_pandas_data = pd.read_html(url)

Or you can convert the `BeautifulSoup` object to a string.

In [15]:
read_html_pandas_data = pd.read_html(str(soup))

Because there is only one table on the page, just take the first table in the returned list.

In [16]:
netflix_dataframe = read_html_pandas_data[0]

netflix_dataframe.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300


## Using webscraping extract Amazon Stock Data
url of amazon stock data is: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/amazon_data_webpage.html

In [17]:
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/amazon_data_webpage.html'

In [18]:
hmtl_data = requests.get(url).text

In [19]:
soup = BeautifulSoup(hmtl_data)

In [21]:
amazon_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    amazon_data = amazon_data.append({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}, ignore_index=True)

In [22]:
amazon_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jan 01, 2021",3270.0,3363.89,3086.0,3206.2,71528900,3206.2
1,"Dec 01, 2020",3188.5,3350.65,3072.82,3256.93,77556200,3256.93
2,"Nov 01, 2020",3061.74,3366.8,2950.12,3168.04,90810500,3168.04
3,"Oct 01, 2020",3208.0,3496.24,3019.0,3036.15,116226100,3036.15
4,"Sep 01, 2020",3489.58,3552.25,2871.0,3148.73,115899300,3148.73


In [23]:
amazon_data.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
56,"May 01, 2016",663.92,724.23,656.0,722.79,90614500,722.79
57,"Apr 01, 2016",590.49,669.98,585.25,659.59,78464200,659.59
58,"Mar 01, 2016",556.29,603.24,538.58,593.64,94009500,593.64
59,"Feb 01, 2016",578.15,581.8,474.0,552.52,124144800,552.52
60,"Jan 01, 2016",656.29,657.72,547.18,587.0,130200900,587.0
