<h1>Extracting Stock Data Using a Web Scraping</h1>


</center>
    <br>

  <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/Images/netflix.png"> </center> 


In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Using Webscraping to Extract Stock Data


I will extract Netflix stock data [https://finance.yahoo.com/quote/NFLX/history?p=NFLX](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html).


On the following web page we have a table with columns name (Date, Open, High, Low, close, adj close volume) out of which we must extract following columns  

* Date 

* Open  

* High 

* Low 

* Close 

* Volume 



# Steps for extracting the data
1. Send an HTTP request to the web page using the requests library.
2. Parse the HTML content of the web page using BeautifulSoup.
3. Identify the HTML tags that contain the data you want to extract.
4. BeautifulSoup methods to extract the data from the HTML tags.
5. Print the extracted data


### Step 1: Send an HTTP request to the web page


In [4]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

In [5]:
data  = requests.get(url).text

### Step 2: Parse the HTML content


## Parsing the data using the BeautifulSoup library
* Create a new BeautifulSoup object.
<br>
<br>
<b>Note: </b>To create a BeautifulSoup object in Python, you need to pass two arguments to its constructor:

1. The HTML or XML content that you want to parse as a string.
2. The name of the parser that you want to use to parse the HTML or XML content. This argument is optional, and if you don't specify a parser, BeautifulSoup will use the default HTML parser included with the library.
here in this lab we are using "html5lib" parser.


In [6]:
soup = BeautifulSoup(data, 'html5lib')

### Step 3: Identify the HTML tags


The web page consists of a table so, I will scrape the content of the HTML web page and convert the table into a data frame.



* "Date"
* "Open"
* "High" 
* "Low" 
* "Close"
* "Volume"


In [18]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Adj Close","Volume"])

<hr>
<hr>
<center>

### Working on HTML table  </center>
<br>

These are the following tags which are used while creating HTML tables.

* &lt;table&gt;: This tag is a root tag used to define the start and end of the table. All the content of the table is enclosed within these tags. 


* &lt;tr&gt;: This tag is used to define a table row. Each row of the table is defined within this tag.

* &lt;td&gt;: This tag is used to define a table cell. Each cell of the table is defined within this tag. You can specify the content of the cell between the opening and closing <td> tags.

* &lt;th&gt;: This tag is used to define a header cell in the table. The header cell is used to describe the contents of a column or row. By default, the text inside a <th> tag is bold and centered.

* &lt;tbody&gt;: This is the main content of the table, which is defined using the <tbody> tag. It contains one or more rows of <tr> elements.

<hr>
<hr>



### Step 4: Use a BeautifulSoup method for extracting data



   * The <i>find() method </i> will return particular tag content.
   * The <i>find_all()</i> method returns a list of all matching tags in the HTML.


In [19]:
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    netflix_data.loc[len(netflix_data)] = [date, Open, high, low, close, adj_close,volume]
    

### Step 5: Print the extracted data


We can now print out the data frame using the head() or tail() function.


In [20]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300


# Extracting data using `pandas` library


In [24]:
read_html_pandas_data = pd.read_html(url)

Convert the BeautifulSoup object to a string.


In [27]:
read_html_pandas_data = pd.read_html(str(soup))

In [28]:
netflix_dataframe = read_html_pandas_data[0]

netflix_dataframe.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300


## TESLA

In [4]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/revenue.htm"
data = requests.get(url).text

In [11]:
tables = pd.read_html(data)
tesla_revenue = tables[1]
tesla_revenue.columns = ['Date', 'Revenue']
tesla_revenue

Unnamed: 0,Date,Revenue
0,2022-09-30,"$21,454"
1,2022-06-30,"$16,934"
2,2022-03-31,"$18,756"
3,2021-12-31,"$17,719"
4,2021-09-30,"$13,757"
5,2021-06-30,"$11,958"
6,2021-03-31,"$10,389"
7,2020-12-31,"$10,744"
8,2020-09-30,"$8,771"
9,2020-06-30,"$6,036"


In [12]:
tesla_revenue["Revenue"]=tesla_revenue["Revenue"].str.replace("$","")
tesla_revenue

Unnamed: 0,Date,Revenue
0,2022-09-30,21454.0
1,2022-06-30,16934.0
2,2022-03-31,18756.0
3,2021-12-31,17719.0
4,2021-09-30,13757.0
5,2021-06-30,11958.0
6,2021-03-31,10389.0
7,2020-12-31,10744.0
8,2020-09-30,8771.0
9,2020-06-30,6036.0


In [13]:
tesla_revenue.dropna(inplace=True)

In [14]:
import locale
locale.setlocale(locale.LC_ALL,'en_US.UTF-8')

'en_US.UTF-8'

In [15]:
tesla_revenue["Revenue"]=tesla_revenue["Revenue"].apply(locale.atof)
tesla_revenue

Unnamed: 0,Date,Revenue
0,2022-09-30,21454.0
1,2022-06-30,16934.0
2,2022-03-31,18756.0
3,2021-12-31,17719.0
4,2021-09-30,13757.0
5,2021-06-30,11958.0
6,2021-03-31,10389.0
7,2020-12-31,10744.0
8,2020-09-30,8771.0
9,2020-06-30,6036.0


In [24]:
tesla_revenue.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53 entries, 0 to 53
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Date     53 non-null     datetime64[ns]
 1   Revenue  53 non-null     float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 1.2 KB


In [25]:
tesla_revenue["Date"]=tesla_revenue["Date"].astype("datetime64[ns]")
tesla_revenue

Unnamed: 0,Date,Revenue
0,2022-09-30,21454.0
1,2022-06-30,16934.0
2,2022-03-31,18756.0
3,2021-12-31,17719.0
4,2021-09-30,13757.0
5,2021-06-30,11958.0
6,2021-03-31,10389.0
7,2020-12-31,10744.0
8,2020-09-30,8771.0
9,2020-06-30,6036.0
