<a href="https://colab.research.google.com/github/babisovan/Basic_Web_Scrapping/blob/main/Basic_Web_Scraping_with_Python_Completed_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1: Loading modules**
Before we start scrapping the target website, we need to import some necessary modules from the system library.
*   “requests” includes the modules for sending HTTP requests to websites, the core step for web scrapping.
*   “bs4/BeautifulSoup” includes the required APIs for cleaning and formatting the data collected from the web scrapper.
*   “pandas” includes some essential functionalities for data analytics, allowing users to quickly manipulate and analyse them.
---


In [1]:
import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

# **Step 2: Naïve Scrapping Method (Scrapping Whole Page)**
We will now introduce the simplest way to scrape the data from a website.
*   Define a Python "list" for every column you identified in the stock price table from Yahoo! Finance.
*   Add the URL of the target website in the code.
*   Observe the stock price table and identify the column data that will be useful. Then, use the "Inspect" feature from Chrome to show the HTML content.
*   Use for-loop to format the data collected from BeautifulSoup.


**Discussions**
1.   Try to discuss the advantages and disadvantages of the method above。
2.   If the column name of the underlying table in the website changes, does this method still work?
---






In [2]:
# Use requests and BeautifulSoup（BS）to scrape website data
active_stocks_url = "https://finance.yahoo.com/most-active"
r = requests.get(active_stocks_url)
data = r.text
soup = BeautifulSoup(data)

# Define python lists for every column
codes=[]
names=[]
prices=[]
changes=[]
percent_changes=[]
total_volumes=[]
market_caps=[]
price_earning_ratios=[]

In [3]:
"""
Using the concepts of for-loop, find all the <tr> tags from "stockTable".
Every <tr> tag represent a row of stock data (saved as listing).
We need to find all the <td> tag from the "listing", and extract its info to be inserted to the relevant python list.
"""
# TODO: Fill in the relevant HTML tag in the find_all "brackets"
stockTable = soup.find('tbody')
for listing in stockTable.find_all('tr'):

    code = listing.find('td', attrs={'aria-label':'Symbol'})
    codes.append(code.text)

    name = listing.find('td', attrs={'aria-label':'Name'})
    names.append(name.text)

    price = listing.find('td', attrs={'aria-label':'Price (Intraday)'})
    prices.append(price.text)
    
    # TODO: Use the same method as above to extract the remaining columns
    change = listing.find('td', attrs={'aria-label':'Change'})
    changes.append(change.text)
    
    percent_change = listing.find('td', attrs={'aria-label':'% Change'})
    percent_changes.append(percent_change.text)
    
    total_volume = listing.find('td', attrs={'aria-label':'Volume'})
    total_volumes.append(total_volume.text)
    
    market_cap = listing.find('td', attrs={'aria-label':'Market Cap'})
    market_caps.append(market_cap.text)
    
    price_earning_ratio = listing.find('td', attrs={'aria-label':'PE Ratio (TTM)'})
    price_earning_ratios.append(price_earning_ratio.text)

In [4]:
"""
Use pandas to create a new data frame, aggregate all python lists into a single table.
You will need to know how to use Python dictionary in this part.
"""
df = pd.DataFrame({ "Symbol":                codes, 
                    "Name":                  names, 
                    "Price":                 prices, 
                    "Change":                changes, 
                    "% Change":              percent_changes, 
                    "Market Cap":            market_caps, 
                    "Volume":                total_volumes, 
                    "PE Ratio (TTM)":        price_earning_ratios })
df

Unnamed: 0,Symbol,Name,Price,Change,% Change,Market Cap,Volume,PE Ratio (TTM)
0,AMD,"Advanced Micro Devices, Inc.",83.75,-0.04,-0.05%,135.718B,86.467M,31.43
1,NIO,NIO Inc.,22.55,-0.11,-0.49%,37.67B,74.775M,
2,AAPL,Apple Inc.,135.35,-0.52,-0.38%,2.191T,73.165M,22.05
3,AMZN,"Amazon.com, Inc.",108.95,0.27,+0.25%,1.109T,58.016M,52.25
4,LU,Lufax Holding Ltd,6.46,-0.36,-5.28%,14.766B,48.012M,6.32
5,T,AT&T Inc.,20.32,0.36,+1.80%,145.471B,45.622M,8.55
6,VALE,Vale S.A.,14.48,-0.24,-1.63%,74.278B,45.529M,3.24
7,F,Ford Motor Company,11.48,0.02,+0.17%,45.333B,45.411M,4.0
8,ITUB,Itaú Unibanco Holding S.A.,4.57,-0.08,-1.72%,44.694B,42.271M,8.31
9,META,"Meta Platforms, Inc.",155.85,-1.2,-0.76%,421.78B,46.804M,11.79


# **Step 3: Naïve Scrapping Method (Scrapping Individual Rows) - No need to use**
*   Copy and paste the Yahoo Finance link for currencies。
*   Use Chrome Inspector to inspect the HTML elements。

**Discussions**
1.   What is the difference of this method in terms of execution efficiency when compared to the previous method?
2.   If the row header, does this method still works?
3.   When should we use whole page scraping, when should we use individual row scraping?
---

In [None]:
currencies_url = "https://finance.yahoo.com/currencies"
r = requests.get(currencies_url)
data = r.text
soup = BeautifulSoup(data)

codes=[]
names=[]
last_prices=[]
changes=[]
percent_changes=[]

# Find the starting and ending data-reactid，and the difference between each column
start, end, jump = 40, 404, 14
for i in range(start, end, jump):
    listing = soup.find('tr', attrs={'data-reactid':i})
    print(listing)

    code = listing.find('td', attrs={'data-reactid':i+1})
    codes.append(code.text)

    name = listing.find('td', attrs={'data-reactid':i+3})
    names.append(name.text)
    
    last_price = listing.find('td', attrs={'data-reactid':i+4})
    last_prices.append(last_price.text)

    change = listing.find('td', attrs={'data-reactid':i+5})
    changes.append(change.text)

    percent_change = listing.find('td', attrs={'data-reactid':i+7})
    percent_changes.append(percent_change.text)

pd.DataFrame({"Symbol": codes, 
              "Name": names, 
              "Last Price": last_prices, 
              "Change": changes, 
              "% Change": percent_changes})

# **Step 4: Header Scraping Method**
This method is an advanced scraping method. The code will automatically scrape the header so that we don't have to define the list for ourselves, making the code much simpler and cleaner.

*   Copy and paste the Yahoo Finance link of active stocks
*   Scrape the headers and put those into a python list
*   Put the relevant data into a Python dictionary
---

In [6]:
crypto_url = "https://finance.yahoo.com/cryptocurrencies"
r = requests.get(crypto_url)
data = r.text
soup = BeautifulSoup(data)

# Scrape all the headers
raw_data = {}
headers = []
for header_row in soup.find_all('thead'):
  for header in header_row.find_all('th'):
    raw_data[header.text] = []
    headers.append(header.text)
  
for rows in soup.find_all('tbody'):
  for row in rows.find_all('tr'):
    for idx, cell in enumerate(row.find_all('td')):
      # print(dir(cell))
      raw_data[headers[idx]].append(cell.text)

pd.DataFrame(raw_data)

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,Day Chart
0,BTC-USD,Bitcoin USD,20009.86,-661.2,-3.20%,381.675B,28.553B,28.553B,28.553B,19.074M,,
1,ETH-USD,Ethereum USD,1053.92,-69.24,-6.16%,127.808B,15.006B,15.006B,15.006B,121.269M,,
2,USDT-USD,Tether USD,0.998942,-0.000139,-0.01%,66.91B,46.119B,46.119B,46.119B,66.981B,,
3,USDC-USD,USD Coin USD,1.0002,0.0001,+0.01%,55.873B,4.59B,4.59B,4.59B,55.864B,,
4,BNB-USD,Binance Coin USD,214.64,-3.72,-1.70%,35.046B,1.049B,1.049B,1.049B,163.277M,,
5,BUSD-USD,Binance USD USD,0.999332,0.000213,+0.02%,17.518B,4.346B,4.346B,4.346B,17.529B,,
6,XRP-USD,XRP USD,0.323428,-0.004525,-1.38%,15.636B,1.065B,1.065B,1.065B,48.343B,,
7,ADA-USD,Cardano USD,0.458245,-0.019778,-4.14%,15.55B,811.653M,811.653M,811.653M,33.934B,,
8,SOL-USD,Solana USD,34.39,-1.52,-4.23%,11.779B,1.533B,1.533B,1.533B,342.508M,,
9,DOGE-USD,Dogecoin USD,0.0618,-0.003515,-5.38%,8.199B,633.292M,633.292M,633.292M,132.671B,,


# **Step 5: Making a generic scarping function**

We are going to turn the header method into a Python function. This function can also work for other types of financial products!

*   Define a good name for the function
*   Define input paramters and input value

---


In [7]:
def scrape_table(url):
    soup = BeautifulSoup(requests.get(url).text)
    headers = [header.text for listing in soup.find_all('thead') for header in listing.find_all('th')]
    raw_data = {header:[] for header in headers}

    for rows in soup.find_all('tbody'):
      for row in rows.find_all('tr'):
        if len(row) != len(headers): continue
        for idx, cell in enumerate(row.find_all('td')):
          raw_data[headers[idx]].append(cell.text)

    return pd.DataFrame(raw_data)

# **Concept Challenge: Scrape other products**
Try using the generic function to scrape other products in Yahoo Finance!
*   Gainers
*   Losers
*   Top ETFs
---


In [8]:
cryptocurrencies = scrape_table("https://finance.yahoo.com/cryptocurrencies")
currencies = scrape_table("https://finance.yahoo.com/currencies")
commondaties = scrape_table("https://finance.yahoo.com/commodities")
activestocks = scrape_table("https://finance.yahoo.com/most-active")
techstocks = scrape_table("https://finance.yahoo.com/industries/software_services")
gainers = scrape_table("https://finance.yahoo.com/gainers")
losers = scrape_table("https://finance.yahoo.com/losers")
indices = scrape_table("https://finance.yahoo.com/world-indices")

#**Step 6: Data Wrangling**
Datatype Conversion

This part will make use of the stock data we have collected from our web scrapper. However, the data collected are all stored as "strings". In other words, the data is regarded as textual data even if the underlying data is representing a number. We need to convert them into right formats for the chart plotting tools.

Steps in data conversion：

Remove all the commas in the number data, and change columns that contain number data to floating point.
Change all columns that contain dates to datetime.
Recover abbreaviated numbers, for example, recover "1M" to 1000000.

In [9]:
from datetime import datetime
def convert_column_to_float(df, columns):
  for column in columns: 
      df[column] = pd.to_numeric(df[column].str.replace(',','').str.replace('%',''))
  return df

def convert_column_to_datetime(df, columns):
  for column in columns:
      df[column] = pd.to_datetime(df[column])
  return df

def revert_scaled_number(number):
  mapping = {'M': 1000000, 'B': 1000000000, 'T': 1000000000000}
  scale = number[-1]
  if scale not in ['M','B','T']:
      return float(number.replace(',',''))
  return float(number[0:-1].replace(',','')) * mapping[scale]

**Filtering dataframe**

- We can scrape all the active stocks easily now
- Let's try to separate them into rising and losing stocks?

In [10]:
# first scrape the active stocks table using the web scraper function
activestocks = scrape_table("https://finance.yahoo.com/most-active")
# change the data type of the dataframe columns
activestocks = convert_column_to_float(activestocks, ['% Change'])

# filter the dataframe by % Change (pos/neg)
rising = activestocks[activestocks['% Change'] > 0]
losing = activestocks[activestocks['% Change'] < 0]

**Sorting dataframe**

- It's not quite clear which stock is the top gainer/loser
- We can sort the dataframe and see it clearly

In [11]:
rising = rising.sort_values(by=['% Change'], ascending=False)
losing = losing.sort_values(by=['% Change'], ascending=True)

Finally, if you prefer, you can add back the "+/-" sign and the percentage symbol and convert back the value to string

In [12]:
rising['% Change']='+' + rising['% Change'].astype(str) + '%'
losing['% Change']=losing['% Change'].astype(str) + '%'

In [13]:
rising

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Volume,Avg Vol (3 month),Market Cap,PE Ratio (TTM),52 Week Range
19,RBLX,Roblox Corporation,31.12,1.48,+4.99%,32.73M,28.876M,18.46B,,
10,PLTR,Palantir Technologies Inc.,9.01,0.3,+3.44%,43.072M,40.255M,18.44B,,
5,T,AT&T Inc.,20.32,0.36,+1.8%,45.622M,52.356M,145.471B,8.55,
21,NLY,"Annaly Capital Management, Inc.",6.01,0.06,+1.01%,32.091M,33.157M,9.382B,3.4,
14,CCL,Carnival Corporation & plc,9.62,0.05,+0.52%,38.602M,39.521M,11.308B,,
18,AAL,American Airlines Group Inc.,13.1,0.06,+0.46%,34.102M,36.432M,8.509B,,
3,AMZN,"Amazon.com, Inc.",108.95,0.27,+0.25%,58.016M,87.079M,1.109T,52.25,
7,F,Ford Motor Company,11.48,0.02,+0.17%,45.411M,63.517M,45.333B,4.0,
