The goal of this notebook is to show how to scrape the business description data from Yahoo Finance. 

The `company_des.csv` file should be provided to you, thus you don't need to rerun the code below to get it. 

The steps that this ipynb shows are below: 

1.   Mount your Google Drive and establish the working directory. 
2.   Load the CSV file with the tickers data.  
3.   Loop through the list of tickers and get the business descriptions via URL `requests` and `BeautifulSoup`. 
4.   Clean the descriptions data. 
5.   Save the descriptions into a separate CSV file for further use. 

**Note**. Save this Colab notebook to your Drive via File > Save a copy in Drive to be able to edit it. 


Mounting allows to access files on your Google Drive. You'll need to allow the Google Drive for desktop's access to your Google Account and copying the sign in code into the authorization code field. 

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
import os

For more on working directory please check the `0. Business Description and Assets/Revenue.ipynb` or `1. Tickers Data.ipynb` files. 

In [None]:
# Set your working directory to a folder in your Google Drive. This way, if your notebook times out,
# your files will be saved in your Google Drive!

# the base Google Drive directory
root_dir = "/content/gdrive/MyDrive/BU/Year1/Summer/"

# choose where you want your project files to be saved
project_folder = "BA870/"

# change the OS to use your project folder as the working directory
os.chdir(root_dir + project_folder)
os.getcwd()

'/content/gdrive/MyDrive/BU/Year1/Summer/BA870'

Load necessary libraries. 

In [None]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

Load the tickers data. 

In [None]:
tickers = pd.read_csv('tickers.csv') # load tickers.csv 

Loop through the tickers and get for each ticker a respective URL and business description. Note that for example purposes `ticker` variable is limited to the first 5 (`[:5]`) items, remove it if you want to run the loop for the whole sample. The loop for all observations might take 30+ minutes.

In [None]:
# Create a loop to store URLs of all stocks' description page
URL = [] # empty list for URLs
DES = [] # empty list for descriptions 
ticker = tickers['Ticker'][:5] # for example purposes we limit the number of tickers to 5
for i in ticker: 
  url ='https://finance.yahoo.com/quote/'+i+'/profile' 
  URL.append(url)
  page = requests.get(url) # visits the URL 
  htmldata = BeautifulSoup(page.content, 'html.parser')
  Business_Description = htmldata.find('p',{'class':'Mt(15px) Lh(1.6)'}) # finds the business description part in the HTML code
  DES.append(Business_Description)

In [None]:
# print(URL)
print(DES) # check the descriptions

Convert the results to pandas dataframe. 

In [None]:
# Create new data frame that stores ticker, description of corresponding tickers 
company_des = pd.DataFrame({'ticker':ticker,'description':DES})
company_des.head()

Unnamed: 0,ticker,description
0,AAPL,"[Apple Inc. designs, manufactures, and markets..."
1,MSFT,"[Microsoft Corporation develops, licenses, and..."
2,AMZN,"[Amazon.com, Inc. engages in the retail sale o..."
3,FB,"[Facebook, Inc. develops products that enable ..."
4,GOOGL,[Alphabet Inc. provides online advertising ser...


Drop tickers with no descriptions. Convert the `description` variable to string. 

Clean the data: remove NAs, convert to string and remove HTML code attributes. 

In [None]:
# Drop the stocks that do not have Yahoo Finance company profiles 
company_des.dropna(inplace=True)
company_des['description'] = company_des['description'].astype(str)

# Remove regex text from description using loop 
a = np.arange(1,300)
a = a.astype(str)
for i in a:
  company_des['description']=company_des['description'].str.replace('<p class="Mt(15px) Lh(1.6)" data-reactid="'+i+'">','',regex=False)

company_des['description']=company_des['description'].str.replace('</p>','',regex=False)

In [None]:
# company_des.head()

The CSV file (`stock_des.csv`) should be provided to you. 

In [None]:
# Export company_des into a csv file
# company_des.to_csv(r'stock_des.csv', index = False, header=True)