# Data Collection

In order to begin analyzing stocks we first need to collect data to analyze. We will do this by scraping a specified timeframe of stock information for a specified number of stocks.


## Getting Started
To get started, we must import some libraries:

In [1]:
import sys
!{sys.executable} -m pip install yfinance
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
print("======================")


import yfinance as yf
import pandas as pd

print ("Successfully imported libraries")

Successfully imported libraries


We will begin by specifying a list of tickers we want to scrape:

In [2]:
tickers = ["MSFT", "AAPL"]

From here, we have to define some options:

- Interval: This represents the frequency we want to get stock data
- Time Frame (2 options):
    - A)
        - Period: This represents a timeframe in english (1m, 5m, 1d, 1y, ...)
    - B)
        - Start Date: This is the starting date for data scraping
        - End Date: This is the ending date for data scraping

In [3]:
period = '10y'
start_date = None
end_date = None
interval = "1d"

Just a quick check to ensure variables are instantiated correctly:

In [4]:
if tickers == None:
    raise Exception("You must specify a list of tickers to scrape")

if period == None and start_date == None and end_date == None:
            raise Exception("You must specify one timeframe in order to scrape")
        
if period == None and (start_date == None or end_date == None):
    raise Exception("You must specify both ends of the timeframe in order to scrape")

if period not in ['1d','5d','1mo','3mo','6mo','1y','2y','5y','10y','ytd','max']:
    raise Exception("Please input a valid period")

if interval not in ['1m','2m','5m','15m','30m','60m','90m','1h','1d','5d','1wk','1mo','3mo']:
    raise Exception("Please input a valid time interval")

if period != None and (start_date != None or end_date != None):
    raise Exception("You can only specify one type of timeframe in order to scrape")

print("Passed!")

Passed!


## Scraping Data
Now that we have defined the scraping parameters, we can actually begin to scrape. We do this by calling the YFinance download method for each stock. However, we have to consider two cases:
- We are using period
- We are using start/end dates


In [5]:
ticker_string = ' '.join(tickers) #convert from list to space-separated string
print(f"Ticker String: \n{ticker_string}")

if period != None: #if using period
    data = yf.download(
        ticker_string,
        period = period,
        interval = interval,
        group_by = 'ticker',
        threads = True
    )
else:
    data = yf.download( #if using start/end dates
        ticker_string,
        start = start_date, 
        end = end_date,
        interval = interval,
        group_by = 'ticker',
        threads = True
    )

Ticker String: 
MSFT AAPL
[*********************100%***********************]  2 of 2 completed


At this point, we have a 3D nested dataframe, split by tickers and then by columns. To put this data in CSV files, we will create a CSV file for each ticker containing the relevant information.

In [6]:
for ticker in tickers:
    print(f"Saving data for {ticker}")
    
    tickerdata = data[ticker]
    tickerdata = tickerdata.drop(["Close"], axis=1) #only care about adjusted close
    tickerdata.to_csv(f'./data/timeseries_data_{ticker}.csv') #write to csv file

print("======================")
print("Saved stocks to CSV files. Take a look inside /datacollection folder for them.")

Saving data for MSFT
Saving data for AAPL
Saved stocks to CSV files. Take a look inside /datacollection folder for them.
