# Data Collection

In order to begin analyzing stocks we first need to collect data to analyze. We will do this by scraping a specified timeframe of stock information for a specified number of stocks.


## Getting Started
To get started, we must import some libraries:

In [191]:
import sys
!{sys.executable} -m pip install yfinance
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
print("======================")


import yfinance as yf
import pandas as pd

print ("Successfully imported libraries")

Successfully imported libraries


We will begin by specifying a list of tickers we want to scrape:

In [192]:
tickers = ["MSFT", "AAPL"]

From here, we have to define some options:

- Interval: This represents the frequency we want to get stock data
- Time Frame (2 options):
    - A)
        - Period: This represents a timeframe in english (1m, 5m, 1d, 1y, ...)
    - B)
        - Start Date: This is the starting date for data scraping
        - End Date: This is the ending date for data scraping

In [193]:
period = '10y'
start_date = None
end_date = None
interval = "1d"

Just a quick check to ensure variables are instantiated correctly:

In [194]:
if tickers == None:
    raise Exception("You must specify a list of tickers to scrape")

if period == None and start_date == None and end_date == None:
            raise Exception("You must specify one timeframe in order to scrape")
        
if period == None and (start_date == None or end_date == None):
    raise Exception("You must specify both ends of the timeframe in order to scrape")

if period not in ['1d','5d','1mo','3mo','6mo','1y','2y','5y','10y','ytd','max']:
    raise Exception("Please input a valid period")

if interval not in ['1m','2m','5m','15m','30m','60m','90m','1h','1d','5d','1wk','1mo','3mo']:
    raise Exception("Please input a valid time interval")

if period != None and (start_date != None or end_date != None):
    raise Exception("You can only specify one type of timeframe in order to scrape")

print("Passed!")

Passed!


## Scraping Data
Now that we have defined the scraping parameters, we can actually begin to scrape. We do this by calling the YFinance download method for each stock. However, we have to consider two cases:
- We are using period
- We are using start/end dates


In [195]:
ticker_string = ' '.join(tickers) #convert from list to space-separated string
print(f"Ticker String: \n{ticker_string}")

if period != None: #if using period
    data = yf.download(
        ticker_string,
        period = period,
        interval = interval,
        group_by = 'ticker',
        threads = True
    )
else:
    data = yf.download( #if using start/end dates
        ticker_string,
        start = start_date, 
        end = end_date,
        interval = interval,
        group_by = 'ticker',
        threads = True
    )


data = data.drop([(i, 'Close') for i in tickers], axis=1)
data = data.rename({"Adj Close": "Close"}, axis=1)


data.head()

Ticker String: 
MSFT AAPL
[*********************100%***********************]  2 of 2 completed


Unnamed: 0_level_0,MSFT,MSFT,MSFT,MSFT,MSFT,AAPL,AAPL,AAPL,AAPL,AAPL
Unnamed: 0_level_1,Open,High,Low,Close,Volume,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2011-10-26,27.030001,27.059999,26.1,21.400375,63029900,14.348571,14.376786,14.041071,12.284348,456304800
2011-10-27,27.129999,27.4,26.65,21.931568,74512400,14.555714,14.607143,14.353214,12.409767,494664800
2011-10-28,27.139999,27.190001,26.790001,21.714264,57712100,14.392857,14.5125,14.375357,12.417742,322842800
2011-10-31,26.76,27.0,26.620001,21.432573,46799000,14.372143,14.618929,14.323214,12.412527,385501200
2011-11-01,26.190001,26.32,25.860001,20.917484,61182600,14.193214,14.267857,14.043571,12.158929,531790000


# Adding Technical Indicators

Using the [TA-Lib](https://mrjbq7.github.io/ta-lib/) package, we can broadcast technical indicators across the time series into their own dataframe columns

In [196]:
import talib

def computeRSI(df):
    return talib.RSI(df["Close"], timeperiod=14)

def computeUltimateOscillator(df):
    return talib.ULTOSC(df["High"], df["Low"], df["Close"], timeperiod1=7, timeperiod2=14, timeperiod3=28)

def computeBollingerBands(df):
    upperband, middleband, _ = talib.BBANDS(df["Close"], timeperiod=5, nbdevup=2, nbdevdn=2, matype=0)
    return upperband - middleband

def computeChaikinOscillator(df):
    return talib.ADOSC(df["High"], df["Low"], df["Close"], df["Volume"], fastperiod=3, slowperiod=10)


Now, we can apply these functions across the collected timeseries

In [197]:
for ticker in tickers:
    # df = data[ticker]
    # df = computeRSI(df)
    # data[ticker].columns = list(data[ticker].columns) + ['rsi']
    df = data[ticker]
    data[ticker, 'RSI'] = computeRSI(df)
    data[ticker, 'Ultimate'] = computeUltimateOscillator(df)
    data[ticker, 'BandRadius'] = computeBollingerBands(df)
    data[ticker, 'Chaikin'] = computeChaikinOscillator(df)


    # print(data.head().columns)
    # print(df.head())
    # data[ticker] = df
    # print(data.head())

data['MSFT'].head(100)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,RSI,Ultimate,BandRadius,Chaikin
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2011-10-26,27.030001,27.059999,26.100000,21.400375,63029900,,,,
2011-10-27,27.129999,27.400000,26.650000,21.931568,74512400,,,,
2011-10-28,27.139999,27.190001,26.790001,21.714264,57712100,,,,
2011-10-31,26.760000,27.000000,26.620001,21.432573,46799000,,,,
2011-11-01,26.190001,26.320000,25.860001,20.917484,61182600,,,0.683406,
...,...,...,...,...,...,...,...,...,...
2012-03-14,32.529999,32.880001,32.490002,26.747755,41986900,73.537960,1.633287,0.568092,-4.455546e+09
2012-03-15,32.790001,32.939999,32.580002,26.813049,49068300,74.187935,1.948936,0.606215,-4.548880e+09
2012-03-16,32.910000,32.950001,32.500000,26.608995,65626400,68.523372,1.331681,0.466842,-4.741123e+09
2012-03-19,32.540001,32.610001,32.150002,26.282505,44789200,60.556235,0.434833,0.368427,-4.766566e+09


Then, clear all rows that do not have a defined indicator value

In [198]:
data.dropna(inplace=True)

data.head()

Unnamed: 0_level_0,MSFT,MSFT,MSFT,MSFT,MSFT,AAPL,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,AAPL,AAPL,AAPL,AAPL
Unnamed: 0_level_1,Open,High,Low,Close,Volume,Open,High,Low,Close,Volume,RSI,Ultimate,BandRadius,Chaikin,RSI,Ultimate,BandRadius,Chaikin
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2011-12-06,25.809999,25.870001,25.610001,20.8074,46175300,14.018214,14.093929,13.906429,11.98843,283598000,50.246215,1.333001,0.322795,-4803864000.0,52.633426,3.170175,0.22528,-22922170000.0
2011-12-07,25.67,25.76,25.34,20.758745,62667000,13.926071,13.962143,13.812857,11.931394,304746400,49.580152,0.675705,0.326037,-4905596000.0,51.187896,1.552887,0.106483,-23380770000.0
2011-12-08,25.48,25.719999,25.370001,20.596569,60522200,13.980357,14.125,13.936786,11.979538,376356400,47.328096,0.568225,0.292567,-5045780000.0,52.37684,2.336222,0.082239,-24053170000.0
2011-12-09,25.52,25.870001,25.5,20.83984,53788500,14.030357,14.072857,13.965357,12.070306,296993200,50.928788,0.229622,0.181873,-5092650000.0,54.621033,1.824285,0.101092,-25567770000.0
2011-12-12,25.41,25.57,25.290001,20.685772,38945900,13.988571,14.067857,13.908929,12.015725,301067200,48.660039,0.446438,0.175152,-5068786000.0,53.003554,1.111867,0.091244,-26241100000.0


At this point, we have a 3D nested dataframe, split by tickers and then by columns. To put this data in CSV files, we will create a CSV file for each ticker containing the relevant information.

In [199]:


for ticker in tickers:
    print(f"Saving data for {ticker}")
    
    tickerdata = data[ticker]
    tickerdata.to_csv(f'./data/timeseries_data_{ticker}.csv') #write to csv file

print("======================")
print("Saved stocks to CSV files. Take a look inside /datacollection folder for them.")

Saving data for MSFT
Saving data for AAPL
Saved stocks to CSV files. Take a look inside /datacollection folder for them.
