# Outline

1. Getting data
2. Data cleaning & featurizing
3. Data Export

In [1]:
%matplotlib inline

import quandl as Quandl
Quandl.ApiConfig.api_key = "EGdC1RASF31yDGeBDRt7"
import numpy as np
import pandas as pd
import math
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime
import pickle as pkl
style.use('ggplot')

import util as u



## <font color = blue> Getting Data</font>

* Individual Stocks
* Market & Industry Indicators

### Individual Stocks

Scraping the top 200 tech companies from NASDAQ (https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology.

In [2]:
tech_tickers = np.array([a.strip().upper() for a in pd.read_csv('companylist.csv')['Symbol']])
quandl_codes_tech_tickers = ["WIKI/"+ticker for ticker in tech_tickers]

Get the ticker data!

In [3]:
tech_data = u.fetch_prices(quandl_codes_tech_tickers, limit=200)

WIKI/VNET does not exist on Quandl's side.
WIKI/JOBS does not exist on Quandl's side.
WIKI/WUBA does not exist on Quandl's side.
WIKI/ACIA does not exist on Quandl's side.
WIKI/ACMR does not exist on Quandl's side.
WIKI/IOTS does not exist on Quandl's side.
WIKI/AER does not exist on Quandl's side.
WIKI/ACY does not exist on Quandl's side.
WIKI/AGMH does not exist on Quandl's side.
WIKI/AIRG does not exist on Quandl's side.
WIKI/AMCN does not exist on Quandl's side.
WIKI/ALRM does not exist on Quandl's side.
WIKI/ALYA does not exist on Quandl's side.
WIKI/ALLT does not exist on Quandl's side.
WIKI/AABA does not exist on Quandl's side.
WIKI/AYX does not exist on Quandl's side.
WIKI/AMRH does not exist on Quandl's side.
WIKI/AMRHW does not exist on Quandl's side.
WIKI/AMN does not exist on Quandl's side.
WIKI/ASYS does not exist on Quandl's side.
WIKI/PLAN does not exist on Quandl's side.
WIKI/APY does not exist on Quandl's side.
WIKI/APPF does not exist on Quandl's side.
WIKI/APPN does 

## <font color = blue>Data Cleaning & Featurizing </font>

* Featurizing
* Cleaning

#### Featurizing

In [4]:
tech_data = u.apply_to_all_stocks(u.select_relevant_raw_features, tech_data)
tech_data = u.apply_to_all_stocks(u.add_ft_PCT_change, tech_data)
tech_data = u.apply_to_all_stocks(u.select_indicators, tech_data)



  dip[i] = 100 * (dip_mio[i]/trs[i])
  din[i] = 100 * (din_mio[i]/trs[i])


# Indices & Market-Level Data

In [5]:
%matplotlib inline

import quandl as Quandl
Quandl.ApiConfig.api_key = "EGdC1RASF31yDGeBDRt7"
import numpy as np
import pandas as pd
import math
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime
import pickle as pkl
style.use('ggplot')

import util as u

Indices from NASDAQOMX (https://www.quandl.com/data/NASDAQOMX-NASDAQ-OMX-Global-Index-Data?keyword=technology):
    
    --------- SECTOR INDICATORS ----------
    
    NASDAQ-100 Ex-Tech Sector (NDXX)
    NASDAQ-100 Technology Sector (NDXT) 
    NASDAQ-100 Target 25 Index(NDXT25) //Note: dropped due to little data
    NASDAQ-100 Technology Sector Total Return (NTTR)
    
    --------- MARKET INDICATORS -----------
    
    NASDAQ N America Index (NQNA)
    NASDAQ US All Market Index (NQUSA)

    NASDAQ US 1500 Index (NQUSS1500)
    NASDAQ US 450 Index (NQUSM450)
    NASDAQ US 300 Index (NQUSL300)

    NASDAQ US Small Cap Index (NQUSS)
    NASDAQ US Large Cap Index (NQUSL)
    NASDAQ US Mid Cap Index (NQUSM)
    
    Nasdaq US Sustainable Momentum Index (NQSUMO) //Note: dropped due to little data
    
    
    
Indices from URC (https://www.quandl.com/data/URC-Unicorn-Research-Corporation?page=3):

    --------- MARKET INDICATORS -----------

    All of them on the NASDAQ
    
    

In [6]:
sector_idcs = ['NDXX','NDXT','NTTR']
market_idcs = ['NQNA', 'NQUSA', 'NQUSS1500', 'NQUSM450', 'NQUSL300', 'NQUSS', 'NQUSL', 'NQUSM']
sector_idcs, market_idcs = ["NASDAQOMX/"+t for t in sector_idcs], ["NASDAQOMX/"+t for t in market_idcs]

In [7]:
market_urc_idcs = ["URC/"+name for name in ["NASDAQ_ADV", "NASDAQ_UNCH", "NASDAQ_52W_LO", "NASDAQ_52W_HI", "NASDAQ_UNCH_VOL", "NASDAQ_DEC_VOL", "NASDAQ_ADV_VOL", "NASDAQ_UNC", "NASDAQ_DEC"]]


In [8]:
sector_data, market_data = u.fetch_prices(sector_idcs), u.fetch_prices(market_idcs)
market_urc_data = u.fetch_prices(market_urc_idcs)


In [9]:
#drop some columns we don't want
_ = [u.drop_col(df, 'Dividend Market Value') for df in sector_data.values()] #drop div values from all sector data
_ = [u.drop_col(df, 'Dividend Market Value') for df in market_data.values()] #drop div values from all sector data


In [10]:
#featurize
sector_data = u.apply_to_all_stocks(u.add_industry_level_fts, sector_data)
market_data = u.apply_to_all_stocks(u.add_market_level_fts, market_data)
market_data_urc = u.apply_to_all_stocks(u.add_market_urc_level_fts, market_urc_data)


Since we don't have dates that align, what we can do is join everything on dates, and see which stocks have full data. We will then experiment only with those!

In [11]:
dfs = []
dfs += [df for df in sector_data.values()]
dfs += [df for df in market_data.values()]
dfs += [df for df in market_urc_data.values()]

industry_mkt_fts = reduce(lambda left,right: pd.merge(left,right,how='inner',left_index=True, right_index=True), dfs)

print("Range of industry and market features is in: ", max(industry_mkt_fts.index), min(industry_mkt_fts.index))


('Range of industry and market features is in: ', Timestamp('2015-08-31 00:00:00'), Timestamp('2011-05-16 00:00:00'))


## Clean & Save Individual Stock and Market & Industry Indicators

We do this after the market & industry because we want to filter by date on individual stocks such that we get matching time frames for all featurizations

#### Cleaning

In [12]:
cleaned_data = {}
cleaned_data_aug = {}

for stock_data,stock_ticker in zip(tech_data.values(), tech_data.keys()):
    

    
    stock_n_market_n_industry = pd.merge(stock_data,
                                          industry_mkt_fts,
                                          how='inner',
                                          left_index=True, 
                                          right_index=True)

    #next, clean and normalize
    try: 
        
        ### MARKET & INDUSTRY
        X,y,X_lately = u.clean_and_split(stock_n_market_n_industry,
                      forecast_pct=0.05,
                      forecast_col='Adj. Close')
    
        cleaned_data_aug[stock_ticker] = (X,y,X_lately)
        
        ### SINGLE STOCK

        stock_data = u.prune_date_range(stock_data, stock_n_market_n_industry)
        X,y,X_lately = u.clean_and_split(stock_data,
                              forecast_pct=0.05,
                              forecast_col='Adj. Close')
        
        cleaned_data[stock_ticker] = (X,y,X_lately)

    except:
        print('skipped one due to NaN/Inf bug!')
    




skipped one due to NaN/Inf bug!
skipped one due to NaN/Inf bug!


### Save

In [13]:
#augmented
with open('cleaned_data_aug.pickle', 'wb') as handle:
    pkl.dump(cleaned_data_aug, handle, protocol=pkl.HIGHEST_PROTOCOL)
    
#individual stock
with open('cleaned_data.pickle', 'wb') as handle:
    pkl.dump(cleaned_data, handle, protocol=pkl.HIGHEST_PROTOCOL)