# PREPROCESSING

This notebook contains the process to compile datasets for modeling purpose.

#### The main steps of the process:

- Reading in stock price datasets, resturcturing and creating price based labels

- Reading in article headlines, stemming and tokenizing the text

- Merging and concatenate the article headlines and lables

- Split into train and test dataset

- Saving into a csv file


In order to compare the company's performance I used <b>S&P500</b> index.

### Importing the necessary libraries

In [405]:
import pandas as pd
import datetime
import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_row', 1000)

### 1. Reading in stock price datasets

#### Source: https://finance.yahoo.com

In [406]:
companies = ['tesla', 'ford', 'ibm', 'goldman', 'boeing', 'ge']

In [407]:
sp500 = pd.read_csv('financial/SP500.csv')

tesla_stock = pd.read_csv('financial/TSLA.csv')
ford_stock = pd.read_csv('financial/F.csv')
ibm_stock = pd.read_csv('financial/IBM.csv')
goldman_stock = pd.read_csv('financial/GS.csv')
boeing_stock = pd.read_csv('financial/BA.csv')
ge_stock = pd.read_csv('financial/GE.csv')


In [207]:
tesla_stock.head(2)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.0,25.0,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.42,23.299999,23.83,23.83,17187100


### 2. Restructuring/labeling stock data

#### Labeling rule:
* <b>good</b>: if the company share price performed better than the index
* <b>bad</b>: if the company share price performed worse than the index

Calculation:
* For calculating changes in stock prices 'Close' prices are used
* 'change': % change compared to the previous day
* 'nextday': % change of the following day

In [408]:
def price_ratios(stock_data, index_data):
    '''
    This function returns a restructured dataframe for labeling purpose.

    The function requires specific structure of the data, which is based on the current 
    datascource (yahoo/finance).

    -------------------------
    Inputs:
        stock_data: dataframe
        index_data: dataframe

    -------------------------
    Returns: dataframe

    '''
    # merge dataframes
    df = pd.merge(stock_data, index_data, how='inner',
                  on='Date', suffixes=('', '_i'))
    
    df.columns = df.columns.str.lower()
    df = df[['date','close', 'close_i']]

    # shifting prices by one day ahead and merge
    df = df.set_index('date')
    shifted_next = df.shift(periods=-1)
    df = pd.merge(df, shifted_next, on='date', suffixes=('', '_next')) #next day prices
    df = df.reset_index()
    
    # shifting prices by one day back and merge
    df = df.set_index('date')
    shifted_prev = df.shift(periods=1)
    df = pd.merge(df, shifted_prev, on='date', suffixes=('', '_prev')) #previous day prices
    df = df.reset_index()

    # creating ratios
    df['change'] = round(df['close'] / df['close_prev'] * 100 - 100, 4)
    df['nextday'] = round(df['close_next'] / df['close'] * 100 - 100, 4)
    df['change_i'] = round(df['close_i'] / df['close_i_prev'] * 100 - 100, 4)
    df['nextday_i'] = round(df['close_i_next'] / df['close_i'] * 100 - 100, 4)

    # adding trading holidays to the dataframe
    dates = []
    last_day = pd.to_datetime(stock_data.iloc[-1, 0], format='%Y-%m-%d')
    for i in range(3650):
        date = last_day - datetime.timedelta(days=i)
        date = date.strftime('%Y-%m-%d')
        dates.append(date)
    date_df = pd.DataFrame(dates, columns=['date'])
    df = pd.merge(date_df, df, on='date', how='left')

    # fill in trading holidays with the following data (becaues dates are in reverse order, it is now ffill)
    df = df.fillna(method='ffill')

    return df


In [409]:
def label(df):
    df.loc[df['change'] > df['change_i'] + 0.5, 'label'] = 'good'

    df.loc[df['change'] < df['change_i'] - 0.5, 'label'] = 'bad'

    df.loc[df.label.isna(), 'label'] = 'neutral'
    
    df = df[['date', 'label']]
    
#     df.date = pd.to_datetime(df.date)
    
    return df

In [410]:
tesla_label = label(price_ratios(tesla_stock, sp500))
ford_label = label(price_ratios(ford_stock, sp500))
ibm_label = label(price_ratios(ibm_stock, sp500))
goldman_label = label(price_ratios(goldman_stock, sp500))
boeing_label = label(price_ratios(boeing_stock, sp500))
ge_label = label(price_ratios(ge_stock, sp500))


In [363]:
tesla_label.label.value_counts()

good       1674
bad        1342
neutral    634 
Name: label, dtype: int64

### 3. Import headlines

In [411]:
import pymongo

myclient = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
news = myclient['news']

In [412]:
tesla_news = pd.DataFrame.from_records(news.tesla_news.find())
ford_news = pd.DataFrame.from_records(news.ford_news.find())
ibm_news = pd.DataFrame.from_records(news.ibm_news.find())
goldman_news = pd.DataFrame.from_records(news.goldman_news.find())
boeing_news = pd.DataFrame.from_records(news.boeing_news.find())
ge_news = pd.DataFrame.from_records(news.ge_news.find())


### 4. Converting dates / restructuring dataframe

In [413]:
def converting_news(some_news):
    some_news.date = pd.to_datetime(some_news.date).apply(lambda x: x.strftime('%Y-%m-%d'))
    some_news['text'] = some_news.headline +' '+ some_news.summary
    some_news = some_news[['date', 'text', 'topic']]
    return some_news

In [414]:
tesla_news = converting_news(tesla_news)
ford_news = converting_news(ford_news)
ibm_news = converting_news(ibm_news)
goldman_news = converting_news(goldman_news)
boeing_news = converting_news(boeing_news)
ge_news = converting_news(ge_news)

### 5. Keeping just the relevant headlines

Relevant a headline or the summary, if it contains the keywords (basically the company name).

In [415]:
def relevant(text, keywords):
    '''
    Inputs:
        text: string
        keywords: list of string objects
    Returns:
        boolean
    '''
    for keyword in keywords:
        if keyword in text:
            return True
    return False

def relevant_rows(df, keywords):
    '''
    Returns the rows of the dataframe, where the text is relevant
    Input:
        df: dataframe
        keywords: list of string objects
    Returns:
        dataframe with the relevant rows
    '''
    df = df.loc[df.text.apply(lambda x: relevant(x, keywords))]
    
    print('Number of rows: ',len(df))
    return df

In [416]:
tesla_news = relevant_rows(tesla_news, ['Elon Musk', 'Tesla'])

Number of rows:  979


In [417]:
ford_news = relevant_rows(ford_news, ['Ford'])

Number of rows:  1447


In [418]:
ibm_news = relevant_rows(ibm_news, ['IBM', 'International Business Machines'])

Number of rows:  509


In [419]:
goldman_news = relevant_rows(goldman_news, ['Goldman'])

Number of rows:  1667


In [420]:
boeing_news = relevant_rows(boeing_news, ['Boeing'])

Number of rows:  1629


In [421]:
ge_news = relevant_rows(ge_news, ['GE', 'General Electric'])

Number of rows:  1085


### 6. Stemming/tokenizing

In [447]:

from nltk.stem import RegexpStemmer
from nltk.stem.snowball import SnowballStemmer

179

The 'stopwords' and 'nowords' were originally based on the stopwords of nltk library. For this sentiment
analysis I review and modified them (removed keywords, added some not predicting words).
Words which modify the meaning of the following words (I call them like 'nowords') will be replaced simply with 'no'.

In [454]:
my_stopwords = ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 
                'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'between', 'both', 'but', 'by',
                'can', 'd', 'did', 'do', 'does', 'doing', 'during', 'each', 'few', 'for', 'from', 'further',
                'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his',
                'how', 'i', 'if', 'in', 'into', 'is', 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma',
                'me', 'might', 'more', 'most', 'my', 'myself', 'neeed', 'now', 'o', 'of', 'off', 'on', 'once',
                'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's',
                'same','she', "she's", 'should', "should've", 'so', 'some', 'such','than', 'that', "that'll",
                'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this',
                'those', 'through', 'to', 'too', 'until', 've', 'very', 'was', 'we', 'were', 'what', 'when',
                'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'could', 'y',
                'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves',
                'mr', 'mrs', 'ms', 'nan', 'inc', 'co', 'com', 'wsj', 'monday', 'tuesday', 'wednesday', 'thursday',
                'friday', 'saturday', 'sunday', 'either', 'shall', 'must', 'with', 'without']

my_nowords = ['aren', "aren't", 'couldn', "couldn't", 'didn',"didn't", 'doesn', "doesn't", 'don', "don't",'hadn',
              "hadn't",'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'neither',
              'nor', 'mustn', "mustn't",'needn', "needn't", 'no', 'shan', "shan't",  'shouldn', "shouldn't",'t', 
              'wasn',"wasn't",'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [455]:
import re

def my_tokenizer(text):
    # just keep the words (remove characters)
    tokens = re.findall(r'([a-zA-Z0-9-]+)', text)

    # lowercase
    tokens = [t.lower() for t in tokens]

    # remove stopwords
    tokens = [word for word in tokens if word not in my_stopwords]
    
    #replace no words
    tokens = ['no' for word in tokens if word in my_nowords]
    
    token_text = ' '.join(tokens)
    return token_text

In [456]:
tesla_news['tokens'] = tesla_news.text.apply(my_tokenizer)
ford_news['tokens'] = ford_news.text.apply(my_tokenizer)
ibm_news['tokens'] = ibm_news.text.apply(my_tokenizer)
goldman_news['tokens'] = goldman_news.text.apply(my_tokenizer)
boeing_news['tokens'] = boeing_news.text.apply(my_tokenizer)
ge_news['tokens'] = ge_news.text.apply(my_tokenizer)

In [457]:
ge_news

Unnamed: 0,date,text,topic,tokens
1,2019-10-01,"GE to Pay $2.7 Million Over Cuba Sanctions The conglomerate agreed to pay more than $2.7 million to settle allegations that its subsidiaries violated U.S. economic sanctions on Cuba through a former customer, the Treasury Department said.",RISK & COMPLIANCE JOURNAL,
5,2019-09-16,"Baker Hughes Sheds an Owner and Six Syllables Shareholders didn’t cheer when GE dumped shares of Baker Hughes, but analysts probably did.",OVERHEARD,no no
7,2019-09-12,GE Chief Says Assets Sales Will Reap $38 Billion CEO Larry Culp sees stronger demand at the company’s power division but warns interest rates will boost pension obligations by $7 billion.,BUSINESS,
8,2019-09-12,"Investors Should Fade GE’s Oil Patch Trade Shale drillers may be suffering, but investors have punished the stocks that service them too much.",HEARD ON THE STREET,
9,2019-09-10,"GE to Give Up Majority Control of Baker Hughes General Electric is giving up majority control of Baker Hughes, selling shares in the oil-field services firm that will raise about $3 billion cash but trigger a more-than-$7-billion accounting charge.",BUSINESS,
12,2019-08-30,"General Electric Gets Most of Shareholder Lawsuit Dismissed A federal judge dismissed much of a major shareholder lawsuit against General Electric, knocking down several securities fraud allegations related to accounting at its insurance and power businesses.",BUSINESS,
14,2019-08-19,GE Issues New Rebuttal to Whistleblower’s Claims About Finances General Electric pushed back further against claims by Harry Markopolos that the company hasn’t been forthright about its finances.,BUSINESS,no no
16,2019-08-16,Shares of GE Light Up Again Shares of General Electric bounced back—but this could be just the beginning of another painful stretch for the industrial conglomerate’s stock.,STOCKS,
17,2019-08-16,"J.C. Penney, Viacom, General Electric: Stocks That Defined the Week Major U.S. stock indexes fell for the week on renewed concerns that the economy could be headed toward a recession, fueled by an inversion of the yield curve and other market signals flashing red.",STOCKS,
19,2019-08-15,GE Is New Target of Madoff Whistleblower Bernie Madoff whistleblower Harry Markopolos alleges that GE’s cash situation is worse than disclosed and that GE needs to boost its insurance reserves. GE says the claims are false and misleading. Shares fell sharply after the release of the report.,BUSINESS,


### 7. Merging/concatenating

In [424]:
tesla = pd.merge(tesla_news, tesla_label, how='left', on='date')
ford = pd.merge(ford_news, ford_label, how='left', on='date')
ibm = pd.merge(ibm_news, ibm_label, how='left', on='date')
goldman = pd.merge(goldman_news, goldman_label, how='left', on='date')
boeing = pd.merge(boeing_news, boeing_label, how='left', on='date')
ge = pd.merge(ge_news, ge_label, how='left', on='date')


In [425]:
data = pd.concat([tesla, ford, ibm, goldman, boeing, ge])

### 8. Removing not relevant topics

In [439]:
# first I determine the relevant topics
relevant_topics = data.topic.value_counts().index.to_list()[:34]
for elem in ['PHOTOS', 'LETTERS', 'COMMENTARY', 'BOOKSHELF']:
    relevant_topics.remove(elem)

def relevant_topic(topic):
    if topic in relevant_topics:
        return True
    return False

data = data.loc[data.topic.apply(lambda x: relevant_topic(x))]

In [440]:
data.shape

(6282, 5)

In [442]:
data.set_index('date')

Unnamed: 0_level_0,text,topic,tokens,label
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-11-22,Tesla Unveiled a Bulletproof Pickup. Then the Window Broke. Elon Musk embraced unorthodox design choices with Tesla’s all-electric pickupto wow truck buyers and help differentiate it. But the features he is promising could dent the commercial prospects of the vehicle Mr. Musk has dubbed Cybertruck.,BUSINESS,tesla unveiled bulletproof pickup window broke elon musk embraced unorthodox design choices tesla all-electric pickupto wow truck buyers help differentiate features promising dent commercial prospects vehicle musk dubbed cybertruck,bad
2019-11-22,Tesla: Break Glass in Case of Sales Emergency Investors betting on Tesla dominating the truck market have far more to worry about than just a bad omen.,HEARD ON THE STREET,tesla break glass case sales emergency investors betting tesla dominating truck market far worry bad omen,bad
2019-11-22,"Tesla Unveils Electric Pickup and Targets Detroit Rivals’ Profit Engine Elon Musk unveiled the next piece of Tesla’s electric-vehicle vision: a pickup truck with a starting price of $39,900. But the road won’t be easy.",TECH,tesla unveils electric pickup targets detroit rivals profit engine elon musk unveiled next piece tesla electric-vehicle vision pickup truck starting price 39 900 road easy,bad
2019-11-14,"Tesla Makes Some Improvements in Consumer Reports’s Latest Reliability Ranks After tumbling in the rankings last year, electric car maker Tesla rebounded slightly in Consumer Reports’ latest reliability survey with the organization restoring its recommended rating to two models.",BUSINESS,tesla makes improvements consumer reports latest reliability ranks tumbling rankings last year electric car maker tesla rebounded slightly consumer reports latest reliability survey organization restoring recommended rating two models,good
2019-11-12,"Tesla to Build European Car Plant in Berlin, Musk Says Tesla Chief Executive Elon Musk said the auto maker would build a planned European plant to assemble electric vehicles in Germany.",AUTOS,tesla build european car plant berlin musk says tesla chief executive elon musk said auto maker build planned european plant assemble electric vehicles germany,good
2019-10-25,"Procter & Gamble, Nike, Tesla: Stocks That Defined the Week Here are seven major companies whose stocks moved on the week’s news",STOCKS,procter gamble nike tesla stocks defined week seven major companies whose stocks moved week news,good
2019-10-24,"Tesla Turns a Corner. It May Not Be the Last One. Tesla this week posted just its fifth quarterly profit since going public in 2010, after accumulating a loss of more than $6 billion so far.",BUSINESS,tesla turns corner may last one tesla week posted fifth quarterly profit since going public 2010 accumulating loss 6 billion far,good
2019-10-24,"Tesla Delivers a Surprising Profit The electric-car maker posted a surprise profit for the third quarter, easing investors’ fears that the pursuit of growth and record production figures would come at the expense of the bottom line.",BUSINESS,tesla delivers surprising profit electric-car maker posted surprise profit third quarter easing investors fears pursuit growth record production figures come expense bottom line,good
2019-10-24,"Stocks to Watch: Microsoft, Tesla, Ford, Las Vegas Sands, Blackstone Here are some of the companies with shares expected to trade actively in Thursday’s session.",MONEYBEAT,stocks watch microsoft tesla ford las vegas sands blackstone companies shares expected trade actively thursday session,good
2019-10-23,"Tesla Stock Charge Will Have Limited Range Shares surge on a surprise profit, but falling sales mean the debate over long-run viability is far from finished.",HEARD ON THE STREET,tesla stock charge limited range shares surge surprise profit falling sales mean debate long-run viability far finished,bad


### 9. Train-test split

In [333]:
X = data[['tokens']]
y = data['label']

In [335]:
y.shape

(7171,)

In [338]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.20,
                                                    random_state=10,
                                                    stratify=y)

### 10. Save into a csv file

In [339]:
X_train.to_csv('train_x.csv', index=False)
y_train.to_csv('train_y.csv', index=False)
X_test.to_csv('test_x.csv', index=False)
y_test.to_csv('test_y.csv', index=False)