# PREPROCESSING

This notebook contains the process to compile datasets for modeling purpose.

#### The main steps of the process:

- Reading in stock price datasets, resturcturing and creating price based labels

- Reading in article headlines, stemming and tokenizing the text

- Merging and concatenate the article headlines and lables

- Split into train and test dataset

- Saving into a csv file


In order to compare the company's performance I used <b>S&P500</b> index.

### Importing the necessary libraries

In [1]:
import pandas as pd
import datetime
import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_row', 1000)

### 1. Reading in stock price datasets

#### Source: https://finance.yahoo.com

In [406]:
companies = ['tesla', 'ford', 'ibm', 'goldman', 'boeing', 'ge']

In [2]:
sp500 = pd.read_csv('financial/SP500.csv')

tesla_stock = pd.read_csv('financial/TSLA.csv')
ford_stock = pd.read_csv('financial/F.csv')
ibm_stock = pd.read_csv('financial/IBM.csv')
goldman_stock = pd.read_csv('financial/GS.csv')
boeing_stock = pd.read_csv('financial/BA.csv')
ge_stock = pd.read_csv('financial/GE.csv')


In [207]:
tesla_stock.head(2)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.0,25.0,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.42,23.299999,23.83,23.83,17187100


### 2. Restructuring/labeling stock data

#### Labeling rule:
* <b>good</b>: if the company share price performed better than the index
* <b>bad</b>: if the company share price performed worse than the index

Calculation:
* For calculating changes in stock prices 'Close' prices are used
* 'change': % change compared to the previous day
* 'nextday': % change of the following day

In [3]:
def price_ratios(stock_data, index_data):
    '''
    This function returns a restructured dataframe for labeling purpose.

    The function requires specific structure of the data, which is based on the current 
    datascource (yahoo/finance).

    -------------------------
    Inputs:
        stock_data: dataframe
        index_data: dataframe

    -------------------------
    Returns: dataframe

    '''
    # merge dataframes
    df = pd.merge(stock_data, index_data, how='inner',
                  on='Date', suffixes=('', '_i'))
    
    df.columns = df.columns.str.lower()
    df = df[['date','close', 'close_i']]

    # shifting prices by one day ahead and merge
    df = df.set_index('date')
    shifted_next = df.shift(periods=-1)
    df = pd.merge(df, shifted_next, on='date', suffixes=('', '_next')) #next day prices
    df = df.reset_index()
    
    # shifting prices by one day back and merge
    df = df.set_index('date')
    shifted_prev = df.shift(periods=1)
    df = pd.merge(df, shifted_prev, on='date', suffixes=('', '_prev')) #previous day prices
    df = df.reset_index()

    # creating ratios
    df['change'] = round(df['close'] / df['close_prev'] * 100 - 100, 4)
    df['nextday'] = round(df['close_next'] / df['close'] * 100 - 100, 4)
    df['change_i'] = round(df['close_i'] / df['close_i_prev'] * 100 - 100, 4)
    df['nextday_i'] = round(df['close_i_next'] / df['close_i'] * 100 - 100, 4)

    # adding trading holidays to the dataframe
    dates = []
    last_day = pd.to_datetime(stock_data.iloc[-1, 0], format='%Y-%m-%d')
    for i in range(3650):
        date = last_day - datetime.timedelta(days=i)
        date = date.strftime('%Y-%m-%d')
        dates.append(date)
    date_df = pd.DataFrame(dates, columns=['date'])
    df = pd.merge(date_df, df, on='date', how='left')

    # fill in trading holidays with the following data (becaues dates are in reverse order, it is now ffill)
    df = df.fillna(method='ffill')

    return df


In [4]:
def label(df):
    df.loc[df['change'] > df['change_i'] + 0.5, 'label'] = 'good'

    df.loc[df['change'] < df['change_i'] - 0.5, 'label'] = 'bad'

    df.loc[df.label.isna(), 'label'] = 'neutral'
    
    df = df[['date', 'label']]
    
#     df.date = pd.to_datetime(df.date)
    
    return df

In [5]:
tesla_label = label(price_ratios(tesla_stock, sp500))
ford_label = label(price_ratios(ford_stock, sp500))
ibm_label = label(price_ratios(ibm_stock, sp500))
goldman_label = label(price_ratios(goldman_stock, sp500))
boeing_label = label(price_ratios(boeing_stock, sp500))
ge_label = label(price_ratios(ge_stock, sp500))


In [6]:
tesla_label.label.value_counts()

good       1674
bad        1342
neutral    634 
Name: label, dtype: int64

### 3. Import headlines

In [30]:
import pymongo

myclient = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
news = myclient['news']

In [62]:
tesla_news = pd.DataFrame.from_records(news.tesla_news.find())
ford_news = pd.DataFrame.from_records(news.ford_news.find())
ibm_news = pd.DataFrame.from_records(news.ibm_news.find())
goldman_news = pd.DataFrame.from_records(news.goldman_news.find())
boeing_news = pd.DataFrame.from_records(news.boeing_news.find())
ge_news = pd.DataFrame.from_records(news.ge_news.find())


### 4. Converting dates / restructuring dataframe

In [63]:
def converting_news(some_news):
    some_news.date = pd.to_datetime(some_news.date).apply(lambda x: x.strftime('%Y-%m-%d'))
    some_news['text'] = some_news.headline +' '+ some_news.summary
    some_news = some_news[['date', 'text', 'topic']]
    return some_news

In [64]:
tesla_news = converting_news(tesla_news)
ford_news = converting_news(ford_news)
ibm_news = converting_news(ibm_news)
goldman_news = converting_news(goldman_news)
boeing_news = converting_news(boeing_news)
ge_news = converting_news(ge_news)

### 5. Keeping just the relevant articles

Relevant a headline or the summary, if it contains the keywords (basically the company name).

In [65]:
def relevant(text, keywords):
    '''
    Inputs:
        text: string
        keywords: list of string objects
    Returns:
        boolean
    '''
    for keyword in keywords:
        if keyword in text:
            return True
    return False

def relevant_rows(df, keywords):
    '''
    Returns the rows of the dataframe, where the text is relevant
    Input:
        df: dataframe
        keywords: list of string objects
    Returns:
        dataframe with the relevant rows
    '''
    df = df.loc[df.text.apply(lambda x: relevant(x, keywords))]
    
    print('Number of rows: ',len(df))
    return df

In [66]:
tesla_news = relevant_rows(tesla_news, ['Tesla'])

Number of rows:  933


In [67]:
ford_news = relevant_rows(ford_news, ['Ford'])

Number of rows:  1447


In [68]:
ibm_news = relevant_rows(ibm_news, ['IBM', 'International Business Machines'])

Number of rows:  509


In [69]:
goldman_news = relevant_rows(goldman_news, ['Goldman'])

Number of rows:  1667


In [70]:
boeing_news = relevant_rows(boeing_news, ['Boeing'])

Number of rows:  1629


In [71]:
ge_news = relevant_rows(ge_news, ['GE', 'General Electric'])

Number of rows:  1085


### 6. Stemming/tokenizing

The 'stopwords' and 'negativewords' were originally based on the stopwords of nltk library. For this sentiment
analysis I reviewed and modified the content of stopwords (removed keywords, added some not predicting words).
Words that change completely the meaning of the following words (I call them like 'negative words') will be replaced simply with 'not'.

In [72]:
my_stopwords = ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any',
                'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'between', 'both', 'but', 'by',
                'can', 'd', 'did', 'do', 'does', 'doing', 'during', 'each', 'few', 'for', 'from', 'further',
                'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his',
                'how', 'i', 'if', 'in', 'into', 'is', 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma',
                'me', 'might', 'more', 'most', 'my', 'myself', 'neeed', 'now', 'o', 'of', 'off', 'on', 'once',
                'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'u', 't',
                'same', 'she', "she's", 'should', "should've", 'so', 'some', 'such', 'than', 'that', "that'll",
                'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'k',
                'those', 'through', 'to', 'too', 'until', 've', 'very', 'was', 'we', 'were', 'what', 'when',
                'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'could', 'y',
                'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves',
                'mr', 'mrs', 'ms', 'nan', 'inc', 'co', 'com', 'wsj', 'monday', 'tuesday', 'wednesday', 'thursday',
                'friday', 'saturday', 'sunday', 'either', 'shall', 'must', 'with', 'without', 'may']

# my_negativewords = ["aren't", "couldn't", "didn't", "doesn't", "don't",
#                     "hadn't", "hasn't", "haven't", "isn't", "mightn't",
#                     'neither', 'nor', "mustn't", "needn't", 'no', "shan't",
#                     "shouldn't", "wasn't", "weren't", "won't", "wouldn't"]


my_negativewords = ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'mightn', 'no',
                    'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn', 'neither', 'nor']

In [73]:
import re


def my_tokenizer(text):
    # just keep the words (remove characters)
    token_list = re.findall(r'([a-zA-Z0-9-]+)', text)

    # lowercase
    token_list = [t.lower() for t in token_list]

    # remove stopwords
    token_list = [word for word in token_list if word not in my_stopwords]

    # replace negative words to 'not'
    for i, word in enumerate(token_list):
        if word in my_negativewords:
            token_list[i] = 'not'
    
    # converting list into text
    token_text = ' '.join(token_list)

    return token_text

In [74]:
tesla_news['tokens'] = tesla_news.text.apply(my_tokenizer)
ford_news['tokens'] = ford_news.text.apply(my_tokenizer)
ibm_news['tokens'] = ibm_news.text.apply(my_tokenizer)
goldman_news['tokens'] = goldman_news.text.apply(my_tokenizer)
boeing_news['tokens'] = boeing_news.text.apply(my_tokenizer)
ge_news['tokens'] = ge_news.text.apply(my_tokenizer)

### 7. Merging/concatenating

In [75]:
tesla = pd.merge(tesla_news, tesla_label, how='left', on='date')
ford = pd.merge(ford_news, ford_label, how='left', on='date')
ibm = pd.merge(ibm_news, ibm_label, how='left', on='date')
goldman = pd.merge(goldman_news, goldman_label, how='left', on='date')
boeing = pd.merge(boeing_news, boeing_label, how='left', on='date')
ge = pd.merge(ge_news, ge_label, how='left', on='date')


In [76]:
data = pd.concat([tesla, ford, ibm, goldman, boeing, ge])

### 8. Removing not relevant topics

In [77]:
# first I determine the relevant topics
relevant_topics = data.topic.value_counts().index.to_list()[:34]
for elem in ['PHOTOS', 'LETTERS', 'COMMENTARY', 'BOOKSHELF', 'RUMBLE SEAT']:
    relevant_topics.remove(elem)

def relevant_topic(topic):
    if topic in relevant_topics:
        return True
    return False

data = data.loc[data.topic.apply(lambda x: relevant_topic(x))]

In [78]:
data.shape

(6202, 5)

In [85]:
data = data.reset_index()


In [88]:
data.drop(columns = 'index', inplace=True)

In [89]:
data.tail(2)

Unnamed: 0,date,text,topic,tokens,label
6200,2010-01-08,"GE Gains; Bad Day for Alcoa Stocks closed higher Thursday as analyst notes fueled a rally in GE, BofA and across the financial sector, although a downgrade sent Alcoa lower.",,ge gains bad day alcoa stocks closed higher analyst notes fueled rally ge bofa across financial sector although downgrade sent alcoa lower,good
6201,2010-01-05,"GE Realigns Appliances, Lighting Unit is realigning the industrial business formerly part of the division that makes appliances and light bulbs.",BUSINESS,ge realigns appliances lighting unit realigning industrial business formerly part division makes appliances light bulbs,neutral


### 9. Train-test split

In [93]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data,
                                         test_size=.20,
                                         random_state=10,
                                         stratify=data.label)

### 10. Save into a csv file

In [339]:
data_train.to_csv('train.csv')
data_test.to_csv('test.csv')