# Basic Stock Prediction Tutorial
---
In this notebook, I look at the original dataset by Aaron. [Source to notebook](https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit)

Here, I delve into the basics of NLP and focus on the original business problem. 

**"Given a list of headlines, how can we determine if the DJIA rises or falls on that particular day?**

---



# Outline

## 1.1 Importing Data & Libraries
## 2.1 Using countvectoriser
## 3.1 Bag of words approach
## 3.2 Basic Logistic Regression
## 3.3 Examining Coefficients
## 4.1 n-gram model (n=2)


## 1.1 Importing libraries and data

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd

In [3]:
data = pd.read_csv("data/Combined_News_DJIA.csv")

In [4]:
data.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."


#### Splitting into train and test set

In [5]:
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']

#### An example of the tokeniser process used later

In [6]:
example = train.iloc[5,2]
print(example)

b"Mom of missing gay man: Too bad he's not a 21-year-old cheerleader, then they'd still be looking for him"


In [7]:
example2 = example.lower()
print(example2)

b"mom of missing gay man: too bad he's not a 21-year-old cheerleader, then they'd still be looking for him"


In [8]:
example3 = CountVectorizer().build_tokenizer()(example2)
print(example3)

['mom', 'of', 'missing', 'gay', 'man', 'too', 'bad', 'he', 'not', '21', 'year', 'old', 'cheerleader', 'then', 'they', 'still', 'be', 'looking', 'for', 'him']


In [23]:
pd.DataFrame([[x,example3.count(x)] for x in set(example3)], columns = ['Word', 'Count'])

Unnamed: 0,Word,Count
0,be,1
1,old,1
2,still,1
3,him,1
4,not,1
5,cheerleader,1
6,then,1
7,21,1
8,missing,1
9,mom,1


## 2.1 Using Count Vectoriser

In [10]:
trainheadlines = []
for row in range(0,len(train.index)):
    trainheadlines.append(' '.join(str(x) for x in train.iloc[row,2:27]))

In [20]:
basicvectorizer = CountVectorizer()
basictrain = basicvectorizer.fit_transform(trainheadlines)
print(basictrain.shape)

(1611, 31675)


In [24]:
type(basictrain)

scipy.sparse.csr.csr_matrix

In [34]:
train.shape

(1611, 27)

#### What's a [sparse matrix?](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/)

In essence, it is a typical matrix with mostly zeroes. In this case, every column is a word, where 1 indicates the word exists while 0 indicates it does not.


## 3.1 What is the bag of words approach?



---

## 3.2 Logistic Regression

Now that we have established a bag of words for our train set, we will be applying the model onto our train set, using logistic regression.

In [35]:
basicmodel = LogisticRegression()
basicmodel = basicmodel.fit(basictrain, train["Label"])



In [36]:
testheadlines = []
for row in range(0,len(test.index)):
    testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
basictest = basicvectorizer.transform(testheadlines)
predictions = basicmodel.predict(basictest)

#### Confusion Matrix

In [47]:
pd.crosstab(test["Label"], predictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,61,125
1,92,100


In [55]:
total = 61 + 125 + 100 + 92
correct = 61 + 100
accuracy = correct/total

print(str(round(accuracy*100,2)) + "%")

42.59%


#### Prediction Accuracy 

Our prediction accuracy using the basic logistic regression model is 42.59%

## 3.3 Examining Coefficients

Next, we examine which words have the most powerful effect on our classifier

In [56]:
basicwords = basicvectorizer.get_feature_names()
basiccoeffs = basicmodel.coef_.tolist()[0]
coeffdf = pd.DataFrame({'Word' : basicwords, 
                        'Coefficient' : basiccoeffs})
coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
19419,nigeria,0.497924
25261,self,0.452526
29286,tv,0.428011
15998,korea,0.425863
20135,olympics,0.425716
15843,kills,0.411636
26323,so,0.411267
29256,turn,0.394855
10874,fears,0.388555
28274,territory,0.384031


In [57]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
27299,students,-0.424441
8478,did,-0.427079
6683,congo,-0.431925
12818,hacking,-0.444069
7139,country,-0.44857
16949,low,-0.463116
3651,begin,-0.470454
25433,sex,-0.494555
24754,sanctions,-0.549725
24542,run,-0.587794


#### Checkpoint

---

To date, we have attempted basic logistic regression using a bag of words approach and obtained an accuracy of 42%. 

However, due to the nature of words, they should rarely be taken as they are. Moving on, we will try n=2, meaning that words are taken as pairs. Hopefully, this will raise the predictive power of our model.

## 4.1 Advanced Modelling



In [58]:
advancedvectorizer = CountVectorizer(ngram_range=(2,2))
advancedtrain = advancedvectorizer.fit_transform(trainheadlines)

In [59]:
print(advancedtrain.shape)

(1611, 366721)


In [60]:
advancedmodel = LogisticRegression()
advancedmodel = advancedmodel.fit(advancedtrain, train["Label"])



In [61]:
testheadlines = []
for row in range(0,len(test.index)):
    testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
advancedtest = advancedvectorizer.transform(testheadlines)
advpredictions = advancedmodel.predict(advancedtest)

In [63]:
pd.crosstab(test["Label"], advpredictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,66,120
1,45,147


In [64]:
total = 66 + 120 + 147 + 45
correct = 66 + 147
accuracy = correct/total

print(str(round(accuracy*100,2)) + "%")

56.35%


In [45]:
advwords = advancedvectorizer.get_feature_names()
advcoeffs = advancedmodel.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords, 
                        'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(10)

Unnamed: 0,Words,Coefficient
272047,right to,0.286533
24710,and other,0.275274
285392,set to,0.274698
316194,the first,0.262873
157511,in china,0.227943
159522,in south,0.224184
125870,found in,0.21913
124411,forced to,0.216726
173246,it has,0.211137
322590,this is,0.209239


In [46]:
advcoeffdf.tail(10)


Unnamed: 0,Words,Coefficient
326846,to help,-0.198495
118707,fire on,-0.201654
155038,if he,-0.209702
242528,people are,-0.211303
31669,around the,-0.213362
321333,there is,-0.215699
327113,to kill,-0.221812
340714,up in,-0.226289
358917,with iran,-0.227516
315485,the country,-0.331153


#### Checkpoint

Using n=2, we have repeated the same process and obtained an accuracy of 57%.

## Conclusion

---

Here, we have examined the [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) approach - using n=1 and n=2.

Moving forward, we can consider using different classifiers for an improved effect.