# Apple Stock Sentiment Prediction Using News Headlines:

In [1]:
## Importing all relevant libraries...
import pandas as pd
import numpy as np
import re
from sklearn import metrics
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [2]:
df = pd.read_csv("Data.csv" , encoding = 'ISO-8859-1')
df.head(3)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links


## About Data:
- **Label** : '1' means stock price will surge and '0' means stock price will drop.

In [3]:
## Indexing all headline columns for preprocessing.....
data = df.iloc[: , 2:]

## Data Preprocessing:

In [4]:
## Converting all 25 headlines into one processed paragragh...
comb_headlines = []

for entry in range(0 , df.shape[0]):
    temp = data.iloc[entry , :]                               ## Indexing row....
    temp = temp.replace('[^a-zA-Z]' , ' ' , regex = True)     ## Removing all characters other than alphabets with space....
    temp = temp.str.lower()                                   ## Converting all aphabets to lower case....
    temp = ' '.join(str(x) for x in temp.values)              ## Joining all columns to make 1 para...
    comb_headlines.append(temp)                               ## Appending the para to empty list....

In [5]:
## Removing all stopwords from all paragraphs.....
lem = WordNetLemmatizer()

for i in range(len(comb_headlines)):
    para = comb_headlines[i]
    para = para.split()
    para = [lem.lemmatize(word) for word in para if word not in stopwords.words('english')]
    para = " ".join(para)
    comb_headlines[i] = para

In [6]:
## processed data....
df_pro = pd.DataFrame(data = comb_headlines , columns = ['Headlines'])
df_pro['Targets'] = df['Label']
df_pro['Date'] = df['Date']

In [7]:
print("Number of rows in the processed dataset:" , df_pro.shape[0])

Number of rows in the processed dataset: 4101


In [8]:
## Making train test split....
train=df_pro[df_pro['Date']<'20150101']
test=df_pro[df_pro['Date']>'20141231']

In [9]:
## Applying Vectorization.....
from sklearn.feature_extraction.text import TfidfVectorizer

#implement TF-IDF
Tfidf = TfidfVectorizer(ngram_range=(2,3))
x_train = Tfidf.fit_transform(train['Headlines'])
y_train = train['Targets']

## Applying vectorization on training data using TF-IDF before Model Building......
x_test = Tfidf.transform(test['Headlines'])
y_test = test['Targets']

## Model Building:

#### Random Forest Classifier:

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators = 200 , criterion='entropy')
rf_clf.fit(x_train , y_train)

RandomForestClassifier(criterion='entropy', n_estimators=200)

In [19]:
print("The Train Accuracy is :" , rf_clf.score(x_test , y_test)*100)
print("The Test Accuracy is :" , metrics.accuracy_score(y_test , rf_clf.predict(x_test))*100)
print("\n" , metrics.classification_report(y_test , rf_clf.predict(x_test)))

The Train Accuracy is : 84.39153439153439
The Test Accuracy is : 84.39153439153439

               precision    recall  f1-score   support

           0       0.94      0.73      0.82       186
           1       0.79      0.95      0.86       192

    accuracy                           0.84       378
   macro avg       0.86      0.84      0.84       378
weighted avg       0.86      0.84      0.84       378



In [12]:
## CONFUSION MATRIX for Test Results....
pd.DataFrame(metrics.confusion_matrix(y_test , rf_clf.predict(x_test)))

Unnamed: 0,0,1
0,136,50
1,9,183


#### Naive Bayes Classifier:

In [13]:
from sklearn.naive_bayes import MultinomialNB

NV = MultinomialNB()
NV.fit(x_train , y_train)

MultinomialNB()

In [18]:
print("The Train Accuracy is :" , NV.score(x_train , y_train)*100)
print("The Test Accuracy is :" , metrics.accuracy_score(y_test , NV.predict(x_test))*100)
print("\n" , metrics.classification_report(y_test , NV.predict(x_test)))

The Train Accuracy is : 100.0
The Test Accuracy is : 85.18518518518519

               precision    recall  f1-score   support

           0       1.00      0.70      0.82       186
           1       0.77      1.00      0.87       192

    accuracy                           0.85       378
   macro avg       0.89      0.85      0.85       378
weighted avg       0.89      0.85      0.85       378



In [15]:
## CONFUSION MATRIX for Test Results....
pd.DataFrame(metrics.confusion_matrix(y_test , NV.predict(x_test)))

Unnamed: 0,0,1
0,130,56
1,0,192


## Conclusion:
- **Random Forest Classifier Model:** Gave good test results, with F1-score of 82% and 86%.
- **Naive Bayesian Model:** Proved to be the best fit for this problem. With slightly improved F1-Score of 82% and 87%.