# <font color='green'>Predicting Stocks prices using News Headlines</font>

### * The kernel is all about creating a model to predict the stocks whether they go up or down based on the top 25 headlines 
 
### * The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".
 
### * In Label column the value is "1" when DJIA Adj Close value rose or stayed as the same
 
### * In Label column the value is "0" when DJIA Adj Close value decreased.

## <font color='darkred'>Objective :</font>
### The goal is to create a machine learning model that predicts whether the stock goes up or down based on top 25 headlines 

## <font color='darkred'>Whole process in detail :</font>
### 1)  Filling null values in the dataset with median

### 2)  Combining all the headlines into one news 

### 3)  Cleaning the text by removing punctuations and changing all the letters to lowercase

### 4)  Applying countvectorizer to all the headlines

### 5)  Visualizing the results and choosing the best algorithm based on requirements

In [None]:
import pandas as pd
import numpy as np 
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import plotly.graph_objects as go
import plotly.express as px


warnings.filterwarnings('ignore')

# imported the file which contains top 25 headlines, stock went up or down(label) and date
data1 = pd.read_csv('../input/stocknews/Combined_News_DJIA.csv')
data1.head()

In [None]:
data1.isnull().sum()

## <font color='darkred'>Data Cleaning</font>

In [None]:
# filling the null values with median 

data1['Top23'].fillna(data1['Top23'].median,inplace=True)
data1['Top24'].fillna(data1['Top24'].median,inplace=True)
data1['Top25'].fillna(data1['Top25'].median,inplace=True)

### Train-Test Split 

In [None]:
# seperating the data into train and test on date

train = data1[data1['Date'] < '20150101']
test = data1[data1['Date'] > '20141231']

In [None]:
# removing punctuations and changing all the letters to lowercase for both train and test

all_data = [train,test]

for df in all_data:
    df.replace("[^a-zA-Z]"," ",regex=True, inplace=True)
    for i in df.columns:
        if i=='Date':
            continue
        if i=='Label':
            continue
        df[i] = df[i].str.lower()

train.head()

In [None]:
# combining all the headlines in train data into one and appending them into a list 

headlines = []
for row in range(0,len(train.index)):
    headlines.append(' '.join(str(x) for x in train.iloc[row,2:]))
headlines[0]

In [None]:
# combining all the headlines in test data into one and appending them into a list 

test_transform= []
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))

## <font color='darkred'>Applying Machine Learning Algorithms (Random forest , XGBOOST and CATBoost)</font>

In [None]:
# Applying countvectorizer on headlines list that we created before and max features is set to 100009

countvector=CountVectorizer(ngram_range=(2,2),max_features=100009)
traindataset=countvector.fit_transform(headlines)

randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(traindataset,train['Label'])



<font color='darkblue'>The maximum features for countvectorizer is set to 100009 because, i tried many other numbers for maximum features and for 100009 i got the best accuracy, with lowest False positive values ( you can see below in the confusion matrix you can try other values and check it yourself, if you find the best accuracy with other maximum features then comment below</font>

### <font color='darkred'>Random forest </font>

In [None]:
# Applying countvectorizer on test_transform list that we created before 

test_dataset = countvector.transform(test_transform)
predictions = randomclassifier.predict(test_dataset)

In [None]:
# confusion matrix for 

matrix=confusion_matrix(test['Label'],predictions)
print(matrix)

In [None]:
# accuracy score (compared test daset original output values with predictions)

score=accuracy_score(test['Label'],predictions)
print(score)

In [None]:
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(score)
report=classification_report(test['Label'],predictions)
print(report)

<font color='darkblue'>Lets apply XGBoost , and will also try different numbers of max features for countvectorizer and see which number gives us the maximum accuracy</font>




### <font color='darkred'>XGBoost </font>

In [None]:
max_features_num = [500,600,700,800,900,1000]
ngram = [1,2,3,4,5]
for i in max_features_num:
    for j in ngram:
        countvector=CountVectorizer(ngram_range=(j,j),max_features=i)
        traindataset=countvector.fit_transform(headlines)
        test_dataset = countvector.transform(test_transform)

        xgb = XGBClassifier(random_state =1)
        xgb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
        predictions = xgb.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))
        score=accuracy_score(test['Label'],predictions)
        print('max number of features used : {}'.format(i))
        print('ngram_range ({},{})'.format(j,j))
        print(score)
        matrix=confusion_matrix(test['Label'],predictions)
        print('confusion matrix : {}'.format(matrix))
        print('===============================')

<font color='darkblue'>Maximum accuracy :</font>

max number of features used : 800

ngram_range (2,2)

0.8650793650793651

confusion matrix : [[161  25]
 [ 26 166]]

In [None]:
countvector=CountVectorizer(ngram_range=(1,1),max_features=800)
traindataset=countvector.fit_transform(headlines)
test_dataset = countvector.transform(test_transform)


xgb = XGBClassifier(random_state =1)
xgb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
predictions = xgb.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))

In [None]:
predictions

In [None]:
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(score)
report=classification_report(test['Label'],predictions)
print(report)

## <font color='darkred'> Conclusion</font>



<font color='darkblue'>After all this analysis we can conclude that the best algorithm which gave good accuracy and more true positive values and less on false negetive values then the best algorithm for you is Random Forest without hyperparameter tuning</font>