# Introduction to Data Science Project: Daily News for Stock Market Prediction
## Team member: Chien-Ming Huang, Tong Chen, Po Chun Chen


### Data Source: Kaggle 
https://www.kaggle.com/aaron7sun/stocknews/data
#### News Data: 
Daily top 25 headlines from Reddit WorldNews Channel (/r/worldnews), ranked by reddit users' votes for every date.<br> (Range: 2008-06-08 to 2016-07-01)

#### Stock Data: 
Dow Jones Industrial Average (DJIA).<br> (Range: 2008-08-08 to 2016-07-01)


# Data Pipeline for Daily News for Stock Market Prediction
### Build a binary classfication problem:
"1" when DJIA Adj Close value rose or stayed as the same <br>
"0" when DJIA Adj Close value decreased
### Task Evaluation:
Training Set: Data from 2008-08-08 to 2014-12-31 <br>
Test Set: Data from 2015-01-02 to 2016-07-01 <br>
split approximately 80% / 20%
### Evaluation Metric:
Mainly use AUC as the evaluation metric, also include precison, recall, F1-score, support for comparison)
### Preprocessing:
1. Tokenizes
2. Removes stopwords and coverts headlines to lowercase letters
3. Stems
4. Returns a list of the cleaned text

### Bag of Words (N-Gram Model) & TFIDF Intro:
Perform Count Vecotrizer & TF-IDF transformer

### Feature Selection:
1. N-days shifts (1-3)
2. Top 3, 10, 25 news

### Model Selection:
Logistic Regression, Naive Bayes, and Random Forest

### Performance Comparison

### Other insight: Sentimental Analysis

### Other insight: Key Words visualization

In [1]:
# Dataframe
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from datetime import date

# Data Preprocessing
# Make sure conda has nltk pacakge
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

# Evaluation Metrics

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc,precision_score, accuracy_score, recall_score, f1_score
from scipy import interp

# Word Count & TF-IDF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Model Selection
from sklearn.model_selection import GridSearchCV # hyper parameter tuning
from pprint import pprint
from time import time
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from wordcloud import WordCloud # Need pip install wordcloud first
import matplotlib
matplotlib.rcParams["figure.figsize"] = "8, 8"

## Data Import

In [2]:
# Data import & Create combined for all top 25 news
df = pd.read_csv('/Users/ChienMingHuang/Desktop/Rutgers MIT Course/Fall 2017/Introduction to Data Science/IDSproject/input/Combined_News_DJIA.csv')
df['Combined25']= df.iloc[:,2:27].apply(lambda row: ''.join(str(row.values)), axis=1)
df.head()
# Label variable: 1 if the DJIA stayed the same or rose on that date
#                 0 if the DJIA decreased on that date

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25,Combined25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge""","[ 'b""Georgia \'downs two Russian warplanes\' a..."
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo...","[ ""b'Why wont America and Nato help us? If the..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man...","[ ""b'Remember that adorable 9-year-old who san..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...,"[""b' U.S. refuses Israel weapons to attack Ira..."
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...,"[""b'All the experts admit that we should legal..."


In [3]:
# Train data (2008-08-08 to 2014-12-31)
train = df.loc[(pd.to_datetime(df["Date"]) <= date(2014,12,31)),['Label','Combined25']]
train.head()

Unnamed: 0,Label,Combined25
0,0,"[ 'b""Georgia \'downs two Russian warplanes\' a..."
1,1,"[ ""b'Why wont America and Nato help us? If the..."
2,0,"[ ""b'Remember that adorable 9-year-old who san..."
3,0,"[""b' U.S. refuses Israel weapons to attack Ira..."
4,1,"[""b'All the experts admit that we should legal..."


In [4]:
# Test data (2015-01-02 to 2016-07-01)
test = df.loc[(pd.to_datetime(df["Date"]) > date(2014,12,31)),['Label','Combined25']]
test.head()

Unnamed: 0,Label,Combined25
1611,1,[ 'Most cases of cancer are the result of shee...
1612,0,[ 'Moscow-&gt;Beijing high speed train will re...
1613,0,"['US oil falls below $50 a barrel'\n ""Toyota g..."
1614,1,"[""'Shots fired' at French magazine HQ""\n '90% ..."
1615,1,[ 'New Charlie Hebdo issue to come out next we...


# ROC Curves Metric

In [5]:
# ROC Curves metric
'''
    Plot ROC curves for the multiclass problem
    based on http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
'''
def ROCCurves (Actual, Predicted):
    
    # Compute ROC curve and ROC area for each class
    n_classes = 2
    fpr = dict()
    tpr = dict()
    roc_auc= dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(Actual.values, Predicted)
        roc_auc[i] = auc(fpr[i], tpr[i])

    # Compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(Actual.ravel(), Predicted.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    ##############################################################################
    # Plot ROC curves for the multiclass problem

    # Compute macro-average ROC curve and ROC area

    # First aggregate all False Positive Rates

    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

    # Interpolate all ROC curves at this points (include FPR, TPR)
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += interp(all_fpr, fpr[i], tpr[i])

    # Average it and compute AUC
    mean_tpr /= n_classes

    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

    # Plot all ROC curves
    plt.figure()
    plt.plot(fpr["micro"], tpr["micro"],
         label = 'micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         linewidth=2)

    plt.plot(fpr["macro"], tpr["macro"],
         label = 'macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         linewidth=2)

    for i in range(n_classes):
        plt.plot(fpr[i], tpr[i], label = 'ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(i, roc_auc[i]))

    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc = "lower right")

# Evaluation Model 

In [6]:
# List to keep different methods scores to compare
ScoreSummaryByMethod = []

# Evaluation Model build for analyzing model performance
'''
    Prints and plots
    - classification report
    - confusion matrix
    - ROC-AUC
'''
def Evaluation (Method,Comment,Actual, Predicted):

    print (Method)
    print (Comment)
    print (classification_report(Actual,Predicted))
    print ('Confussion matrix:\n', confusion_matrix(Actual,Predicted))
    ROC_AUC = roc_auc_score(Actual,Predicted)
    print ('ROC-AUC: ' + str(ROC_AUC))
    
    Precision = precision_score(Actual,Predicted)
    Accuracy = accuracy_score(Actual,Predicted)
    Recall = recall_score(Actual,Predicted)
    F1 = f1_score(Actual,Predicted)
    ScoreSummaryByMethod.append([Method,Comment,ROC_AUC,Precision,Accuracy,Recall,F1])

# Text Preprocessing

In [7]:
# Text preprocessing:
'''
1. Tokenizes
2. Removes stopwords and coverts headlines to lowercase letters
3. Stems
4. Returns a list of the cleaned text
'''

def text_process(text):
    if pd.isnull(text):
        return []
    
    # Tokenizes with RegexpTokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    text_processed=tokenizer.tokenize(text)
    
    # Removes any stopwords such as , . ; 
    # Coverts headlines to lowercase letters
    text_processed = [word.lower() for word in text_processed if word.lower() not in stopwords.words('english')]
    
    # Stems
    porter_stemmer = PorterStemmer()
    
    text_processed = [porter_stemmer.stem(word) for word in text_processed]
    
    try:
        text_processed.remove('b')
    except: 
        pass
    
    # Returns a list of the cleaned text
    return text_processed

# Bag of Words (N-gram Model) & TF-IDF Intro

Bag of Words (N-gram Model) <br>
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. We use the n-gram to divide the text into different size of words. Then we will use the TF-IDF to calculate the weight of the words.

For example, the text "This is an apple".

When n-gram = 1, the text will be divided into: "This", "is", "an", "apple"

When n-gram = 2, the text will be divided into: "This is", "is an", "an apple"

When n-gram = 3, the text will be divided into: "This is an", "is an apple"

ngram_range:(1,1) is n-gram =1;(1,2) is ngram =1 and ngram =2; (1,3) is ngram values from 1 to 3. <br>
Then we will use the TF-IDF to calculate the weight of the words.<br><br>

TF-IDF <br>
TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.

1) Term frequency:

$TF_{i,j}$  = $\frac{n_{i,j}}{\sum_{k}n_{k,j}}$

$n_{ij}$ is the number of occurrences of the word in the file $d_j$. And the denominator is the number of occurrences in the file $d_j$ All the words in the number of appearances <br>

2) Inverse document frequency:

$IDF_i$ = $\log{\frac{|D|}{|j:t_i \in d_j|}}$

| D |: The total number of files in the corpus

$|j:t_i \in d_j|$ is the number of $t_i$ in the files <br>

3) $TFIDF_{ij}$ = $TF_{i,j}\times IDF_i$ <br>

# Feature Selection

In [62]:
# Combine Top 3, Top 10 headlines
df['Combined3']= df.iloc[:,2:4].apply(lambda row: ''.join(str(row.values)), axis=1)
df['Combined10']= df.iloc[:,2:11].apply(lambda row: ''.join(str(row.values)), axis=1)

# Create 1 day shift, 2 days shift, 3 days shift
df["d1"] = df["Label"].shift(-1)
df["d1"].drop(df.index[len(df)-1], inplace=True)
df["d2"] = df["Label"].shift(-2)
df["d2"].drop(df.index[len(df)-2], inplace=True)
df["d3"] = df["Label"].shift(-3)
df["d3"].drop(df.index[len(df)-3], inplace=True)

In [9]:
# New train and test data for later feature selection
# Train data (2008-08-08 to 2014-12-31)
train = df.loc[(pd.to_datetime(df["Date"]) <= date(2014,12,31)),['Label','d1','d2','d3','Combined3','Combined10','Combined25']]
# Train data (2008-08-08 to 2014-12-31)
test = df.loc[(pd.to_datetime(df["Date"]) > date(2014,12,31)),['Label','d1','d2','d3','Combined3','Combined10','Combined25']]

train.head()

Unnamed: 0,Label,d1,d2,d3,Combined3,Combined10,Combined25
0,0,1.0,0.0,0.0,"[ 'b""Georgia \'downs two Russian warplanes\' a...","[ 'b""Georgia \'downs two Russian warplanes\' a...","[ 'b""Georgia \'downs two Russian warplanes\' a..."
1,1,0.0,0.0,1.0,"[ ""b'Why wont America and Nato help us? If the...","[ ""b'Why wont America and Nato help us? If the...","[ ""b'Why wont America and Nato help us? If the..."
2,0,0.0,1.0,1.0,"[ ""b'Remember that adorable 9-year-old who san...","[ ""b'Remember that adorable 9-year-old who san...","[ ""b'Remember that adorable 9-year-old who san..."
3,0,1.0,1.0,0.0,"[""b' U.S. refuses Israel weapons to attack Ira...","[""b' U.S. refuses Israel weapons to attack Ira...","[""b' U.S. refuses Israel weapons to attack Ira..."
4,1,1.0,0.0,0.0,"[""b'All the experts admit that we should legal...","[""b'All the experts admit that we should legal...","[""b'All the experts admit that we should legal..."


In [23]:
# Disable Runtime Warning
import warnings
warnings.filterwarnings("ignore")

# Logestic Regression

## No day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

In [32]:
# Linear Regression with ngram = (1, 1), no shift, Top 3 News
lr_1n_t3_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200)),
])
lr_1n_t3_0_pipeline.fit(train['Combined3'],train['Label'])
lr_1n_t3_0_prediction = lr_1n_t3_0_pipeline.predict(test['Combined3'])
Evaluation ('Linear Regression','ngram= (1,1), no shift, Top 3 News', test["Label"], lr_1n_t3_0_prediction) 

Linear Regression
ngram= (1,1), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       186
          1       0.50      1.00      0.67       189

avg / total       0.25      0.50      0.34       375

Confussion matrix:
 [[  0 186]
 [  0 189]]
ROC-AUC: 0.5


## 1 day shift with top 3, top 10, top 25 news

## 2 day shift with top 3, top 10, top 25 news

## 3 day shift with top 3, top 10, top 25 news

# Bernoulli Naive Bayes

## No day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

In [33]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 3 News
bnb_1n_t3_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t3_0_pipeline.fit(train['Combined3'],train['Label'])
bnb_1n_t3_0_prediction = bnb_1n_t3_0_pipeline.predict(test['Combined3'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 3 News', test["Label"], bnb_1n_t3_0_prediction)  

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       186
          1       0.50      0.54      0.52       189

avg / total       0.50      0.50      0.50       375

Confussion matrix:
 [[ 84 102]
 [ 86 103]]
ROC-AUC: 0.4982932241


In [37]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 10 News
bnb_1n_t10_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t10_0_pipeline.fit(train['Combined10'],train['Label'])
bnb_1n_t10_0_prediction = bnb_1n_t10_0_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 10 News', test["Label"], bnb_1n_t10_0_prediction) 

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 10 News
             precision    recall  f1-score   support

          0       0.47      0.39      0.42       186
          1       0.49      0.57      0.53       189

avg / total       0.48      0.48      0.48       375

Confussion matrix:
 [[ 72 114]
 [ 81 108]]
ROC-AUC: 0.479262672811


In [38]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 25 News
bnb_1n_t25_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t25_0_pipeline.fit(train['Combined25'],train['Label'])
bnb_1n_t25_0_prediction = bnb_1n_t25_0_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 25 News', test["Label"], bnb_1n_t25_0_prediction)         

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 25 News
             precision    recall  f1-score   support

          0       0.49      0.29      0.36       186
          1       0.50      0.70      0.58       189

avg / total       0.49      0.50      0.47       375

Confussion matrix:
 [[ 54 132]
 [ 57 132]]
ROC-AUC: 0.494367639529


### ngram = (1, 2)

In [55]:
# Bernoulli Naive Bayes with ngram = (1, 2), no shift, Top 3 News
bnb_2n_t3_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 2))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_2n_t3_0_pipeline.fit(train['Combined3'],train['Label'])
bnb_2n_t3_0_prediction = bnb_2n_t3_0_pipeline.predict(test['Combined3'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,2), no shift, Top 3 News', test["Label"], bnb_2n_t3_0_prediction)  

Bernoulli Naive Bayes
ngram= (1,2), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       186
          1       0.50      0.54      0.52       189

avg / total       0.50      0.50      0.50       375

Confussion matrix:
 [[ 84 102]
 [ 86 103]]
ROC-AUC: 0.4982932241


In [None]:
# Bernoulli Naive Bayes with ngram = (1, 2), no shift, Top 10 News
bnb_2n_t10_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 2))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_2n_t10_0_pipeline.fit(train['Combined10'],train['Label'])
bnb_2n_t10_0_prediction = bnb_2n_t10_0_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,2), no shift, Top 10 News', test["Label"], bnb_2n_t10_0_prediction)  

In [None]:
# Bernoulli Naive Bayes with ngram = (1, 2), no shift, Top 25 News
bnb_2n_t25_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 2))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_2n_t25_0_pipeline.fit(train['Combined25'],train['Label'])
bnb_2n_t25_0_prediction = bnb_2n_t25_0_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,2), no shift, Top 25 News', test["Label"], bnb_2n_t25_0_prediction)  

### ngram = (1, 3)

In [59]:
# Bernoulli Naive Bayes with ngram = (1, 3), no shift, Top 3 News
bnb_3n_t3_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 3))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_3n_t3_0_pipeline.fit(train['Combined3'],train['Label'])
bnb_3n_t3_0_prediction = bnb_3n_t3_0_pipeline.predict(test['Combined3'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,3), no shift, Top 3 News', test["Label"], bnb_3n_t3_0_prediction)  

Bernoulli Naive Bayes
ngram= (1,3), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       186
          1       0.50      0.54      0.52       189

avg / total       0.50      0.50      0.50       375

Confussion matrix:
 [[ 84 102]
 [ 86 103]]
ROC-AUC: 0.4982932241


In [60]:
# Bernoulli Naive Bayes with ngram = (1, 3), no shift, Top 10 News
bnb_3n_t10_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 3))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_3n_t10_0_pipeline.fit(train['Combined10'],train['Label'])
bnb_3n_t10_0_prediction = bnb_3n_t10_0_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,3), no shift, Top 10 News', test["Label"], bnb_3n_t10_0_prediction)  

Bernoulli Naive Bayes
ngram= (1,3), no shift, Top 10 News
             precision    recall  f1-score   support

          0       0.47      0.39      0.42       186
          1       0.49      0.57      0.53       189

avg / total       0.48      0.48      0.48       375

Confussion matrix:
 [[ 72 114]
 [ 81 108]]
ROC-AUC: 0.479262672811


In [61]:
# Bernoulli Naive Bayes with ngram = (1, 3), no shift, Top 25 News
bnb_3n_t25_0_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 3))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_3n_t25_0_pipeline.fit(train['Combined25'],train['Label'])
bnb_3n_t25_0_prediction = bnb_3n_t25_0_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,3), no shift, Top 25 News', test["Label"], bnb_3n_t25_0_prediction)  

Bernoulli Naive Bayes
ngram= (1,3), no shift, Top 25 News
             precision    recall  f1-score   support

          0       0.49      0.29      0.36       186
          1       0.50      0.70      0.58       189

avg / total       0.49      0.50      0.47       375

Confussion matrix:
 [[ 54 132]
 [ 57 132]]
ROC-AUC: 0.494367639529


## 1 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

In [63]:
# Bernoulli Naive Bayes with ngram = (1, 1), 1 day shift, Top 3 News
bnb_1n_t3_1_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t3_1_pipeline.fit(train['Combined3'],train['d1'])
bnb_1n_t3_1_prediction = bnb_1n_t3_1_pipeline.predict(test['d1'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), 1 day shift, Top 3 News', test["d1"], bnb_1n_t3_1_prediction)  

TypeError: cannot use a string pattern on a bytes-like object

In [None]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 10 News
bnb_1n_t10_1_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t10_1_pipeline.fit(train['Combined10'],train['Label'])
bnb_1n_t10_1_prediction = bnb_1n_t10_1_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 10 News', test["Label"], bnb_1n_t10_1_prediction) 

In [None]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 25 News
bnb_1n_t25_1_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t25_1_pipeline.fit(train['Combined25'],train['Label'])
bnb_1n_t25_1_prediction = bnb_1n_t25_1_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 25 News', test["Label"], bnb_1n_t25_1_prediction)         

### ngram = (1, 2)

### ngram = (1, 3)

## 2 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

In [33]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 3 News
bnb_1n_t3_2_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t3_2_pipeline.fit(train['Combined3'],train['Label'])
bnb_1n_t3_2_prediction = bnb_1n_t3_2_pipeline.predict(test['Combined3'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 3 News', test["Label"], bnb_1n_t3_2_prediction)  

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       186
          1       0.50      0.54      0.52       189

avg / total       0.50      0.50      0.50       375

Confussion matrix:
 [[ 84 102]
 [ 86 103]]
ROC-AUC: 0.4982932241


In [37]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 10 News
bnb_1n_t10_2_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t10_2_pipeline.fit(train['Combined10'],train['Label'])
bnb_1n_t10_2_prediction = bnb_1n_t10_2_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 10 News', test["Label"], bnb_1n_t10_2_prediction) 

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 10 News
             precision    recall  f1-score   support

          0       0.47      0.39      0.42       186
          1       0.49      0.57      0.53       189

avg / total       0.48      0.48      0.48       375

Confussion matrix:
 [[ 72 114]
 [ 81 108]]
ROC-AUC: 0.479262672811


In [38]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 25 News
bnb_1n_t25_2_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t25_2_pipeline.fit(train['Combined25'],train['Label'])
bnb_1n_t25_2_prediction = bnb_1n_t25_2_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 25 News', test["Label"], bnb_1n_t25_2_prediction)         

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 25 News
             precision    recall  f1-score   support

          0       0.49      0.29      0.36       186
          1       0.50      0.70      0.58       189

avg / total       0.49      0.50      0.47       375

Confussion matrix:
 [[ 54 132]
 [ 57 132]]
ROC-AUC: 0.494367639529


### ngram = (1, 2)

### ngram = (1, 3)

## 3 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

In [33]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 3 News
bnb_1n_t3_3_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t3_3_pipeline.fit(train['Combined3'],train['Label'])
bnb_1n_t3_3_prediction = bnb_1n_t3_3_pipeline.predict(test['Combined3'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 3 News', test["Label"], bnb_1n_t3_3_prediction)  

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 3 News
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       186
          1       0.50      0.54      0.52       189

avg / total       0.50      0.50      0.50       375

Confussion matrix:
 [[ 84 102]
 [ 86 103]]
ROC-AUC: 0.4982932241


In [37]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 10 News
bnb_1n_t10_3_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t10_3_pipeline.fit(train['Combined10'],train['Label'])
bnb_1n_t10_3_prediction = bnb_1n_t10_3_pipeline.predict(test['Combined10'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 10 News', test["Label"], bnb_1n_t10_3_prediction) 

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 10 News
             precision    recall  f1-score   support

          0       0.47      0.39      0.42       186
          1       0.49      0.57      0.53       189

avg / total       0.48      0.48      0.48       375

Confussion matrix:
 [[ 72 114]
 [ 81 108]]
ROC-AUC: 0.479262672811


In [38]:
# Bernoulli Naive Bayes with ngram = (1, 1), no shift, Top 25 News
bnb_1n_t25_3_pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer = text_process,ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB(alpha = 0.5, binarize = 0.0)),
])
bnb_1n_t25_3_pipeline.fit(train['Combined25'],train['Label'])
bnb_1n_t25_3_prediction = bnb_1n_t25_3_pipeline.predict(test['Combined25'])
Evaluation ('Bernoulli Naive Bayes','ngram= (1,1), no shift, Top 25 News', test["Label"], bnb_1n_t25_3_prediction)         

Bernoulli Naive Bayes
ngram= (1,1), no shift, Top 25 News
             precision    recall  f1-score   support

          0       0.49      0.29      0.36       186
          1       0.50      0.70      0.58       189

avg / total       0.49      0.50      0.47       375

Confussion matrix:
 [[ 54 132]
 [ 57 132]]
ROC-AUC: 0.494367639529


### ngram = (1, 2)

### ngram = (1, 3)

# Random Forest

## No day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

### ngram = (1, 2)

### ngram = (1, 3)

## 1 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

### ngram = (1, 2)

### ngram = (1, 3)

## 2 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

### ngram = (1, 2)

### ngram = (1, 3)

## 3 day shift with top 3, top 10, top 25 news

### ngram = (1, 1)

### ngram = (1, 2)

### ngram = (1, 3)

# Score Summary by Method

In [None]:
# Show the top 20 model performance sorted on AUC performance
df_ScoreSummaryByMethod=DataFrame(ScoreSummaryByMethod,columns=['Method','Comment','ROC_AUC','Precision','Accuracy','Recall','F1'])
df_ScoreSummaryByMethod.sort_values(['ROC_AUC'],ascending=False,inplace=True)
df_ScoreSummaryByMethod.head(20)