## Kaggle: Bag of Words Meets Bags of Popcorn - II
###### This is a tutorial competiton from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial), the goal is to build a binary classification model to predict the sentiment of a movie review.

The following code contains two models: logistic regression and ensemble model of logistic+SVM.  
For part-1, logistic model, leaderboard score is 0.95498 (Probability), 0.88604 (Binary).  
For part-2, ensemble model, leaderboard score is 0.95522 (Probability), 0.88792 (Binary).

## Preprocessing 
When try to remove something from the text other than HTML tags, it may play an important role in the accuracy. Single punctuation may have few effect, how about ':-D' or ':-<'? (If contains [emoji code](https://www.emoji.codes/), it definitely affect more than a single world. Anyway, these datasets seem not to have emoji code. I haven't checked about that. Just another story...) How about numbers? Here I retain numbers in text after compared with the accuracyof training from non-number text.

In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
import nltk
import re

In [2]:
from bs4 import BeautifulSoup
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import check_random_state
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import EnglishStemmer

In [3]:
train=pd.read_csv('E:/542/Pro4/labeledTrainData.tsv', 
                  header=0,delimiter='\t',quoting=3)
test = pd.read_csv("E:/542/Pro4/testData.tsv", header=0, delimiter="\t", quoting=3)

In [4]:
def preprocess (corpus):
    num=corpus['review'].size
    clean=[]
    for i in range(0, num):
        review_text = BeautifulSoup(corpus["review"][i],'html.parser').get_text()
        pattern = re.compile(r'[^\w\s]')
        letnum_only = re.sub(pattern, " ", review_text) 
        clean.append(letnum_only)
    clean_result=pd.DataFrame({'text':clean}) 
    return clean_result

In [5]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stemmer = EnglishStemmer()
    stems = map(stemmer.stem, tokens)
    return stems

In [6]:
clean_train_df=preprocess(train)
clean_test_df=preprocess(test)

In [7]:
X_train, X_test, y_train, y_test= train_test_split(clean_train_df['text'], train['sentiment'],  test_size=0.25, random_state=check_random_state(888))
X_train2, X_test2, y_train2, y_test2= train_test_split(clean_train_df['text'], train['sentiment'],  test_size=0.25, random_state=check_random_state(1000))
X_train3, X_test3, y_train3, y_test3= train_test_split(clean_train_df['text'], train['sentiment'],  test_size=0.25)
X_trainf, X_testf, y_trainf, y_testf= train_test_split(clean_train_df['text'], train['sentiment'],  test_size=0.25, random_state=check_random_state(512))

## Ensemble method
### Single model
Before use ensemble method, check the performance of each single model on validation dataset first. It seems Naive Bayes has a lower accuracy than other two. Although this rank maybe different by tuning parameters, I decide to leave it aside.

Another issue you may find is that [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) only returns probability when loss function is 'log' or 'modified_huber'. Here use 'hinge' which gives a linear SVM. Then when try to combine two model and generate probabilistic predictions, I just use binary results of SGDClassifiers. This may affect the variance of its predictions, but if regard it as a function, it become acceptable.

In [8]:
def model1(X_train, y_train, X_test, random_state):
    tfidf=TfidfVectorizer(stop_words = 'english',min_df=5,max_df=0.15,ngram_range=(1, 4),tokenizer=tokenize)
    train_X = tfidf.fit_transform(X_train)
    test_X=tfidf.transform(X_test)
    
    model=LogisticRegression(penalty='l2', dual=True,random_state=random_state)
    model.fit(train_X,y_train)
    y_pred=model.predict(test_X)
    y_score=model.predict_proba(test_X)    
    return y_pred,y_score

In [9]:
def model2(X_train, y_train, X_test, random_state):
    tfidf=TfidfVectorizer(stop_words = 'english',min_df=5,max_df=0.15,ngram_range=(1, 4),tokenizer=tokenize)
    train_X = tfidf.fit_transform(X_train)
    test_X=tfidf.transform(X_test)
    
    model2=BernoulliNB(alpha=0.15)
    model2.fit(train_X,y_train)
    y_pred2=model2.predict(test_X)
    y_score2=model2.predict_proba(test_X)
    return y_pred2,y_score2

In [10]:
def model3(X_train, y_train, X_test, random_state):
    tfidf=TfidfVectorizer(stop_words = 'english',min_df=5,max_df=0.15,ngram_range=(1, 4),tokenizer=tokenize)
    train_X = tfidf.fit_transform(X_train)
    test_X=tfidf.transform(X_test)
    model3=SGDClassifier(random_state=random_state)
    #model3=svm.LinearSVC(random_state=random_state)
    model3.fit(train_X,y_train)
    y_pred3=model3.predict(test_X)
    return y_pred3

In [11]:
y_pred1, y_score1= model1(X_train, y_train, X_test,random_state=check_random_state(800))
y_pred2,y_score2= model2(X_train2, y_train2, X_test2,random_state=check_random_state(255))
y_pred3= model3(X_train3, y_train3, X_test3,random_state=check_random_state(1024))
score1 = accuracy_score(y_pred1, y_test)
score2 = accuracy_score(y_pred2, y_test2)
score3 = accuracy_score(y_pred3, y_test3)
print("Logistic Regression prediction accuracy = {0:3.1f}%".format(100.0 * score1))
print("Naive Bayes prediction accuracy = {0:3.1f}%".format(100.0 * score2))
print("SGDClassifier(SVM) prediction accuracy = {0:3.1f}%".format(100.0 * score3))

Logistic Regression prediction accuracy = 89.3%
Naive Bayes prediction accuracy = 87.8%
SGDClassifier(SVM) prediction accuracy = 89.3%


### Ensemble result

In [12]:
y_pred1f,y_score1f= model1(X_train, y_train, X_testf,random_state=check_random_state(800))
#y_pred2f,y_score2f= model2(X_train2, y_train2, X_testf,random_state=check_random_state(255))
y_pred3f= model3(X_train3, y_train3, X_testf,random_state=check_random_state(1024))

In [13]:
finallist=[0 for i in range(0,6250)]
finaldf=pd.DataFrame({'Sentiment':finallist})
for i in range(0,6250):
    ss=y_score1f[i][1]+y_pred3f[i]
    if ss>=1.5:
        finaldf['Sentiment'][i]=1

In [14]:
scorefinal = accuracy_score(finaldf['Sentiment'], y_testf)
print(scorefinal)

0.94368


## Generate Prediction of Test Dataset

In [15]:
y_pred1F,y_score1F= model1(clean_train_df['text'], train['sentiment'], clean_test_df['text'],random_state=check_random_state(800))
#y_pred2F,y_score2F= model2(clean_train_df['text'], train['sentiment'], clean_test_df['text'],random_state=check_random_state(255))
y_pred3F= model3(clean_train_df['text'], train['sentiment'], clean_test_df['text'],random_state=check_random_state(1024))
Finallist=[0 for i in range(0,25000)]
Finallist2=[0 for i in range(0,25000)]
for i in range(0,25000):
    thres=y_score1F[i][1]+y_pred3F[i]
    thres2=y_pred1F[i]+y_pred3F[i]
    Finallist[i]=thres/2
    if thres2>=1.5:
        Finallist2[i]=1

In [16]:
output=pd.DataFrame({'id':test['id'],'sentiment':Finallist})
output.to_csv("PredEnsemProb.csv", index=False, quoting=3)

In [17]:
output2=pd.DataFrame({'id':test['id'],'sentiment':Finallist2})
output2.to_csv("PredEnsemBin.csv", index=False, quoting=3)