## Model Ideas:
- naive bayes with SVM
    - text cleaning: remove stop words, punctuation, also try with varying levels of processing
- attention modeling with transfer learning
- logistic regression

### for next time:
- read up on tf-idf
- naive bayes / SVM
- high level understanding of nltk

### outline of steps:
#### preprocessing
- remove blank rows
- change all text to lowercase
- tokenization: break stream of text into list of words or sentences (nltk.word_tokenize, nltk.sent_tokenize)
- remove stopwords and non-alpha text
- word stemming/lemmatization: convert words to their root
- in original dataset, need a column of pre-processed text that has list of lemmatized/stemmed words, no puncuation

#### encoding / word vectorization
- encode target variable to a number (sklearn.preprocessing.LabelEncoder)
- calculate tf-idf on the column of preprocessed text
- tf-idf output gives you (row, integer number of each word) tf-idf_score

#### modeling
- fit naive_bayes.MultinomiaNB() on tf-idf matrix (result of tf-idf transformation)
- calculate predictions, accuracy score

In [40]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import os

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [2]:
path = '/Users/christinejiang/Documents/Python/data/tweet-sentiment-extraction/'
train = pd.read_csv(path+'train.csv')
test = pd.read_csv(path+'test.csv')
submission = pd.read_csv(path+'sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [4]:
temp = train.groupby('selected_text').count().sort_values('sentiment',ascending=False).head(20)
temp.style.background_gradient(cmap='mako_r')

Unnamed: 0_level_0,textID,text,sentiment
selected_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
good,199,199,199
love,185,185,185
Happy,163,163,163
miss,143,143,143
happy,106,106,106
thanks,98,98,98
great,91,91,91
sad,89,89,89
sorry,82,82,82
Thanks,82,82,82


In [5]:
print(f'train shape:{train.shape}')
display(train.head())
print(f'test shape: {test.shape}')
display(test.head())

train shape:(27481, 4)


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


test shape: (3534, 3)


Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


In [6]:
test.dtypes

textID       object
text         object
sentiment    object
dtype: object

In [26]:
corpus = list(train['text'].astype(str))
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()

In [34]:
print(X)

  (0, 10794)	0.3074369990679438
  (0, 25304)	0.39019790983682484
  (0, 12348)	0.332449153509136
  (0, 19605)	0.7626615488152436
  (0, 11515)	0.2470890629972583
  (1, 7808)	0.5082396710971107
  (1, 20219)	0.4815063372325792
  (1, 12491)	0.1777785842206892
  (1, 11735)	0.27636000213407014
  (1, 26209)	0.16531891177156582
  (1, 15589)	0.28046379261867355
  (1, 25527)	0.2590416443756134
  (1, 20135)	0.2940035182063991
  (1, 21583)	0.3772705717955796
  (2, 15185)	0.23247743154143272
  (2, 5065)	0.7364127040606855
  (2, 12841)	0.2097338570214545
  (2, 4664)	0.5686869089975516
  (2, 16099)	0.19041397500089752
  (3, 2775)	0.5070657519380314
  (3, 14060)	0.4708223941249711
  (3, 12734)	0.594892521062647
  (3, 25351)	0.3241544027791997
  (3, 15185)	0.2494742630904795
  (4, 4689)	0.3171944836114534
  :	:
  (27478, 18579)	0.2410399861844931
  (27478, 22777)	0.21200390526958965
  (27478, 22342)	0.22878705459482904
  (27478, 26047)	0.22130032230145363
  (27478, 10816)	0.14707383594906176
  (27478, 1