## Model Ideas:
- naive bayes with SVM
    - text cleaning: remove stop words, punctuation, also try with varying levels of processing
- attention modeling with transfer learning
- logistic regression

### for next time:
- read up on tf-idf
- naive bayes / SVM
- high level understanding of nltk

### outline of steps:
#### preprocessing
- remove blank rows
- change all text to lowercase
- tokenization: break stream of text into list of words or sentences (nltk.word_tokenize, nltk.sent_tokenize)
- remove stopwords and non-alpha text
- word stemming/lemmatization: convert words to their root
- in original dataset, need a column of pre-processed text that has list of lemmatized/stemmed words, no puncuation

#### encoding / word vectorization
- encode target variable to a number (sklearn.preprocessing.LabelEncoder)
- calculate tf-idf on the column of preprocessed text
- tf-idf output gives you (row, integer number of each word) tf-idf_score

#### modeling
- fit naive_bayes.MultinomiaNB() on tf-idf matrix (result of tf-idf transformation)
- calculate predictions, accuracy score

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import os

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [27]:
path = '/Users/christinejiang/Documents/Python/data/tweet-sentiment-extraction/'
train = pd.read_csv(path+'train.csv')
test = pd.read_csv(path+'test.csv')
submission = pd.read_csv(path+'sample_submission.csv')

In [28]:
# train['length'] = [len(str(x).split()) for x in train['selected_text']]

In [29]:
# #data preprocessing
# train['positive'] = [1 if x =='positive' else 0 for x in train['sentiment']]
# train['negative'] = [1 if x =='negative' else 0 for x in train['sentiment']]
# train['neutral'] = [1 if x =='neutral' else 0 for x in train['sentiment']]

# train.head()

In [30]:
temp = train.groupby('selected_text').count().sort_values('sentiment',ascending=False).head(20)
temp.style.background_gradient(cmap='mako_r')

Unnamed: 0_level_0,textID,text,sentiment
selected_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
good,199,199,199
love,185,185,185
Happy,163,163,163
miss,143,143,143
happy,106,106,106
thanks,98,98,98
great,91,91,91
sad,89,89,89
sorry,82,82,82
Thanks,82,82,82


In [31]:
print(f'train shape:{train.shape}')
display(train.head())
print(f'test shape: {test.shape}')
display(test.head())

train shape:(27481, 4)


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


test shape: (3534, 3)


Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


In [32]:
test.dtypes

textID       object
text         object
sentiment    object
dtype: object

In [33]:
# corpus = list(train['text'].astype(str))
# vectorizer = TfidfVectorizer()
# X = vectorizer.fit_transform(corpus)
# feature_names = vectorizer.get_feature_names()

### Kaggle Notebook Walkthrough

In [34]:
#print shape of all datasets
#get distribution of sentiments
#got length of selected text, text
print(f'Train shape: {train.shape}')
print(f'Test shape: {test.shape}')
print(f'Submission shape: {submission.shape}')

Train shape: (27481, 4)
Test shape: (3534, 3)
Submission shape: (3534, 2)


In [35]:
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [36]:
print(train.sentiment.value_counts(normalize=True))

neutral     0.404570
positive    0.312288
negative    0.283141
Name: sentiment, dtype: float64


In [37]:
print(test.sentiment.value_counts(normalize=True))

neutral     0.404641
positive    0.312111
negative    0.283248
Name: sentiment, dtype: float64


In [40]:
train.isnull().sum()
train = train.dropna()

In [41]:
train.isnull().sum()

textID           0
text             0
selected_text    0
sentiment        0
dtype: int64