## Introduction and Dataset

#### Introduciton

Basic sentiment analysis using sklearn in python. The dataset contains 25000 positive and 25000 negative reviews, and has 50/50 train/test split. We evaluate the model using accuracy. 

#### Dataset
IMDB movie reviews dataset can be found here: http://ai.stanford.edu/~amaas/data/sentiment

### Import the libraries

In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

### Loading the dataset

In [2]:
df = pd.read_csv('./movie_data.csv')

In [3]:
# Glance the contents of the dataset
df.head()
#df['review'][4]

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


### Data Preparation

In [4]:
# Cleaning function to remove non-essential characters and moving emoji's to the end
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

# Cleaning our functions
df['review'] = df['review'].apply(preprocessor)

### Tokenization of documents and stop words

In [5]:
# Tokenization and Stop Words

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

stop = stopwords.words('english')

### Transform Text Data into TF-IDF Vectors

In [6]:
# Converting to tf-idf 
tfidf = TfidfVectorizer(strip_accents = None,
                       lowercase = False,
                       preprocessor = None, #preprocessor function already applied
                       stop_words = stop,
                       tokenizer= tokenizer_porter,
                       use_idf = True,
                       norm = 'l2',
                       smooth_idf = True)
y = df.sentiment.values
x = tfidf.fit_transform(df.review)

  'stop_words.' % sorted(inconsistent))


### Document Classification using Logistic Regression

In [7]:
# Test train split
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size = 0.5,
                                                    shuffle = False)

In [8]:
# Fitting logistic regression model, using 10-fold cross-validation

clf = LogisticRegressionCV(cv = 10, 
                          scoring = 'accuracy',
                          random_state = 0,
                          n_jobs = -1,
                          verbose = 3,
                          max_iter = 300).fit(X_train,  y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:  2.8min remaining:  1.2min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  3.6min finished


### Model Evaluation

In [9]:
# Evaluating model accuracy

clf.score(X_test, y_test)

0.8948