# Sentiment Classification the old-fashioned way:

`Naive Bayes`, `Logistic Regression`, and `Ngrams`

The purpose of this notebook is to show how sentiment classification is done via the classic techniques of `Naive Bayes`, `Logistic Regression`, and `Ngrams`. We will be using `sklearn` and the `fastai` library.

In a future lesson, we will revisit sentiment classification using `deep learning`, so that you can compare the two approaches.

The content here was extended from Lesson 10 of the fast.ai Machine Learning course. Linear model is pretty close to the state of the art here. Jeremy surpassed state of the art using RNN in fall 2017.

## 0. The fastai library

We will begin using the fastai library (version 1.0) in this notebook. We will use it more once we move on to neural networks.

The fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It is used in production at a number of companies. You can read more about it here:
- Fast.ai's software could radically democratize AI

## 1. The IMDB dataset

The large movie review dataset contains a collection of 50,000 reviews from IMDB, We will use the version hosted as part fast.ai datasets on AWS Open Datasets.

The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

### Imports

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai import *
from fastai.text import *
from fastai.utils.mem import GPUMemTrace #call with mtrace
import sklearn.feature_extraction.text as sklearn_text
import pickle 

### Preview the sample IMDb data set

fast.ai has a number of datasets hosted via AWS Open Datasets for easy download. We can see them by checking the docs for URLs (remember ?? is a helpful command):

In [None]:
?? URLs

It is always good to start working on a sample of your data before you use the full dataset-- this allows for quicker computations as you debug and get your code working. For IMDB, there is a sample dataset already available:

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)
path

Read the data set into a pandas dataframe, which we can inspect to get a sense of what our data looks like. We see that the three columns contain review label, review text, and the `is_valid` flag, respectively. `is_valid` is a boolean flag indicating whether the row is from the validation set or not.

In [None]:
df = pd.read_csv(path/'texts.csv')
df.head()

### Extract the movie reviews from the sample IMDb data set.

#### We will be using TextList from the fastai library:

%%time
# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!

count = 0
error = True
while error:
    try: 
        # Preprocessing steps
        movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
                         .split_from_df(col=2)
                         .label_from_df(cols=0))
        error = False
        print(f'failure count is {count}\n')    
    except: # catch *all* exceptions
        # accumulate failure count
        count = count + 1
        print(f'failure count is {count}')