# Background & Work Flow

- I will be analyzing text data. A common approach to analyzing text data is to use methods that allow us to convert text data into some kind of numerical representation - since we can then use all of our mathematical tools on such data. In this small project, I will explore 2 feature engineering methods that convert raw text data into numerical vectors:
    - **Bag of Words (BoW)**
        - BoW encodes an input sentence as the frequency of each word in the sentence. 
        - In this approach, all words contribute equally to the feature vectors.
    - **Term Frequency - Inverse Document Frequency (TF-IDF)**
        - TF-IDF is a measure of how important each term is to a specific document, as compared to an overall corpus. 
        - TF-IDF encodes each word as its frequency in the document of interest, divided by a measure of how common the word is across all documents (the corpus).
        - Using this approach, each word contributes differently to the feature vectors.
        - The assumption behind using TF-IDF is that words that appear commonly everywhere are not that informative about what is specifically interesting about a document of interest, so it is tuned to representing a document in terms of the words it uses that are different from other documents. 

- To compare those 2 methods, we will first apply them on the same Movie Review dataset to analyse sentiment (how positive or negative a text is). In order to make the comparison fair, an **SVM (support vector machine)** classifier will be used to classify positive reviews and negative reviews.

- SVM is a simple yet powerful and interpretable linear model. To use it as a classifier, we need to have at least 2 splits of the data: training data and test data. The training data is used to tune the weight parameters in the SVM to learn an optimal way to classify the training data. We can then test this trained SVM classifier on the test data, to see how well it works on data that the classifier has not seen before. 

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

For this project I will be using `nltk`: the Natural Language Toolkit.

To do so, I will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and/or example text data) which is what I will download here now.

In [2]:
# NLTK English Tokenizer and stopwords of all languages
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/crebollarramirez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/crebollarramirez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Sentiment Analysis on Movie Review Data

I will apply sentiment analysis to Movie Review (MR) data.

- The MR data contains more than 10,000 reviews collected from the IMDB website, and each of the reviews is annotated as either positive or negative. The number of positive and negative reviews are roughly the same. For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/

- For this project, I already shuffled the data, and truncated the data to contain only 5000 reviews.

Goals of this part of the project:
- Transform the raw text data into vectors with the BoW encoding method
- Split the data into training and test sets
- Write a function to train an SVM classifier on the training set
- Test this classifier on the test set and report the results

#### Import data

In [3]:
MR_df = pd.read_csv('data/rt-polarity.tsv', sep='\t', header=None)
MR_df.rename(columns={1: 'label', 2: 'review'}, inplace=True)

# Double Checking the data
MR_df.head()

#### Function that converts string labels to numerical labels


`convert_label` function features:
- take two parameters `label` and `direction`
- if `direction` is 'tonumber', 
    - and if the input label is "pos" return 1.0
    - and if the input label is "neg" return 0.0
    - otherwise, return the input label as is
- if `direction` is 'tolabel'
    - and if the input label is `1.0` return "pos"
    - and if the input label is `0.0` return "neg"
    - otherwise, return the label as is
        

In [6]:
def convert_label(label, direction):
    if direction == 'tonumber':
        if label == 'pos':
            return 1.0
        elif label == 'neg':
            return 0.0
        
    elif direction == 'tolabel':
        if label == 1.0:
            return 'pos'
        elif label == 0.0:
            return 'neg'
    
    return label

### Numerical Labels
Converting all labels in `MR_df["label"]` to numerical labels, using the `convert_label` function.

In [9]:
MR_df['Y'] = MR_df['label'].map(lambda x : convert_label(x, direction='tonumber'))
MR_df

# Verifying if everything is right
MR_df.head()

Unnamed: 0,0,label,review,Y
0,8477,neg,except as an acting exercise or an exceptional...,0.0
1,4031,pos,japanese director shohei imamura 's latest fil...,1.0
2,10240,neg,i walked away not really know who `` they `` w...,0.0
3,8252,neg,what could have been a neat little story about...,0.0
4,1346,pos,no screen fantasy-adventure in recent memory h...,1.0
...,...,...,...,...
4995,1229,pos,pacino is brilliant as the sleep-deprived dorm...,1.0
4996,70,pos,"a taut , intelligent psychological drama .",1.0
4997,9832,neg,' . . . the cast portrays their cartoon counte...,0.0
4998,9140,neg,the central character is n't complex enough to...,0.0


### Convert Text data into vector 

I will now create a `CountVectorizer` object to transform the text data into vectors with numerical values. 

To do so, I initialized a `CountVectorizer` object, and named it as `vectorizer`.

I passed 4 arguments to initialize a CountVectorizer:
  1. `analyzer`: `'word'` 
          Specify to analyze data from word-level.
  2. `max_features`: `2000`
          Set a max number of unique words.
  3. `tokenizer`: `word_tokenize`
          Set to tokenize the text data by using the word_tokenizer from NLTK .
  4. `stop_words`: `stopwords.words('english')`
          Set to remove all stopwords in English. We do this since they generally don't provide useful discriminative information.

In [12]:
vectorizer = CountVectorizer(analyzer='word', max_features=2000, tokenizer=word_tokenize, stop_words=stopwords.words('english'))

### Vectorize reviews

I transform reviews `MR_df["review"]` into vectors using the `vectorizer` I created above:

In [14]:
MR_X = vectorizer.fit_transform(MR_df["review"]).toarray()
MR_Y = np.array(MR_df['Y'])

### Defining the train & test sets

I will instead used `sklearn`'s `train_test_split()` function here to define our train and test set. I stored the train data (predictors) into `MR_train_X` and labels (outcomes) into `MR_train_Y`. Similarly, I stored the test data into `MR_test_X` and test labels into `MR_test_Y`.

In addition to providing the predictors (`MR_X`) and outcomes (`MR_Y`) to the function, I used the following arguments for this task which is very important:
- `test_size`: 0.2
- `random_state`: 200

In [18]:
MR_train_X, MR_test_X, MR_train_Y, MR_test_Y = train_test_split(MR_X, MR_Y, test_size=0.2, random_state=200)

### SVM

Function `train_SVM` that initializes an SVM classifier and trains it

Inputs: 
- `X`: np.ndarray, training samples, 
- `y`: np.ndarray, training labels,
- `kernel`: string, set the default value of "kernel" as "linear"

Output: a trained classifier `clf`

In [20]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X,y)
    
    return clf

### Train SVM

Trained an SVM classifier with the default linear kernel on the samples `MR_train_X` and the labels `MR_train_Y`

In [22]:
MR_clf = train_SVM(MR_train_X, MR_train_Y)

### Predict outcome


In [None]:
MR_predicted_train_Y = MR_clf.predict(MR_train_X)
MR_predicted_test_Y = MR_clf.predict(MR_test_X)

 `classification_report` that prints out the perfornamce of the classifier on the training set:

In [None]:
# Your classifier should be able to reach above 90% accuracy 
# on the training set
print(classification_report(MR_train_Y,MR_predicted_train_Y))

#### Checking the performance

In [None]:
# Your classifier should be able to reach around 67% accuracy on the test set.
print(classification_report(MR_test_Y, MR_predicted_test_Y))

# TF-IDF 

I have explored TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the BoW approach

Goals for this part of the project
- Transform the raw text data into vectors using TF-IDF
- Train an SVM classifier on the training set and report the performance this classifer on the test set

### Text Data to Vectors

I created a `TfidfVectorizer` object to transform the text data into vectors with TF-IDF


Here are the arguments. 
  1. `sublinear_tf`: `True`
           Set to apply TF scaling.
  2. `analyzer`: `'word'`
           Set to analyze the data at the word-level
  3. `max_features`: `2000`
           Set the max number of unique words
  4. `tokenizer`: `word_tokenize`
           Set to tokenize the text data by using the word_tokenizer from NLTK

In [None]:
tfidf = TfidfVectorizer(sublinear_tf=True, analyzer='word', max_features=2000, tokenizer=word_tokenize)

### Transform Reviews 

I have transformed the `review` column of `MR_df` into vectors using the `tfidf` I have created above.

In [None]:
tfidf.fit(MR_df['review'])
MR_tfidf_X = tfidf.transform(MR_df['review']).toarray()

### Using `train_test_split`, I split the `MR_tfidf_X` and `MR_Y` into training set and test set. 

 

In [None]:
MR_train_tfidf_X, MR_test_tfidf_X, MR_train_tfidf_Y, MR_test_tfidf_Y = train_test_split(
    MR_tfidf_X, MR_Y, test_size=0.2, random_state=200)

### 2d) Training

I trained an SVM classifier on the samples `MR_train_tfidf_X` and the labels `MR_train_tfidf_Y`.


In [None]:
MR_tfidf_clf = train_SVM(MR_train_tfidf_X, MR_train_tfidf_Y)


### Prediction

In [None]:
MR_pred_train_tfidf_Y = MR_tfidf_clf.predict(MR_train_tfidf_X)

MR_pred_test_tfidf_Y = MR_tfidf_clf.predict(MR_test_tfidf_X)

Again, I used `classification_report` to check the performance on the training set.

In [None]:
print(classification_report(MR_train_tfidf_Y, MR_pred_train_tfidf_Y))

# Again, I check the performance on the test set:
print(classification_report(MR_test_tfidf_Y, MR_pred_test_tfidf_Y))

# Sentiment Analysis on Customer Review with TF-IDF 

In this part of the project, I will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. The reviews are annotated by humans as either positive reviews or negative reviews. In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

I have already split the data into a training set and a test set, in which the training set has labels for the reviews, but the test set doesn't. 

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

Goals for this part of the project.
- Use the TF-IDF feature engineering method to encode the raw text data into vectors
- Train an SVM classifier on the training set
- Predict labels for the reviews in the test set


### Loading the data

In [None]:
CR_train_file = 'data/custrev_train.tsv'
CR_test_file = 'data/custrev_test.tsv'

CR_train_df = pd.read_csv(CR_train_file, sep='\t', names=['index', 'label', 'review'])
CR_test_df = pd.read_csv(CR_test_file, sep='\t', names=['index', 'review'])


### Concatenation

In [None]:
CR_df = pd.concat([CR_train_df, CR_test_df], ignore_index=True)
CR_df

### Cleaning

Converting all labels in `CR_df["label"]` using the `convert_label` function created previously. 

In [None]:
CR_df['Y'] = CR_df['label'].map(lambda x : convert_label(x, direction='tonumber'))
CR_df

### Using `tfidf`

Transforming reviews `CR_df["review"]` into vectors using the `tfidf` vectorizer I created previously. 

In [None]:
tfidf.fit(CR_df['review'])
CR_tfidf_X = tfidf.transform(CR_df['review']).toarray()

Here I will collect all training samples & numerical labels from `CR_tfidf_X`. This will extract all samples with labels from the dataframe:


In [None]:
# To collect labels
CR_train_X = CR_tfidf_X[~CR_df['Y'].isnull()]
CR_train_Y = CR_df['Y'][~CR_df['Y'].isnull()]

### SVM 
Training an SVM classifier on the samples `CR_train_X` and the labels `CR_train_Y`:

In [None]:
CR_clf = train_SVM(CR_train_X, CR_train_Y)

### Predict: training data

Predicting labels on the training set

In [None]:
CR_pred_train_Y = CR_clf.predict(CR_train_X)
#MR_pred_train_tfidf_Y = MR_tfidf_clf.predict(MR_train_tfidf_X)


In [None]:
# Checking the classifier accuracy on the train data
print(classification_report(CR_train_Y, CR_pred_train_Y))

# Collecting all test samples from CR_tfidf_X
CR_test_X = CR_tfidf_X[CR_df['Y'].isnull()]

### Predict: test set
Predicting the labels on the test set. I stored the predictions in a pandas DataFrame called `CR_pred_test_Y`, with the numeric predictions in a column (series) `'label'`

In [None]:
CR_pred_test_Y = pd.DataFrame(CR_clf.predict(CR_test_X), columns=['label'])

### Convert labels

Using the `convert_label` function, I converted the predicted numerical labels back to string labels ("pos" and "neg").


In [None]:
CR_test_df['label'] = CR_pred_test_Y['label']
CR_test_df['label'] = CR_test_df['label'].map(lambda x : convert_label(x, direction='tolabel'))

# Now we have a dataset with the labels that were predicted using the model