# Advanced Text Mining Part 2 - Exercises with Answers

In the following exercises, we will work with data from movie reviews for the sentiment analysis. The movie reviews were scraped from a website. Each review is a document and, collectively, the reviews form the corpus. We will be using the `movie_reviews.csv` file

## Exercise 1

#### Task 1
##### Load the following packages/libraries that are used in this module:
##### pandas, numpy, pickle (Helper packages); nltk (natural language toolkit for text processing); scikit-learn; matplotlib (for visualizing)

#### Result:

In [None]:
# Helper packages.
import os
import pandas as pd
import numpy as np
import pickle

# Packages with tools for text processing.
import nltk

# Packages for working with text data and analyzing sentiment.
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Packages to build and measure the performance of a logistic regression model. 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing

# Package for visualizing the results.
import matplotlib.pyplot as plt



##### Task 2
##### Set `main_dir` to the location of your `booz-allen-hamilton` folder.
##### Make `data_dir` from the `main_dir` and concatenate remainder of the path to data directory.
##### Make `plots_dir` from the `main_dir` and concatenate remainder of the path to plots directory.
##### Set the working directory to `data_dir`.
##### Check if the working directory is updated to `data_dir`.

#### Result:

In [None]:
from pathlib import Path
# Set `home_dir` to the root directory of your computer.
home_dir = Path.home()

# Set `main_dir` to the location of your `booz-allen-hamilton` folder.
main_dir = home_dir / "Desktop" / "booz-allen-hamilton"

# Make `data_dir` from the `main_dir` and remainder of the path to data directory.
data_dir = main_dir / "data"

# Make `plots_dir` from the `main_dir` and remainder of the path to plots directory.
plot_dir = main_dir / "plots"

In [None]:
# Change the working directory.
os.chdir(data_dir)

# Check the working directory.
print(os.getcwd())

#### Task 3

###### Load the `movie_reviews.csv` file and preview the data. 

In [None]:
movie_reviews = pd.read_csv('movie_reviews.csv')
movie_reviews.head()

#### Task 4
##### Execute the below chuck of code that performs the following steps in order to clean the movie reviews data. (These are the same steps that were used to preprocess the text data in our earlier module)

##### 1. Converted all characters to lower case 
##### 2. Removed stop words 
##### 3. Removed punctuation, numbers, and all other symbols that are not letters 
##### 4. Stemmed words
##### 5. Saved the cleaned reviews in the list `reviews_clean_list` and created a Document-Term Matrix and saved it as `reviews_DTM` 

###### Print the first 10 reviews from `reviews_clean_list`.

#### Result:

In [None]:
reviews = movie_reviews["reviews"]

reviews_tokenized = [word_tokenize(reviews[i]) for i in range(0,len(reviews))]

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
stop_words = stopwords.words('english')

# Create a vector for clean reviews.
reviews_clean = [None] * len(reviews_tokenized)
# Create a vector of word counts for each clean reviews.
word_counts_per_reviews = [None] * len(reviews_tokenized)
# Process words in all documents.
for i in range(len(reviews_tokenized)):
    # 1. Convert to lower case.
    reviews_clean[i] = [reviews.lower() for reviews in reviews_tokenized[i]]
    
    # 2. Remove stopwords.
    reviews_clean[i] = [word for word in reviews_clean[i] if not word in stop_words]
    
    # 3. Remove punctuation and any non-alphabetical characters.
    reviews_clean[i] = [word for word in reviews_clean[i] if word.isalpha()]
    
    # 4. Stem words.
    reviews_clean[i] = [PorterStemmer().stem(word) for word in reviews_clean[i]]
    
    # Record the word count per reviews.
    word_counts_per_reviews[i] = len(reviews_clean[i])
reviews_clean_list = [' '.join(message) for message in reviews_clean]

ex_vec = CountVectorizer()
ex_X = ex_vec.fit_transform(reviews_clean_list)

reviews_DTM = pd.DataFrame(ex_X.toarray(), columns = ex_vec.get_feature_names())

In [None]:
print(reviews_clean_list[:10])

#### Task 5
##### We want to analyze the sentiment of the movie reviews.
##### Let us first add the sentiment labels to our cleaned reviews.
##### Load the sentiment analysis function we used in our module.

##### This function outputs a list of labels for each chat message:

In [None]:
def sentiment_analysis(texts):
    list_of_scores = []
    for text in texts:
        sid = SentimentIntensityAnalyzer()
        compound = sid.polarity_scores(text)["compound"]
        if compound >= 0:
            list_of_scores.append("positive")
        else:
            list_of_scores.append("negative")
    return(list_of_scores) 

##### Assign labels to the `reviews_clean_list` using the `sentiment_analysis` function and save to them to `score_labels` variable.

#### Result:

In [None]:
score_labels = sentiment_analysis(reviews_clean_list)

#### Task 6
##### Split the `reviews_DTM` dataset to 70% training and 30% test sets.
##### Split `score_labels` the same way too.
##### Use random state 2.
##### Let the output variables be named the same way we named them in class:
 - `X_train`
 - `X_test`
 - `y_train`
 - `y_test`

#### Result:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reviews_DTM, 
                                                    score_labels, 
                                                    train_size = 0.70,
                                                    random_state = 2)

## Exercise 2

#### Task 1
##### Use LabelBinarizer function from the preprocessing module to convert categorical variables to binary target variables in `y_test`.

#### Result:

In [None]:
# Initiate the Label Binarizer.
lb = preprocessing.LabelBinarizer()

# Convert y_test to binary integer format.
y_test= lb.fit_transform(y_test)

#### Task 2
##### Build the logistic regression model and save it as `log_model` variable, then inspect it.

#### Result:

In [None]:
# Set up logistic regression model.
log_model = LogisticRegression(solver='liblinear')
print(log_model)

#### Task 3
##### Fit the model to `X_train` and `y_train` data.

#### Answers:

In [None]:
# Fit the model.
log_model = log_model.fit(X=X_train, y=y_train)

#### Task 4 
##### Use the model and predict on the test dataset.
##### Save the predictions to `y_pred` variable.
##### Convert the categorical `y_pred` values to binary values using Label Binarizer.
##### Print the first 5 values of `y_pred`.

#### Result:

In [None]:
# Predict on test data.
y_pred = log_model.predict(X_test)
print(y_pred)

# Convert y_pred to binary integer format.
y_pred= lb.fit_transform(y_pred)
print(y_pred[:5])

## Exercise 3

#### Task 1
##### Print the confusion matrix and accuracy on the test data.
##### Interpret the results.

#### Result:

In [None]:
# Take a look at test data confusion matrix.
conf_matrix_test = metrics.confusion_matrix(y_test, y_pred)
print(conf_matrix_test)

# Compute test model accuracy score.
test_accuracy_score = metrics.accuracy_score(y_test, y_pred)
print("Accuracy on test data: ", test_accuracy_score)

# The model predicts the sentiment of the reviews with about 90% accuracy.
# It predicts the positive reviews better than the negative reviews.

#### Task 2
##### Print the classification report by making the target variable classes.

#### Result:

In [None]:
# Result:
# Create a list of target names to interpret class assignments.
target_names = ['Negative', 'Positive']

# Print an entire classification report.
class_report = metrics.classification_report(y_test, y_pred, target_names = target_names)
print(class_report)

#### Task 3
##### Print the probabilities of classifying the reviews as positive/negative.



#### Result:

In [None]:
# Get probabilities instead of predicted values.
test_probabilities = log_model.predict_proba(X_test)
print(test_probabilities[0:5, :])

# Get probabilities of test predictions only.
test_predictions = test_probabilities[: , 1]
print(test_probabilities[0:5])

#### Task 4
##### Get TPR, FPR and threshold values.
#### Inspect the results.

#### Result:

In [None]:
# Get FPR, TPR and threshold values.
fpr, tpr, threshold = metrics.roc_curve(y_test, test_predictions)
print("False positive: ", fpr)

#### Task 5
#####  Compute the AUC and print it.
##### Plot the ROC curve.
##### Interpret the results.
##### Why do you think we have such results?
##### What could we do to improve our model?

#### Result:

In [None]:
# Get AUC by providing the FPR and TPR.
auc = metrics.roc_auc_score(y_test, y_pred)
print("Area under the ROC curve: ", auc)

# Make an ROC curve plot.
plt.title('Receiver Operator Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

# We get a fairly poor model, because of a few reasons:
# - Small dataset with relatively few datapoints
# - Unbalanced dataset with the majority of observations being positive and few negative
# - Untuned model

# We could improve our model by:
# - Getting more datapoints (i.e. scraping more reviews or getting our hands on a movie review database)
# - Removing some positive reviews or adding more negative reviews (or even generating some negative reviews!)
# - Tuning the model