# Assignment 2: Text Classifier
### Conrad Lee

#### Goal:
- Train a text classifier that categorizes movie reviews in "positive" and "negative".

#### Steps:
- Step 1: Read all the text files from aclImdb/train/pos/ and aclImdb/train/neg/ into a pandas DataFrame called train with two columns: review (the text itself) and sentiment (positive or negative). Do the same thing with aclImdb/test and save it to a different DataFrame. If the dataset is too large for your machine, use a sample. (Hint: Use the glob module) (2 points)
- Step 2: Preprocess the text using CountVectorizer or TfidfVectorizer from scikit-learn, adding a preprocessor that removes spurious <br /> tags from the text, or alternatively removing them from the dataframes beforehand (1 point)
- Step 3: Apply the Vectorizer to the train data, fit a logistic regression, and compute the accuracy score (1 point)
- Step 4: Concatenate the train and test data, and use GridSearchCV to optimize the C hyperparameter of the logistic regression (1 point)

#### Sources:
- I have forked and referenced from Jitendra Reddy's excellent Github repository online, primarily with regards to PyPrind (progress indicator feature) and the initial data loading loop. The title of the repository is "Movie Review Classifier."

Link: https://github.com/g10draw/movie_review_classifier

#### Step 1:

In [436]:
# Import libraries necessary for this project
import pandas as pd
import numpy as np
import glob

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

import pyprind  # Allows to visualize the progress and estimated time until completion

#### glob
- glob(pathname, *, recursive=False)
- Return a list of paths matching a pathname pattern.
- The pattern may contain simple shell-style wildcards a la fnmatch. However, unlike fnmatch, filenames starting with a dot are special cases that are not matched by '*' and '?' patterns.

In [694]:
num_data = pd.DataFrame()
num_data.loc["pos", "train"] = len(glob.glob("./data/aclImdb/train/pos/*.txt"))
num_data.loc["neg", "train"] = len(glob.glob("./data/aclImdb/train/neg/*.txt"))
num_data.loc["pos", "test"] = len(glob.glob("./data/aclImdb/test/pos/*.txt"))
num_data.loc["neg", "test"] = len(glob.glob("./data/aclImdb/test/neg/*.txt"))
num_data

Unnamed: 0,train,test
pos,12500.0,12500.0
neg,12500.0,12500.0


There are a total of 50,000 records in the train and test data sets. Half are positive and half are negative.

#### Sampling
- Let's decide how many records we want to sample from the total:

In [700]:
sample = 10000

#### PyPrind (Python Progress Indicator)
- ProgPercent(iterations, track_time=True, stream=2, title='', monitor=False, update_interval=None)
- Initializes a progress bar object that allows visualization of an iterational computation in the standard output screen.
- Iterations = Number of iterations for the iterative computation.

In [696]:
# Setup PyPrind
pper = pyprind.ProgPercent(sample)

In [697]:
# Create labels for positive and negative
labels = {"pos": 1, "neg": 0}

In [701]:
# Load data frame
dataframe = pd.DataFrame()

for i in ("train", "test"):
    for j in ("pos", "neg"):
        path = "./data/aclImdb/%s/%s/*.txt" % (i, j)
        for file in glob.glob(path)[0:2500]:
            with open(file, "r", encoding="utf8") as infile:
                text = infile.read()
            dataframe = dataframe.append([[text, labels[j]]], ignore_index=True)
            pper.update()

dataframe.columns = ["review", "sentiment"]

[100 %] Time elapsed: 00:02:08 | ETA: 00:00:004
Total time elapsed: 00:02:08


In [702]:
# We'll only use 10,000 records (out of the 50,000 total records)
len(dataframe)

10000

In [704]:
# Convert into a single CSV file.
dataframe[0:5000].to_csv("./movie_reviews_test_data.csv", index=False)
dataframe[5000:10000].to_csv("./movie_reviews_train_data.csv", index=False)
dataframe.to_csv("./movie_reviews_data.csv", index=False)

In [705]:
# Read the dataset back into a data variable
# Note: This is just in case we use the full dataset, and want to avoid re-importing all the data again
data = pd.read_csv("movie_reviews_data.csv")
train = pd.read_csv("movie_reviews_train_data.csv")
test = pd.read_csv("movie_reviews_test_data.csv")

#### Step 2:

#### CountVectorizer:
- Convert a collection of text documents to a matrix of token counts
- This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
- If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

#### Parameters:
- CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

##### preprocessor : callable or None (default)
- Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

##### tokenizer : callable or None (default)
- Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

##### stop_words : string {‘english’}, list, or None (default)
- If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0] to automatically detect and filter stop words based on intra corpus document frequency of terms.

##### max_df : float in range [0.0, 1.0] or int, default=1.0
- When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

##### min_df : float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.


#### Methods:

##### get_feature_names()
- Array mapping from feature integer indices to feature name

##### fit_transform(raw_documents[, y])
- Learn the vocabulary dictionary and return term-document matrix.

In [722]:
# vectorizer = CountVectorizer()
# vectorizer = CountVectorizer(stop_words='english')
vectorizer = CountVectorizer(stop_words="english", min_df=0.01, max_df=0.99)

In [723]:
# Create a matrix that contains the vocabulary dictionary
total = vectorizer.fit_transform(data.review)

In [724]:
# X is the dictionary matrix for the train set
# Y is the dictionary matrix for the test set
X = total[0:5000]
Y = total[5000:10000]

#### The dictionary length varies based on the parameters we give CountVectorizer:
- CountVectorizer() --> 49,444 words
- CountVectorizer(stop_words='english') --> 49,133 words
- CountVectorizer(stop_words='english', min_df=0.01, max_df=0.99) --> 1,542 words

In [725]:
# Count the number of words in the dictionary
len(vectorizer.get_feature_names())

1542

In [726]:
# Print out the dictionary
print(vectorizer.get_feature_names())



In [727]:
print(total.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [2 0 0 ... 1 0 0]]


#### Step 3:

#### LogisticRegression:
- LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)

##### fit(X, y[, sample_weight])
- Fit the model according to the given training data.

##### predict(X)
- Predict class labels for samples in X.

##### predict_proba(X)
- Probability estimates.

##### score(X, y[, sample_weight])
- Returns the mean accuracy on the given test data and labels.

##### C : float, default: 1.0
- Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

In [729]:
# Fit the logistic regression model based on the train dictionary vs. train sentiment
clf1 = LogisticRegression(solver="lbfgs", max_iter=500).fit(X, train.sentiment)

In [730]:
# Predict the results for the test data
y_pred1 = clf1.predict(Y)

In [731]:
# These are the predictions for the test set (based on the logistic regression model)
y_pred1

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [732]:
# These are the actual values for the test set
y_true1 = test.sentiment.values
y_true1

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [744]:
# Accuracy score
clf1.score(Y.toarray(), test.sentiment)

0.8186

#### Confusion Matrix
- confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)

In [754]:
# Create a confusion matrix
conf1 = confusion_matrix(y_true1, y_pred1)

In [755]:
# Retrieve the individual TN, FP, FN, and TP
tn1, fp1, fn1, tp1 = confusion_matrix(y_true1, y_pred1).ravel()

In [756]:
# Let's the confusion matrix a pretty dataframe:
results = pd.DataFrame({"true/pos": [tp1, fp1], "true/neg": [fn1, tn1]})
results.rename(index={0: "pred/pos", 1: "pred/neg"}, inplace=True)
results

Unnamed: 0,true/pos,true/neg
pred/pos,2022,478
pred/neg,429,2071


#### Let's try using LogisticRegressionCV instead:

In [753]:
# Run the LinearRegressionCV model
clf2 = LogisticRegressionCV(cv=5, solver="lbfgs", max_iter=500).fit(X, train.sentiment)
y_pred2 = clf2.predict(Y)
y_true2 = test.sentiment.values
clf2.score(Y.toarray(), test.sentiment)

0.847

In [757]:
# Confusion Matrix for LinearRegressionCV
conf2 = confusion_matrix(y_true2, y_pred2)
tn2, fp2, fn2, tp2 = confusion_matrix(y_true2, y_pred2).ravel()
results2 = pd.DataFrame({"true/pos": [tp2, fp2], "true/neg": [fn2, tn2]})
results2.rename(index={0: "pred/pos", 1: "pred/neg"}, inplace=True)
results2

Unnamed: 0,true/pos,true/neg
pred/pos,2133,367
pred/neg,398,2102


It looks like LogisticRegressionCV performs a little better than the regular version.

#### Step 4:

#### GridSearchCV
- Exhaustive search over specified parameter values for an estimator.
- GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

#### Parameters:
- GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

In [758]:
# Increase the maximum number of iterations so that 'lbfgs' converges
clf_grid = LogisticRegression(solver="lbfgs", max_iter=500).fit(X, train.sentiment)

First let's try setting C to any integer between 1 and 10:

In [759]:
parameters = {"C": np.arange(1, 10, 1)}
grid = GridSearchCV(clf_grid, parameters, cv=5)
grid.fit(total, data.sentiment)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': array([1, 2, 3, 4, 5, 6, 7, 8, 9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [760]:
topC = grid.best_params_["C"]
print(topC)

1


In [762]:
# Run the LinearRegression model with the optimized hyperparameter C:
clf3 = LogisticRegression(solver="lbfgs", C=topC, max_iter=500).fit(X, train.sentiment)
y_pred3 = clf3.predict(Y)
y_true3 = test.sentiment.values
clf3.score(Y.toarray(), test.sentiment)

0.8186

Then let's also try setting C to a decimal between 0.1 and 1 (in increments of 0.1)

In [763]:
parameters = {"C": np.arange(0.1, 1, 0.1)}
grid = GridSearchCV(clf_grid, parameters, cv=5)
grid.fit(total, data.sentiment)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [764]:
topC = grid.best_params_["C"]
print(topC)

0.1


In [765]:
# Run the LinearRegression model with the optimized hyperparameter C:
clf3 = LogisticRegression(solver="lbfgs", C=topC, max_iter=500).fit(X, train.sentiment)
y_pred3 = clf3.predict(Y)
y_true3 = test.sentiment.values
clf3.score(Y.toarray(), test.sentiment)

0.8424

It looks like C = 0.1 results in the best accuracy.