# Project 3
# DS 501 - Introduction to Data Science
# Group 3&5

___

# Problem 1 (20 points): Complete Exercise 2: Sentiment Analysis on Movie Reviews from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Part 1; Problem 1: Downloading Data
Modify the solution on Exercise 2 so that it can run in this iPython notebook
* This will likely involved moving around data files and/or small modifications to the script.

### Download Data Script

In [6]:
"""Script to download the movie review dataset"""

from pathlib import Path
from hashlib import sha256
import tarfile
from urllib.request import urlopen


URL = "http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz"

ARCHIVE_SHA256 = "fc0dccc2671af5db3c5d8f81f77a1ebfec953ecdd422334062df61ede36b2179"
ARCHIVE_NAME = Path(URL.rsplit("/", 1)[1])
DATA_FOLDER = Path("txt_sentoken")


if not DATA_FOLDER.exists():

    if not ARCHIVE_NAME.exists():
        print("Downloading dataset from %s (3 MB)" % URL)
        opener = urlopen(URL)
        with open(ARCHIVE_NAME, "wb") as archive:
            archive.write(opener.read())

    try:
        print("Checking the integrity of the archive")
        assert sha256(ARCHIVE_NAME.read_bytes()).hexdigest() == ARCHIVE_SHA256

        print("Decompressing %s" % ARCHIVE_NAME)
        with tarfile.open(ARCHIVE_NAME, "r:gz") as archive:
            archive.extractall(path=".")

    finally:
        ARCHIVE_NAME.unlink()

## Part 2; Problem 1: Sentiment Analysis

### Imports

In [7]:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics


### Check number of samples and if prev script(s) ran properly

In [8]:

# the training data folder must be passed as first argument
movie_reviews_data_folder = DATA_FOLDER
dataset = load_files(movie_reviews_data_folder, shuffle=False)
print("n_samples: %d" % len(dataset.data))



n_samples: 2000


### Sentiment Analysis

In [22]:

# split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.25, random_state=None)

# TASK: Build a vectorizer / classifier pipeline that filters out tokens
# that are too rare or too frequent
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
    ('clf', LinearSVC(C=5000)),
])

# TASK: Build a grid search to find out whether unigrams or bigrams are
# more useful.
# Fit the pipeline on the training set using grid search for the parameters
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
grid_search.fit(docs_train, y_train)

# TASK: print the mean and std for each candidate along with the parameter
# settings for all the candidates explored by grid search.
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i, 'params - %s; mean - %0.2f; std - %0.2f'
          % (grid_search.cv_results_['params'][i],
             grid_search.cv_results_['mean_test_score'][i],
             grid_search.cv_results_['std_test_score'][i]))

# TASK: Predict the outcome on the testing set and store it in a variable
# named y_predicted
y_predicted = grid_search.predict(docs_test)

# Print the classification report
print(metrics.classification_report(y_test, y_predicted,
                                    target_names=dataset.target_names))

# Print and plot the confusion matrix
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

# import matplotlib.pyplot as plt
# plt.matshow(cm)
# plt.show()

0 params - {'vect__ngram_range': (1, 1)}; mean - 0.86; std - 0.02
1 params - {'vect__ngram_range': (1, 2)}; mean - 0.87; std - 0.02
              precision    recall  f1-score   support

         neg       0.86      0.85      0.85       239
         pos       0.86      0.87      0.87       261

    accuracy                           0.86       500
   macro avg       0.86      0.86      0.86       500
weighted avg       0.86      0.86      0.86       500

[[203  36]
 [ 33 228]]


___
# Problem 2 (20 points): Explore the Scikit-learn TfidfVectorizer Class

**Read the documentation for the TfidfVectorizer class at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.** 


## Part 1; Problem 2:
 Define the term frequency–inverse document frequency (TF-IDF) statistic (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) will likely help.

### Definition of TF-IDF

TF-IDF stands for "Term Frequency-Inverse Document Frequency." It is a way of figuring out how important a word is in a document or a piece of writing.

Let's say you have a book about cats. In this book, the word "cat" is used a lot because it's the main topic of the book. But the word "dog" is only used a few times because it's not really related to the subject of the book.

TF-IDF takes into account both the number of times a word appears in a document (the "Term Frequency") and how rare that word is in all the other documents (the "Inverse Document Frequency").

So, in the example of the book about cats, the word "cat" would have a high TF-IDF score because it appears frequently in the book and is relevant to the topic. The word "dog" would have a low TF-IDF score because it appears infrequently and is not as relevant to the topic.

Basically, TF-IDF helps us understand which words are most important in a document and which ones are less important.

## Part 2; Problem 2:
 Run the TfidfVectorizer class on the training data above (docs_train).


In [10]:
import pandas as pd

vectorizer = TfidfVectorizer(min_df=0.2, max_df=0.95)


vectors = vectorizer.fit_transform(docs_train)
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

print(f"Output Shape: {df.shape}")
df.head()


Output Shape: (1500, 251)


Unnamed: 0,about,acting,action,actor,actors,actually,after,again,all,almost,...,without,work,world,would,year,years,yet,you,young,your
0,0.151729,0.0,0.0,0.0,0.053878,0.0,0.037662,0.054293,0.084233,0.0,...,0.0,0.0,0.050958,0.0,0.05065,0.100791,0.0,0.121046,0.0,0.104689
1,0.105541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.039701,0.069282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.073903,0.237542,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.042407,0.222015,0.143885,0.0,0.150585,0.0,0.0,0.0,0.117712,0.0,...,0.0,0.065406,0.071211,0.0,0.07078,0.0,0.0,0.042289,0.0,0.0


### Part 3; Problem 2:
- Explore the min_df and max_df parameters of TfidfVectorizer.
- What do they mean? How do they change the features you get?


##### Book Definition

**min_df definition: min_dffloat or int, default=1**
    - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**max_df definition: max_dffloat or int, default=1.0**
    - When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Explanation:

### Part 4; Problem 2
Explore the ngram_range parameter of TfidfVectorizer. What does it mean? How does it change the features you get? (Note, large values  of ngram_range may take a long time to run!)

**ngram_rangetuple (min_n, max_n), default=(1, 1)**
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

#### Answer:

## Problem 3 (20 points): Machine Learning Algorithms

* Based upon Problem 2, pick some parameters for TfidfVectorizer
    * "fit" your TfidfVectorizer using docs_train
    * Compute "Xtrain", a Tf-idf-weighted document-term matrix using the transform function on docs_train
    * Compute "Xtest", a Tf-idf-weighted document-term matrix using the transform function on docs_test
    * Note, be sure to use the same Tf-idf-weighted class (**"fit" using docs_train**) to transform **both** docs_test and docs_train
* Examine two classifiers provided by scikit-learn 
    * LinearSVC
    * KNeighborsClassifier
    * Try a number of different parameter settings for each and judge your performance using a confusion matrix (see Problem 1 for an example).
* Does one classifier, or one set of parameters work better?
    * Why do you think it might be working better?
* For a particular choice of parameters and classifier, look at 2 examples where the prediction was incorrect.
    * Can you conjecture on why the classifier made a mistake for this prediction?

In [33]:
from sklearn.neighbors import KNeighborsClassifier

vectorizer = TfidfVectorizer(min_df=3, max_df=0.90)

pipeline1 = Pipeline([
    ('vect', vectorizer),
    ('clf', LinearSVC(C=15000)),
])

pipeline2 = Pipeline([
    ('vect', vectorizer),
    ('clf', KNeighborsClassifier(n_neighbors=3)),
])


parameters1 = {
    'vect__ngram_range': [(1, 1), (1, 3)],
    'vect__min_df': [3, 4, 5, 6],
    'vect__max_df': [0.7, 0.8, 0.9, 0.95]
}

parameters2 = {
    'vect__ngram_range': [(1, 1), (1, 3)],
    'clf__n_neighbors': [3, 12, 15, 20],
    'vect__min_df': [3, 4, 5, 6],
    'vect__max_df': [0.7, 0.8, 0.9, 0.95]
}

grid_search_linear = GridSearchCV(pipeline1, parameters1, n_jobs=-1)
grid_search_linear.fit(docs_train, y_train)

grid_search_knn = GridSearchCV(pipeline2,parameters2, n_jobs=-1)
grid_search_knn.fit(docs_train, y_train)



GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(max_df=0.9, min_df=3)),
                                       ('clf',
                                        KNeighborsClassifier(n_neighbors=3))]),
             n_jobs=-1,
             param_grid={'clf__n_neighbors': [3, 12, 15, 20],
                         'vect__max_df': [0.7, 0.8, 0.9, 0.95],
                         'vect__min_df': [3, 4, 5, 6],
                         'vect__ngram_range': [(1, 1), (1, 3)]})

In [34]:
# Performance

def calculate_performance_stats(grid_search):
    # TASK: print the mean and std for each candidate along with the parameter
    # settings for all the candidates explored by grid search.
    n_candidates = len(grid_search.cv_results_['params'])
    for i in range(n_candidates):
        print(i, 'params - %s; mean - %0.2f; std - %0.2f'
              % (grid_search.cv_results_['params'][i],
                 grid_search.cv_results_['mean_test_score'][i],
                 grid_search.cv_results_['std_test_score'][i]))

    # TASK: Predict the outcome on the testing set and store it in a variable
    # named y_predicted
    y_predicted = grid_search.predict(docs_test)

    # Print the classification report
    print(metrics.classification_report(y_test, y_predicted,
                                        target_names=dataset.target_names))

    # Print and plot the confusion matrix
    cm = metrics.confusion_matrix(y_test, y_predicted)
    print(cm)

print("Performance of Linear: \n")
calculate_performance_stats(grid_search_linear)

print("Performance of KNN: \n")
calculate_performance_stats(grid_search_knn)

Performance of Linear: 

0 params - {'vect__max_df': 0.7, 'vect__min_df': 3, 'vect__ngram_range': (1, 1)}; mean - 0.85; std - 0.03
1 params - {'vect__max_df': 0.7, 'vect__min_df': 3, 'vect__ngram_range': (1, 3)}; mean - 0.87; std - 0.02
2 params - {'vect__max_df': 0.7, 'vect__min_df': 4, 'vect__ngram_range': (1, 1)}; mean - 0.85; std - 0.03
3 params - {'vect__max_df': 0.7, 'vect__min_df': 4, 'vect__ngram_range': (1, 3)}; mean - 0.87; std - 0.02
4 params - {'vect__max_df': 0.7, 'vect__min_df': 5, 'vect__ngram_range': (1, 1)}; mean - 0.85; std - 0.03
5 params - {'vect__max_df': 0.7, 'vect__min_df': 5, 'vect__ngram_range': (1, 3)}; mean - 0.87; std - 0.02
6 params - {'vect__max_df': 0.7, 'vect__min_df': 6, 'vect__ngram_range': (1, 1)}; mean - 0.84; std - 0.03
7 params - {'vect__max_df': 0.7, 'vect__min_df': 6, 'vect__ngram_range': (1, 3)}; mean - 0.87; std - 0.02
8 params - {'vect__max_df': 0.8, 'vect__min_df': 3, 'vect__ngram_range': (1, 1)}; mean - 0.85; std - 0.02
9 params - {'vect__ma

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


## Problem 4 (20 points): Open Ended Question:  Finding the Right Plot

* Can you find a two-dimensional plot in which the positive and negative reviews are separated?
    * This problem is hard since you will likely have thousands of features for review, and you will need to transform these thousands of features into just two numbers (so that you can make a 2D plot).
* Note, I was not able to find such a plot myself!
    * So, this problem is about **trying** but perhaps **not necessarily succeeding**!
* I tried two things, neither of which worked very well.
    * I first plotted the length of the review versus the number of features we compute that are in that review
    * Second I used Principle Component Analysis on a subset of the features.
* Can you do better than I did!?

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create feature vectors
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)
train_vectors = vectorizer.fit_transform(docs_train)
test_vectors = vectorizer.transform(docs_test)


In [41]:
import time
from sklearn import svm
from sklearn.metrics import classification_report
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, y_train)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# results
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
report = classification_report(y_test, prediction_linear, output_dict=True)


Training time: 4.529257s; Prediction time: 1.185212s
