# Case Study 4 : Deep Learning

** Due Date: February 24, 2020, BEFORE the beginning of class at 11:00am **

NOTE: There are always last minute issues submitting the case studies. DO NOT WAIT UNTIL THE LAST MINUTE!

<img src="https://mapr.com/blog/demystifying-ai-ml-dl/assets/process.png">

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    member 1
    
    member 2
    
    ...

**Desired outcome of the case study.**
* Similar to case study 3 we will look at movie reviews from the v2.0 polarity dataset comes from
the http://www.cs.cornell.edu/people/pabo/movie-review-data.
    * It contains written reviews of movies divided into positive and negative reviews.
 
**NOTE:  This case study is, one purpose, more open ended than the previous case studies.
* This is intended to help prepare you for the final case study which will be even more open-ended.
    
**Required Readings:** 
* This case study will be based upon the scikit-learn Python library
* Again will build upon the turtorial "Working With Text Data" which can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* Read about deep learning at https://scikit-learn.org/stable/modules/neural_networks_supervised.html

**Required Python libraries:**
* Same as case study 3, except for the extra credit question.
    * Numpy (www.numpy.org) 
    * Matplotlib (matplotlib.org) 
    * Scikit-learn (scikit-learn.org) 
    * You are also welcome to use the Python Natural Language Processing Toolkit (www.nltk.org) (though it is not required).

** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

## Problem 1 (20 points): Load in the movie review data, create TF-IDF features, and use two of your favorite classification algorithms from sci-kit learn for predicting sentiment

* This problem is, basically, already answered as part of case study 3, so it is fine to use your work from there to help answer this question.

In [1]:
# import scikit-learn

"""Build a sentiment analysis / polarity model
Sentiment analysis can be casted as a binary text classification problem,
that is fitting a linear classifier on features extracted from the text
of the user messages so as to guess whether the opinion of the author is
positive or negative.
In this examples we will use a movie review dataset.
"""
# Author: Olivier Grisel <olivier.grisel@ensta.org>
# License: Simplified BSD

import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics


if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core grid search that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1 in GridSearchCV

    # the training data folder must be passed as first argument
    movie_reviews_data_folder = sys.argv[1]
    dataset = load_files('txt_sentoken', shuffle=False)
    print("n_samples: %d" % len(dataset.data))

    # split the dataset in training and test set:
    docs_train, docs_test, y_train, y_test = train_test_split(
        dataset.data, dataset.target, test_size=0.25, random_state=None)

    # TASK: Build a vectorizer / classifier pipeline that filters out tokens
    # that are too rare or too frequent
    pipeline = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
        ('clf', LinearSVC(C=1000)),
    ])

    # TASK: Build a grid search to find out whether unigrams or bigrams are
    # more useful.
    # Fit the pipeline on the training set using grid search for the parameters
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2)],
    }
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
    grid_search.fit(docs_train, y_train)

    # TASK: print the mean and std for each candidate along with the parameter
    # settings for all the candidates explored by grid search.
    n_candidates = len(grid_search.cv_results_['params'])
    for i in range(n_candidates):
        print(i, 'params - %s; mean - %0.2f; std - %0.2f'
                 % (grid_search.cv_results_['params'][i],
                    grid_search.cv_results_['mean_test_score'][i],
                    grid_search.cv_results_['std_test_score'][i]))

    # TASK: Predict the outcome on the testing set and store it in a variable
    # named y_predicted
    y_predicted = grid_search.predict(docs_test)

    # Print the classification report
    print(metrics.classification_report(y_test, y_predicted,
                                        target_names=dataset.target_names))

    # Print and plot the confusion matrix
    cm = metrics.confusion_matrix(y_test, y_predicted)
    print(cm)

    # import matplotlib.pyplot as plt
    # plt.matshow(cm)
    # plt.show()

n_samples: 2000




0 params - {'vect__ngram_range': (1, 1)}; mean - 0.84; std - 0.02
1 params - {'vect__ngram_range': (1, 2)}; mean - 0.85; std - 0.02
              precision    recall  f1-score   support

         neg       0.88      0.86      0.87       258
         pos       0.86      0.87      0.86       242

    accuracy                           0.87       500
   macro avg       0.87      0.87      0.87       500
weighted avg       0.87      0.87      0.87       500

[[223  35]
 [ 31 211]]


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = docs_train

# high max and min results in only very frequent words
vectorizer = TfidfVectorizer(max_df = 0.9, min_df = 0.1, ngram_range = (2,2))

# low max and min results in much less frequent words
X = vectorizer.fit_transform(corpus)
feature_names = []
Xtrain = []
for word in vectorizer.get_feature_names():
    feature_names.append(word.replace("_", ""))
Xtrain = list(set(feature_names))
        
# print(feature_names)
print(len(feature_names))

237


In [12]:
#Run two favored algorithms:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 0.0025)
vectorizer_fit = vectorizer.fit(docs_train)

Xtrain = vectorizer_fit.transform(docs_train)

Xtest = vectorizer_fit.transform(docs_test)

classifier_one = LinearSVC()
classifier_one.fit(Xtrain, y_train)

classifier_two = KNeighborsClassifier(n_neighbors=250
                                      , weights="distance")
classifier_two.fit(Xtrain, y_train)

prediction_one = classifier_one.predict(Xtest)
prediction_two = classifier_two.predict(Xtest)

print(metrics.confusion_matrix(y_test, prediction_one))
print(metrics.confusion_matrix(y_test, prediction_two))

[[217  41]
 [ 33 209]]
[[201  57]
 [ 64 178]]


## Problem 2 (20 points): Use a Multi-Layer Perceptron (MLP) for classifying the reviews.  Explore the parameters for the MLP and compare the accuracies against your baseline algorithms in Problem 1.

**Read the documentation for the MLPClassifier class at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.** 
* Try different values for "hidden_layer_sizes".  What do you observe in terms of accuracy?
* Try different values for "activation". What do you observe in terms of accuracy?
* Try different values for "solver". What do you observe in terms of accuracy?

In [18]:
from sklearn.neural_network import MLPClassifier

classifier = MLPClassifier(hidden_layer_sizes=(100,100,100))

classifier.fit(Xtrain, y_train)
y_pred = classifier.predict(Xtest)


In [19]:
print(y_pred)

[1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1
 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0
 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 0 1 1 0 0
 1 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 1
 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 1
 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0
 1 1 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1
 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1
 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0
 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 0
 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1
 1 1 1 1 0 0 1 1 0 1 1 0 

## Problem 3 (20 points): Accuracy is not everything!  How fast are the algorithms versus their accuracy?
**Compare the runtime of your  baseline algorithms to the runtime of the MLPClassifier** 

**The jupyter command %timeit can be used to measure how long a calculation takes https://ipython.readthedocs.io/en/stable/interactive/magics.html.**
* Try different values for "hidden_layer_sizes".  What do you observe in term of runtime?
* Try different values for "activation". What do you observe in term of runtime?
* Try different values for "solver". What do you observe in term of runtime?
* How long does the "fit" function take as opposed to the "predict" function?  Can you explain why?



## Problem 4 (20 points): Business question

* Suppose you had a machine learning algorithms that could detect the sentinment of tweets that was highly accurate.  What kind of business could you build around that?
* Who would be your competitors, and what are their sizes?
* What would be the size of the market for your product?
* In addition, assume that your machine learning was slow to train, but fast in making predicitions on new data.  How would that affect your business plan?
* How could you use the cloud to support your product?

# Problem 5 (extra credit): MLPs are not all there is!

* In real life, MLPs are rarely is ever used for NLP. 
     * In this class I don't want you to have to learn too many different libraries and MLP is all that scikit-learn reall has.
* What is a Recurrent Neural Network (RNN), and why might they be better for NLP?
* What is a Convolutional Neural Network (CNN), and why might they be better for NLP?
* Pytorch is a famous library for deep learning and complete tutorial for sentiment analysis using Pytorch can be found at https://github.com/bentrevett/pytorch-sentiment-analysis.
    * What do they do for sentiment analysis?
    * For a bit of fun, you can actually run their notebooks on Google-colab at the touch of a button!
  
  

# Slides (for 10 minutes of presentation) (20 points)


1. (5 points) Motivation about the data collection, why the topic is interesting to you. 

2. (10 points) Communicating Results (figure/table)

3. (5 points) Story telling (How all the parts (data, analysis, result) fit together as a story?)


# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study. 


*Please compress all the files into a single zipped file.*


** How to submit: **

        Please submit through email to Prof. Paffenroth (rcpaffenroth@wpi.edu).

#### We auto-process the submissions so make sure your subject line is *exactly*:

### DS3010 Case Study 4 Team ??

#### where ?? is your team number.
        
** Note: Each team just needs to submits one submission **