# Case Study 4 : Deep Learning

** Due Date: February 24, 2020, BEFORE the beginning of class at 11:00am **

NOTE: There are always last minute issues submitting the case studies. DO NOT WAIT UNTIL THE LAST MINUTE!

<img src="https://mapr.com/blog/demystifying-ai-ml-dl/assets/process.png">

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    Achu Balasubramanian
    
    Irean Ali
    
    Josh Lovering

**Desired outcome of the case study.**
* Similar to case study 3 we will look at movie reviews from the v2.0 polarity dataset comes from
the http://www.cs.cornell.edu/people/pabo/movie-review-data.
    * It contains written reviews of movies divided into positive and negative reviews.
 
**NOTE:  This case study is, one purpose, more open ended than the previous case studies.
* This is intended to help prepare you for the final case study which will be even more open-ended.
    
**Required Readings:** 
* This case study will be based upon the scikit-learn Python library
* Again will build upon the turtorial "Working With Text Data" which can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* Read about deep learning at https://scikit-learn.org/stable/modules/neural_networks_supervised.html

**Required Python libraries:**
* Same as case study 3, except for the extra credit question.
    * Numpy (www.numpy.org) 
    * Matplotlib (matplotlib.org) 
    * Scikit-learn (scikit-learn.org) 
    * You are also welcome to use the Python Natural Language Processing Toolkit (www.nltk.org) (though it is not required).

** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

## Problem 1 (20 points): Load in the movie review data, create TF-IDF features, and use two of your favorite classification algorithms from sci-kit learn for predicting sentiment

* This problem is, basically, already answered as part of case study 3, so it is fine to use your work from there to help answer this question.

In [7]:
# import scikit-learn

"""Build a sentiment analysis / polarity model
Sentiment analysis can be casted as a binary text classification problem,
that is fitting a linear classifier on features extracted from the text
of the user messages so as to guess whether the opinion of the author is
positive or negative.
In this examples we will use a movie review dataset.
"""
# Author: Olivier Grisel <olivier.grisel@ensta.org>
# License: Simplified BSD

import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics


if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core grid search that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1 in GridSearchCV

    # the training data folder must be passed as first argument
    movie_reviews_data_folder = sys.argv[1]
    dataset = load_files('txt_sentoken', shuffle=False)
    print("n_samples: %d" % len(dataset.data))

    # split the dataset in training and test set:
    docs_train, docs_test, y_train, y_test = train_test_split(
        dataset.data, dataset.target, test_size=0.25, random_state=None)

    # TASK: Build a vectorizer / classifier pipeline that filters out tokens
    # that are too rare or too frequent
    pipeline = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
        ('clf', LinearSVC(C=1000)),
    ])

    # TASK: Build a grid search to find out whether unigrams or bigrams are
    # more useful.
    # Fit the pipeline on the training set using grid search for the parameters
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2)],
    }
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
    grid_search.fit(docs_train, y_train)

    # TASK: print the mean and std for each candidate along with the parameter
    # settings for all the candidates explored by grid search.
    n_candidates = len(grid_search.cv_results_['params'])
    for i in range(n_candidates):
        print(i, 'params - %s; mean - %0.2f; std - %0.2f'
                 % (grid_search.cv_results_['params'][i],
                    grid_search.cv_results_['mean_test_score'][i],
                    grid_search.cv_results_['std_test_score'][i]))

    # TASK: Predict the outcome on the testing set and store it in a variable
    # named y_predicted
    y_predicted = grid_search.predict(docs_test)

    # Print the classification report
    print(metrics.classification_report(y_test, y_predicted,
                                        target_names=dataset.target_names))

    # Print and plot the confusion matrix
    cm = metrics.confusion_matrix(y_test, y_predicted)
    print(cm)

    # import matplotlib.pyplot as plt
    # plt.matshow(cm)
    # plt.show()

n_samples: 2000




0 params - {'vect__ngram_range': (1, 1)}; mean - 0.85; std - 0.01
1 params - {'vect__ngram_range': (1, 2)}; mean - 0.85; std - 0.01
              precision    recall  f1-score   support

         neg       0.87      0.85      0.86       248
         pos       0.86      0.88      0.87       252

    accuracy                           0.87       500
   macro avg       0.87      0.87      0.87       500
weighted avg       0.87      0.87      0.87       500

[[212  36]
 [ 31 221]]


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = docs_train

# high max and min results in only very frequent words
vectorizer = TfidfVectorizer(max_df = 0.9, min_df = 0.1, ngram_range = (2,2))

# low max and min results in much less frequent words
X = vectorizer.fit_transform(corpus)
feature_names = []
Xtrain = []
for word in vectorizer.get_feature_names():
    feature_names.append(word.replace("_", ""))
Xtrain = list(set(feature_names))
        
# print(feature_names)
print(len(feature_names))

229


In [9]:
#Run two favored algorithms:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 0.0025)
vectorizer_fit = vectorizer.fit(docs_train)

Xtrain = vectorizer_fit.transform(docs_train)

Xtest = vectorizer_fit.transform(docs_test)

classifier_one = LinearSVC()
classifier_one.fit(Xtrain, y_train)

classifier_two = KNeighborsClassifier(n_neighbors=250
                                      , weights="distance")
classifier_two.fit(Xtrain, y_train)

prediction_one = classifier_one.predict(Xtest)
prediction_two = classifier_two.predict(Xtest)

print(metrics.confusion_matrix(y_test, prediction_one))
print(metrics.confusion_matrix(y_test, prediction_two))

[[208  40]
 [ 36 216]]
[[225  23]
 [ 90 162]]


In [None]:
'''
Linear SVC: 74/500 false predictions

K Neighbors: 121/500 false predictions
'''

## Problem 2 (20 points): Use a Multi-Layer Perceptron (MLP) for classifying the reviews.  Explore the parameters for the MLP and compare the accuracies against your baseline algorithms in Problem 1.

**Read the documentation for the MLPClassifier class at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.** 
* Try different values for "hidden_layer_sizes".  What do you observe in terms of accuracy?
* Try different values for "activation". What do you observe in terms of accuracy?
* Try different values for "solver". What do you observe in terms of accuracy?

In [15]:
from sklearn.neural_network import MLPClassifier

layer_sizes = [50, 100, 200]
activation_strings = ["identity", "logistic", "tanh", "relu"]
solver_strings = ['lbfgs', 'sgd', 'adam']

for layer in layer_sizes:
    for activation_string in activation_strings:
        for solver_string in solver_strings:
            classifier = MLPClassifier(hidden_layer_sizes=(layer,), activation = activation_string, solver = solver_string)
            classifier.fit(Xtrain, y_train)
            y_pred = classifier.predict(Xtest)
            print("Layer size: " + str(layer) + ", Activation String: " + activation_string + ", Solver String: " + solver_string)
            print(metrics.confusion_matrix(y_test, y_pred))



Layer size: 50, Activation String: identity, Solver String: lbfgs
[[212  26]
 [ 42 220]]
Layer size: 50, Activation String: identity, Solver String: sgd
[[221  17]
 [210  52]]
Layer size: 50, Activation String: identity, Solver String: adam
[[209  29]
 [ 43 219]]
Layer size: 50, Activation String: logistic, Solver String: lbfgs
[[212  26]
 [ 48 214]]
Layer size: 50, Activation String: logistic, Solver String: sgd
[[238   0]
 [262   0]]
Layer size: 50, Activation String: logistic, Solver String: adam
[[210  28]
 [ 41 221]]
Layer size: 50, Activation String: tanh, Solver String: lbfgs
[[214  24]
 [ 42 220]]
Layer size: 50, Activation String: tanh, Solver String: sgd
[[225  13]
 [237  25]]
Layer size: 50, Activation String: tanh, Solver String: adam
[[209  29]
 [ 42 220]]
Layer size: 50, Activation String: relu, Solver String: lbfgs
[[217  21]
 [ 44 218]]
Layer size: 50, Activation String: relu, Solver String: sgd
[[238   0]
 [262   0]]
Layer size: 50, Activation String: relu, Solver Stri

In [None]:
'''
We ran each possible parameter together, for hidden layer sizes we chose
the three values: 50, 100, and 200.

The top 4 values were:
200 layers, identity, lbfgs: 65 false predictions
100 layers, logistic, lbfgs: 72 false predictions
50 layers, logistic, adam: 73 false predictions
50 layers, relu, adam: 73 false predictions

Hidden Layer Sizes: The more layers the better the predictions were on average.
Though 50 layers resulted in 2 of the top 4 values. A lower layer size paired
with the adam solver proves very successful, while in other cases layer sizes
of 100 or 200 works best.

Activations: Of the average false predictions, the order of most to least
successful activations was: identity, relu, tanh, logistic. Though all of 
these activations had very similar average results. Relu works bester with
smaller layer sizes. Identity paired with lbfgs works best with high layer sizes
while identity paired with adam works best with lower layer sizes.

Solvers: Of the average false predictions, the order of most to least
successful solvers was: adam, lbfgs, sgd. sgd by far had the poorest results - it predicted a lot of false negatives. 
lbfgs works well with higher layer sizes and pairs bet with identity and logistic activations.
adam works best with smaller layer sizes.
'''

## Problem 3 (20 points): Accuracy is not everything!  How fast are the algorithms versus their accuracy?
**Compare the runtime of your  baseline algorithms to the runtime of the MLPClassifier** 

**The jupyter command %timeit can be used to measure how long a calculation takes https://ipython.readthedocs.io/en/stable/interactive/magics.html.**
* Try different values for "hidden_layer_sizes".  What do you observe in term of runtime?
* Try different values for "activation". What do you observe in term of runtime?
* Try different values for "solver". What do you observe in term of runtime?
* How long does the "fit" function take as opposed to the "predict" function?  Can you explain why?

In [5]:
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 0.0025)
vectorizer_fit = vectorizer.fit(docs_train)


Xtrain = vectorizer_fit.transform(docs_train)


Xtest = vectorizer_fit.transform(docs_test)

classifier_one = LinearSVC()
%timeit classifier_one.fit(Xtrain, y_train)

classifier_two = KNeighborsClassifier(n_neighbors=250
                                      , weights="distance")
%timeit classifier_two.fit(Xtrain, y_train)

%timeit classifier_one.predict(Xtest)
%timeit classifier_two.predict(Xtest)

layer_sizes = [50, 100, 200]
activation_strings = ["identity", "logistic", "tanh", "relu"]
solver_strings = ['lbfgs', 'sgd', 'adam']

for layer in layer_sizes:
    for activation_string in activation_strings:
        for solver_string in solver_strings:
            classifier = MLPClassifier(hidden_layer_sizes=(layer,), activation = activation_string, solver = solver_string)
            print("Layer size: " + str(layer) + ", Activation String: " + activation_string + ", Solver String: " + solver_string)
            %timeit classifier.fit(Xtrain, y_train)
            %timeit classifier.predict(Xtest)


58.4 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.81 ms ± 460 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
396 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
227 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Layer size: 50, Activation String: identity, Solver String: lbfgs
3.65 s ± 724 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.6 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Layer size: 50, Activation String: identity, Solver String: sgd
9.23 s ± 3.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.74 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Layer size: 50, Activation String: identity, Solver String: adam
39.4 s ± 816 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.21 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Layer size: 50, Activation String: logistic, Solver String: lbfgs
6.4 s ± 974 ms per loop (mean ± std. dev. o

In [None]:
'''
Overall, the baseline algorithms performed their operations much much faster than the deep learning
algorithms, presumable due to the complexity in implementing the latter.

Hidden Layer Sizes: The more layers the the longer the fitting took, on average. The
run time performed better relative to the number of layers, however. 

Activations: Idenity, relu, and tanh all reported very similar fitting runtimes (around 30 seconds).
Logistic was by far the slowest with an average of over 50 seconds.

Solvers: Adam had by far the longest average run time, with the 12 slowest times all belonging to it.
Lbfgs was the quickest and had good accuracy, while sgd was slightly slower but very inaccurate

Fitting took much longer than predicting across all the measurements. This is probably due to the 
larger data sets that needed to be processed and the time consumed in creating a model.
Something interesting to note is that prediction times were fairly consistent across all choices,
only ranging from 1-3 milliseconds.
'''



## Problem 4 (20 points): Business question

* Suppose you had a machine learning algorithms that could detect the sentinment of tweets that was highly accurate.  What kind of business could you build around that?
* Who would be your competitors, and what are their sizes?
* What would be the size of the market for your product?
* In addition, assume that your machine learning was slow to train, but fast in making predicitions on new data.  How would that affect your business plan?
* How could you use the cloud to support your product?

In [None]:
'''

Business: Create an algorithm that does sentiment analysis on tweets regarding gym experience. 
This can be sold to a large chain gym with many locations. The gym can use this to monitor their locations and see if any
are not being ran to their satisfaction and may be damaging their name. We would use keyword queries to find a list of tweets
that mention the respective gym. We would then use further keyword searching and maybe geotags to determine which gym location
is being tweeted about. Our algorithm would run on these list of tweets mentioning each location and determine which are positive
and which are negative tweets. The gym locations with many negative tweets would be flagged and some of those tweets could be further
analyzed to provide a reason for these poor gym experiences in order to give the company an actionable problem to fix.

Competitors: A competitor of ours would be the gym's marketing team. This team would likely be in charge of managing their twitter
account and boosting their branches' business. This team would report back any dissatisfied customers to the headquarters 
in order to notify the company of any flaws that could be fixed. This team may be larger than ours as this is likely a global
or country wide gym company. Other competitors would be groups like ours with large training and testing sets and 
algorithms trained on sentiment analysis. Many of these competitors may be part of large companies and be of greater size 
than us. In order to combat their larger size we would gain access to more datasets to further train our algorithm. A downside
is that larger groups can offord more manpower to sort and label datasets manually, so we would have to find/purchase datasets
to be competetive with these groups.

Market Size: English speaking countries. To expand the business we could incorportate translators or 
word sets including more languages

Slow to train, fast to predict: We can take time to build the business and train the algorithm.
then once we sell it each application runs quickly.

Cloud Support: Store tons of training & testing datasets in order to make the algorithm very thorough. We could train our
algorithm on movie reviews and other sentiment datasets to further expand its versatility in predicting tweets on 
different gym branches.

'''

# Problem 5 (extra credit): MLPs are not all there is!

* In real life, MLPs are rarely is ever used for NLP. 
     * In this class I don't want you to have to learn too many different libraries and MLP is all that scikit-learn reall has.
* What is a Recurrent Neural Network (RNN), and why might they be better for NLP?
* What is a Convolutional Neural Network (CNN), and why might they be better for NLP?
* Pytorch is a famous library for deep learning and complete tutorial for sentiment analysis using Pytorch can be found at https://github.com/bentrevett/pytorch-sentiment-analysis.
    * What do they do for sentiment analysis?
    * For a bit of fun, you can actually run their notebooks on Google-colab at the touch of a button!
  
  

In [None]:
'''
RNN processes in sequence & time series information
Ideal for speech and text
Can be used for image captioning
Difficult to train
Takes a long time (exponential)
Classification, Regression, Forecasting

CNN is Supervised Machine Learning
Requires large training data sets
Uses 3 main layers:
Convolution
RELU
Pooling
CNN is inspired by the organization of the visual cortex
CNN is a variation of MLP and uses minimal amounts of preprocessing
Ideal for images and videos

PyTorch
Released in 2016
Deep Neural Networks
Tensor Computing

'''

# Slides (for 10 minutes of presentation) (20 points)


1. (5 points) Motivation about the data collection, why the topic is interesting to you. 

2. (10 points) Communicating Results (figure/table)

3. (5 points) Story telling (How all the parts (data, analysis, result) fit together as a story?)


# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study. 


*Please compress all the files into a single zipped file.*


** How to submit: **

        Please submit through email to Prof. Paffenroth (rcpaffenroth@wpi.edu).

#### We auto-process the submissions so make sure your subject line is *exactly*:

### DS3010 Case Study 4 Team ??

#### where ?? is your team number.
        
** Note: Each team just needs to submits one submission **