# Exercises for Machine Learning

## Load Libraries

In [None]:
from sklearn.feature_extraction.text import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

from sklearn.naive_bayes import MultinomialNB

import pandas as pd

from sklearn.feature_extraction.text import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics

import pandas as pd

## 5.1. Intro to Classification

### "Spam" or "Ham"

Some of the earliest spam filters were simply Naïve Bayes Classifiers. Let's build one. This dataset contains text messages labeled as either "spam" (messages that are unwanted) or "ham" (wanted messages). Build either an SVM or an NBC and evaluate its effectiveness.

Be sure to:
1. Evaluate how well it finds spam and does not filter out good messages.
2. Does it do significantly better than random guessing?
2. Determine whether it performs significantly worse on testing data than training data.
3. Try writing a few sample ham/spam messages and seeing if it filters them properly.

In [None]:
spam = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/HARP-210-NLP/main/datasets/SMSSpamCollection.tsv", sep="\t")

In [None]:
X = spam["text"]
y = spam ["class"]

X_train, X_test, y_train, y_test  = train_test_split(X,y,test_size = 0.2, random_state = 0)

### Sentiment Analysis

Sentiment Analysis is a specific type of text classification meant to determine whether text is positive or negatively describing some person, product, film, and so on.

This dataset is movie reviews, with categories of either 1 (positive) or 0 (negative). Build an SVM to sort them out and evaluate its effectiveness.

Note: this dataset is 50 MB. It takes a while to download.

In [None]:
reviews = pd.read_csv("https://drive.google.com/u/0/uc?id=16eZ9zC4m6KodcoFnz1yMicxPnwn3aRIo&export=download")

In [None]:
cv    = CountVectorizer(max_features=1000, stop_words='english') #CountVectorizer does a value counts(in pandas) for each x entry, then applies stop words and says "only keep top 1000 words"
X_train = cv.fit_transform(X_train) #fit applies what weve established before (the now known top 1000 words and the stop words), transform applies it

tfidf = TfidfTransformer(use_idf=True)
X_train = tfidf.fit_transform(X_train)

What words are most associated with positive reviews? Negative reviews?

In [None]:
X_test = cv.transform(X_test)
X_test = tfidf.transform(X_test)

In [None]:
classifier = SGDClassifier() #can be whatever classifier we want to use
classifier = classifier.fit(X_train, y_train) #figures out the line that seperates the xs and os

In [None]:
y_train.value_counts(normalize=True) #null accuracy is 87%, whihc means if our model isnt atleast 87% accurate then it's not predicting anything. Has to beat this
#accuracy to be meaningful

ham     0.868297
spam    0.131703
Name: class, dtype: float64

In [None]:
#Train acuracy: 99.3%

y_pred_train = classifier.predict(X_train)
metrics.accuracy_score(y_train, y_pred_train)

0.9934933811981154

In [None]:
#TEST accuracy: 98.7%

y_pred_test = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred_test)

#We want test accuracy to be bigger than null accuracy, or else the model does nothing
#We like the test accuracy to be SIMILAR to train accuracy, or the model is overfit (the line is too wiggly)

#If overfit:
  #- the model may have some nonrealness to it... it doesn't represent the data as well as it might
  #- It may be possible to reduce overfitting and increase test accuracy

0.9865470852017937

In [None]:
print(metrics.classification_report(y_test, y_pred_test, zero_division=0))

#If these were emails, based on precision..

#99% of the emails in your inbox are actually good emails

#98% of the spam emails in your spam box are spam



              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       955
        spam       0.98      0.93      0.95       160

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



### News Source

Build an SVM to classify the Netflix dataset into genres (you can see some code that cleans up the genre column in the notes), or country.

In [None]:
news = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/class-datasets/main/datasets/netflix.csv")

## 5.2. Pipelines and Multiple Outputs

Modify your code from Part 1 to use Pipelines instead. You can modify your code in-place above, or re-write it below.

In [None]:
X = spam["text"]
y = spam["class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314159)


In [None]:
text_pipeline = Pipeline([
     ('tokenize', CountVectorizer(max_features=10000, stop_words="english")),
     ('tfidf', TfidfTransformer(use_idf=True)),
     ('classifier', SGDClassifier())
    ])

In [None]:
text_pipeline.fit(X_train, y_train)

In [None]:
y_pred_test = text_pipeline.predict(X_test)
y_pred_train = text_pipeline.predict(X_train)

In [None]:
metrics.accuracy_score(y_train, y_pred_train)

0.9993269015032533

In [None]:
metrics.accuracy_score(y_test, y_pred_test)

0.9766816143497757

In [None]:
print(metrics.classification_report(y_test, y_pred_test, zero_division=0))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       956
        spam       0.98      0.86      0.91       159

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



Modify the analysis in the lecture to predict the country a movie or show was produced in, rather than its genre.

For the datasets in Part 1, would any of them be improved by using a Multioutput classifier?

## 5.3. Transformer Architecture

These questions are in a separate notebook, 5.X2