<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/solutions_do_not_open/Model_Evaluation_and_Dimensionality_Reduction_with_Scikit_Learn_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Model Evaluation and Dimensionality Reduction with Scikit Learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = 'https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/'

In [None]:
df = pd.read_csv(url + 'sms.tsv', sep='\t')

In [None]:
df.head()

In [None]:
df['label'].value_counts() / len(df)

In [None]:
y = (df['label'] == 'spam')

### Word count features

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def cross_val_score_print(model, X, y, cv=3):
    scores = cross_val_score(model, X, y, cv=cv)
    print("Accuracy score: {:0.3} +/- {:0.3}".format(scores.mean(), scores.std()))

In [None]:
vect = CountVectorizer(decode_error='ignore',
                       stop_words='english',
                       binary=True,
                       max_features=2000)

X = vect.fit_transform(df['msg'])

In [None]:
X

Visualize the first 200 word counts in the first 200 messages

In [None]:
N = 200
plt.figure(figsize=(10, 10))
plt.imshow(X.todense()[:N, :N]);

In [None]:
cross_val_score_print(DummyClassifier(strategy='most_frequent'), X, y)

In [None]:
cross_val_score_print(LogisticRegression(solver='liblinear'), X, y)

### Feature importances

The model using 2000 word features seems to be performing quite well. Let's find out which words are more correlated with `Spam`.

The features we are using are counts of word occurrences in a corpus of SMS messages. Since SMSs have a fixed length, we can assume that these counts are proportional to the fequencies of occurrences. In other words we can assume that all the features have the same normalization scale. Under this assumption, we can interpret the coefficients of the `LogisticRegression` model as features importances.

In [None]:
model = LogisticRegression(solver='liblinear')
model.fit(X, y)

In [None]:
word_feature_importances = pd.Series(model.coef_[0],
                                     index=vect.get_feature_names()).sort_values()

In [None]:
# Top 20 least spammy words
word_feature_importances.head(20)

In [None]:
# Top 20 most spammy words
word_feature_importances.tail(20)

### Truncated SVD

A common way to visualize highly dimensional feature sets is to use the Truncated SVD dimensionality reduction technique. Let's use it to compress our 2000 sparse features to a 5 dimensional space.

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
X_tsvd = pd.DataFrame(TruncatedSVD(n_components=5).fit_transform(X), columns=['c1', 'c2', 'c3', 'c4', 'c5'])
X_tsvd['label'] = y

In [None]:
sns.pairplot(X_tsvd, hue='label');

### Exercise 1: Model validation with Pipelines.

When the feature engineering step involves learning something from the data we should only learn from the training set.

In the case above, we are learning the vocabulary from the data, but there are many other cases where the transformer is learning  properties of the data. In these cases, we should proceed with caution and only learn from the training set.

One way to achieve this is to do something like this:

```python
raw_features_train, raw_features_test = train_test_split(....)
transformer = ....

transformer.fit(raw_features_train)
X_train = transformer.transform(raw_features_train)
X_test = transformer.transform(raw_features_test)
```

a better way to achieve the same is to bundle the transformer and the estimator into a [`Pipeline`](https://scikit-learn.org/stable/modules/compose.html).

Complete the following steps:

- Split `df['msg']` and `y` into train and test sets
- Create a pipeline using the `make_pipeline` function that contains at least 2 steps: `vect` and `LogisticRegression()`. Feel free to include additional intermediate steps if you wish
- Train the pipeline model on the trainin set and compare the training and test scores
- Bonus points if you perform Cross Validation


In [None]:
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.pipeline import make_pipeline

In [None]:
msg_train, msg_test, y_train, y_test = train_test_split(df['msg'], y, test_size=0.2, random_state=0)

In [None]:
model = make_pipeline(vect,
                      LogisticRegression(solver='liblinear'))

In [None]:
model.fit(msg_train, y_train)

In [None]:
model.score(msg_train, y_train)

In [None]:
model.score(msg_test, y_test)

In [None]:
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

cross_val_score_print(model, df['msg'], y, cv=cv)

### Exercise 2: ROC curve and learning curve

- Use the trained pipeline model to calculate the [`roc_curve`](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)
- Bonus point if you plot it for both train and test sets defined above.
- Use the pipeline model to calculate and plot the [`learning_curve`](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)

In [None]:
from sklearn.metrics import roc_curve

In [None]:
probas_train = model.predict_proba(msg_train)[:, 1]
probas_test = model.predict_proba(msg_test)[:, 1]

In [None]:
fpr_train, tpr_train, _ = roc_curve(y_train, probas_train)
fpr_test, tpr_test, _ = roc_curve(y_test, probas_test)

In [None]:
plt.figure(figsize=(8, 8))
plt.plot(fpr_train, tpr_train)
plt.plot(fpr_test, tpr_test)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.legend(['train', 'test'])
plt.title('Receiver Operating Characteristic')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');

In [None]:
from sklearn.model_selection import learning_curve

In [None]:
tsz = np.linspace(0.1, 1, 10)
train_sizes, train_scores, test_scores = learning_curve(model, df['msg'], y, train_sizes=tsz, cv=3)

In [None]:
fig = plt.figure()
plt.plot(train_sizes, train_scores.mean(axis=1), 'ro-', label="Train Scores")
plt.plot(train_sizes, test_scores.mean(axis=1), 'go-', label="Test Scores")
plt.title('Learning Curve: Logistic Regression')
plt.xlabel("Train Size")
plt.ylabel("Average Score, CV=3")
plt.ylim((0.8, 1.0))
plt.legend()
plt.draw()
plt.show()

### Exercise 3

Let's explore the effect of Dimensionality Reduction techniques on a different dataset: the Digits dataset.

The data is loaded for you.

- Use one or more dimensionality reduction techniques (e.g. `PCA`, `TSNE` or other) to compress the 64 pixel features into 2 features.
- Use `sns.scatterplot` to visualize the whole dataset in the reduced space
- Use the `y` variable to color the data: do you see clusters of similar points appear?

In [None]:
from sklearn.datasets import load_digits

In [None]:
X, y = load_digits(return_X_y=True)

In [None]:
X.shape

In [None]:
plt.imshow(X[0].reshape(8, 8), cmap='gray');

In [None]:
plt.imshow(X[1].reshape(8, 8), cmap='gray');

In [None]:
from sklearn.decomposition import KernelPCA
from sklearn.manifold import TSNE

In [None]:
X_pca = pd.DataFrame(KernelPCA(n_components=2).fit_transform(X), columns=['c1', 'c2'])
X_pca['label'] = y

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(data=X_pca, x='c1', y='c2', hue='label', palette="Set2");

In [None]:
X_tsne = pd.DataFrame(TSNE().fit_transform(X), columns=['c1', 'c2'])
X_tsne['label'] = y

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(data=X_tsne, x='c1', y='c2', hue='label', palette="Set2");

In [None]:
X_tsne = pd.DataFrame(TSNE(perplexity=5).fit_transform(X), columns=['c1', 'c2'])
X_tsne['label'] = y

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(data=X_tsne, x='c1', y='c2', hue='label', palette="Set2");