## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. Cells in which "# YOUR CODE HERE" is found are the cells where your graded code should be written.
2. In order to test out or debug your code you may also create notebook cells or edit existing notebook cells other than "# YOUR CODE HERE". We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Feature Extraction

In this exercise you'll apply a variety of dimensionality reduction to a relatively high dimensional dataset and investigate how they perform in visualizing the data as well as working with a simple supervised learning algorithm.


In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as sk
import scipy

np.random.seed(0)
plt.style.use("ggplot")

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE, SpectralEmbedding
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score


In [None]:
data = load_breast_cancer()

In [None]:
features = pd.DataFrame(data["data"], columns=data["feature_names"])
target = pd.Series(data["target"], name="class")
print(data["DESCR"])

In [None]:
features.head()

In [None]:
features.describe()

In [None]:
target.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

In [None]:
y_train

First, we'll visualize the data with Principal Component Analysis. We can use the scikit-learn [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [KernelPCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html?highlight=kernelpca) classes. We'll want to visualize the data in just 2 dimensions.

In [None]:
pca = PCA(n_components=2)
# We can use fit transform to fit and transform the training data
X_train_pca = pca.fit_transform(X_train)


plt.scatter(X_train_pca[:,0], X_train_pca[:,1], c=y_train)
plt.title("PCA")
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.show()


In [None]:
# When we decide to visualize the test data, we now have to make sure we **do not fit**
# We just use the transform method

X_test_pca = pca.transform(X_test)

plt.scatter(X_test_pca[:,0], X_test_pca[:,1], c=y_test)
plt.title("PCA")
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.show()

In [None]:
for degree in [2,3,4,5]:
    
    kpca = KernelPCA(n_components=2, kernel="poly", degree=degree)
    X_train_kpca = kpca.fit_transform(X_train)
    
    plt.scatter(X_train_kpca[:,0], X_train_kpca[:,1], c=y_train)
    plt.title(f" Polynomial Kernel PCA with degree {degree}")
    plt.xlabel("PC 1")
    plt.ylabel("PC 2")
    plt.show()
    

Next, we'll try out the t-Distributed Stochastic Neighborhood Embeddings. We can use the [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html?highlight=tsne#sklearn.manifold.TSNE) class from scikit-learn. We'll start by doing a hyperparameter search to see multiple variations of visualizations. As we covered in the lecture, this is recommended to make sure we're understanding the neighborhoods identified in the data.

In [None]:
perplexities = [5, 20, 30, 50, 100]
iters = [250, 1000, 3000]

Create an embedding using [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) using number of iterations (n), a random state of 0 and specified perplexity. Save that embedding to tsne_embedding.

The next code block will take some time to run, you might want to come back to this later or continue to the next part while it runs.

In [None]:
tsne_embeddings = []
for perplexity in perplexities:
    fig, axes = plt.subplots(nrows=1, ncols=len(iters), figsize=(16, 8), sharex=True, sharey=True)
    for i,n in enumerate(iters):
        # Create an embedding using t-SNE using number of iterations (n), a random state of 0 and specified perplexity
        # Save that embedding to tsne_embedding
        # YOUR CODE HERE
        raise NotImplementedError()
        axes[i].scatter(tsne_embedding[:, 0], tsne_embedding[:, 1], c=y_train)
        axes[i].set_title(f"t-SNE\nPerplexity={perplexity}, {n} steps")
    tsne_embeddings.append(tsne_embedding)
    plt.show()

In [None]:
assert tsne_embedding.shape == (455,2)
assert len(tsne_embeddings) == 5

Next, we'll calculate the [spectral embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html#sklearn.manifold.SpectralEmbedding) for the data. You can use the [SpectralEmbedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html?highlight=spectralembedding) class from scikit-learn.

> *Optional* - select an affinity to create the affinity matrix for the manifold graph. You can select a [pairwise distance method](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.pdist.html) to call with [squareform](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform). 

In [None]:
# Now use Spectral Embeddings to calculate the embedding of the data
# Save the embedding to spectral_embedding, and transformer to spectral
# YOUR CODE HERE
raise NotImplementedError()
plt.scatter(spectral_embedding[:, 0], spectral_embedding[:, 1], c=y_train)
plt.title("Spectral Embedding")
plt.show()

In [None]:
assert spectral
assert spectral_embedding.shape == (455, 2)

Next, we'll repeat this process with Linear Discriminant Analysis. You can use the [LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) class from scikit-learn.

In [None]:
# Save the embedding to lda_embedding, and transformer to lda
# YOUR CODE HERE
raise NotImplementedError()
plt.scatter(lda_embedding, [0]*len(lda_embedding), c=y_train)
plt.yticks([])
plt.title("LDA Embedding")
plt.show()

In [None]:
assert lda
assert lda_embedding.shape == (455, 1)

We will now investigate how these methods perform on classification with a basic Linear SVM Classifier.

Since t-SNE, and Spectral Embedding do not save a projection, we can not transform the test data using the trained model. You will need to concatenate both X_train and X_test then select the training and test rows from the embeddings after. The [guide on merging dataframes](http://pandas.pydata.org/pandas-docs/stable/merging.html) may be useful.


In [None]:
# Select a perplexity for t-SNE and save it to selectedPerplexity
# Create 2 embeddings `tsne_train` and `tsne_test` using a new TSNE model saved to `tsne_model`
#        fitted and transforming the whole data then manually selecting each group
# Create a LinearSVC model with default settings fitted to the training embeddings
#       save it as tsne_svc
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Create 2 embeddings `spectral_train` and `spectral_test` using a new Spectral Embedding model
#        saved to `spectral_model` fitted and transforming the whole data then manually 
#        selecting each group.
# Create a LinearSVC model with default settings fitted to the training embeddings
#       save it as spectral_svc
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Create an embedding `lda_test` from the test data using the lda model you trained
# Create a LinearSVC model with default settings fitted to the training embeddings
#       save it as lda_svc
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert selectedPerplexity >= 2
assert selectedPerplexity <= 100
assert tsne_model
assert tsne_model.n_components == 2
assert tsne_model.perplexity == selectedPerplexity
assert tsne_model.n_iter >= 3000
assert tsne_train.shape == (455,2)
assert tsne_test.shape == (114,2)

assert spectral_model
assert spectral_model.n_components == 2
assert spectral_train.shape == (455,2)
assert spectral_test.shape == (114, 2)

assert lda_test.shape == (114, 1)

assert tsne_svc
assert tsne_svc.coef_.shape[1] == 2
assert spectral_svc
assert spectral_svc.coef_.shape[1] == 2
assert lda_svc
assert lda_svc.coef_.shape[1] == 1

In [None]:
print(f"The t-SNE embedding + Linear SVM scores an F-1 = {f1_score(y_test, tsne_svc.predict(tsne_test)):.3f}.")
print(f"The Spectral Embedding + Linear SVM scores an F-1 = {f1_score(y_test, spectral_svc.predict(spectral_test)):.3f}.")
print(f"The LDA + Linear SVM scores an F-1 = {f1_score(y_test, lda_svc.predict(lda_test)):.3f}.")

Can you think of why we get these scores for the respective models? What happens if we train the same Linear SVM model on all the features?

In [None]:
all_feat_score = f1_score(y_test, LinearSVC().fit(X_train, y_train).predict(X_test))
print(f"The Linear SVM with all features scores an F-1 = {all_feat_score:.3f}.")

## Feedback

In [None]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()