import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# ^^^ pyforest auto-imports - don't write above this line
# Before your start:

- Comment as much as you can and use the resources
- Happy learning!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Challenge 1 - Reading and Standardizing the Data

In this challenge we will work with image data and reduce the dimensions of the data to create a 2 dimensional plot. While we have not worked much with image data before, it can be represented as a numpy array where each pixel is assigned a numeric value. In this lesson, we will look at a dataset of cats and a dataset of dogs and see whether reducing them to two dimensions allows us to see if the data is separable. We will start by loading and processing the data. Run the cell below to load the two csv files and convert them into numpy arrays.

## Read the files into dataframes named `dogs` and `cats`

In [None]:
#your code here
cats = np.array(pd.read_csv('../data/cats.csv'))
dogs = np.array(pd.read_csv('../data/dogs.csv'))

Next, we'll examine the shape of both the cats and the dogs arrays. Print out both shapes below.

In [None]:
# Your code here:
cats.shape, dogs.shape

What this means is that both arrays contain 80 images each (the number of rows). Each image is comprised of 64x64 pixels (so a total of 4096 pixels per image). The images have been flattened such that all 4096 are in a single row.

Print out row 0 of the cats dataframe to see what the pixels values look like.

In [None]:
# Your code here:
cats[0].reshape(64,64)

Using `matplotlib` we can plot a single cat or a single dog. We do this by reshaping the observation vector back to a squared image and then using the `imshow` function.

Plot the image of the first cat and dog of your dataframes

_Extra: you can play with `imshow`'s argument `cmap` to see which colors are suitable for visualization_

In [None]:
# your code here
plt.imshow(cats[0].reshape(64,64),)
plt.show()
plt.imshow(dogs[0].reshape(64,64),)
plt.show()

Now concatenate the cats and dogs dataframes. Make sure to put dogs first. This should result in a dataframe containing 160 observations and 4096 dimensions.

In [None]:
# your code here
arrays = np.concatenate([dogs, cats], axis = 0)

Next, we would like to standardize our data. 

In order to do that, we will need to use the `StandardScaler` class from `sklearn.preprocessing` module.

Remember, we need to standardize the information for each pixel (which are the dimensions of our dataset) so that they can be compared in the PCA algorithm. Otherwise, the result would be dominated by the variable with the highest scale. 


In [None]:
# your code here
from sklearn.preprocessing import StandardScaler 

In [None]:
standard = StandardScaler()
X_std = standard.fit_transform(arrays)
X_std

## Bonus

After standardizing your data, try visualizing your image again. Does the standardization changes your original image?

In [None]:
# your code here
plt.imshow(X_std[0].reshape(64, 64))
plt.show()
plt.imshow(X_std[80].reshape(64, 64))
plt.show()

# Challenge 2 - Using PCA

Now that we have created a standardized matrix of cats and dogs, we'll find the two most important components in the data. 

Load the PCA from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and apply the function to our standarized data. Create a PCA model with two components.

In [None]:
# Your code here:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(2)

In [None]:
pca.fit(X_std)
pca.transform(X_std)

pc1 = pca.transform(X_std)[:, 0]
pc2 = pca.transform(X_std)[:, 1]



After instantiating the `PCA class` and applying it to the standardized data, the results of your principal component analysis will be stored on the PCA object you've created (from the PCA class). 

# Challenge 3 - Plotting the Data

Now that we have reduced our cats and dogs arrays, we can plot the data in a way that the human eye can understand in 2-D. We can look at this plot to see whether there are differences between the cat and dog images. 

In the cell below, create a pandas dataframe containing the columns `pc1`, and `pc2` with the results of the first and second components of your results, respectively.

In [None]:
# Your code here:
df = pd.DataFrame([pc1, pc2]).T
df.columns = ['pc1', 'pc2']
df['cat_dog'] = ['dog' if x < 80 else 'cat' for x in range(160)]
df

Create a labels list. This list will be of size 160. The first 80 elements in the list will be the word `blue`, and the last 80 elements will be `red`. This will ensure that all dog observations will be colored in blue and all cats will be colored in red. Name this list `colors`.

In [None]:
# Your code here:
df['colors'] = df['cat_dog'].apply(lambda x: 'blue' if x == 'dog' else 'red')
df

Create a scatter plot of `pc1` and `pc2` below. Use the `colors` list above to color the observations by setting `c=colors`.

In [None]:
# Your code here:
plt.scatter('pc1', 'pc2', c = 'colors', data = df);

In [None]:
# or

sns.scatterplot('pc1', 'pc2', hue = 'cat_dog', data = df)

Do you see a clear separation between cats and dogs? Write your answer below.

In [None]:
# Your conclusions here:


# Evaluate the results

Calculate how much explained variance do the results of your PCA has lead to.  

In [None]:
# Your code here:
pca.explained_variance_

Explain with your own words what those values represent.

In [None]:
# your answer here:

# Bonus Challenge


Recreate your PCA using 20 components. You'll not be able to visualize the results this time, but the idea here is to plot a cumulative sum of your explained variance results. Follow the steps as before (i.e, create a dataframe containing the 20 components and so on)


What do you observe? How much of the information is retained after going from 4096 to 20 dimensions?

In [None]:
pca = PCA(20)

pca.fit(X_std)

In [None]:
pca.explained_variance_ratio_.sum()

# Bonus Challenge 2

Use the `.inverse_transform()` method to the dataframe with the 20 components and store your results. The inverse transform takes the results of the PCA (the reduced dimension space) and take it back to the original space (with 4096 dimensions). This will be helpful for you to visualize how the PCA affected the original data.

In [None]:
# your code
df_pca20 = pd.DataFrame([pca.transform(X_std)[:, i] for i in range(len(pca.explained_variance_ratio_))]).T

df_inverse = pd.DataFrame([pca.inverse_transform(df_pca20.loc[i, :]) for i in range(len(df_pca20))])

Use the `imshow` to plot the first row of the `inverse_transform` of the results. Compare the results with the original image.

In [None]:
# your code
plt.imshow(cats[0].reshape(64, 64)) #original
plt.show()
plt.imshow(np.array(df_inverse.loc[80, :]).reshape(64, 64)) # PCA 20
plt.show()
plt.imshow(dogs[0].reshape(64, 64)) #original
plt.show()
plt.imshow(np.array(df_inverse.loc[0, :]).reshape(64, 64)) # PCA 20
plt.show()

Change the value of the components above to have a feeling of the information retained (try 100 components).

In [None]:
pca = PCA(100)
pca.fit(X_std)

df_pca100 = pd.DataFrame([pca.transform(X_std)[:, i] for i in range(len(pca.explained_variance_ratio_))]).T

df_inverse100 = pd.DataFrame([pca.inverse_transform(df_pca100.loc[i, :]) for i in range(len(df_pca100))])

plt.imshow(cats[0].reshape(64, 64)) # Original
plt.show()
plt.imshow(np.array(df_inverse100.loc[80, :]).reshape(64, 64)) # PCA 100
plt.show()
plt.imshow(dogs[0].reshape(64, 64)) # Original
plt.show()
plt.imshow(np.array(df_inverse100.loc[0, :]).reshape(64, 64)) # PCA 100
plt.show()

The results above demonstrate the power of the PCA analysis. It takes the best combination of your columns in such a way that it preserves the most of its information. So, although you lose some of the information, you effectively reduce the number of dimensions on your dataset. This can be important both for visualization purposes as well as for understanding the importance of each feature.