# Day 1 Machine learning in Python - Exercises with answers

## Exercise 1

#### Question 1
##### Import the required packages.
##### Set the working directory to data directory.
##### Print the working directory and the plot directory.

#### Answer: 

In [2]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import sklearn
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from matplotlib import cm

In [3]:
# Set `home_dir` to the root directory of your computer.
home_dir = Path.home()

# Set `main_dir` to the location of your `skillsoft-intro-to-machine-learning-in-python` folder.
main_dir = home_dir / "Desktop" / "skillsoft-intro-to-machine-learning-in-python"

# Make `data_dir` from the `main_dir` and remainder of the path to data directory.
data_dir = main_dir / "data"

# Create a plot directory to save our plots
plot_dir = main_dir / "plots"

In [4]:
# Set working directory.
os.chdir(data_dir)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/bnpaulus/Desktop/skillsoft-intro-to-machine-learning-in-python/data'

In [None]:
# Check working directory.
print(os.getcwd())

#### Question 2

##### Load the dataset `fast_food_data.csv` and save it as `ex_subset`.

##### Print the first few rows of `ex_subset` and its summary using describe().

##### Drop all the non-numerical columns from `ex_subset` and print the few rows again to see what the dataframe looks like.

#### Answer: 

In [None]:
ex_subset = pd.read_csv("fast_food_data.csv")

In [None]:
ex_subset.head()

In [None]:
ex_subset.info()

In [None]:
ex_subset = ex_subset.drop(['Fast Food Restaurant','Item', 'Type'], axis = 1)

In [None]:
ex_subset.head()

#### Question 3

##### Check how many NAs are in each column and impute them with mean. 

##### For clustering, we will be using just the `Calories` and `Sodium (mg)` columns. Drop all the other columns from `ex_subset` and name the new dataset as `ex_cluster`. 

##### Print the first few rows of `ex_cluster` to make sure we have the correct dataset.

#### Answer: 

In [None]:
print(ex_subset.isnull().sum())

In [None]:
ex_subset = ex_subset.fillna(ex_subset.mean())

In [None]:
ex_subset.isnull().sum()

In [None]:
ex_cluster = ex_subset[['Calories', 'Sodium (mg)']]
print(ex_cluster.head())

#### Question 4

##### In the dataset `ex_cluster`, check the data types for all of the columns. 
##### After making sure that all the data is numeric, scale the dataset and name it `ex_cluster_scaled`.
##### When the dataset is scaled, convert `ex_cluster_scaled` back to a pandas dataframe and make sure to name the columns again. 

##### Print out the first few rows of `ex_cluster_scaled` to make sure the column names are correct and are ready for clustering.

#### Answer: 

In [None]:
ex_cluster.dtypes

In [None]:
scaler = MinMaxScaler()
ex_cluster_scaled = scaler.fit_transform(ex_cluster)

In [1]:
ex_cluster_scaled = pd.DataFrame(ex_cluster_scaled, columns = ex_cluster.columns)
print(ex_cluster_scaled.head())

NameError: name 'pd' is not defined

#### Question 5

#####  Rename `ex_cluster_scaled` as `ex_kmeans`.
##### We will be using `ex_cluster_scaled` in other clustering models as well.
##### Plot the two variables from `ex_kmeans` to see their interactions. 
##### Plot `Sodium (mg)` as `y` and `Calories` as `x`.

#### Answer: 

In [None]:
# Rename `ex_cluster_scaled` as `ex_kmeans`.
ex_kmeans = ex_cluster_scaled

In [None]:
# Plot the data.
plt.scatter(ex_kmeans['Calories'], ex_kmeans['Sodium (mg)'], label = 'True Position') 
plt.title('Calories vs Sodium')
plt.ylabel('Sodium (mg)')
plt.xlabel('Calories')

#### Question 6

##### Let's find an optimal K. 
##### Initialize the k-means with 2 clusters and name it `ex_kmeans_2`. 
##### Fit `ex_kmeans_2` with `ex_kmeans`. 
##### Predict the clusters with `ex_kmeans_2` and name the outputs as `labels`. 
##### Get the cluster centers and name it as `C_2`. 
##### Print `C_2` to see what it looks like. 


#### Answer: 

In [None]:
ex_kmeans_2 = KMeans(n_clusters = 2)

In [None]:
ex_kmeans_2 = ex_kmeans_2.fit(ex_kmeans)

In [None]:
labels = ex_kmeans_2.predict(ex_kmeans)

In [None]:
C_2 = ex_kmeans_2.cluster_centers_
print(C_2)

#### Question 7

##### Plot the data with clusters colored in and each centroid plotted.

#### Answer: 

In [None]:
# First, we plot our clusters, colored in by the labels.
plt.scatter(ex_kmeans.iloc[:,0],            
            ex_kmeans.iloc[:,1], 
            c=ex_kmeans_2.labels_, 
            cmap='rainbow')

# Second, we plot the optimized centroids over the clusters.
plt.scatter(C_2[:, 0], 
            C_2[:, 1], 
            c='black', 
            s=200, 
            alpha=0.5)

## Exercise 2

#### Question 1

#####  Get the metrics we need for building an elbow plot.
##### The range for K should be from 1 to 20. 

#### Answer: 

In [None]:
# Set the range of k.
K_MAX = 20
KK = range(1, K_MAX + 1)

# Run `kmeans` for values in the range k = 1-20.
KM = [kmeans(ex_kmeans, k) for k in KK]

# Find the centroids for each KM output. 
centroids = [cent for (cent,var) in KM]

# Calculate centroids for each iteration of k. 
D_k = [cdist(ex_kmeans, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D, axis = 1) for D in D_k]
dist = [np.min(D, axis = 1) for D in D_k]

tot_withinss = [sum(d**2) for d in dist]                        # Total within-cluster sum of squares
totss = sum(pdist(ex_kmeans)**2) / ex_kmeans.shape[0]           # The total sum of squares
betweenss = totss - tot_withinss                                # The between-cluster sum of squares

#### Question 2
##### Build an elbow curve plot for KMeans clustering and find the optimal K. 

#### Answer: 

In [None]:
# Set range for k.
kIdx = 2        # K=3
clr = cm.Spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'

In [None]:
# Elbow curve - explained variance.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, 
        markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')

#### Question 3
##### Now try the silhouette method to find the optimal number of `k`.

#### Answer:

In [None]:
obs = ex_kmeans
silhouette_score_values=list()

NumberOfClusters = range(2,30)

for i in NumberOfClusters:

    classifier=cluster.KMeans(i,init='k-means++', n_init=10, 
                              max_iter=300, 
                              tol=0.0001, 
                              verbose=0, 
                              random_state=None, 
                              copy_x=True)
    classifier.fit(obs)
    labels= classifier.predict(obs)
    sklearn.metrics.silhouette_score(obs,labels ,metric='euclidean', sample_size=None, random_state=None)
    silhouette_score_values.append(sklearn.metrics.silhouette_score(obs,labels ,metric='euclidean', sample_size=None, random_state=None))

plt.plot(NumberOfClusters, silhouette_score_values)
plt.title("Silhouette score values vs Numbers of Clusters ")
plt.show()     

Optimal_NumberOf_Components=NumberOfClusters[silhouette_score_values.index(max(silhouette_score_values))]

In [None]:
print("Optimal number of components is:", Optimal_NumberOf_Components)

#### Question 4
##### Print the explained variance for both k = 2 and the optimal k and compare.

#### Answer: 

In [None]:
# Explained variance for optimal number of clusters at `k = 2`.
print(betweenss[1]/totss*100)

In [None]:
# Explained variance for optimal number of clusters at `k = 3`.
print(betweenss[2]/totss*100)

#### Question 5
##### Initiate a new k-means cluster classifier and name it `ex_kmeans_K`, with K being the optimal number of clusters as 3.
##### Fit `ex_kmeans_K` and use the model to predict clusters and store them in a list called `labels`. 

##### Plot a scatterplot with the optimal number of clusters shown in different colors.
##### Plot the optimized centroids over the clusters.


#### Answer: 

In [None]:
ex_kmeans_3 = KMeans(n_clusters = 3)
ex_kmeans_3 = ex_kmeans_3.fit(ex_kmeans)
labels = ex_kmeans_3.predict(ex_kmeans)
C_3 = ex_kmeans_3.cluster_centers_

In [None]:
plt.scatter(ex_kmeans.iloc[:,0],            
            ex_kmeans.iloc[:,1], 
            c = ex_kmeans_3.labels_, 
            cmap = 'rainbow')


plt.scatter(C_3[:, 0], 
            C_3[:, 1], 
            c = 'black', 
            s = 200, 
            alpha = 0.5)

#### Question 6
##### Create a new dataframe named `clustered_ex` and populate with all the columns from `ex_cluster_scaled`. 
##### Append the list of predicted cluster, `labels`, to the `clustered_ex` dataframe. 
##### Print the `clustered_ex` dataframe to inspect the clusters.

#### Answer: 

In [None]:
clustered_ex = ex_cluster_scaled.copy()
clustered_ex['cluster'] = pd.Series(labels)

In [None]:
clustered_ex.head()

#### Question 7
##### Group the `clustered_ex` dataframe by `cluster` to see the group mean of each variable.
##### Name the new dataframe as `ex_cluster_groups_means`. 
##### Print `ex_cluster_groups_means` to inspect each clusters.


#### Answer: 

In [None]:
ex_cluster_groups_means = clustered_ex.groupby('cluster').mean()
ex_cluster_groups_means