# Module 20 - Unsupervised Machine Learning Challenge

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

## Part 1: Prepare the Data

* Reads the csv into pandas
* Previews the DataFrame
* Removes the MYOPIC column from the dataset
* Standardizes the dataset using a scaler
* Names the resulting DataFrame X

In [2]:
# Reading the CSV into pandas, and previewing the DataFrame
df = pd.read_csv("./Resources/myopia.csv")
df.head()

Unnamed: 0,AGE,SPHEQ,AL,ACD,LT,VCD,SPORTHR,READHR,COMPHR,STUDYHR,TVHR,DIOPTERHR,MOMMY,DADMY,MYOPIC
0,6,-0.052,21.889999,3.69,3.498,14.7,45,8,0,0,10,34,1,1,1
1,6,0.608,22.379999,3.702,3.392,15.29,4,0,1,1,7,12,1,1,0
2,6,1.179,22.49,3.462,3.514,15.52,14,0,2,0,10,14,0,0,0
3,6,0.525,22.200001,3.862,3.612,14.73,18,11,0,0,4,37,0,1,1
4,5,0.697,23.290001,3.676,3.454,16.16,14,0,0,0,4,4,1,0,0


In [3]:
# Checking if database contains null or duplicated values
print("Null values: ", df.isna().sum().sum())
print("Duplicated values: ", df.duplicated().sum().sum())

Null values:  0
Duplicated values:  0


In [4]:
# Removing the "MYOPIC" column from the dataset, since it is not needed for an unsupervised model
df = df.drop('MYOPIC', axis=1)

In [5]:
# Standardising the dataset using StandardScaler
X_scaled = StandardScaler().fit_transform(df)

In [7]:
# Printing the resulting scaled DataFrame X
X_scaled

array([[-4.20219106e-01, -1.36391690e+00, -8.92861464e-01, ...,
         4.98303926e-01,  9.87137728e-01,  1.00324150e+00],
       [-4.20219106e-01, -3.08612235e-01, -1.71839800e-01, ...,
        -8.75087555e-01,  9.87137728e-01,  1.00324150e+00],
       [-4.20219106e-01,  6.04386289e-01, -9.97682023e-03, ...,
        -7.50233784e-01, -1.01302987e+00, -9.96768974e-01],
       ...,
       [-4.20219106e-01,  1.65169621e+00,  6.52187361e-01, ...,
        -1.37450264e+00,  9.87137728e-01,  1.00324150e+00],
       [-4.20219106e-01, -2.17472219e-01, -8.48716244e-01, ...,
        -1.88391815e-01, -1.01302987e+00, -9.96768974e-01],
       [-4.20219106e-01,  1.58339808e-03, -3.48415042e-01, ...,
        -7.50233784e-01,  9.87137728e-01,  1.00324150e+00]])

## Part 2: Apply Dimensionality Reduction

* PCA model is created and used to reduce dimensions of the scaled dataset
* A model’s explained variance is set to 90% (0.9)
* e shape of the reduced dataset is examined for reduction in number of features
* SNE model is created and used to reduce dimensions of the scaled dataset
* SNE is used to create a plot of the reduced features

## Part 3: Perform a Cluster Analysis with K-means

* A K-means model is created
* A `for` loop is used to create a list of inertias for each `k` from 1 to 10, inclusive
* A plot is created to examine any elbows that exist
* States a brief (1-2 sentence) conclusion on whether patients can be clustered together, and supports it with findings

## Part 4: Make a Recommendation