# Data Analysis using KMeans Clustering with Artificial Data (Project 1)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

## Create some Data


In [None]:
from sklearn.datasets import make_blobs

In [None]:
# Create Data
data = make_blobs(n_samples=200, n_features=2, 
                           centers=4, cluster_std=1.8,random_state=101, return_centers=True)

In [None]:
data

## Visualize Data

In [None]:
datadf = pd.DataFrame(data[0], columns = ['feature1', 'feature2'])

In [None]:
datadf

In [None]:
plt.scatter(data = datadf, x = 'feature1', y = 'feature2',c=data[1],cmap='rainbow')

## Creating the Clusters


In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=4)

In [None]:
kmeans.fit(datadf)

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data = datadf,x= 'feature1', y='feature2',c=kmeans.labels_,cmap='rainbow')
ax1.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='black', marker = 'x' , s = 100)
ax2.set_title("Original")
ax2.scatter(data = datadf, x = 'feature1', y = 'feature2',c=data[1],cmap='rainbow')
ax2.scatter(data[2][:,0], data[2][:,1],c='black', marker = 'x' , s = 100)

## It is evident that the results obtained from the K-means clustering algorithm closely resemble the original dataset. The similarity between the two indicates that the algorithm has effectively clustered the data, achieving a near-perfect match between the original data distribution and the clustered groups. This suggests that K-means has successfully identified meaningful patterns or groupings within the data.

___
# K Means Clustering Project using Real Data (Project 2)

For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public.

___
It is **very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.** 

When using the Kmeans algorithm under normal circumstances, it is because you don't have labels. In this case we will use the labels to try to get an idea of how well the algorithm performed, but we won't usually do this for Kmeans.


## The Data

We will use a data frame with 777 observations on the following 18 variables.
* Private A factor with levels No and Yes indicating private or public university
* Apps Number of applications received
* Accept Number of applications accepted
* Enroll Number of new students enrolled
* Top10perc Pct. new students from top 10% of H.S. class
* Top25perc Pct. new students from top 25% of H.S. class
* F.Undergrad Number of fulltime undergraduates
* P.Undergrad Number of parttime undergraduates
* Outstate Out-of-state tuition
* Room.Board Room and board costs
* Books Estimated book costs
* Personal Estimated personal spending
* PhD Pct. of faculty with Ph.D.’s
* Terminal Pct. of faculty with terminal degree
* S.F.Ratio Student/faculty ratio
* perc.alumni Pct. alumni who donate
* Expend Instructional expenditure per student
* Grad.Rate Graduation rate

## Importing Libraries

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

## Data

In [None]:
data = pd.read_csv('College_Data', index_col=0)

In [None]:
data.head()

** Checking the data

In [None]:
data.info()

In [None]:
data.describe()

## Exploratory Data Analysis

It's time to create some data visualizations!

** We are going to create some plots based on different features to see how the data is based on the private feature.

In [None]:
sns.set_style('whitegrid')
sns.scatterplot(x= 'Grad.Rate', y='Room.Board', data = data, hue = 'Private')

In [None]:
sns.scatterplot(x = 'Outstate', y='F.Undergrad', data = data, hue = 'Private')

In [None]:
sns.countplot(data = data, x = 'Private')

In [None]:

g= sns.FacetGrid(data, hue='Private',palette='coolwarm',aspect=2,height=6)
g.map(plt.hist,'Outstate',alpha = 0.7, bins= 20)

In [None]:
g= sns.FacetGrid(data, hue='Private',palette='coolwarm',aspect=2,height=6)
g.map(plt.hist,'Grad.Rate',alpha = 0.7, bins= 20)

** Notice how there seems to be a private school with a graduation rate of higher than 100%.**

In [None]:
data[data['Grad.Rate']>100]

** We are going to set that school's graduation rate to 100 so it makes sense.

In [None]:
data['Grad.Rate']['Cazenovia College'] = 100

In [None]:
data[data['Grad.Rate']>100]

In [None]:
g= sns.FacetGrid(data, hue='Private',palette='coolwarm',aspect=2,height=6)
g.map(plt.hist,'Grad.Rate',alpha = 0.7, bins= 20)

## K Means Cluster Creation

Now it is time to create the Cluster labels!


In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans1 = KMeans(n_clusters=2)

**Fit the model to all the data except for the Private label.**

In [None]:
kmeans1.fit(data.drop('Private',axis =1))

** Cluster centers **

In [None]:
kmeans1.cluster_centers_

In [None]:
kmeans1.labels_

## Evaluation

There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, but we do not have these lables in real world

** I Created a new column for data called 'Cluster', which is 1 for a Private school, and a 0 for a public school.**

In [None]:
def convertor(private):
    if private == 'Yes':
        return 1
    else:
        return 0

In [None]:
data['cluster'] = data['Private'].apply(convertor)

In [None]:
data

** The below plot is an overview of clustering with original data and Kmeans considering only two features.
** Since we have a lot of features, it is not possible to see the plot of final KMeans clustering. As can be seen Kmeans is doing very good, and if we could add other features in this plot, for sure, the clustering would be shown better.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data = data,x= 'Enroll', y='Grad.Rate',c=kmeans1.labels_,cmap='rainbow')
ax2.set_title("Original")
ax2.scatter(data = data, x = 'Enroll', y = 'Grad.Rate',c='cluster',cmap='rainbow')
