<a href="https://colab.research.google.com/github/devadathen/datasciencelab/blob/main/Another_copy_of_StudentCopy_of_Kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

Program to implement k-means clustering technique using any standard dataset available in the public domain

---

### Dataset Description

In this project, we will be using the dataset holding the information of carbon dioxide emission from different car models.

The dataset includes 36 instances with 5 columns which can be briefed as:

|Column|Description|
|-|-|
|Car|Brand of the car|
|||
|Model|Model of the car|
|||
|Volume|Total space available inside the car (in $litres$)|
|||
|Weight|Total weightof the car (in $kg$)|
|||
|$CO_2$|Total emission of carbon dioxide from the car|
|||



**Note:** *(This is a manually created custom dataset for this project.*)

---

### List of Activities

**Activity 1:** Import Modules and Read Data

**Activity 2:** Data Cleaning
  
**Activity 3:** Find Optimal Value of `K`

**Activity 4:** Plot Silhouette Scores

#### Activity 1: Import Modules and Read Data

Import the necessary Python modules along with the following modules:

 - `KMeans` - For clustering using K-means.

 - `re` - To remove unwanted rows using regex.

Read the data from a CSV file to create a Pandas DataFrame and go through the necessary data-cleaning process (if required).

**Dataset link:** https://raw.githubusercontent.com/jiss-sngce/CO_3/main/jkcars.csv

In [None]:
# Import the modules and Read the data.
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/jiss-sngce/CO_3/main/jkcars.csv")
# Print the first five records
df.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Mitsubishi,Space Star,1200,1160,95
1,Skoda,Citigo,1000,929,95
2,Fiat,500,900,865,90
3,Mini,Cooper,1500,1140,105
4,VW,Up!,1000,929,105


In [None]:
# Get the total number of rows and columns, data types of columns and missing values (if exist) in the dataset.
print(df.shape)
new_df = df[['Volume','Weight','CO2']]
new_df.head()

(32, 5)


Unnamed: 0,Volume,Weight,CO2
0,1200,1160,95
1,1000,929,95
2,900,865,90
3,1500,1140,105
4,1000,929,105





#### Activity 3: Find Optimal value of K

In this activity, you need to find the optimal value of `K` using the silhouette score.

1. Create a subset of the dataset consisting of three columns i.e `Volume`, `Weight`, and `CO2`.



In [None]:
# Create a new DataFrame consisting of three columns 'Volume', 'Weight', 'CO2'.
from pandas.core.common import random_state
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Print the first 5 rows of this new DataFrame.


2. Compute K-Means clustering for the 3D dataset `data_3d` by varying `K` from `2` to `10` clusters. Also, for each `K`, calculate silhouette score using `silhouette_score` function.

 **Steps to Follow**

 - Create an empty list to store silhouette scores obtained for each `K` (let's say `sil_scores`).

 - Initiate a `for` loop that ranges from 2 to 10.

 -  Perform K-means clustering for the current value of `K` inside `for` loop.
    
 - Use `fit()` and `predict()` to create clusters.

 - Calculate silhouette score for current `K` value using `silhouette_score()` function and append it to the empty list `sil_scores`.

 - Create a DataFrame with two columns. The first column must contain `K` values from 2 to 10 and the second column must contain silhouette values obtained after the `for` loop.



In [None]:
# Calculate inertia for different values of 'K'.
sil_score = []
cluster = range(2,11)
for k in cluster:
  kmean_k = KMeans(n_clusters=k,random_state=1)
  kmean_k.fit(new_df)
  cluster_labels = kmean_k.predict(new_df)
  s = silhouette_score(new_df,cluster_labels)
  sil_score.append(s)
# Create an empty list to store silhouette scores obtained for each 'K'
sil_data = pd.DataFrame({'Clusters':cluster,"silhouette Score":sil_score})
sil_data

**Q**: What are the maximum silhouette score and the corresponding cluster value?

**A**:
- Maximum silhouette score=
- Corresponding cluster value=

---

#### Activity 4: Plot silhouette Scores  find optimal value for K

Create a line plot with `K` ranging from `2` to `10`  on the $x$-axis and the silhouette scores stored in `sil_scores` list on the $y$-axis.

In [None]:
# Plot silhouette scores vs number of clusters.
x =

**Q:** Write your observations of the graph.

**A:** From the graph, we can conclude that the optimal value of `K` is 3.

In [None]:
# Clustering the dataset for K = 3

# Perform K-Means clustering with n_clusters = 3 and random_state = 10



# Fit the model to the scaled_df

# Make a series using predictions by K-Means


In [None]:
# Create a DataFrame with cluster labels for cluster visualisation


---