## Exercise 1:
#### Load the data.csv file into the DataFrame.

#### Then, implement the K-Means algorithm to split the given data into two clusters. Specify the centroid of each cluster and print its coordinates to the console. Round the result to three decimal places for each coordinate.

#### Steps:

- determine the ranges of values for the variables x1 and x2

- randomly select the centroid from the calculated intervals

- assign points to the nearest centroid

- calculate new centroid (as the arithmetic mean of the coordinates of the points in one cluster)

- go back to step 3 and repeat until converge (10 iterations is enough)



In [1]:
import numpy as np
from numpy.linalg import norm
import pandas as pd
import random


np.random.seed(42)

## Reading Data
df = pd.read_csv('data.csv')

## Determining min max of x1 and x2.
x1_min = df.x1.min()
x1_max = df.x1.max()
 
x2_min = df.x2.min()
x2_max = df.x2.max()
 
## Defining two centroids (random.uniform means that centroids will unlikely be the same.)
centroid_1 = np.array(
    [
        random.uniform(x1_min, x1_max),
        random.uniform(x2_min, x2_max),
    ]
)
centroid_2 = np.array(
    [
        random.uniform(x1_min, x1_max),
        random.uniform(x2_min, x2_max),
    ]
)

## Returning only the values in the df as an array.
data = df.values
 
## Loops and assigns the df values (point) to a cluster where the distance is the shortest from the centroid.
for i in range(10):
    clusters = []
    for point in data:
        centroid_1_dist = norm(centroid_1 - point)
        centroid_2_dist = norm(centroid_2 - point)
        cluster = 1
        if centroid_1_dist > centroid_2_dist:
            cluster = 2
        clusters.append(cluster)

## Adds cluster assigned to the df.
    df['cluster'] = clusters

## Updating centroids to the mean of each cluster    
    centroid_1 = [
        round(df[df.cluster == 1].x1.mean(), 3),
        round(df[df.cluster == 1].x2.mean(), 3),
    ]
    centroid_2 = [
        round(df[df.cluster == 2].x1.mean(), 3),
        round(df[df.cluster == 2].x2.mean(), 3),
    ]

print(centroid_1)
print(centroid_2)

[0.352, 2.502]
[2.663, -3.083]


## Exercise 2:
#### Load the clusters.csv file into the DataFrame. The file contains two variables x1 and x2. The distribution of the variables is as follows:

#### Using the KMeans class from the scikit-learn, split the data into three clusters. Set arguments:

- max_iter=1000

- random_state=42

#### In response, print the coordinates of the centroid of each cluster as shown below.



In [5]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans


np.random.seed(42)

## Reading Data
df = pd.read_csv('clusters.txt', delimiter = ',')

## Defining Model
km = KMeans(n_clusters = 3, max_iter = 1000, random_state = 42)

## Fitting Model
km.fit(df)

## Printing coordinates of the centroids
print(km.cluster_centers_)

print(df.head())

[[-0.55537629 -0.32971364]
 [ 4.86661316  0.42352176]
 [-2.15656147 -4.30478556]]
         x1        x2
0 -2.776333 -4.166641
1 -1.335879 -1.083934
2  6.507272 -0.158773
3 -0.956622  0.235036
4 -1.558383 -3.969630


## Exercise 3:
#### Load the clusters.csv file into the DataFrame. The distribution of the variables form this file is as follows:

#### Using the KMeans class from the scikit-learn, the model was created. Make a prediction based on this model (kmeans) and assign a cluster number to each sample in the df DataFrame as 'y_kmeans' column.

#### In response, print the first ten rows of the df DataFrame to the console.

In [6]:
np.random.seed(42)

kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=42)
kmeans.fit(df)

## Predicting 
df['y_kmeans'] = kmeans.predict(df)

print(df.head(10))

         x1        x2  y_kmeans
0 -2.776333 -4.166641         2
1 -1.335879 -1.083934         0
2  6.507272 -0.158773         1
3 -0.956622  0.235036         0
4 -1.558383 -3.969630         2
5 -0.652304 -1.332604         0
6  5.560753  1.517069         1
7 -0.891052 -3.455786         2
8  6.391479  3.597473         1
9  5.812508 -0.845526         1


## Exercise 4:

#### Using the KMeans class (set random_state=42) from the scikit-learn, create a list of WCSS (Within-Cluster Sum-of-Squared) values for the number of clusters from 2 to 9 inclusive. Round WCSS values to two decimal places and print to the console.

In [7]:
wcss = []
for i in range(2, 10):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(df)
    wcss.append(round(kmeans.inertia_, 2))
    
print(wcss)

[5611.32, 1950.88, 1714.8, 1487.39, 1263.45, 1105.1, 977.53, 856.57]


#### Notes:

- km.inertia_ provides the WCSS, indicating how tightly data points cluster around their centers. Lower values suggest better-defined clusters.

## Exercise 5:

#### Using the DBSCAN class from the scikit-learn, create a model to split given dataset into clusters. Set the following arguemnts:

- eps=0.6

- min_samples=7

#### Make a prediction based on this model and assign a new column 'cluster' which stores the cluster number for each sample in the df DataFrame.

#### In response, print the first ten rows of the df DataFrame.

In [10]:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN

df = pd.read_csv('clusters.txt', delimiter = ',')

dbs = DBSCAN(eps = 0.6, min_samples = 7)

dbs.fit(df)

## Predicting 
df['cluster'] = dbs.labels_

print(df.head(10))

         x1        x2  cluster
0 -2.776333 -4.166641        0
1 -1.335879 -1.083934        0
2  6.507272 -0.158773        1
3 -0.956622  0.235036        0
4 -1.558383 -3.969630        0
5 -0.652304 -1.332604        0
6  5.560753  1.517069        1
7 -0.891052 -3.455786        0
8  6.391479  3.597473       -1
9  5.812508 -0.845526        1


## Notes:

- In DBSCAN, -1 indicates that a point does not belong to any cluster and is considered noise.