<a href="https://colab.research.google.com/github/dhaev/Machine-Learning/blob/main/TME6015_ASSIGNMENT3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **David Abodunrin**
### **3736201**
### **TME6015 ASSIGNMENT 3**
### **KMeans and Agglomerative clustering**

In [73]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from time import time

### **Dataset**
> This assignment will make use of an imbalance dataset reffered to as 'Ecoli' or 'Protein Localization Sites' dataset from UCI repository "https://archive.ics.uci.edu/dataset/39/ecoli". We will first perform clustering with the highly under-represented classes and then perform clustering after removing the highly under-represented classes from the dataset

1. Title: Ecoli(Protein Localization Sites).

2. Creator and Maintainer:
	     Kenta Nakai
             Institue of Molecular and Cellular Biology
	     Osaka, University
	     1-3 Yamada-oka, Suita 565 Japan
	     nakai@imcb.osaka-u.ac.jp
             http://www.imcb.osaka-u.ac.jp/nakai/psort.html
   Donor: Paul Horton (paulh@cs.berkeley.edu)
   Date:  September, 1996
   See also: yeast database

3. Past Usage.
Reference: "A Probablistic Classification System for Predicting the Cellular
           Localization Sites of Proteins", Paul Horton & Kenta Nakai,
           Intelligent Systems in Molecular Biology, 109-115.
	   St. Louis, USA 1996.

4. Number of Instances:  336 for the E.coli dataset and

5. Number of Attributes: 8 ( 7 predictive, 1 name )
	     
6. Attribute Information.

> *  Sequence Name: Accession number for the SWISS-PROT database.
*  mcg: McGeoch's method for signal sequence recognition.
*  gvh: von Heijne's method for signal sequence recognition.
*  lip: von Heijne's Signal Peptidase II consensus sequence score.
*  chg: Presence of charge on N-terminus of predicted lipoproteins.
*  aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
* alm1: score of the ALOM membrane spanning region prediction program.
* alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.

7. Missing Attribute Values: None.


8. Class Distribution. The class is the localization site.

> * cp  (cytoplasm) -                                   143
*   im  (inner membrane without signal sequence) -       77               
*   pp  (perisplasm)  -                                  52
*   imU (inner membrane, uncleavable signal sequence) -  35
*   om  (outer membrane) -                               20
*   omL (outer membrane lipoprotein) -                    5
*   imL (inner membrane lipoprotein) -                    2
*   imS (inner membrane, cleavable signal sequence) -     2



In [74]:
filename= "/content/drive/MyDrive/TME_6015/Assignment_3/ecoli/ecoli.data"
columns = ['Sequence Name','mcg','gvh', 'lip', 'chg', 'aac', 'alm1', 'alm2', 'class']
ecoli = pd.read_csv(filename,delim_whitespace=True, names=columns)




### **Data Preprocessing 1**
> * check for missing values
*Ensure all features have the right correct data type
* The sequence name is removed because it is unique in all instance and does not help in predicting or clustering
* split the features from the labels. featureas are (m x n) array where m is number of instances and n is number of features. labels is flattened to a 1d array
* standardize dataset


In [75]:
len(ecoli['Sequence Name'].unique())
ecoli.drop(columns=['Sequence Name'], inplace=True)
data = ecoli.iloc[:,:-1]
labels = ecoli.iloc[:,-1:].values.flatten()
(n_samples, n_features), n_classes = data.shape, np.unique(labels).size
print(f"# n_classes: {n_classes}; # samples: {n_samples}; # features {n_features}")

# n_classes: 8; # samples: 336; # features 7


In [76]:

scaler = StandardScaler()

def bench_k_means(kmeans, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    kmeans : KMeans instance
        A :class:`~sklearn.cluster.KMeans` instance with the initialization
        already set.
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
    fit_time = time() - t0
    if name.split(' ')[0] == 'Agglo':
      results = [name, fit_time]
    else:
      results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator
    # labels
    clustering_metrics = [
        metrics.fowlkes_mallows_score,
    ]
    results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    data = scaler.fit_transform(data)

    # Show the results
    if name.split(' ')[0] == 'Agglo':
      formatter_result = (
          "{:9s}\t{:.3f}\t{:.3f}"
      )#\t{:.3f}
    else:
            formatter_result = (
          "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}"
      )
    print(formatter_result.format(*results))


### **Kmeans Clustering Implementation**
> There are 2 different Kmeans Clustering implemented.The parameters set are

* init
* n_clusters
* n_init=n_init
* random_state

The major difference between both implementation is the init method used.
We then try different n_init parameters to observe how different n_init values affect the results in both implementation.

* The number of classes determines the number of clusters

Evaluation metric = Fowlkes-Mallows scores


In [77]:
def kmeans_clustering(n_classes,data,labels,init_numbers):
  for n_init in init_numbers:
    print(f"n_init : {n_init}")
    print(120 * "_")
    print("init\t\ttime\tinertia\tFMI")

    kmeans = KMeans(init="k-means++", n_clusters=n_classes, n_init=n_init, random_state=0)
    bench_k_means(kmeans=kmeans, name="k-means++", data=data, labels=labels)

    kmeans = KMeans(init="random", n_clusters=n_classes, n_init=n_init, random_state=0)
    bench_k_means(kmeans=kmeans, name="random", data=data, labels=labels)
    print(120 * "_")


### **Agglomerative Clustering Implementation**
> There are 4 different agglomerative Clustering implemented with three set parameters which are


*   n_clusters
*   metrics
*   linkage

All parameters except linkage remains the same in all implementation to observe how different linkage affects the result.

> * The linkage parameters tested: 'ward', 'single', 'complete', 'average'.
* The metric used: 'euclidean'
* The number of classes determines the number of clusters

Evaluation metric = Fowlkes-Mallows scores


In [78]:
def agglomerative(n_classes,data,labels):
  # Apply Agglomerative Clustering
  print(120 * "_")
  print("init\t\ttime\tFMI")
  aggloward = AgglomerativeClustering(n_clusters=n_classes, metric='euclidean', linkage='ward')
  bench_k_means(kmeans=aggloward, name="Agglo ward", data=data, labels=labels)

  agglosingle = AgglomerativeClustering(n_clusters=n_classes, metric='euclidean', linkage='single')
  bench_k_means(kmeans=agglosingle, name="Agglo single", data=data, labels=labels)

  agglocomplete = AgglomerativeClustering(n_clusters=n_classes, metric='euclidean', linkage='complete')
  bench_k_means(kmeans=agglocomplete, name="Agglo complete", data=data, labels=labels)

  aggloaverage = AgglomerativeClustering(n_clusters=n_classes, metric='euclidean', linkage='average')
  bench_k_means(kmeans=aggloaverage, name="Agglo average", data=data, labels=labels)
  print(120 * "_")

### **Clustering including highly under-represented classes**

### kmeans clustering

In [79]:
kmeans_clustering(n_classes,data,labels,[1,2,3,14])

n_init : 1
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.013s	571	0.784
random   	0.007s	804	0.584
________________________________________________________________________________________________________________________
n_init : 2
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.006s	525	0.627
random   	0.009s	804	0.584
________________________________________________________________________________________________________________________
n_init : 3
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.008s	525	0.627
random   	0.005s	796	0.581
________________________________________________________________________________________________________________________
n_init : 1

n_init 1 had the best FMI score for k-mean++ which indicates agreement between clusters, whereas both n_init 1 and n_init 2 for random had the same values which were the best out of all n_init values for random. However kmean++ performed better

### agglomerative clustering

In [80]:
agglomerative(n_classes,data,labels)

________________________________________________________________________________________________________________________
init		time	FMI
Agglo ward	0.009	0.635
Agglo single	0.005	0.529
Agglo complete	0.004	0.829
Agglo average	0.004	0.706
________________________________________________________________________________________________________________________


agglomerative clustering with complete linkage has the best FMI score

### **Data Preprocessing 2**
> Remove under-represented classes

In [81]:
#create a filter to remove  under-represented classes
filter = (ecoli['class']=='imL') | (ecoli['class']=='imS') | (ecoli['class']=='omL')
#apply filter to data
filtered_data = ecoli[~filter]

In [82]:
#get feature data
data = filtered_data.iloc[:,:-1]
#get labels
labels = filtered_data.iloc[:,-1:].values.flatten()

(n_samples, n_features), n_classes = data.shape, np.unique(labels).size
print(f"# n_classes: {n_classes}; # samples: {n_samples}; # features {n_features}")

# n_classes: 5; # samples: 327; # features 7


## **Clustering excluding highly under-represented classes**

### kmeans clustering

In [83]:
kmeans_clustering(n_classes,data,labels,[1,2,3,10,12])

n_init : 1
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.010s	646	0.783
random   	0.003s	905	0.622
________________________________________________________________________________________________________________________
n_init : 2
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.005s	646	0.779
random   	0.007s	646	0.789
________________________________________________________________________________________________________________________
n_init : 3
________________________________________________________________________________________________________________________
init		time	inertia	FMI
k-means++	0.012s	622	0.809
random   	0.013s	646	0.789
________________________________________________________________________________________________________________________
n_init : 1

### agglomerative clustering

In [84]:
agglomerative(n_classes,data,labels)

________________________________________________________________________________________________________________________
init		time	FMI
Agglo ward	0.006	0.819
Agglo single	0.004	0.532
Agglo complete	0.004	0.762
Agglo average	0.004	0.814
________________________________________________________________________________________________________________________


agglomerative clustering with ward linkage had the best FMI score

## Overall

Agglomerative clustering performed better than both kmeans implementation based on FMI scores when both including or excluding under-represented classes

0