# Project objective
In this project, we review affinity propagation for clustering #cancer cell lines tissue of origin of# cancer cell lines using their gene expression provided in #cancer cell line encyclopedia dataset. You can also learn about homogeneity, completeness and V-measure as different statistical measures for assessing performance of a clustering model.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)
* **Seaborn**: Seaborn is a visualization library in Python. https://seaborn.pydata.org/
* **pandas**: Pandas provides easy-to-use data structures and data analysis tools for Python. (https://pandas.pydata.org/)


In [1]:
import numpy as np
import pandas as pd
import sklearn as sk

# Introduction to the dataset

**Name**: Cancer Cell Line Encyclopedia dataset

**Summary**: Identifying tissue of origin of cancer cell lines using their gene expression. Cell lines from 6 tissues were chosen for this code including: breast, central_nervous_system, haematopoietic_and_lymphoid_tissue, large_intestine, lung, skin

**number of features**: 500 (real, positive) 
Top 500 genex based on variance of their expression in the dataset is chosen. The right way to select the features is to do it only on the training set to eliminate information leak from test set. But to simplify the process for the sake of this teaching code, we use all the dataset.

**Number of data points (instances)**: 550

**dataset accessibility**: Dataset is available as part of PharmacoGx R package (https://www.bioconductor.org/packages/release/bioc/html/PharmacoGx.html)

**Link to the dataset**: https://portals.broadinstitute.org/ccle



## Importing the dataset
We can import the dataset in multiple ways

**Colab Notebook**: You can download the dataset file (or files) from the link (if provided) and uploading it to your google drive and then you can import the file (or files) as follows:

**Note.** When you run the following cell, it tries to connect the colab with google derive. Follow steps 1 to 5 in this link (https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/) to complete the process. 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# This path is common for everybody
# This is the path to your google drive
input_path = '/content/gdrive/My Drive/'
# reading the data (target)
target_dataset_features = pd.read_csv(input_path + 'CCLE_ExpMat_Top500Genes.csv', index_col=0)
target_dataset_output = pd.read_csv(input_path + 'CCLE_ExpMat_Phenotype.csv', index_col=0)
# Transposing the dataframe to put features in the dataframe columns
target_dataset_features = target_dataset_features.transpose()

**Local directory**: In case you save the data in your local directory, you need to change "input_path" to the local directory you saved the file (or files) in.

**GitHub**: If you use my GitHub (or your own GitHub) repo, you need to change the "input_path" to where the file (or files) exist in the repo. For example, when I clone ***ml_in_practice*** from my GitHub, I need to change "input_path" to 'data/' as the file (or files) is saved in the data dicretory in this repository. 

**Note.**: You can also clone my ***ml_in_practice*** repository (here: https://github.com/alimadani/ml_in_practice) and follow the same process.

## Making sure about the dataset characteristics (number of data points and features)

In [3]:
print('number of features: {}'.format(target_dataset_features.shape[1]))
print('number of data points: {}'.format(target_dataset_features.shape[0]))

number of features: 500
number of data points: 550


##Data preparation

We need to prepare the dataset for machine learnign modeling. Here we need to convert categorical columns (strings) to integeres. Those columns can be used as output variable.

In [4]:
# tissueid is the column that contains tissue type information
output_var_names = target_dataset_output['tissueid']
# converting tissue names to labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

le.fit(output_var_names)
output_var = le.transform(output_var_names)

# we would like to use all the features as input features of the model
input_features = target_dataset_features

### Normalizing feature values
Normalizing feature values usually helps us for developing a better machine learning model. Here we normalize the data by deducting mean of each feature column from each one of its values and then divide them by standard deviation of the same feature column. This normalization process change mean of each feature (column) to zero and its standard deviation to one. 


In [5]:
from sklearn import preprocessing

input_features = pd.DataFrame(preprocessing.scale(input_features)) 

## Building the unsupervised learning (clustering) model
We want to build a clustering model using affinity propagation. As it is an unsupervise process, we do not need output values of datapoints for the modeling.

### Affinity propagation
Clustering using k-means can be summarized in 6 main steps:

1) Number of clusters (k) does not need to be determined

2) Determining pairwise similarity between data points

* Similarity between data points i and j: negative euclidean distance between data points i and j

3) Determines clusters and representative data points for the clusters (exemplars)

4) Iterative massage passing
Data points compete to become exemplars
Exemplars are real data points not centers as in k-means

5) Initialization independent



**Hyperparameters of affinity propagation:**

* **max_iter**: Maximum number of iterations

* **convergence_iter**: Number of iterations with no change in the number of estimated clusters that stops the convergence.

* **damping (ranges between 0.5 and 1)**: Damping factor is the extent to which the current value is maintained relative to incoming values.

In [6]:
from sklearn import cluster

# Create logistic regression object
clustmodel = cluster.AffinityPropagation(max_iter=500, convergence_iter = 15, damping = 0.97)

# Train the model using the training sets
clustmodel.fit(input_features)

AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.97, max_iter=500, preference=None, verbose=False)

### Number of clusters

As k (number of clusters) are not determined as an input, we cannot expect specific number of cluster as the results of affinity propagation clustering. Let's check how many clusters affinity propagation identtified in our dataset and compare it with the actual number of groups in the dataset:

In [7]:
print('number of unique clusters: {}'.format(len(np.unique(clustmodel.labels_))))
print('number of true groups in the dataset: {}'.format(len(np.unique(output_var))))

number of unique clusters: 21
number of true groups in the dataset: 6


**Note.** We have 21 clusters instead of original 6 tissue types. This could be due to some underlying groupings within each class. You can play with "damping" parameter and see how number of clusters and performance of the clustering will be changed.



## Evaluating performance of the model




We can have a summary measure like V-measure that tells us about the performance of the clustering. The V-measure is the harmonic mean (reciprocal of the arithmetic mean of the reciprocals) between homogeneity and completeness:


* **V-measure (ranges between 0 and 1)**


\begin{equation*} V = \frac{(1+\beta)*homogeneity*completeness}{\beta*homogeneity +completeness}\end{equation*}

* **Homogeneity (ranges between 0 and 1)**: If homogeneity is 1, it means that every cluster contains only elements that are members of the same class while 0 homogeneity means having 1 cluster of all data points. 

* **Completeness (ranges between 0 and 1)**: If completeness is 1, it means that all elements of any given class are in the same cluster and 0 means that there is only one class and every element in it is assigned to a different cluster.

Learn more about **Homogeneity** and **Completeness** here: https://www.aclweb.org/anthology/D07-1043.pdf





In [8]:
from sklearn import metrics

print("Homogeneity of the identified clusters:", metrics.homogeneity_score(output_var,clustmodel.labels_))
print("Completeness of the identified clusters:", metrics.completeness_score(output_var,clustmodel.labels_))
print("V-measure of the identified clusters:", metrics.v_measure_score(output_var,clustmodel.labels_))

Homogeneity of the identified clusters: 0.7449723315176991
Completeness of the identified clusters: 0.4201893794737535
V-measure of the identified clusters: 0.5373150503532702


As we can see, completeness of the clustering is low, although homogeneity could be acceptable. One reason could be the difference in number of clusters (21) and original classes (6) in the dataset.