In [0]:
import numpy as np

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

To make sure that every single line will be  printed, even if they're in the same cell, we can use thf ollowing config:

In [0]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Build a directory and name it *ML_Python_PMCRT* in your colab directory in google drive. Then upload the two files "CCLE_ExpMat_Top500Genes.csv" and "CCLE_ExpMat_Pheno.csv" in this directory.

In [0]:
import pandas as pd

CCLE_exp = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/ML_Python_PMCRT/CCLE_ExpMat_Top500Genes.csv', index_col=0)
CCLE_pheno = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/ML_Python_PMCRT/CCLE_ExpMat_Pheno.csv', index_col=0)
####
print("shape of the expression dataframe:", CCLE_exp.shape)
print("shape of the expression dataframe:", CCLE_pheno.shape)
####
CCLE_exp = CCLE_exp.transpose()
n_samples, n_features = CCLE_exp.shape


CCLE_exp.shape

In [0]:
import collections
collections.Counter(CCLE_pheno.loc[:,"tissueid"])

In [0]:
type(np.where(CCLE_pheno.loc[:,"tissueid"] == 'breast'))
type(np.where(CCLE_pheno.loc[:,"tissueid"] == 'breast')[0])

tuple

numpy.ndarray

In [0]:
collections.Counter(np.where(CCLE_pheno.loc[:,"tissueid"] == 'breast')[0])

We need to convert the labels, tissue names, to numbers.

In [0]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

le.fit(CCLE_pheno.loc[:,"tissueid"])
y=le.transform(CCLE_pheno.loc[:,"tissueid"])

collections.Counter(y)

LabelEncoder()

Counter({0: 54, 1: 49, 2: 170, 3: 57, 4: 172, 5: 48})

In [0]:
CCLE_pheno['Encoded_tissue'] = y
CCLE_pheno.columns.values

# Types of observations

There are three types of data measurements or variables we deal with in building machine learning models.

* **Nominal**:
*Nominal* (categorical or qualitative) variables or observations are labels without any order or quantitative difference between the label classes. Examples of nominal data are tissue type, healthy versus malignant tissue, male versus female, etc.
* **Ordinal**:
For *ordinal* variables, order of the values assigned (or tested) for each sample is important.  Examples of ordinal data are tumor stage, level of satisfaction about a product, etc.
* **Interval**:
For *Interval* variables, both order and exact difference between the data points matter. Examples of ordinal data are tumor size and age.

# Building classification models

Among the available classification methods in Python, we focus on the following five to build classification models of tissue type of the cancer cell lines in our dataset:

* Logistic regression
* K- nearest neighbour
* Naive Bayes
* Random forest
* Support vector machine

## Logistic regression
If we have set of features X1 to Xn, y can be obtained as:
\begin{equation*} y=b0+b1X1+b2X2+...+bnXn\end{equation*}

where y is the predicted value obtained by weighted sum of the feature values.

Then probability of each class (for example tissue class BREAST) can be obtained using the logistic function 

\begin{equation*} p(class=BREAST)=\frac{1}{(1+exp(-y))} \end{equation*}

Based on the given class labels and the features given in the trainign data, coefficients b0 to bn can be ontained during the optimization process.

b0 to bn are fixed for all samples while X1 to Xn are feature values specific to each sample. Hence, the logistic function will give us probability of each class assigned to each sample. Finally, the model will choose the class with the highest probability for each sample.


**Note.** The logistic regression model is parametric and the parameters are the regression coefficiets b0 to bn.


In [0]:
from sklearn.linear_model import LogisticRegression

# Initialize our classifier
logreg = LogisticRegression()

# Fitting the model with the data
logreg.fit(CCLE_exp, y)

In [0]:
y_pred = logreg.predict(CCLE_exp)
print(y_pred)

Checking the accuracy of the model

In [0]:
from sklearn import metrics

print("accuracy of the predictions:", metrics.accuracy_score(y, y_pred))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y, y_pred))
print("MCC of the predictions:", metrics.matthews_corrcoef(y, y_pred))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y, y_pred))

## Splitting data to training and testing sets

To investigate performance of our model, we need to split the data to training and testing sets. This will help us to check potential overfitting in our model training.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(CCLE_exp, y, test_size=0.4, random_state=5)

Training a logistic regression model on the training set:

In [0]:
logreg = LogisticRegression()

# Fitting the model with the data
logreg.fit(X_train, y_train)

Testing the model on the testing set:

In [0]:
y_pred_test = logreg.predict(X_test)
print(y_pred_test)

In [0]:
print("accuracy of the predictions:", metrics.accuracy_score(y_test, y_pred_test))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y_test, y_pred_test))
print("MCC of the predictions:", metrics.matthews_corrcoef(y_test, y_pred_test))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y_test, y_pred_test))

# K nearest neighbour(k-NN)

K nearest neighbour uses a distance metric like Euclidean distance to identity similarity of target data point (sample) in test or validation set to the data points (samples) in the trainign set. Then based on the user specified k, it finds the k closest points (samples) to the target data point. Afterward, it chooses the most frequent label among the k closes points (majority voting) as the class label of the target sample. The class labels can be also assigned based on weighted voting of the k closest data points to the data point.

**Note.** The k nearest neighbour is non-parametric.


In [0]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize our classifier
knn = KNeighborsClassifier()

# Fitting the model with the data
knn.fit(CCLE_exp, y)

In [0]:
y_pred = knn.predict(CCLE_exp)
print(y_pred)

In [0]:
print("accuracy of the predictions:", metrics.accuracy_score(y, y_pred))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y, y_pred))
print("MCC of the predictions:", metrics.matthews_corrcoef(y, y_pred))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y, y_pred))

In [0]:
knn = KNeighborsClassifier()

# Fitting the model with the data
knn.fit(X_train, y_train)

Testing the model on the testing set:

In [0]:
y_pred_test = knn.predict(X_test)
print(y_pred_test)

In [0]:
print("accuracy of the predictions:", metrics.accuracy_score(y_test, y_pred_test))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y_test, y_pred_test))
print("MCC of the predictions:", metrics.matthews_corrcoef(y_test, y_pred_test))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y_test, y_pred_test))

# Naive Bayes
To understand Naive Bayes algotirhm, we need to know what Bayes theorem. Bayes theorem related conditional rpobabilities as follows:

\begin{equation*} p(A|B)p(B)=p(B|A)p(A) \end{equation*}
that can be rewritten as

\begin{equation*} p(A|B)=\frac{p(B|A)p(A)}{p(B)} \end{equation*}

where p(A) and p(B) are probabilities of events A and B, respectively. p(A|B) and p(B|A) are also conditional probabilities of A given B and B given A, respectively.
**Example without numbers**

Now let's assume we have 3 features X1, X2 and X3 and we want to identify the probability of class C for sample A with feature values *x1*, *x2* and *x3*:

\begin{equation*} p(class=C|X1=x1, X2=x2 , X3=x3)=\frac{p(X1=x1|class=C)p(X2=x2|lass=C)p(X3=x3|class=C)p(class=C)}{p(X1=x1)p(X2=x2)p(X3=x3)} \end{equation*}

where 
\begin{equation*} p(X1=x1, X2=x2 , X3=x3)=p(X1=x1)p(X2=x2)p(X3=x3) \end{equation*}
and
\begin{equation*} p(X1=x1, X2=x2 , X3=x3|class=C)=p(X1=x1|class=C)p(X2=x2|class=C)p(X3=x3|lass=C)p(class=C) \end{equation*}

as the features are independent variables. 

**Real life example with numbers**
We want to know the chance of having breast cancer if the diagnosis test is positive for a woman with the age between 40 and 60. This example is mainly for understanding Bayes theorem not Naive Bayes classifier. In case of Naive Bayes algorithm, this process can be easily extended to multiple features as described in the above example.

***Assumptions (not necessarily correct)***
* 2% of women between 40 and 60 have breast cancer
* True positive rate is 95% (if a woman has breast cancer, it will be diagnosed with 95% probability). Therefore, 5% of the time the women without breast cancer will be diagnosed positively by the test.

Now the question is *What is the chance of havign breast cancer if a woman has positive result from a diagnosis test?*

\begin{equation*} p(having \quad breast \quad cancer|positive)=\frac{p(positive|breast \quad cancer)p(breast cancer)}{p(positive)} \end{equation*}

where 


\begin{equation*} p(positive) = p(positive|having \quad breast \quad cancer)p(having \quad breast \quad cancer) \\+ p(positive|not \quad having \quad breast \quad cancer)p(not \quad having \quad breast \quad cancer)\\=
0.95*0.02+0.05*0.98\\=0.068\end{equation*}

Therefore,

\begin{equation*} p(having \quad breast  \quad cancer|positive)=\frac{p(positive|breast \quad cancer)p(having \quad breast \quad cancer)}{p(positive)}\\= \frac{0.95*0.02}{0.068}\\=0.28\end{equation*}


As we can see, there is only 28% chance of having cancer upon positive test result. Although the numbers were not clinically valid numbers, we deal with similar results in disease diagnosis. This is one of the reasons that further checkups by phycisions are mandatory upon positive results. Do not panic when you have a positive result but follow up with your doctor immediately.

In [0]:
from sklearn.naive_bayes import GaussianNB

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(CCLE_exp, y)

In [0]:
y_pred = knn.predict(CCLE_exp)
print(y_pred)

In [0]:
print("accuracy of the predictions:", metrics.accuracy_score(y, y_pred))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y, y_pred))
print("MCC of the predictions:", metrics.matthews_corrcoef(y, y_pred))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y, y_pred))

In [0]:
gnb = GaussianNB()

# Fitting the model with the data
gnb.fit(X_train, y_train)

Testing the model on the testing set:

In [0]:
y_pred_test = knn.predict(X_test)
print(y_pred_test)

In [0]:
print("accuracy of the predictions:", metrics.accuracy_score(y_test, y_pred_test))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y_test, y_pred_test))
print("MCC of the predictions:", metrics.matthews_corrcoef(y_test, y_pred_test))
print("Confusion matrix of the predictions:", metrics.confusion_matrix(y_test, y_pred_test))