# Bipolar disorder and response to lithium: blood

Dataset Files: 
* gds5393.csv 
* meta-gds5393.csv

**Introdução**

Bipolar disorder is a mental condition characterized by highly variable mood episodes, with periods of euphoria or mania (known as manic episodes) alternating with periods of deep depression (known as depressive episodes). Treatment for bipolar disorder usually includes therapy and medication.

One of the most common medications used to treat bipolar disorder is lithium. Lithium is a mineral that acts as a mood stabilizer and is effective in preventing manic and depressive episodes in people with bipolar disorder. It works by helping to balance the levels of certain chemicals in the brain called neurotransmitters, which are responsible for transmitting information between brain cells.


**Summary** 

Analysis of peripheral blood from patients with bipolar disorder before and 1 month after lithium treatment. Response of patients to lithium assessed after 6 months. Results identify a gene expression signature for the response to lithium treatment in patients with bipolar disorder.

This particular dataset, with 48107 rows and 120 columns, contains gene expression data from blood samples from patients with bipolar disorder and patients without bipolar disorder, some of whom were receiving lithium treatment and some who were not.
The file meta-gds5393 contains class labels.


**Carregamento e analise dos dados e metadados do dataset**

In [1]:
# Importar bibliotecas necessárias para o processamento
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn import preprocessing
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import accuracy_score

# Ler dados usando a lib pandas
data = pd.read_csv("gds5393.csv", sep=',', index_col = 0)
meta = pd.read_csv("meta-gds5393.csv", sep=',', index_col = 0)

#### Data Preparation

In [2]:
# limpar todos os dados nulos, usando o parâmetro inplace = True para alterar directamente na variável data
data.dropna(inplace = True)

In [3]:
# fazer a transporta para termos os dados na forma de uma matriz genes x amostras
data = data.transpose()

In [4]:
# Dimensao dos dados
data.values.shape

(120, 47323)

In [5]:
# pre-processamento: standardizacao dos dados
input_sc = preprocessing.scale(data)

# classe para a previsão
classe = "other"
meta_values = meta[classe].values

### Modelling

**Aprendizagem supervisionada**

In [6]:
# Criação modelos supervisionados de aprendizagem para prever a classe

# calcular o número de posições para treino e teste, 70% dos dados para treino e 30% para testar o modelo
num_tst = int(data.shape[0] / 3)
print(num_tst)

indices = np.random.permutation(data.shape[0])
train_in = input_sc[indices[:-num_tst]]
train_out = meta_values[indices[:-num_tst]]
test_in = input_sc[indices[-num_tst:]]
test_out = meta_values[indices[-num_tst:]]

print("Input shape:", input_sc.shape)
print("Train in shape:", train_in.shape)
print("Train out shape:", train_out.shape)
print("Test in shape:", test_in.shape)
print("Test out shape:", test_out.shape)

40
Input shape: (120, 47323)
Train in shape: (80, 47323)
Train out shape: (80,)
Test in shape: (40, 47323)
Test out shape: (40,)


**Árvore de decisão**

In [7]:
from sklearn import tree

tree_model = tree.DecisionTreeClassifier()
tree_model = tree_model.fit(train_in, train_out)
tree_pred = tree_model.predict(test_in)

print("Valores previstos: ", tree_pred)
print("Valores reais: " , test_out)

Valores previstos:  ['non-responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'responder'
 'non-responder' 'non-responder' 'non-responder' 'responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder']
Valores reais:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-re

**SVMs**

In [8]:
from sklearn import svm

clf = svm.SVC(gamma=0.001, C=100.)
svm_model = clf.fit(train_in, train_out)
svm_pred = clf.predict(test_in)

print("Valores previstos: " , svm_pred)
print("Valores reais: " , test_out)

Valores previstos:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder']
Valores reais:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-res

**Naive Bayes**

In [9]:
from sklearn.naive_bayes import GaussianNB

gnb_model = GaussianNB()
gnb_model = gnb_model.fit(train_in, train_out)
gnb_pred = gnb_model.predict(test_in)

print("Valores previstos: " , gnb_pred)
print("Valores reais: " , test_out)

Valores previstos:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder']
Valores reais:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-res

**Regressão Logística**

In [10]:
from sklearn import linear_model

logistic_model = linear_model.LogisticRegression(C=1e5, solver = "liblinear", multi_class = "auto")
logistic_model = logistic_model.fit(train_in, train_out)
logistic_pred =  logistic_model.predict(test_in)

print("Valores previstos: " , logistic_pred)
print("Valores reais: " , test_out)

Valores previstos:  ['responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder' 'responder' 'responder'
 'responder' 'responder' 'responder' 'responder']
Valores reais:  ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 

**KNeighbors - Método dos k vizinhos mais próximos**

In [11]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_model = knn_model.fit(train_in, train_out)
knn_pred = knn_model.predict(test_in)

print("Valores previstos:\n" , knn_pred)
print("Valores reais:\n" , test_out)

Valores previstos:
 ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'responder'
 'non-responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder']
Valores reais:
 ['non-responder' 'non-responder' 'non-responder' 'non-responder'
 'responder' 'responder' 'non-responder' 'non-responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'responder' 'non-responder'
 'non-responder' 'responder' 'non-responder' 'non-responder'
 'non-responder' 'non-responder' 'non-responder' 'non-res

### Evaluation

In [12]:
print("Precentagem de exemplos corretamente previstos:\n")
print("Decision Tree:", accuracy_score(test_out, tree_pred))
print("SVMs:", accuracy_score(test_out, svm_pred))
print("Naive Bayes:", accuracy_score(test_out, gnb_pred))
print("Regressão Logística:", accuracy_score(test_out, logistic_pred))
print("KNeighbors:", accuracy_score(test_out, knn_pred))

Precentagem de exemplos corretamente previstos:

Decision Tree: 0.575
SVMs: 0.725
Naive Bayes: 0.725
Regressão Logística: 0.275
KNeighbors: 0.675


Analisando os resultados obtidos em cada modelo, verificamos que os dois modelos com melhor desempenho foram:
* Naive Bayes
* SVMs
De seguida, vamos usar uma validação cruzada com, 5 partições, para analisar o desempenho deste dois modelos

In [13]:
from sklearn.model_selection import cross_val_score

print("Naive Bayes")
scores = cross_val_score(gnb_model, input_sc, meta_values, cv = 5)
print(scores)
print(scores.mean())

print("SVMs")
scores = cross_val_score(svm_model, input_sc, meta_values, cv = 5)
print(scores)
print(scores.mean())

Naive Bayes
[0.875      0.83333333 0.79166667 0.79166667 0.79166667]
0.8166666666666667
SVMs
[0.79166667 0.79166667 0.79166667 0.79166667 0.75      ]
0.7833333333333333
