<a href="https://colab.research.google.com/github/alicezil/38615-Lab-3/blob/main/Wide_Data_and_Linear_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab3: Wide data and linear models
You are provided with a dataset for 554 patients, 80% (444 patients) of the dataset was selected to be the training set, and 20% (110 patients) as the test set. Features and labels of the training set can be found in train_X.csv, train_y.csv respectively. Features of the test set can be found in test_X.csv while labels are hidden.

Your task is to predict the disease type (phenotype) from transcriptomics data. Disease: UCEC (uterine corpus endometrial carcinoma). Labels (1/0) are encoding tumor grade “II-” vs. “III+”

Specific tasks:
1. Please use binary classification (0/1) using linear models. Measure the classification performance using accuracy and F1-score on the given validation set. Please report averaged values.
2. Develop a pipeline to try different linear models (linear regression, logistic regression, Ridge regression, LASSO, etc.)
3. Study the effect of regularization parameters on model performance. What model is the best?
4. Compare your best model accuracy with random guessing (Hint: scramble labels aka Y-randomization)
5. What are the top important genes for the model decision?
6. Try to visualize the dataset and see if you could visually separate two groups of patients

Bonus Qs:
You could use https://www.uniprot.org/Links to an external site, and search for Gene IDs. See if there is a meaningful connection between the top 10 most important genes and disease. Did your model recapitulate of the known association between Genes and Disease?

# 1. Exploring the data:

**1.1 Importing necessary libraries**

In [1]:
%pip install --upgrade kneed

import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import kneed

from sklearn import manifold
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import accuracy_score
from kneed import KneeLocator

%matplotlib inline 
sns.set(color_codes=True)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kneed
  Downloading kneed-0.8.1-py2.py3-none-any.whl (10 kB)
Installing collected packages: kneed
Successfully installed kneed-0.8.1


**1.2 Importing and cleaning data**

In [4]:
df_features = pd.read_csv("/content/train_X.csv")
df_labels = pd.read_csv("/content/train_y.csv")
df_test = pd.read_csv("/content/test_X.csv")

Let's take a look at our data by looking at the heads and summaries:

In [5]:
df_features.head()

Unnamed: 0.1,Unnamed: 0,ENSG00000000003,ENSG00000000005,ENSG00000000419,ENSG00000000457,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,...,ENSG00000282651,ENSG00000282815,ENSG00000282939,ENSG00000283063,ENSG00000283439,ENSG00000283463,ENSG00000283526,ENSG00000283586,ENSG00000283632,ENSG00000283697
0,EB0D68BC-5FF9-44A5-A355-CA5441BFBA0A,7.062725,0.026623,6.720413,5.449267,3.868619,4.587771,7.165112,4.643161,6.771731,...,0.325987,-5.545564,-5.545564,-5.545564,-5.545564,4.014351,4.841392,-5.545564,5.855893,3.618253
1,0876B4BB-58BA-4C4C-84F4-E9D19EF96147,5.965392,-5.431256,6.358498,4.161479,4.585293,4.326924,6.849703,4.391534,5.819945,...,5.910874,-0.945029,3.75043,1.611211,-0.498573,3.430928,3.160435,-5.431256,4.41393,3.353496
2,EACD1021-7B52-4531-8806-B7555B73AC84,7.892221,-5.85187,8.132992,5.98632,5.422599,4.728815,8.168477,6.289562,7.331591,...,10.103565,-5.85187,6.498217,5.481945,-5.85187,5.137298,4.296777,-5.85187,5.345372,5.028567
3,368ACD26-C7FB-4974-BB7F-0AE22670CB0E,6.826546,0.964851,5.99828,4.991435,4.963,4.977695,7.149421,4.570863,6.008286,...,2.442099,-5.994056,2.862038,1.909955,0.56812,4.768694,3.983207,-5.994056,4.609411,4.329472
4,F23B0A1A-25AE-41D9-8C49-B692C4FDE1E4,7.059095,2.429954,6.746639,5.591316,5.11112,5.972938,7.576201,6.032083,6.470761,...,5.553223,-5.870484,3.044916,-5.870484,0.01832,4.640575,4.954957,-5.870484,4.620774,4.464277


In [8]:
df_features_summary = df_features.describe(include = 'all')
print(df_features_summary)

                                  Unnamed: 0  ENSG00000000003  \
count                                    444       444.000000   
unique                                   444              NaN   
top     EB0D68BC-5FF9-44A5-A355-CA5441BFBA0A              NaN   
freq                                       1              NaN   
mean                                     NaN         6.798864   
std                                      NaN         0.657958   
min                                      NaN         3.865658   
25%                                      NaN         6.447793   
50%                                      NaN         6.852772   
75%                                      NaN         7.226415   
max                                      NaN         8.642886   

        ENSG00000000005  ENSG00000000419  ENSG00000000457  ENSG00000000938  \
count        444.000000       444.000000       444.000000       444.000000   
unique              NaN              NaN              NaN      

In [6]:
df_labels.head()

Unnamed: 0.1,Unnamed: 0,xml_neoplasm_histologic_grade
0,EB0D68BC-5FF9-44A5-A355-CA5441BFBA0A,0
1,0876B4BB-58BA-4C4C-84F4-E9D19EF96147,1
2,EACD1021-7B52-4531-8806-B7555B73AC84,0
3,368ACD26-C7FB-4974-BB7F-0AE22670CB0E,0
4,F23B0A1A-25AE-41D9-8C49-B692C4FDE1E4,1
