# **Mission of the classification notebook**

Picture yourself as a data scientist sitting with executives from SpaceX who have received reports that several of their astronauts are complaining of vision impairment.  In order to do a full investigation, their medical team decided to use minimally-invasive [intraocular fine needle aspiration](https://pubmed.ncbi.nlm.nih.gov/8233394/) to take biopsies from the astronauts and their ground-control counterparts.  Using this tissue, they were able to perform immunostaining microscopy as well as RNA sequencing.  They were also able to obtain intraocular pressure measurements from both the astronauts and their ground-control counterparts.  

Your goal is to determine if there are [biological pathways](https://en.wikipedia.org/wiki/Biological_pathway) that are responding to conditions in space, because if so, there may be a molecular target that can be used to diagnose, monitor, and/or treat this condition.  But first you must determine if there's any association at all between the RNA-seq gene expression data and the measurements obtained from their medical team.  Your mission is to evaluate the use of random forest and single-layer perceptron classification algorithms to determine if the genes expressed in the retinal tissue are predictive of the phenotypic responses that were observed.  You are also encouraged to try the [logistic regression algorithm](https://en.wikipedia.org/wiki/Logistic_regression) for the same.



# Read in the methods

Recall that we have put all the custom python methods in a separate notebook which you copied to your Google drive.  We need to read those methods into this notebook so that we can use them here.  You will get prompted to select the gmail address to use to permit access to your google drive for this notebook.

Note that we will import the methods in the notebook as "m", so all subsequent references to methods in that notebook will be prefixed with "m.".

**IMPORTANT**: Make sure you put a copy of the methods.ipynb in your google drive by following [these instructions](https://docs.google.com/document/d/1V9a3Z5YKT2Pbef4fgPAwB83bHX-p-rPBRRwo7w5Bi9k/edit?usp=sharing).

In [1]:
# install and import the python module for importing a notebook
!pip install import_ipynb
import import_ipynb

Collecting import_ipynb
  Downloading import_ipynb-0.1.4-py3-none-any.whl (4.1 kB)
Collecting jedi>=0.16 (from IPython->import_ipynb)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, import_ipynb
Successfully installed import_ipynb-0.1.4 jedi-0.19.1


In [None]:
# mount your google drive to this notebook
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt", force_remount=True)

Drive not mounted, so nothing to flush and unmount.


MessageError: Error: credential propagation was unsuccessful

In [None]:
# import the "Copy of methods.ipynb" from your google drive into this notebook
m = __import__("mnt/MyDrive/Colab Notebooks/Copy of methods")

# read in the data

After reading in the methods, we need to read in the data from the NASA OSDR space biology data repository.  In this notebook, we will  be using the immunostaining microscopy PECAM data from OSD-568, the RNA-seq data from OSD-255, and the tonometry data from OSD-583.

After reading in the data from OSDR, we will reduce the dimensions of the RNA-seq data to include only those genes whose [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) is greater than a threshold. This is a form of [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) that will remove some noise from the gene expression so our classification algorithms can focus on the signal.  

In [None]:
# define dictionaries for data and metadata
data=dict()
metadata=dict()

In [None]:
# read in metadata
metadata['255'] = m.read_meta_data('255')
metadata['568'] = m.read_meta_data('568')
metadata['583'] = m.read_meta_data('583')

In [None]:
# read in tonometry transformed data from OSD-583
data['iop'] = m.read_phenotype_data('583', 'LSDS-16_tonometry_maoTRANSFORMED')
print('num samples: ', str(len(list(data['iop']['Sample Name']))))
print('samples: ', list(data['iop']['Sample Name']))
data['iop'].head()

In [None]:
# read in the immunostaining PECAM microscopy data from OSD-568
data['immunoMICRO-PECAM'] = m.read_phenotype_data('568', 'LSDS-5_immunostaining_microscopy_PECAMtr_TRANSFORMED')
print('num records: ', len(data['immunoMICRO-PECAM']))
data['immunoMICRO-PECAM'].head()

In [None]:
# use m.read_rnaseq_data() to read in the normalized transcriptomic counts from OSD-255
data['255-normalized'] = m.read_rnaseq_data('255_rna_seq_Normalized_Counts')
data['255-normalized'].head()

In [None]:
# filter genes to those significantly differentially expressed between ground control and space flight
rna_seq = m.filter_by_dgea(data['255-normalized'], metadata['255'],  pval=0.05, l2fc=0)
print('rna_seq data shape: ', rna_seq.shape)

**QUESTIONS**

1. How many genes in the RNA-seq dataset were there before filtering on the coefficient of variation? After filtering?

2. How many samples have IOP measurements? PECAM measurements?

3. What is the name of the column in the PECAM data that we will be using as a phenotype measurement?

# Predict intraocular pressure (IOP) from RNA-seq (gene expression) data

Not all the samples with IOP measurements had their RNA sequenced.  We will need to first subset the IOP data to match those samples with RNA-seq data.

## Prepare the data for the algorithms

In [None]:
# create a dataframe called iop_subset - a subset of data['iop'] - which uses only "Retina_Ground" and "Retina_Flight" samples
samples=list()
for sample in rna_seq.columns[1:]:
  samples.append(metadata['255'][metadata['255']['Sample Name']==sample]['Source Name'].values[0])
samples_short=list()
for sample in samples:
  num = ""
  for c in sample:
    if c.isdigit():
      num += str(c)
  if 'G' in sample:
    samples_short.append("GC" + num)
  elif 'F' in sample:
    samples_short.append("F" + num)
iop_subset = data['iop'][data['iop']['Source Name'].isin(samples_short)]
iop_subset.head()

In [None]:
# change the names in the rna_seq dataframe to match those in the iop_subset dataframe
rna_seq.columns = ['Unnamed: 0'] + list(iop_subset['Source Name'])

In [None]:
# create numpy array y of IOP values (average of the Avg_Left and Avg_Right) which will be used as the target (response) in our model.
y = list()
for i in range(len(iop_subset)):
  iop_val=(iop_subset.iloc[i]['Avg_Left'] + iop_subset.iloc[i]['Avg_Right'])/2
  y.append(iop_val)
y = m.np.array(y)
y_classes = list()
for i in y:
  if i > y.mean():
    y_classes.append(1)
  else:
    y_classes.append(0)

y = y_classes

In [None]:
# create numpy array X of rna-seq values
X = m.transpose_df(rna_seq, 'Unnamed: 0', 'sample').drop(columns=['sample'])

In [None]:
# split up data between training and testing
X_train, X_test, y_train, y_test = m.train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# show the dimensions of the training and testing data
print('X train: ', X_train.shape)
print('y train: ', len(y_train))
print('X test: ', X_test.shape)
print('y test: ', len(y_test))

**QUESTIONS**

1. How many samples are used for training?

2. How many samples are used for testing?

3. Based on the number of samples used for testing, what are the possible values for the testing accuracy?

## Build a random forest model to predict IOP from gene expression

In [None]:
# run random forest classification on X, y

clf = m.RandomForestClassifier(max_depth=8, random_state=23)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
train_acc = m.accuracy_score(y_train, y_train_pred)
print("train accuracy:", train_acc)

y_pred = clf.predict(X_test)
test_acc = m.accuracy_score(y_test, y_pred)
print("test accuracy:", test_acc)

In [None]:
# visualize the random forest
num_trees=10
for i in range(num_trees):
    tree = clf.estimators_[i]
    dot_data = m.export_graphviz(tree,
                               feature_names=X_train.columns,
                               filled=True,
                               impurity=False,
                               proportion=True)
    graph = m.graphviz.Source(dot_data)
    display(graph)

In [None]:
# now create a confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(X_test)
confusion_matrix(y_test, y_pred)

**QUESTIONS**

1. What is the training accuracy of the random forest model? Test accuracy?

2. Which genes are used in the 5 decision trees of the random forest model?

3. According to the confusion matrix, how many low IOP samples were correctly classified?  correct high IOP?  how many low IOP samples were confused with high IOP samples?

## Build a single-layer perceptron model that predicts IOP from gene expression


In [None]:
# run random forest classification on X, y

from sklearn.linear_model import Perceptron
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
train_acc = m.accuracy_score(y_train, y_train_pred)
print("train accuracy:", train_acc)

y_pred = clf.predict(X_test)
test_acc = m.accuracy_score(y_test, y_pred)
print("test accuracy:", test_acc)

print('overall score: ', clf.score(X, y))


**QUESTIONS**

1. What is the training accuracy of the SLP model?

2. What is the test accuracy of the SLP model?

3. What might explain the discrepancy between the training and testing accuracy?

## BONUS: Build a logistic regression model that predicts IOP from gene expression

In [None]:
# now run logistic regression classification on X, y

clf = m.LogisticRegression(random_state=23)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
train_acc = m.accuracy_score(y_train, y_train_pred)
print("train accuracy:", train_acc)

y_pred = clf.predict(X_test)
test_acc = m.accuracy_score(y_test, y_pred)
print("test accuracy:", test_acc)


**QUESTIONS**

1. What is the training accuracy of the logistic regression model?

2. What is the test accuracy of the logistic regression model?

3. Which model has a better test accuracy for predicting IOP from gene expression -- the random forest model, the SLP model, or the logistic regression model?

# Predict immunostaining PECAM microscopy from RNA-seq (gene expression)

Not all the samples with PECAM measurements had their RNA sequenced.  We will need to first intersect the PECAM data with samples from RNA-seq data.

## Prepare the data for the algorithms

In [None]:
# filter genes to those significantly differentially expressed between ground control and space flight
rna_seq = m.filter_by_dgea(data['255-normalized'], metadata['255'],  pval=0.05, l2fc=0)

In [None]:
# get source names from 255 and sample names in immunoMICRO pecam and intersect the lists and subset the df's
samples_255_dict = dict()
samples_pecam = list()
for i in range(len(metadata['255'])):
  sample = metadata['255'].iloc[i]['Source Name']
  num = ""
  for c in sample:
    if c.isdigit():
      num += str(c)
  if "G" in sample:
    samples_255_dict["GC" + num] = metadata['255'].iloc[i]['Sample Name']

  elif "F" in sample:
    samples_255_dict["F" + num] = metadata['255'].iloc[i]['Sample Name']
  else:
    continue

for sample in data['immunoMICRO-PECAM']['Sample_Name']:
  num = ""
  for c in sample:
    if c.isdigit():
      num += str(c)
  if "G" in sample:
    samples_pecam.append("GC" + num)
  elif "F" in sample:
    samples_pecam.append("F" + num)
  else:
    print('neither ground nor space: ',  sample)
    continue

print('255 samples: ', samples_255_dict.keys())
print('pecam samples: ', samples_pecam)
# intersect 255 samples with immunoMICRO pecam samples
samples_both=list(set(samples_255_dict.keys()) & set(samples_pecam))
print('both: ', samples_both)
# subset 255 and pecam samples from intersection
gsm_samples = list()
for sample in samples_both:
  gsm_samples.append(samples_255_dict[sample])
print('gsm: ', gsm_samples)

In [None]:
# now subset the rna_seq dataframe with samples from the gsm_samples list
X = rna_seq[['Unnamed: 0'] + gsm_samples]
print(X.columns)

In [None]:
# subset the pecam data frame with samples from the both list
samples_pecam

In [None]:
# create numpy array Y of immuno PECAM values
y = list()
for i in range(len(data['immunoMICRO-PECAM'])):
  pecam_val=data['immunoMICRO-PECAM'].iloc[i]['Average']
  print('sample: ', data['immunoMICRO-PECAM'].iloc[i]['Sample_Name'])
  y.append(pecam_val)

y = m.np.array(y)
y_classes = list()
for p in y:
  if p > y.mean():
    y_classes.append(1)
  else:
    y_classes.append(0)

y = y_classes
print('y = ', y)

In [None]:
# create numpy array X of rna-seq values
X = m.transpose_df(X, 'Unnamed: 0', 'sample').drop(columns=['sample'])

In [None]:
# split up data into training and testing subsets
X_train, X_test, y_train, y_test = m.train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# show the dimensions of the training and testing data
print('X train: ', X_train.shape)
print('y train: ', len(y_train))
print('X test: ', X_test.shape)
print('y test: ', len(y_test))

**QUESTIONS**

1. How many samples are used for training the model?

2. How many samples are used for testing the model?

3. Based on the number of samples for testing, what are the possible accuracy scores?

## Build a random forest model to predict PECAM microscopy from gene expression

In [None]:
# now run classification on X, y
max_depth=4
clf = m.RandomForestClassifier(max_depth=max_depth, random_state=23)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
train_acc = m.accuracy_score(y_train, y_train_pred)
print("train accuracy:", train_acc)

y_pred = clf.predict(X_test)
accuracy = m.accuracy_score(y_test, y_pred)
print("test accuracy:", accuracy)

In [None]:
# visualize forest / feature importance
num_trees=10
for i in range(num_trees):
    tree = clf.estimators_[i]
    dot_data = m.export_graphviz(tree,
                               feature_names=X_train.columns,
                               filled=True,
                               impurity=False,
                               proportion=True)
    graph = m.graphviz.Source(dot_data)
    display(graph)

**QUESTIONS**

1. What is training accuracy of the random forest model?

2. What is the test accuracy of the random forest model?

3. Which genes are used in the decision trees of the random forest?

## BONUS: Build a logistic regression model to predict PECAM microscopy from gene expression

In [None]:
# now run classification on X, y

clf = m.LogisticRegression(random_state=23)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
train_acc = m.accuracy_score(y_train, y_train_pred)
print("train accuracy:", train_acc)

y_pred = clf.predict(X_test)
accuracy = m.accuracy_score(y_test, y_pred)
print("test accuracy:", accuracy)

**QUESTIONS**

1. What is the training accuracy of the logistic regression model?

2. What is the test accuracy of the logistic regression model?

3. Which model has a better test accuracy for predicting PECAM microscopy from gene expression -- the random forest model, the SLP model, or the logistic regression model?