Astrobotany project: Finding patterns from space data using statistical method

So far, from first milestone of the project, objective of the project(name of the project) was defined with help from Choi et.al paper. Then, data exploration was done with BRIC19 data by using R. Few of concerns and possible suggestions with data was discussed. Possible statistical method such as unsupervised learning(e.g. clustering) and deep learning(e.g. NLP based neural networks) was suggested.
After formal discussion with first milestone that I have done, it is decided that new data from NASA Gene lab GLDS-120 will be used to perform the analysis. Therefore, process that was done with first mileston, such as, data preprocessing and data exploration will be performed again before building the main model. This time Python will be used for the analysis.

Study Description of GLDS-120: 

Experimentation on the International Space Station has reached the stage where repeated and nuanced transcriptome studies are beginning to illuminate the **structural and metabolic differences between plants grown in space compared to plants on the Earth.** 

Genes that are important in setting up the spaceflight responses are being identified; their role in spaceflight physiological adaptation are increasingly understood, and the fact that different genotypes adapt differently is recognized. However, the basic question of whether these spaceflight responses are required for survival has yet to be posed, and the fundamental notion that spaceflight responses may be non-adaptive has yet to be explored.
Therefore the experiments presented here were designed to ask **if portions of the plant spaceflight response can be genetically removed without causing loss of spaceflight survival and without causing increased stress responses.**

The CARA experiment compared the spaceflight transcriptome responses of two Arabidopsis ecotypes, Col-0 and WS, as well as that of a PhyD mutant of Col-0. When grown with the ambient light of the ISS, phyD displayed a significantly reduced spaceflight transcriptome response compared to Col-0, **suggesting that altering the activity of a single gene can actually improve spaceflight adaptation by reducing the transcriptome cost of physiological adaptation.** The WS genotype showed an even simpler spaceflight transcriptome response in the ambient light of the ISS, more **broadly indicating that the plant genotype can be manipulated to reduce the transcriptome cost of plant physiological adaptation to spaceflight and suggesting that genetic manipulation might further reduce, or perhaps eliminate the metabolic cost of spaceflight adaptation.** 

When plants were germinated and then left in the dark on the ISS, the WS genotype actually mounted a larger transcriptome response than Col-0, suggesting that the in-space light environment affects physiological adaptation, which further **implies that manipulating the local habitat can also substantially impact the metabolic cost of spaceflight adaptation**.

'Normalized_counts' data from GLDS-120 will be mainly used and  'Array_Genediff_pilot','RNAseq_Genediff_pilot' and 'RNAseq_Isoformsdiff_pilot' will be used as supplemantary data. 

Objective of this project will be remained same:
Finding patterns from space data using statistical method

In [1]:
%load_ext watermark
%watermark -a 'Alex-Seo' -v -p matplotlib,numpy,pandas

Alex-Seo 

CPython 3.6.2
IPython 6.1.0

matplotlib 3.2.0
numpy 1.18.1
pandas 1.0.1


In [3]:
#Data Exploration
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
nc_all = pd.read_csv("GLDS-120_rna_seq_Normalized_Counts.csv")

In [6]:
#Summary
print(nc_all.columns.values)
nc_all.describe(include='all')

['Unnamed: 0' 'Atha_Col-0-PhyD_root_FLT_Alight_Rep1_GSM2493783'
 'Atha_Col-0-PhyD_root_FLT_Alight_Rep2_GSM2493784'
 'Atha_Col-0-PhyD_root_FLT_Alight_Rep3_GSM2493785'
 'Atha_Col-0-PhyD_root_FLT_dark_Rep1_GSM2493792'
 'Atha_Col-0-PhyD_root_FLT_dark_Rep2_GSM2493793'
 'Atha_Col-0-PhyD_root_FLT_dark_Rep3_GSM2493794'
 'Atha_Col-0-PhyD_root_GC_Alight_Rep1_GSM2493765'
 'Atha_Col-0-PhyD_root_GC_Alight_Rep2_GSM2493766'
 'Atha_Col-0-PhyD_root_GC_Alight_Rep3_GSM2493767'
 'Atha_Col-0-PhyD_root_GC_dark_Rep1_GSM2493774'
 'Atha_Col-0-PhyD_root_GC_dark_Rep2_GSM2493775'
 'Atha_Col-0-PhyD_root_GC_dark_Rep3_GSM2493776'
 'Atha_Col-0_root_FLT_Alight_Rep1_GSM2493777'
 'Atha_Col-0_root_FLT_Alight_Rep2_GSM2493778'
 'Atha_Col-0_root_FLT_Alight_Rep3_GSM2493779'
 'Atha_Col-0_root_FLT_dark_Rep1_GSM2493786'
 'Atha_Col-0_root_FLT_dark_Rep2_GSM2493787'
 'Atha_Col-0_root_FLT_dark_Rep3_GSM2493788'
 'Atha_Col-0_root_GC_Alight_Rep1_GSM2493759'
 'Atha_Col-0_root_GC_Alight_Rep2_GSM2493760'
 'Atha_Col-0_root_GC_Alight_Rep3_

Unnamed: 0.1,Unnamed: 0,Atha_Col-0-PhyD_root_FLT_Alight_Rep1_GSM2493783,Atha_Col-0-PhyD_root_FLT_Alight_Rep2_GSM2493784,Atha_Col-0-PhyD_root_FLT_Alight_Rep3_GSM2493785,Atha_Col-0-PhyD_root_FLT_dark_Rep1_GSM2493792,Atha_Col-0-PhyD_root_FLT_dark_Rep2_GSM2493793,Atha_Col-0-PhyD_root_FLT_dark_Rep3_GSM2493794,Atha_Col-0-PhyD_root_GC_Alight_Rep1_GSM2493765,Atha_Col-0-PhyD_root_GC_Alight_Rep2_GSM2493766,Atha_Col-0-PhyD_root_GC_Alight_Rep3_GSM2493767,...,Atha_Ws_root_FLT_Alight_Rep3_GSM2493782,Atha_Ws_root_FLT_dark_Rep1_GSM2493789,Atha_Ws_root_FLT_dark_Rep2_GSM2493790,Atha_Ws_root_FLT_dark_Rep3_GSM2493791,Atha_Ws_root_GC_Alight_Rep1_GSM2493762,Atha_Ws_root_GC_Alight_Rep2_GSM2493763,Atha_Ws_root_GC_Alight_Rep3_GSM2493764,Atha_Ws_root_GC_dark_Rep1_GSM2493771,Atha_Ws_root_GC_dark_Rep2_GSM2493772,Atha_Ws_root_GC_dark_Rep3_GSM2493773
count,23986,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,...,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0,23986.0
unique,23986,,,,,,,,,,...,,,,,,,,,,
top,AT5G45460,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,271.395314,277.232597,289.182954,289.71426,280.02401,258.990533,260.75602,302.439771,281.749583,...,294.663725,273.061268,285.883888,277.750948,258.538545,286.332299,262.432956,288.948716,297.528658,285.203647
std,,1353.469798,1337.9631,1482.151966,1560.239516,1806.662967,1744.745931,1430.56722,1852.863195,1661.791337,...,1395.875646,1382.495552,1658.790398,2044.074041,1702.860888,1321.071097,1355.903087,1653.220698,1674.812064,2507.1156
min,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,10.558713,11.034681,11.198903,11.208779,7.418363,5.17859,9.640505,11.45843,10.868236,...,9.349489,6.909304,8.944097,9.159718,5.653611,10.864496,8.519041,4.816593,6.535961,7.535663
50%,,72.482867,73.71633,74.265394,72.477015,65.433599,64.302697,69.705616,73.065333,73.183726,...,66.423257,62.668809,67.204603,67.648305,63.159644,70.55374,68.947188,57.603644,60.760212,66.765505
75%,,211.638011,214.029798,220.91251,218.012934,207.940249,209.4661,205.339227,223.362101,211.618886,...,202.509334,195.134214,201.563571,204.235876,197.097063,206.261872,206.716153,190.750169,197.74432,200.67655


In [None]:
#Divide into different dataset according to genotype: PhyD, Col0, WS