# 1 Data Preparation to Integrate the Arthal1 (Healthy) RNA Sequence Count Data Set

#### Dataset1: Healthy - Arabidopsis (Col-0)

#### Cynthia Soto
#### Fecha: 20 de Agosto 2020
#### Datatype representation: expression levels CDS (gene_name) 
#### Data quantify with HTSeq (alignments done with STAR SA) 

Data quantifications produced with HTSeq are absolute counts. Thus need to be prepared and transformated for a further exploration data analysis.

Workflow:
1) Explore data to get the correct format (remove or add headers; remove not informative data, etc.)
2) Get basic statics and count zero values.
3) Plus 1 to zero values to avoid arithmetical error when making Log2 convertion (avoid div/0 error)
3) Make Log2 convertions
4) Get basic statics
5) Save transformations and statistics in cvs file for further analysis. 


In [15]:
import pandas as pd
import numpy as np
import os 

In [16]:
try:
    # Change the current working Directory    
    os.chdir("/home/cyntsc/Proyectos/athal1_htseq-counts/")
    print("Directory changed")
except OSError:
    print("Can't change the Current Working Directory")     
    
print(os.getcwd())

Directory changed
/home/cyntsc/Proyectos/athal1_htseq-counts


In [17]:
#print(os.listdir(os.getcwd()) # list files in Dir
os.listdir(os.getcwd())

['SRR6283145',
 'SRR3383640_Log2.csv',
 'SRR6283144',
 'SRR3383821_Log2.csv',
 'SRR3383640',
 'SRR3383782_Log2.csv',
 'SRR6283145_Log2.csv',
 'SRR3383641_Log2.csv',
 'SRR3383783_Log2.csv',
 'SRR3383783',
 'SRR3383822_Log2.csv',
 'SRR3383782',
 'SRR3383822',
 'SRR6283144_Log2.csv',
 'SRR3383641',
 'SRR3383821']

In [18]:
# Read data from txt file

df= pd.read_csv("SRR3383640", sep='\t', header=None)

# Add columns
df.columns = ["Genes","Counts"]

In [19]:
df.shape

(27660, 2)

In [20]:
df.head(10)
#df.tail(10)
#para estas muestras debe incluirse hasta el renglon 27654

Unnamed: 0,Genes,Counts
0,AT1G01010,91
1,AT1G01020,108
2,AT1G01030,13
3,AT1G01040,1027
4,AT1G01050,850
5,AT1G01060,288
6,AT1G01070,42
7,AT1G01080,975
8,AT1G01090,1281
9,AT1G01100,483


In [120]:
df.drop(df.tail(5).index,inplace=True) # drop last n rows (refering statistics of HTSeq)
df

Unnamed: 0,Genes,Counts
0,AT1G01010,91
1,AT1G01020,108
2,AT1G01030,13
3,AT1G01040,1027
4,AT1G01050,850
...,...,...
27650,ATMG01350,1
27651,ATMG01360,20
27652,ATMG01370,24
27653,ATMG01400,0


In [121]:
##Get max value of the numeric cols
#df.select_dtypes(include=[np.number]).max()
#df.select_dtypes(include=[np.number]).min()
##Get basic statics
df.describe()
#df.shape

Unnamed: 0,Counts
count,27655.0
mean,347.664328
std,1860.46925
min,0.0
25%,0.0
50%,52.0
75%,281.0
max,153437.0


In [122]:
#Check for zeros, if there are, so need to plus 1 to vales to avoid div/0 error
df.isin([0]).sum()

Genes        0
Counts    7396
dtype: int64

In [123]:
## Plus 1 to avoid arithmetical error in Log2 transformation (avoid error div/0)
#df.dtypes
df["Counts"] += 1

In [124]:
df

Unnamed: 0,Genes,Counts
0,AT1G01010,92
1,AT1G01020,109
2,AT1G01030,14
3,AT1G01040,1028
4,AT1G01050,851
...,...,...
27650,ATMG01350,2
27651,ATMG01360,21
27652,ATMG01370,25
27653,ATMG01400,1


In [125]:
# check no more zero values
df.isin([0]).sum()

Genes     0
Counts    0
dtype: int64

In [126]:
# Apply Log2 numpy function to absolute values and add a column
df['log2_value'] = np.log2(df['Counts'])

In [127]:
df.head(10)

Unnamed: 0,Genes,Counts,log2_value
0,AT1G01010,92,6.523562
1,AT1G01020,109,6.768184
2,AT1G01030,14,3.807355
3,AT1G01040,1028,10.005625
4,AT1G01050,851,9.733015
5,AT1G01060,289,8.174926
6,AT1G01070,43,5.426265
7,AT1G01080,976,9.930737
8,AT1G01090,1282,10.324181
9,AT1G01100,484,8.918863


In [30]:
# Store df to a file with dataframe statistics
df.to_csv('SRR3383640_Log2.csv', sep='\t', index=True)

In [31]:
# Get basic statistics for both metrics 
df.describe()

Unnamed: 0,Counts
count,27660.0
mean,383.968944
std,5252.645241
min,0.0
25%,0.0
50%,52.0
75%,281.0
max,798390.0


In [32]:
# Send the dataframe to a statistics file for further analysis
df.describe().to_csv('../athal1_stats/SRR3383640_stats.csv', sep='\t')

In [33]:
# Load the data in another df to append a new column with the sample-name 
df_stats = pd.read_csv("../athal1_stats/SRR3383640_stats.csv", sep='\t')
df_stats
df_stats["Sample"] = "SRR3383640"

In [34]:
df_stats

Unnamed: 0.1,Unnamed: 0,Counts,Sample
0,count,27660.0,SRR3383640
1,mean,383.968944,SRR3383640
2,std,5252.645241,SRR3383640
3,min,0.0,SRR3383640
4,25%,0.0,SRR3383640
5,50%,52.0,SRR3383640
6,75%,281.0,SRR3383640
7,max,798390.0,SRR3383640


In [35]:

## Code to change the cols order to preserve the sample-name as the first column.
df_stats.columns
cols = df_stats.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_stats = df_stats[cols]
df_stats


Unnamed: 0.1,Sample,Unnamed: 0,Counts
0,SRR3383640,count,27660.0
1,SRR3383640,mean,383.968944
2,SRR3383640,std,5252.645241
3,SRR3383640,min,0.0
4,SRR3383640,25%,0.0
5,SRR3383640,50%,52.0
6,SRR3383640,75%,281.0
7,SRR3383640,max,798390.0


In [38]:
# Store the statistics to further analysis
df_stats.to_csv("../athal1_stats/SRR3383640_stats.csv", sep='\t', index=False)

To add more stastics with the same structure, simply concatenate the DataFrames along the row you can use the concat() function in pandas. You will have to pass the names of the DataFrames in a list as the argument to the concat() function.
Example:
df_row = pd.concat([df1, df2])
df_row