### Homework 6 - BIOINF 575

Goals:
* Reading from a file with pandas
* Creating and using pd.DataFrame 
* Saving results to a file

This homework even though it has two problems should be very short in terms of lines of code to write.


#### Problem 1. 

The data file (`GSE53697_RNAseq_AD.txt`) is a tab-separated file that contains raw read counts and gene expression values in the form of RPKM values from RNA sequencing of 8 control and 9 Alzheimer's disease (AD) samples (human).

The first line in the file is the header that describes the table of values:

    GeneID - Gene ID
    GeneSymbol - Gene Symbol
    8 columns: C1-8_raw 
        raw read count values for the 8 control samples
    9 columns: A9-17_raw 
        raw read count values for the 9 Alzheimer's disease samples
    8 columns: C1-8_rpkm 
        RPKM values for the 8 control samples
    9 columns: A9-17_rpkm 
        RPKM values for the 9 Alzheimer's disease (AD) samples

Reads Per Kilobase of transcript per Million mapped reads (RPKM) is a normalized unit of transcript/gene expression. This is used to normalize for sequencing depth and gene length. Different genes might have more reads just due to the fact that they have more bases that the reads can align to and that needs to be accounted for. Also, if one sample (s1) has more reads than another sample (s2) then the overall expression of all genes in s1 would be higher than in s2. Hence, the two samples cannot be compared if a normalization for sequencing depth is not performed.

Given this file, select differentially expressed genes between Alzheimer's disease and control. There are many ways to compute differential expression. Here, to go from expression to differential expression, for each gene you have to compute the ratio between two averages: (1) the average expression of the gene in disease and (2) the average expression of the gene in control.

Since these averages might be 0 the division by zero will result in infinite (na.inf, which can be checked with the funtion:  na.isfinite). Ignore the infinite values. Select the differentially expressed genes and return them as a pd.DataFrame.

#### Assignment Requirement:

* Create a function `compute_DEgenes` that prcesses a file with the file name given as a parameter and computes the a data frame with the differentially expressed genes (genes with a log2 fold change > 1, where the fold change is the ratio between the average disease RPKM values and the average of the control RPKM values). The data frame contains only on column which is the values of the log2ratios and the row names that are the gene symbols.
* Test the function and write the results to a file "DEgenes.csv".

As long as your solution makes use of `pd.DataFrame` and returns the required output it is an acceptable solution.

One way to do this is:
* Read the file content into a `pd.DataFrame`. 
* Use subsetting to create a new data frame with the rpkm control values and compute the row means. 
* Use subsetting to create a new data frame with the rpkm AD values and compute the row means. 
* Create a new data frame with the log2 ratios and the gene symbols as row names (gene symbols are available in the initial data frame where the data was read from the file).


#### Problem 2. 

The data you will be working with (`clinvar_20190923_short.vcf`) contains several allele frequencies from different databases. The one to look for in this assignment is from ExAC database.

The beginning of every VCF file contains various sets of information:
* Meta-information (details about the experiment or configuration) lines start with **`##`**
    * These lines are helpful in understanding specialized keys found in the `INFO` column. 
* Header lines (column names) start with **`#`**

From there on, each line is made up of tab (`\t`) separated values that make up eight (8) columns. Those columns are:
    1. CHROM (chromosome)
    2. POS (base pair position of the variant)
    3. ID (identifier if applicable; `.` if not applicable/missing)
    4. REF (reference base)
    5. ALT (alternate base(s): comma (`,`) separated if applicable)
    6. QUAL (Phred-scaled quality score; `.` if not applicable/missing)
    7. FILTER (filter status; `.` if not applicable/missing)
    8. INFO (any additional information about the variant)
    * Semi-colon (`;`) separated key-value pairs
    * Key-value pairs are equal sign (`=`) separated 
        (key on the left, value on the right)
    * If a key has multiple values, the values are comma (`|`) separated

There are some additional details to consider for this assignment. You will be expected to consider two (2) special types of keys:
* The `AF_EXAC` key that describes the allele frequencies from the ExAC database
    > `##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">`
    * The data included are `float`ing point numbers
* The `CLNDN` key that gives all the names the given variant is associated with
    > `##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">`
    * The data are`str`ings. **However**, if there are multiple diseases associated with a given variant, the diseases are pipe (`|`) separated.

#### Assignment Requirement
* Write a funtion `select_variants` that takes the file name as a parameter and returns a data frame with rare variants (AF_EXAC < 0.0001) that are associated with at least 2 diseases have `"C"` as REF and `"T"` as a ALT.
* Test the function and write the results to a file "SELvariants.csv".

As long as your solution makes use of `pd.DataFrame` and returns the required output it is an acceptable solution.

One way to do this is:
* Read the table from the file into a pd.DataFrame. The `pd.read_csv()` funtion has  `comment` parameter that can be used to skip the lies that start with `"#"`. However, columns labels have to be added by using the `columns` attribute if that is done.
* Similar to homework 3, compute a list of diseaseses for rare variants (variants with AF_EXAC < 0.0001), the funtion will return an empty list if the variant is not rare or does not have any disease associated. The `not_specified` and `not_provided` should not appear in these lists. Use the details provided in homework 3. You can do this part of the homework by adapting the parse_line function you wrote for homework 3 to a parse_INFO function that does not parse a line but the value in the INFO column, since that is now a column in the data frame. 
* The apply function of a pd.Series can be used to apply the funtion parse_INFO to each element of the INFO column of the data frame to compute the disease list for each variant and the len funtion can be applied to the resulting pd.Series to compute the length of the disease list for each variant. 
* The result can be stored as a new column in the initial data frame. The new column contains the length of the associated disease list.
* Subset the updated data frame using conditional subsetting. Conditions on the `REF` in `ALT` and the newly added column (that contains the associated disease list length) would compute a data frame with rare variants (AF_EXAC < 0.0001) that are associated with at least 2 diseases, have `"C"` as REF and `"T"` as a ALT. 

    
    


---
## Academic Honor Code
In accordance with Rackham's Academic Misconduct Policy; upon submission of your assignment, you (the student) are indicating acceptance of the following statement:

> “I pledge that this submission is solely my own work.”

As such, the instructors reserve the right to process any and all source code therein contained within the submitted notebooks with source code plagiarism detection software.

Any violations of the this agreement will result in swift, sure, and significant punishment.

In [1]:
import numpy as np
import pandas as pd

In [2]:
def compute_DEgenes(file_name):
    '''
    This function will parse for genes of which have a 2-fold expression change
    (positive or negative log) between a disease and control state.
    

    This function takes a file (.txt) location as a string to open and read in the data to a DataFrame from said file.
    A value of 1 is added to both the control and disease RPKM values(to avoid ZeroDivisionError when calculating
    the ratio). The ratio between the disease RPKM to control RPKM values is computed and the log base 2 of the
    ratio is added to the DataFrame. Genes with an absolute log base 2 ratio greater than 1 are differentiated for
    and saved to a new DataFrame. The new DataFrame is written to a .csv file, 'DEgenes.csv'

    Parameters: 
    filename (string): The argument of this function is (.txt) file location as a string.


    Returns: 
    genes_de_df (DataFrame): This output is a DataFrame of the genes with an absolute log base 2 ratio of
                             disease RPKM to control RPKM values expression greater than 1
    DEgenes.csv (file.csv): This output is a .csv file of the genes_de_df DataFrame
    
    '''
    data=pd.read_csv('GSE53697_RNAseq_AD.txt', header=0, sep='\t',index_col=1)
    crpmk=data[['C1_rpkm', 'C2_rpkm','C3_rpkm','C4_rpkm','C5_rpkm','C6_rpkm','C7_rpkm','C8_rpkm']]
    for row in crpmk:
        crpmk_mean= crpmk.mean(axis=1)
        crpmk_mean_PlusValue1 = crpmk_mean +1
    crpmk_mean_list=crpmk_mean.tolist()
    crpmk_mean_PlusValue1_list=crpmk_mean_PlusValue1.tolist()
    data['crpmk_mean']=crpmk_mean_list
    data['crpmk_mean_PlusValue1']=crpmk_mean_PlusValue1_list
    Arpmk=data[['A9_rpkm', 'A10_rpkm','A11_rpkm','A12_rpkm','A13_rpkm','A14_rpkm','A15_rpkm','A16_rpkm','A17_rpkm']]
    for row in Arpmk:
        Arpmk_mean= Arpmk.mean(axis=1)
        Arpmk_mean_PlusValue1 = Arpmk_mean +1   
    Arpmk_mean_list=Arpmk_mean.tolist()
    Arpmk_mean_PlusValue1_list= Arpmk_mean_PlusValue1.tolist()
    data['Arpmk_mean']=Arpmk_mean_list
    data['Arpmk_mean_PlusValue1']= Arpmk_mean_PlusValue1_list
    Arpmk_mean_array= np.array(Arpmk_mean_PlusValue1_list)
    crpmk_mean_array=np.array(crpmk_mean_PlusValue1_list)
    log2ratios=np.log2(np.divide(Arpmk_mean_array,crpmk_mean_array))
    data['log2ratios']=log2ratios
    data1=data.copy()
    data1=data1.loc[abs(data1['log2ratios']) >1]
    dif_expressions_data=data1['log2ratios'] #when index for a specific column, it turns it into a series, not a dataframe
    genes_de_df=dif_expressions_data.to_frame()
    genes_de_df.to_csv('DEgenes.csv')
    return genes_de_df

In [3]:
# test the function compute_DEgenes with the given file 'GSE53697_RNAseq_AD.txt'

# save result to file "DEgenes.csv"

compute_DEgenes('GSE53697_RNAseq_AD.txt')

Unnamed: 0_level_0,log2ratios
GeneSymbol,Unnamed: 1_level_1
B2M,-1.083161
CD44,1.156077
CHI3L1,-1.804778
CHI3L2,-2.707243
DSP,-1.144183
IFI6,-1.516431
GBP1,-1.580063
GBP2,-1.117521
GBP3,-1.09507
IFIT3,-1.414833


In [4]:
def parse_info(INFO):
    '''
    This function will create a list of known diseases associated to rare gene variants, as determined by an 
    AF_EXAC value < 0.0001. If the variant is not rare, or there is no known associated disease, then an empty
    list is returned
    

    This function takes a pd.series containing the INFO column of a .vcf file and parses the input for rare
    gene variants, as determined by an AF_EXAC value < 0.0001. The associated diseases of the variant are then
    returned in a list. If a gene variant is not rare, or has no known associated diseases, an empty list is
    returned

    Parameters: 
    INFO (pd.series): The argument of this function is a pd.series containing the INFO column of a .vcf file


    Returns: 
    CLNDN_disease_clean (list): This output is a list of the associated diseases to a rare gene variant. If the
                                variant is not rare, or there are no known associated diseases, the list is 
                                created and is returned empty
    
    '''
    INFO=INFO.split()
    for element in INFO:
        if not'AF_EXAC' in element:
            CLNDN_disease_clean=[]
        else:
            datapoint=element.split(';')
            for elementD in datapoint:
                if elementD.startswith("AF_EXAC") ==True: #index to AF_EXAC column and compare values
                    AF_EXAC=elementD.split("=")
                    AF_EXACnum=float(AF_EXAC[1])
                    if AF_EXACnum > 0.0001:
                        CLNDN_disease_clean=[]
                    elif AF_EXACnum == 0.0001 or AF_EXACnum< 0.0001:
                        for elementC in datapoint:
                            if elementC.startswith("CLNDN") == True: #Index to diseases to create list
                                CLNDN_list = elementC.split('=')
                                CLNDN_disease=CLNDN_list[1]
                                CLNDN_disease1=CLNDN_disease.split('|')
                                CLNDN_disease_clean=[]
                                for elementCD in CLNDN_disease1:
                                    if elementCD != 'not_provided' and elementCD !='not_specified':
                                        CLNDN_disease_clean.append(elementCD)
    return(CLNDN_disease_clean)

In [5]:
def select_variants(filename):
    '''
    This function will parse for rare gene variants determined by an AF_EXAC value < 0.0001 and variants with 
    more than 1 known associated disease within that pool.
    

    This function takes a .vcf file location as a string to open and read in the data to a DataFrame from said file.
    A pd.series is created from the INFO column of the DataFrame and the function, parse_info, is applied to the series.
    A new column, ASSOCIATED_DISEASE_NUMBER, is added to the initial DataFrame(using the length of the pd.series after
    parse_info is applied). A smaller DataFrame containing only the rare variants with more than one known associated
    disease is created(ref_variants). This DataFrame is then selected for only the gene variants with a REF value of 'C'
    and an ALT value of 'T' and saved into a new DataFrame, variants_selected_df, which is saved to the file 
    'SELvariants.csv'

    Parameters: 
    filename (string): The argument of this function is a file (.vcf) location as a string.


    Returns: 
    variants_selected_df (DataFrame): This output is a DataFrame of the rare gene variants(as determined by an 
                                      AF_EXAC value < 0.0001) with more than one known associated disease,
                                      a REF value of 'C', and an ALT value of 'T'
    DEgenes.csv (file.csv): This output is a .csv file of the variants_selected_df DataFrame
    
    '''
    clinvar_initial=pd.read_csv(filename, sep='\t', comment='#', names=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO'])
    clinvar_INFO=clinvar_initial['INFO']
    clinvar_INFO_applied=clinvar_INFO.apply(parse_info)
    clinvar_initial['ASSOCIATED_DISEASE_NUMBER']=clinvar_INFO_applied.apply(len)
    clinvar_rare_disease=clinvar_initial.loc[clinvar_initial['ASSOCIATED_DISEASE_NUMBER']>1]
    ref_variants=clinvar_rare_disease.loc[clinvar_rare_disease['REF'] == 'C']
    variants_selected_df=ref_variants.loc[ref_variants['ALT'] == 'T']
    variants_selected_df.to_csv('SELvariants.csv')
    
    
    return variants_selected_df

In [6]:
# test the function select_variants with the given file 'clinvar_20190923_short.vcf'

# save result to file "SELvariants.csv"

select_variants('clinvar_20190923_short.vcf')

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,ASSOCIATED_DISEASE_NUMBER
632,1,1806513,224713,C,T,.,.,AF_EXAC=0.00001;ALLELEID=226500;CLNDISDB=.|.|H...,6
795,1,2229492,381375,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00005;ALLELEID=364674...,2
818,1,2229730,264199,C,T,.,.,AF_EXAC=0.00000;ALLELEID=257945;CLNDISDB=MedGe...,2
823,1,2302983,520164,C,T,.,.,AF_EXAC=0.00002;ALLELEID=509149;CLNDISDB=MedGe...,2
920,1,2304437,213697,C,T,.,.,AF_EXAC=0.00000;ALLELEID=209481;CLNDISDB=MedGe...,2
922,1,2304450,213678,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00009;ALLELEID=209482...,2
950,1,2306158,520169,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00002;ALLELEID=509157...,2


-----------

In [7]:
# Do not change this cell
genes_df = pd.read_csv("DEgenes.csv", index_col=0)
genes_df

Unnamed: 0_level_0,log2ratios
GeneSymbol,Unnamed: 1_level_1
B2M,-1.083161
CD44,1.156077
CHI3L1,-1.804778
CHI3L2,-2.707243
DSP,-1.144183
IFI6,-1.516431
GBP1,-1.580063
GBP2,-1.117521
GBP3,-1.09507
IFIT3,-1.414833


In [8]:
# Do not change this cell
variants_df = pd.read_csv("SELvariants.csv")
variants_df

Unnamed: 0.1,Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,ASSOCIATED_DISEASE_NUMBER
0,632,1,1806513,224713,C,T,.,.,AF_EXAC=0.00001;ALLELEID=226500;CLNDISDB=.|.|H...,6
1,795,1,2229492,381375,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00005;ALLELEID=364674...,2
2,818,1,2229730,264199,C,T,.,.,AF_EXAC=0.00000;ALLELEID=257945;CLNDISDB=MedGe...,2
3,823,1,2302983,520164,C,T,.,.,AF_EXAC=0.00002;ALLELEID=509149;CLNDISDB=MedGe...,2
4,920,1,2304437,213697,C,T,.,.,AF_EXAC=0.00000;ALLELEID=209481;CLNDISDB=MedGe...,2
5,922,1,2304450,213678,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00009;ALLELEID=209482...,2
6,950,1,2306158,520169,C,T,.,.,AF_ESP=0.00008;AF_EXAC=0.00002;ALLELEID=509157...,2
