### Study session 9 - more bioinformatics
#### Use data and control structures and pandas dataframes 
#### BIOINF 575 - Fall 2023



___ 
#### RECAP: Pandas dataframes

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png" width = 550/>

https://www.geeksforgeeks.org/python-pandas-dataframe/

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

_____
_____

#### Gene expression differential analysis and functional annotation

Read the data from the file `expression_data.txt` into a pandas data frame. The file contains a tab-separated matrix with genes on the rows and samples on the columns.
The data file contains comment lines that start with "#".
Each sample pertains to a group `disease` or `healthy` and the groups are labels are as follows:

```python
    group = ["disease", "healthy", "disease", "disease", "healthy",
             "healthy", "healthy", "disease", "disease", "disease"]
```

------------
  
<b><font color = "red">Exercise</font> <br></b>


Select the genes that have a p-value < 10% on a t-test between the expression in disease and the expression in healthy.  
The t-test is implemented in the scipy library in the stats module (you can use `from scipy import stats`).   
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html    
Keep in mind these are multiple statistical test we are doing so we should correct for multiple comparisons before intepreting/ reading too much into the results.    
https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html   


In [None]:
# Write your solution here



___

<b> <font color = "red">Exercise</font></b>

Select the GO (gene ontology) terms for the genes that have a p-value < 10%.   
Gene ontology terms tell us what biological processes, molecular functions, and cellular components the genes are associated with.     
http://geneontology.org/docs/ontology-documentation/   

The file `gene_go.txt` contains that association.   
Note: look for the `.isin' method for a pd.Series.


In [None]:
# Write your solution here



____
____

#### Variant data exploration   
We have variant data for a sample in the file `variant_data_file.vcf`.   
The vcf data file contains information about differences found in the genome of a specific sample when compared to the reference.    
The file contains approximatelly 1000 differences found in the Y chromosome.





..................................................................................................

<b><font color = "red">Exercise</font> <br></b>

<b>Total depth distribution</b>

We want to have a look at the distribution of the total depth in our data (DP key in the INFO column).
- Read the data into a pandas data frame
    - look at the comment and names arguments for the read_csv funtion in the pandas library, you will need to add the header by hand
- Write a function that selects the total depth from the INFO part of a variant (row in our dataframe)
- Apply the function for each row of the dataframe which will result in a pandas series with the total depth for each variant 
- Plot the histogram of total depths (pandas series previously computed)
- Plot the histogram of the total depths that are less than 170

In [None]:
# Write your solution here






..................................................................................................

<b><font color = "red">Exercise</font> <br></b>


<b>Quality and SNP distribution</b> 

<b>A.</b> Display the distribution of the quality scores for the variants. 
- Select the QUAL column from the data and plot the histogram

<b>B.</b> Compute the frequency (number of occurences) of each of the REF ALT combinations available in the dataset and make a bar plot.    
Then, identify the most frequent combination of REF ALT in the dataframe 
- Compute the number of occurences for REF ALT combinations. 
    - Make use of the .value_counts() method for a dataframe to count the number for the unique combinations of row elements.
- The result of value_counts is a pandas series, plot the values using a barplot.
- Select from the series the maximum value together with the associated label: e.g. A T 300

In [None]:
# Write your solution here




_______
### <font color = "red">Exercise</font> 

Explain what the following code does and describe how it computes the result it displays:


In [None]:
def compute_perc(seq):
    return 100*(seq.count("C")+seq.count("G"))/len(seq)

def process_text(text):
    res = 0
    gene = ""
    if text != None:
        gene, promoter_seq = text.split()
        res = compute_perc(promoter_seq)
    return gene, res
        
        
file_name = "gene_promoter_sequence.txt"       
with open(file = file_name, mode = "r") as promoter_seq_file:
    for line in promoter_seq_file:
        res = process_text(line)
        print("gene", res[0], "percentage", res[1])


In [None]:
## Write your code description here



_______

_______
