# Script to analyze the gene expression over different bins
### Genome Analysis Project, VT24

**Last Changed:** 2024-04-29

## 1) Read the data

Open the data files (.tsv) and store them as pandas dataframes

In [1]:
# Import the necessary packages
import os
import pandas as pd

In [10]:
# Define the path to the directory containing the data files (.tsv format)
directory = '/Users/claranordquist/Documents/Universitetet/VT/Genome_Analysis/Lab_Project/Feature_analysis/'

# Create a dictionary where all the dataframes will be stored for each bin
# Create another dictionary where the data is pooled so that there is just one dataset for the high (SRR4342137) and one for the low oxygen (SRR4342139) environment
bin_dataframes = {}
env_dataframes = {'High':pd.DataFrame(), 'Low':pd.DataFrame()}

# Loop over the files in the input directory and pick those that end with .tsv
# For those, read the files and make them into dataframes
for file in os.listdir(directory):
    if file.endswith('.tsv'):
        basename = file[:-4]
        bin_dataframes[basename] = pd.read_table(os.path.join(directory, file), sep='\t', skiprows=1, header=None)
        bin_dataframes[basename].columns = ["Count", "Gene_ID", "ftype", "length_bp", "gene", "EC_number", "COG", "Product"]

        if basename.endswith('SRR4342137'):
            env_dataframes['High'] = pd.concat([env_dataframes['High'], bin_dataframes[basename]], axis=0)
        else:
            env_dataframes['Low'] = pd.concat([env_dataframes['Low'], bin_dataframes[basename]], axis=0)

In [None]:
# This can be used to merge rows that are from the same feature, if multiple bins would show the same feature 
# Don't know if this is necessary though?
aggregate_function = {"Count":'sum', "Gene_ID":'first', "ftype":'first', "length_bp":'first', "gene":'first', "EC_number":'first', "COG":'first', "Product":'first'}

for env, data in env_dataframes.items():
    data = data.groupby(data['Gene_ID']).aggregate(aggregate_function)

## 2) What do I want to visualize?

Count = How many RNA reads were mapped to that feature

Gene_ID = The Prokka ID for the feature

ftype = CDS, coding sequence

length_bp = Length of the feature

Gene = Gene name

EC_number = Numerical classification for enzymes based on the chemical reactions they catalyze

COG = Cluster of orthologous genes

Product = Description of the gene product (when known)