# **Conceptual Overview:**
This notebook will guide you through the process of compiling and analyzing a subset of gene expression data.

## **Purpose:**
The purpose of this project is to compile one smaller, more focused, subset of gene expression data from multiple, larger datasets. We are specifically going to be filtering for the **BDNF** gene, which produces brain-derived neurotrophic factor, or BDNF, a protein that is critical for the survival and growth of neurons. BDNF also plays a major role in synaptic efficiency and plasticity, which is the biological basis behind learning and memory.

## **Scope:**
In this notebook, we are going to compile a smaller subset of gene expression data that targets the **BDNF** gene and how it is expressed across different regions of the brain. The original datasets that we will be working on are taken from the Allen Brain Atlas. We will use Python to prepare our data.

## **Major Steps:**
1.   Download the original datasets from the Allen Brain Atlas.
2.   Mount Google Drive to access uploaded files in Colab and import the necessary Python libraries.
3.   Load the datasets (*gene expression*, *brain region*, and *gene name*) using Python.
4.   Filter the data for BDNF gene expression only.
5.   Analyze and clean the new data.
6.   Export the newly compiled subset as a CSV file.





# **Step 1: Downloading the Original Datasets**

To begin with, there are three files you will need to access from the Allen Brain Atlas.

You can access the datasets from this website: https://neuroinformatics.nl/HBP/ABA_mouse/

The three files that you should look for are:

1.   ***ABA_expression.csv*** : This is the main file that contains all of the gene expression data.
2.   ***ABA_structures.csv*** : This is a supplementary file that contains a mapping of row indices in the main file to brain structures, such as the hippocampus and the amygdala.
3.   ***ABA_genes.csv*** : This is another supplementary file that contains a mapping of column indices in the main file to gene names, such as BDNF and SERT.

After you have downloaded these three files, upload them to your Google Drive because we will be accessing them directly from there in the next step!

# **Step 2: Mount Google Drive & Import Python Libraries**

In this step, we will be mounting Google Drive into this Colab notebook so that we can read the files you just stored there directly into Python.

To do so, run the following code:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


After running this code, you will likely be prompted by Google Colab to give it permission to access your Drive. Click on the provided link, log in with your Google account, and allow the necessary permissions. If you succeeded in mounting Google Drive, you should see the message: "Mounted at /content/gdrive".

Now, we will import the Python libraries necessary for compiling our new subset. The only library necessary for basic data manipulation, which is what we will be doing, is `pandas`.

Import `pandas` by running the following code:

In [None]:
import pandas as pd


Now that you have mounted your Google Drive and imported `pandas`, we can proceed to the next step: loading the files into Python so that we can begin working with them!

# **Step 3: Load the Data into Python**

We are going to load our three datasets into Python as dataframes so that we can easily filter and manipulate the data.

In order to do this, we will be using the `pd.read_csv()` function. It is important that you know *exactly* where the files are stored in your Google Drive, as folder names must be included in the file path. For example, my files are stored in a folder named "ENGL 105: Unit 3," so I will include that when I run my code.

To load your datasets, run the following code:

In [None]:
expression_df = pd.read_csv('/content/gdrive/MyDrive/ENGL 105: Unit 3/ABA_expression.csv', index_col=0)
structures_df = pd.read_csv('/content/gdrive/MyDrive/ENGL 105: Unit 3/ABA_structures.csv')
genes_df = pd.read_csv('/content/gdrive/MyDrive/ENGL 105: Unit 3/ABA_genes.csv')

Now, to confirm that you loaded your datasets correctly, we will utilize the `.head()` function to preview the first 5 rows of each dataframe.

Run the following code to ensure that you loaded your data properly:

In [None]:
print("Expression Data:")
print(expression_df.head())

print("\nStructures Data:")
print(structures_df.head())

print("\nGenes Data:")
print(genes_df.head())

Expression Data:
          0.951206  2.42347  0.239863  4.62837   11.7279  5.24365    23.667  \
0.513941                                                                      
0.071430  9.386910  3.43439  0.482486  3.16590  19.28580  2.48365  27.12040   
NaN            NaN      NaN       NaN      NaN       NaN      NaN       NaN   
0.314675  0.703189  2.64723  1.080480  4.72685  11.28940  3.69585  18.70220   
NaN            NaN      NaN       NaN      NaN       NaN      NaN       NaN   
0.509998  0.162417  5.19324  0.164435  2.95629   8.92439  1.75076   9.33423   

          0.0367014  4.92416  0.26645  ...  0.606783  3.12378    0.1101  \
0.513941                               ...                                
0.071430   0.039502  2.87590  1.11642  ...  0.468842  7.75779  0.012348   
NaN             NaN      NaN      NaN  ...       NaN      NaN       NaN   
0.314675   0.164630  3.23008  1.41935  ...  1.043950  2.49682  0.101512   
NaN             NaN      NaN      NaN  ...       NaN  

If you run into an error here, go back and double-check your file paths in Google Drive!

Now, we are ready to move onto the next step, finding the BDNF gene and pulling out out its data.

# **Step 4: Filter the Data for BDNF Gene Expression**

This is where the process can get complicated or confusing, so here is a brief outline of what we are trying to accomplish in this step:

1.   Find the index of the BDNF gene in our gene list.
2.   Extract the BDNF expression data.
3.   Add the names of the corresponding brain structures.
4.   Create a new dataframe with the the names of the brain structures (labels) and our expression data (values).
5.   Preview the data to make sure that our new dataframe looks correct!

## **Find the Index of the BDNF Gene**

In our gene dataframe, each different gene has its own position, or index. In order to compile our focused subset of the gene expression data, we need to know the index of the BDNF gene. This way, we can find its expression data in the main file easily.

To find the index of the BDNF gene, run the following code:



In [None]:
bdnf_index = genes_df[genes_df['acronym'] == 'Bdnf'].index[0]

## **Extract BDNF Expression Data**
Now, that we have found the index of the BDNF gene, we know where its expression data will be located. We want to "pull out" all of the rows under the column that is located at this index. This will give us our expression values. To do this, we are going to use the `.iloc` function.

To extract the BDNF expression data, run the following code:

In [None]:
bdnf_expression = expression_df.iloc[:, bdnf_index]


## **Add Brain Regions**
All we have right now is numerical values. In order to complete our new dataframe, we will need to replace the row numbers with the names of the corresponding brain structure.

However, there is an unexpected error that we have run into here. The lengths of `bdnf_expression` and `structures_df` do not match. The former has 1298 rows, while the latter has 1299 rows. The issue is that you cannot had 1299 labels to only 1298 numbers; they need to match exactly in length.

To fix this issue we can drop the extra row in `structures_df` so that the two have the same number of rows!

To drop the extra row and add brain region labels to our data, run the following code:

In [None]:
# Fixes the issue:
structures_df = structures_df.iloc[:-1]
# Adds the labels:
bdnf_expression.index = structures_df['name']

## **Create New Dataframe**
Now that we have our brain region labels and our BDNF expression values, we want to put them together into a new dataframe. To do this, we are going to use the `pd.Dataframe` function to reformat our data.

To create our new dataframe, run the following code:


In [None]:
bdnf_df = pd.DataFrame({'Region': bdnf_expression.index,'BDNF_Expression': bdnf_expression.values})


## **Preview the Data**
Now, we are going to use the `.head()` function again to preview our dataset and make sure that everything looks good!

Run the following code to quickly check the new dataset:

In [None]:
bdnf_df.head()


Unnamed: 0,Region,BDNF_Expression
0,"Tuberomammillary nucleus, ventral part",0.221836
1,"Primary somatosensory area, mouth, layer 6b",
2,secondary fissure,2.18097
3,Inferior colliculus,
4,internal capsule,1.96829


If your dataset does not look right at this point, that is okay! Real data manipulation can be messy and often takes multiple attempts. Here are some tips to help you troubleshoot what might be wrong:

1.   Did you correctly identify the BDNF gene index?
2.   Did you verify that the lengths of your gene expression data and your brain region labels match?
3.   If not, did you handle the extra rows?

Rerun earlier steps in the procedure and preview your data again using the `.head()` function until your dataset looks correct.



# **Step 5: Analysis & Cleaning**
Before we can export our new dataset, we need to check for any missing values (NaNs) and remove them. This way, our finalized dataset is fully ready for future scientific analysis.

To check for and remove missing values, run the following code:


In [None]:
# Check for missing values:
print("Missing values before cleaning:")
print(bdnf_df.isnull().sum())

# Drop any rows with missing values, if any:
bdnf_df = bdnf_df.dropna()

# Confirm that missing values are gone:
print("Missing values after cleaning:")
print(bdnf_df.isnull().sum())

# Preview final, cleaned dataset:
bdnf_df.head()


Missing values before cleaning:
Region               0
BDNF_Expression    456
dtype: int64
Missing values after cleaning:
Region             0
BDNF_Expression    0
dtype: int64


Unnamed: 0,Region,BDNF_Expression
0,"Tuberomammillary nucleus, ventral part",0.221836
2,secondary fissure,2.18097
4,internal capsule,1.96829
5,Principal sensory nucleus of the trigeminal,1.47377
6,Basic cell groups and regions,1.27221


You'll see that we removed 456 missing values from our new dataset, successfully making sure that it is free of errors for later analysis.

# **Step 6: Export New CSV File**
Now that we have successfully created a subset of our gene expression dataset, we will export it as a CSV file to our Google Drive.

To export your new dataset, run the following code:


In [None]:
bdnf_df.to_csv('/content/gdrive/MyDrive/ENGL 105: Unit 3/compiled_BDNF_expression.csv', index=False)
print("File exported successfully to your Google Drive!")

File exported successfully to your Google Drive!


# **Wrap-Up:**

Congratulations, you have successfully compiled and analyzed a subset of gene expression data that focuses on the **BDNF** gene!

As mentioned earlier, the BDNF gene is responsible for the transcription of the BDNF protein, which plays a critical role in learning and long-term memory formation. On top of this, varying levels of BDNF expression have been linked to mood disorders such as depression and anxiety and even neurodegenerative diseases such as Alzheimer's. The data subset you have just compiled could maybe go on to be used in research that explores how BDNF expression correlates with such conditions, providing us with the potential to better understand, or even predict and treat these neurological and psychiatric disorders.