<img src="materials/images/introduction-to-proteomics-cover.png"/>

# Introduction to Proteomics

`🕒 This module should take less than 1 hour to complete.`

`✍️ This notebook is written using Python.`

## What is protein?

Proteins are the main product of our body, and present in all living organisms. Proteins are made up of amino acids. They are molecules that carry out the majority of biological functions, such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, and synthesis and replication of DNA.

There are two major types of proteins, **enzymes** and **protein hormones**:

1. **Enzymes** accelerate chemical reactions. They help us to break down large molecules into smaller molecules.

> For example, lactase is an enzyme protein that is produced in the small intestine. It is responsible for breaking down sugar molecule lactose into two products: glucose and fructose. When small intestines cannot produce enough lactase to break down the lactose, people often struggle with symptoms of lactose intolerance.



<img src="materials/images/enzyme-function.png"/>

2. **Protein hormones** help regulate the cell signaling processes so that cells in our physical bodies can maintain healthy conditions to carry out functions properly (i.e. homeostasis).

> For example, when the blood sugar level goes up, pancreas produces a protein called insulin. What insulin does is that it signals fat cells to start storing sugar molecules (i.e. glucose molecules) in the blood to keep the amount of sugar molecules in the blood within a normal range. This will help to keep the glucose in the blood within a normal range. Let’s review the mechanism by using diabetes as an example. The symptoms are results of too much sugar building up in the blood, such as heavy thirst, blurry vision, and sugar in the urine. You may ask how the sugars end up accumulating in the blood?



<img src="materials/images/how-insulin-works.png"/>

In a healthy person’s cell, once insulin attaches to the insulin receptor on a cell, the cell opens its glucose channel to let glucose molecules in. The cells of a diabetic person fails to respond to insulin’s signal that asks the cell to open its glucose channel. This leaves all the glucose modules that are supposed to enter the cell accumulating in the blood. This is why scientist often refer to diabetic population as “insulin resistant population” in scientific publications - the cells in people with diabetes are not responding to insulin properly [1].

<img src="materials/images/insulin-sensitive-vs-resistant.png"/>



---



## What is a proteome?

Proteome is a set of proteins produced in an organism. The term “proteome” was first created by Australian scientist, Mark Wilkins, in 1994 to describe the set of proteins encoded by the genome, “PROTein complement expressed by a genOME”  [1][2].

## What is proteomics?

Proteomics is the large-scale study of proteins. Here are two examples of how to distinguish the use of proteome and proteomics:

1. We could say Marc Wilkins is an expert in the field of proteomics.
2. Alternatively, we could say Marc Wilkins has published several research articles studying the proteome of human tissue samples.





---


## How do we quantify protein or small molecules?

In order to say whether someone’s blood sugar is high or low, we will need to quantify the proteins. There are a variety of protein quantitation methods (e.g., relative quantitation, absolute quantitation). Each of these protein quantitation technologies has unique benefits. An important method for **quantifying mass** and **characterizing proteins** is mass spectrometry.

Mass spectrometry is an analytical tool useful for measuring the mass-to-charge ratio (m/z) of one or more molecules presented in a sample [3]. Each molecule has a given mass (m) and the charge number of ions (z). The m/z ratio is important information that helps us understand the identity of the molecule.


---



## What are the major applications of proteomics?

In personalized medicine, applications of quantifying proteomics are typically based on biomarkers, which are measurable indicators that reflect disease risk or disease status [4]. For example, studies show the proteomic profile of prostate cancer can be significantly altered during the course of the disease [5].

<img src="materials/images/applications-of-proteomics.png"/>



---




#### References:

[1] Tyers, M., & Mann, M. (2003). From genomics to proteomics. Nature, 422(6928), 193-197.


[2] Marc Wilkins (geneticist). (2022). Retrieved August 25, 2022, from https://en.wikipedia.org/wiki/Marc_Wilkins_(geneticist)

[3] https://www.broadinstitute.org/technology-areas/what-mass-spectrometry - last access: August 22, 2022.

[4] Schubert, O. T., Röst, H. L., Collins, B. C., Rosenberger, G., & Aebersold, R. (2017). Quantitative proteomics: challenges and opportunities in basic and applied research. Nature protocols, 12(7), 1289-1294.

[5] Latonen, L., Afyounian, E., Jylhä, A., Nättinen, J., Aapola, U., Annala, M., ... & Visakorpi, T. (2018). Integrative proteomics in prostate cancer uncovers robustness against genomic and transcriptomic aberrations during disease progression. Nature communications, 9(1), 1-13.

[6] https://www.genecards.org/cgi-bin/carddisp.pl?gene=CFD - last access: October 19, 2022.



---

# Proteomics Data Analysis

### Background

We will be focusing our analyses on a patient (i.e. Patient Z) from the Integrated Personal Omics Profiling (iPOP) study. Designed and performed at Stanford University, iPOP aims to understand what “healthy" biochemical and physiological profiles look like at a personal level, and what happens when people become ill. This provides the basis for personalized precision medicine, which attempts to tailor medical decisions and treatments to a patient's individual omics profile.

IPOP is a longitudinal study that follows 106 patients, and collects samples at regular intervals over several years through sickness and health. The iPOP cohort also contains a significant proportion of pre-diabetic patients. Therefore, another focus of the study was to better understand how omics are influenced by a pre-diabetic state and the progression from pre-diabetes to either a normal healthy state or diabetes.

In this module, we will be reading and visualizing the proteomics data using Python. In our analyses, we will focus on a specific time period when Patient Z experienced an infection (and eventually recovered). By the end of this module, we will be able to interpret visualizations of the data and find interesting correlations.

### Load data

Let's read our proteome abundances file named `abundance.csv` using the Python Pandas library. `index_col=0` sets the first entry in each row of the file as the row name.

In our data, labeled row indices are the timepoint when each of the samples were collected (corresponding to different stages of the infection). Unlabeled (integer) row indices are other samples that were collected from Patient Z. Each column name represents a different protein.

In [None]:
import pandas as pd

data = pd.read_csv('data/abundance.csv', index_col=0)
data

To get an overview of our data, we could take a look at the number of samples and proteins by examining the shape of the DataFrame.

In [None]:
(nRows, nCols) = data.shape
print('Our dataset contains the abundance data for', nCols, 'proteins from', nRows, 'samples.')

### Checking our data

Now, let's make sure our data makes sense. You do not have to worry to much about this code, but we are using a `for` loop and the `iloc` function to locate each row. `np.argmax` gets the index of the maximum value within each row. We can then print the most abundant protein and sample name by using our indices to access the indices (timepoints) and columns (protein names) from our data table.

In [None]:
import numpy as np

for i in range(5):
  j = np.argmax(data.iloc[i, :])
  print(data.columns[j], 'is the protein with the highest abundance in the', data.index[i], 'sample.')

**Sanity Check**: ALB (albumin) is the most abundant protein in human blood. It checks out that each of our 5 labeled samples has albumin as the most abundant protein.

### Plotting

Since the quantities of each of the proteins assayed vary in scale, we will scale these measurements to a value between 0 and 1. We obtain this relative abundance for each protein by subtracting the minimum abundance from all measurements the measurements of the specific protein. This gives us the minimum value of 0. Then, we can ensure the maximum value is 1 by dividing all of the shifted measurements by the maximum value (maximum abundance - minimum abundance).

In [None]:
relativeAbundance = []
for protein in range(data.shape[1]):
  minAbundance = min(data.iloc[:, protein])
  maxAbundance = max(data.iloc[:, protein])
  relativeAbundance.append((data.iloc[:, protein] - minAbundance) / (maxAbundance - minAbundance))
relativeAbundance = pd.DataFrame(relativeAbundance)

In [None]:
relativeAbundance.iloc[0:10, :5]

Now, we are ready to plot our data using a heatmap. Using the `seaborn` package, we can plot our relative abundances for the first 20 of our proteins of interest from our samples. We choose to plot this data using a heatmap, because it allows us to easily visualize changes in protein abundance over time (by simply comparing colors). Here, we choose to plot just the first 20 proteins to produce a reasonably sized heatmap. However, we could make additional plots for all the remaining proteins.

In [None]:
import seaborn as sns

df = relativeAbundance.iloc[0:20, :5]
ax = sns.heatmap(df, vmin=0, vmax=1)
ax.figure

It is very difficult to draw conclusions from this heatmap, but we can observe that there is a visibly noticeable change in the heatmap profile when the infection recovery first starts. Preceding this, we see a low amount (dark cell) of the CFD (Complement Factor D) protein. CFD deficiency has been associated with recurrent bacterial meningitis infections for human patients [6]. We can rationalize this as the CFD levels increased back to earlier levels as the recovery from the infection progressed.



---



## Contributions & acknowledgement

- **Module Content:** Ryan Park
- **Engineering:** Amit Dixit
- **UX/UI Design & Illustration:** Kexin Cha
- **Video Production:** Francesca Goncalves
- **Project Management:** Amir Bahmani, Kexin Cha



---



Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.