In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl") # remove warning from mac

# Workflow Example
## MaxQuant

To perform a downstream analysis on MaxQuant output data we requirer the `proteinGroups.txt` file generated by MaxQuant. To gain biological insight between various phenotypes an additional metadata file should be provided. Hereby the sample names match with the sample names in the proteinGroups.txt file. 
At first take a look at the files:
The `proteinGroups.txt` contains all standard column headers from MaxQuant. Later, for our analysis we will use 
the Protein Intensity described in `"LFQ intensity [sample]"`.

In [2]:
protein_groups = pd.read_csv("../testfiles/maxquant/proteinGroups.txt", sep = "\t", low_memory=False)
protein_groups.head(5)

Further, we created a excel-file with the corresponding metadata to our proteinGroups.txt-file. The sample names in the column "sample" match the names in `proteinGroups.txt` file.

In [3]:
metadata = pd.read_excel("../testfiles/maxquant/metadata.xlsx")
metadata.head(5)

## Start Downstream Analysis

## 0. Import AlphaStats

Import library alphastats

In [4]:
import alphastats

## 1. Import Data
Load the MaxQuant proteinGroups.txt file and specify the columns indicating the intensity as well as the column that is used for indexing, like here the "Protein IDs" or the gene names. As the column is used for indexing, the values of this column must be unqiue. 

In [5]:
maxquant_data = alphastats.MaxQuantLoader(
    file="../testfiles/maxquant/proteinGroups.txt",
    intensity_column="LFQ intensity [sample]",
    index_column="Protein IDs"
)

## 2. Create a DataSet
Combine the imported MaxQuant data with the metadata

In [6]:
ds = alphastats.DataSet(
    loader = maxquant_data, 
    metadata_path = "../testfiles/maxquant/metadata.xlsx",
    sample_column = "sample" # specify the column that corresponds to the sample names in proteinGroups
)

AlphaStats will create a matrix of the Protein Intensities, which will be accessable using `ds.mat` and will save the metadata as a dataframe `ds.metadata`. 
Our original MaxQuant ProteinGroup file contains much more samples, than we have metadata for

## 3. Preprocess

In [7]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")

Firstly, we will subset the matrix it will only contains samples, that are also described in the metadata.

In [8]:
ds.preprocess(subset=True)

In [9]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")

#### Unnormalized data, Sample Distribution

In [10]:
ds.plot_sampledistribution(color = "disease").show(renderer = "png")

- Contaminations get removed indicated in following columns Only identified by site, Reverse, Potential contaminant (MaxQuant specific) and contamination_library (added by AlphaStats)
- Normalized using quantile normalization
- Missing Values get imputed using K-nearest neighbour imputation

In [11]:
ds.preprocess(
    remove_contaminations=True,
    normalization = "quantile",
    imputation = "knn"
)

#### After quantile normalization, Sample Distribution

In [12]:
ds.plot_sampledistribution(method = "box", color = "disease").show(renderer = "png")

The preprocessing steps can be accessed using:

In [13]:
ds.preprocessing_info

## 4. Visualization

### Principal Component Analysis (PCA)

In [14]:
ds.plot_pca(group = "disease", circle = True).show(renderer = "png")