In [1]:
import time
notebook_start_time = time.perf_counter()

# GalNAc data pre-processing

## Documentation

### GalNAc data column descriptions
* **uniprot**: UniProt accession
* **peptide.id**: Peptide UUID
* **peptide**: peptide AA sequence
* **peptide.start**: absolute start position of peptide in protein AA sequence (1-indexed, inclusive)
* **peptide.end**: absolute end position of peptide in protein AA sequence (1-indexed, inclusive)
* **site**: absolute position of glycosite in protein AA sequence (1-indexed)
* **site.ambiguous.start**: absolute start position of peptide in protein AA sequence (1-indexed, inclusive)
* **site.ambiguous.end**: absolute end position of peptide in protein AA sequence (1-indexed, inclusive)
* **site.composition**: sugar composition of glycosite (ex. 1xHex3xHexNAc)
  * Hex
  * dHex
  * HexNAc
  * HexHexNAc
  * NeuAc
  * NeuGc
  * Sia
  * Sulf
* **source**: Peptide library (ex. sec\_pna\_try or vva\_chy or tcl\_pna)
  * Peptide origin: sec (Secreted), tcl (Total Cell Lysis), tissue (?)
  * Lectin: vva (Vicia Villosa), pna (Arachis hypogaea)
  * Proteases: try (Trypsin), chy (Chymotrypsin), gluc (Glu-C/V-8)
* **quantification**:
  * Value from 0-100: ?
  * potential_ko: ?
  * potential_wt: ?
  * rejected_ratio: ?
  * rejected_singlet_light: ?
  * rejected_singlet_medium: ?
* **quantification.channels**:
  * ko:wt: ?
  * L: ?
  * M: ?
  * M:L: ?
* **site.ambiguity**: If the glycosite was inferred indirectly from other information
  * inferred - By subtracting known 'site.composition' from 'composition' when there is only one other possible glycosite (S/T/Y)
  * missing_site_coverage - If multiple fragments could be glycosylated from the MS data, the site or ambiguous range can be determined from the possible glycosites (S/T/Y)
  * delta_cn_filter - Might be lower quality data, as it doesn't pass a filter on the MS data (cross-referencing peaks we see two groupings and can filter out the group below a cutoff?)
* **quantification.confidence**:
  * high: ?
  * low: ?
  * low_sn: ?
* **composition**: semicolon-separated list of sugar composition of all glycosites on peptide (see site.composition)
* **activation**: Mass spectrometry technique
  * CID - Collision Induced Dissociation - Might not have fragmentation information for peptides
  * HCD - Higher-energy Collisional Dissociation - Might not have fragmentation information for peptides
  * ETD - Electron Transfer Dissociation
* **dataset**: Source dataset

### Basic validation
* **uniprot** must be a valid accession string
* **peptide.id** must be a valid UUID string
* **peptide** must be a valid AA sequence string (only 20 basic AA and X) with at least one S/T/Y or NA
* **peptide.start** & **peptide.end** must be an positive integer range or both NA
* **site** must be an positive integer
* **site.ambiguous.start** & **site.ambiguous.end** must be an positive integer range or both NA
* **site.composition** must be a valid sugar composition string or NA
* **source** must be a string with format: "Peptide Origin"\_"Lectin"\_"Protease1,Protease2..." or "Lectin"\_"Protease1,Protease2..." or NA
* **quantification** must be an float between 0-100 (inclusive?) or a string of either "potential_ko", "potential_wt", "rejected_ratio", "rejected_singlet_light", "rejected_singlet_medium" or NA
* **quantification.channels** must be a string of either "ko:wt", "L", "M", "M:L" or NA
* **site.ambiguity** must be a string of either "inferred", "missing_site_coverage", "delta_cn_filter" or NA
* **quantification.confidence** must be a string of either "high", "low", "low_sn" or NA
* **composition** must be a semicolon-separated list of valid sugar compositions
* **activation** must one of "CID", "HCD", "ETD" or NA
* **dataset** must be a non-whitespace string
* **origin** must be one of "sec", "tcl", "tissue" or NA
* **lectin** must be one of "pna" or "vva"
* **proteases** must be a comma-separated list of "try", "chy", "gluc" or NA

### Imputing missing info
* If 'site', 'site.ambiguous.start' & 'site.ambiguous.end' are all NA 
  * If 'peptide', 'peptide.start' & 'peptide.end' are not NA, find positions of all S/T/Y in peptide
    * If one position: set 'site' as position
    * If several positions: set 'site.ambiguous.start' & 'site.ambiguous.end' to min(positions) & max(positions)

### Cross-referencing columns
* The allowed combinations of sources are origin_lectin_proteases, origin_lectin, lectin_proteases or none of them
* Either 'site' should be NA or 'site.ambiguous.start' & 'site.ambiguous.end' should be NA
* If 'peptide', 'peptide.start' & 'peptide.end' are not NA, len('peptide') should be equal to 'peptide.end' - 'peptide.start' + 1
* If 'peptide.start' & 'peptide.end' are not NA, 'site', 'site.ambiguous.start' & 'site.ambiguous.end' should be within their range or NA
* If 'peptide', 'peptide.start' & 'peptide.end' are not NA, 'site', 'site.ambiguous.start' & 'site.ambiguous.end' must be S/T/Y or NA
* Rows with same 'peptide.id' should have same 'uniprot', 'peptide', 'peptide.start', 'peptide.end', 'source', 'composition', 'activation' & 'dataset'


### Cross-reference columns with UniProt AA sequence
* If 'uniprot', 'peptide', 'peptide.start' & 'peptide.end' are not NA:
  * Does the peptide position and sequence not match with uniprot?
* If '
* 'site' is not Serine (S), Threonine (T) or Tyrosine (Y):
  * If 'peptide' is not NA:
    * Try to re-align 'peptide' to uniprot sequence, does it not align?
      * BLAST search uniprot, filter by organism, is there more or less than one good match?
        * Discard
* 'site.ambiguous.start' & 'site.ambiguous.end'



### Further ideas
* Do alignment, BLAST or other search algorithm for finding correct isoform of peptides?
* Filter out isoforms that only have very few peptides compared to the other isoforms 

## Imports

### Built-in imports

In [2]:
from functools import reduce
import operator
from pathlib import Path

### Shared library imports

In [3]:
from glyc_processing import cf
from glyc_processing.data_formats.galnac.config import GalNAcConfig

from glyc_processing.data_formats.galnac.standardization import standardize_df

from glyc_processing.uniprot import get_uniprot_id_mappings, get_uniprot_entries, get_uniprot_isoforms, \
    get_entry_isoforms_dicts, set_new_uniprot_ids, set_missing_peptide_sequences, set_uniprot_isoforms_containing_peptides, \
    check_uniprot_idmapping, consistent_entry_peptide, consistent_entry_sites, set_correct_peptide_ranges

from glyc_processing.misc import display_invalid_rows

from glyc_processing.data_formats.common.validation import valid_uniprot, valid_peptide_id, valid_peptide, valid_peptide_range, \
     valid_single_site, valid_unclear_site_range, consistent_site_or_unclear_range, consistent_peptide_length, \
     consistent_sites_position, consistent_sites_aa, consistent_id_common_info

from glyc_processing.data_formats.galnac.validation import valid_site_composition, valid_source, valid_quantification, \
     valid_quantification_channels, valid_site_ambiguity, valid_quantification_confidence, valid_composition, valid_activation, \
     valid_dataset

from glyc_processing.clustering import cluster_peptides

### External imports

In [4]:
from IPython.display import display
import numpy as np
import pandas as pd

## Paths & Constants

In [5]:
#BASE_DIR = Path("/mnt/g/My Drive/CloudVault/Masters/Data")
BASE_DIR = Path("/home/jakob/Cloudvault_new/Data")

# Data file path
DATA_FILE = BASE_DIR/'NetOGlyc5 data'/'GalNAc data'/'01-raw_data'/'sitedata.flatpack.9606.HexNAc.210826.xlsx'

# Path to write cleaned data file to
CLEAN_POSITIVE_DATA_FILE = BASE_DIR/'NetOGlyc5 data'/'GalNAc data'/'02-GalNAC_processing'/'clean_data.xlsx'

# Path to write list of used uniprot IDs to
CLEAN_POSITIVE_UNIPROT_IDS_FILE = BASE_DIR/'NetOGlyc5 data'/'GalNAc data'/'02-GalNAC_processing'/'uniprot_ids.txt'

In [6]:
# Set right config
cf.use_config(GalNAcConfig)

In [7]:
# The temp folder is used to organize data-specific temporary files
cf.TEMP_DIR = BASE_DIR/'uniprot'/'GalNAc'

# Uniprot release downloads directory path (make sure you have a few GB of free space)
cf.UNIPROT_DOWNLOADS_DIR = BASE_DIR/'uniprot'

# The amino acids that are allowed to be glycosylated
cf.ALLOWED_AA = ('S','T','Y')

# The uniport release can be 'latest' for the current release or any of those with format (YYYY_MM) found here: https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/
# Warning: Uniprot only keeps previous releases other than the first of the year for 2 years, so only use first yearly (2015_01, 2021_01 etc.) for reproducability!
cf.UNIPROT_RELEASE = '2021_01'

# If True ignores existing data-specific temp files and recreates them from scratch
# This should be used if the data or script has changed
cf.IGNORE_EXISTING_FILES = False

In [8]:
print(f"Using Uniprot Release {cf.TRUE_UNIPROT_RELEASE}")

Using Uniprot Release 2021_01


## Read data

In [9]:
df = pd.read_excel(DATA_FILE, header=0)
n_original_rows = len(df)

In [10]:
display(df)
display(df.info())

Unnamed: 0,uniprot,peptide.id,peptide,peptide.start,peptide.end,site,site.ambiguous.start,site.ambiguous.end,site.composition,source,quantification,quantification.channels,site.ambiguity,quantification.confidence,composition,activation,dataset
0,C9J069,7e893282-7694-4687-95d0-8d69e6913073,,,,543.0,,,HexNAc,tcl_vva_chy,,,,,1xHexNAc,,AGS.Glycoproteome
1,O00220,a9b4488b-2f63-4dec-8ad0-eec98c199811,,,,117.0,,,HexNAc,tcl_vva_chy,,,,,1xHexNAc,,AGS.Glycoproteome
2,O00220,e4842095-b422-45bd-a382-8cf2de020293,,,,117.0,,,HexNAc,tcl_vva_gluc,,,,,1xHexNAc,,AGS.Glycoproteome
3,O00220,90fe3d78-753c-47ca-ad2f-77bdac09ea6b,,,,120.0,,,HexNAc,tcl_vva_chy,,,,,1xHexNAc,,AGS.Glycoproteome
4,O00220,f96809ce-04c8-4db5-8dde-7303c19bb020,,,,120.0,,,HexNAc,tcl_vva_gluc,,,,,1xHexNAc,,AGS.Glycoproteome
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319486,Q9NX40,2012d47a-e4ab-42b7-bd03-e03480fe2c52,VTSPAADNIEMLPHYEPIPF,145.0,164.0,,,,,tcl_vva_chy,,,,,1xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319487,Q6ZRI0,230baa2e-421a-4f39-b267-1a223425d2f7,PAQHTTMATRSPALPPETPAAASLSTATDGL,1764.0,1794.0,,,,,tcl_vva_chy,,,,,2xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319488,Q53GL7,60f2f1e7-e5a7-49de-826e-ec0290c3a78a,GQEGLVEIAMDSPEQEGLVGPMEITMGSL,396.0,424.0,,,,,tcl_vva_chy,,,,,2xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319489,Q9BXK5,21d3c909-c10c-47db-939f-04eabbf5bf79,HTESLPVSLGPESW,263.0,276.0,,,,,tcl_vva_chy,,,,,2xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319491 entries, 0 to 319490
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   uniprot                    319491 non-null  object 
 1   peptide.id                 319491 non-null  object 
 2   peptide                    290654 non-null  object 
 3   peptide.start              309550 non-null  float64
 4   peptide.end                309565 non-null  float64
 5   site                       194901 non-null  float64
 6   site.ambiguous.start       31942 non-null   float64
 7   site.ambiguous.end         31942 non-null   float64
 8   site.composition           226846 non-null  object 
 9   source                     301991 non-null  object 
 10  quantification             103724 non-null  object 
 11  quantification.channels    169433 non-null  object 
 12  site.ambiguity             63919 non-null   object 
 13  quantification.confidence  14

None

## Data standardization

In [11]:
df = standardize_df(df)

## Mapping to UniProt sequences

### Fetch any new UniProt ID Mappings

In [12]:
mappings_dict = get_uniprot_id_mappings(set(df['uniprot'].unique()))

### Check if any IDs didn't map to uniprot or have been merged/demerged/deleted and change them accordingly

In [13]:
check_uniprot_idmapping(set(df['uniprot'].unique()), mappings_dict)

{'new_mapping_ids': set(), 'non_mapping_ids': set()}

In [14]:
set_new_uniprot_ids(df, mappings_dict)

In [15]:
valid_uniprot_rows = display_invalid_rows(df, "Deleted 'uniprot' IDs", valid_uniprot)
df = df[valid_uniprot_rows].reset_index(drop=True)

### Fetch UniProt sequences

In [16]:
get_uniprot_entries(set(df['uniprot'].unique()))
get_uniprot_isoforms(set(df['uniprot'].unique()))
entry_isoforms, isoform_seqs = get_entry_isoforms_dicts()
print(f"Uniprot sequences - Number of entries: {len(entry_isoforms)}, Number of isoforms: {len(isoform_seqs)}")

Uniprot sequences - Number of entries: 4858, Number of isoforms: 11831


### Recover missing peptide ranges

In [17]:
missing_peptide_range_count = (df['peptide'].notna() & df['peptide_start'].isna() & df['peptide_end'].isna()).sum()
print(f"Missing peptide ranges before: {missing_peptide_range_count} {missing_peptide_range_count/df.shape[0]*100:.2f}%")

set_correct_peptide_ranges(df, isoform_seqs)

missing_peptide_range_count = (df['peptide'].notna() & df['peptide_start'].isna() & df['peptide_end'].isna()).sum()
print(f"Missing peptide ranges after: {missing_peptide_range_count} {missing_peptide_range_count/df.shape[0]*100:.2f}%")

Missing peptide ranges before: 3 0.00%
Missing peptide ranges after: 436 0.14%


### Recover missing peptide sequences

In [18]:
missing_peptide_count = (df['peptide'].isna() & df['peptide_start'].notna() & df['peptide_end'].notna()).sum()
print(f"Missing peptide sequences before: {missing_peptide_count} {missing_peptide_count/df.shape[0]*100:.2f}%")

set_missing_peptide_sequences(df, isoform_seqs)

missing_peptide_count = (df['peptide'].isna() & df['peptide_start'].notna() & df['peptide_end'].notna()).sum()
print(f"Missing peptide sequences after: {missing_peptide_count} {missing_peptide_count/df.shape[0]*100:.2f}%")

Missing peptide sequences before: 0 0.00%
Missing peptide sequences after: 0 0.00%


### Set the right isoform (& peptide/site position info) for each peptide

In [19]:
print(f"Non-canonical isoform sites before: {df['uniprot'].str.contains('-').sum()}, Unique: {df.loc[df['uniprot'].str.contains('-'), 'uniprot'].nunique()}")

set_uniprot_isoforms_containing_peptides(df, entry_isoforms, isoform_seqs)

print(f"Non-canonical isoform sites after: {df['uniprot'].str.contains('-').sum()}, Unique: {df.loc[df['uniprot'].str.contains('-'), 'uniprot'].nunique()}")

Non-canonical isoform sites before: 0, Unique: 0
Non-canonical isoform sites after: 0, Unique: 0


## Data Validation

### Basic column validation

In [20]:
def validate_individual_columns(df):
    valid_rows = display_invalid_rows(df, "Invalid 'uniprot' rows", valid_uniprot)
    valid_rows &= display_invalid_rows(df, "Invalid 'peptide_id' rows", valid_peptide_id)
    valid_rows &= display_invalid_rows(df, "Invalid 'peptide' rows", valid_peptide)
    valid_rows &= display_invalid_rows(df, "Invalid 'peptide_start' and/or 'peptide_end' rows", valid_peptide_range)
    valid_rows &= display_invalid_rows(df, "Invalid 'single_site' rows", valid_single_site)
    valid_rows &= display_invalid_rows(df, "Invalid 'unclear_site_start' and/or 'unclear_site_start' rows", valid_unclear_site_range)
    valid_rows &= display_invalid_rows(df, "Invalid 'site_composition' rows", valid_site_composition)
    valid_rows &= display_invalid_rows(df, "Invalid 'source' rows", valid_source)
    valid_rows &= display_invalid_rows(df, "Invalid 'quantification' rows", valid_quantification)
    valid_rows &= display_invalid_rows(df, "Invalid 'quantification_channels' rows", valid_quantification_channels)
    valid_rows &= display_invalid_rows(df, "Invalid 'site_ambiguity' rows", valid_site_ambiguity)
    valid_rows &= display_invalid_rows(df, "Invalid 'quantification_confidence' rows", valid_quantification_confidence)
    valid_rows &= display_invalid_rows(df, "Invalid 'composition' rows", valid_composition)
    valid_rows &= display_invalid_rows(df, "Invalid 'activation' rows", valid_activation)
    valid_rows &= display_invalid_rows(df, "Invalid 'dataset' rows", valid_dataset)
    return valid_rows

In [21]:
valid_rows = validate_individual_columns(df)

#### Invalid 'peptide' rows: 3 (0.00%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
2396,O43278,ba98e2ed-49fe-4f0d-91b7-a25705ec9ced,O43278,,,,,,,tcl_vva_,,,,,O43278xHexNAc,,AGS.Glycoproteome..unpublished
211590,O15240,3052ff9c-e052-424e-8938-bb1b6aa5ff52,O15240,,,,,,,tcl_vva_,,,,,O15240xHexNAc,,MKN45.Glycoproteome..unpublished
212096,P14625,078a8d1d-14c5-4e5e-98a5-ee4db38ed665,P14625,,,,,,,tcl_vva_,,,,,P14625xHexNAc,,MKN45.Glycoproteome..unpublished


#### Invalid 'unclear_site_start' and/or 'unclear_site_start' rows: 3 (0.00%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
18278,P02751,5c6b9194-c74e-4834-a1ec-65a1934e89d4,,,,,2345.0,2345.0,1xHexNAc,sec_vva_try,,,,,1xHexNAc,,HaCaT.Glycoproteome
33396,P13611,a79c8374-6d34-4163-95e8-e51d6049f442,,,,,2194.0,2194.0,1xHexNAc,tcl_vva_chy,,,,,1xHexNAc,,Hek293.Glycoproteome
214920,Q6UXD5,500c4483-958d-4339-93d5-6f0b9975d7bb,,,,,100.0,96.0,1xHexNAc,tcl_vva_try,,,,,1xHexNAc,,OVCAR.3.Glycoproteome


#### Invalid 'composition' rows: 1319 (0.41%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
2396,O43278,ba98e2ed-49fe-4f0d-91b7-a25705ec9ced,O43278,,,,,,,tcl_vva_,,,,,O43278xHexNAc,,AGS.Glycoproteome..unpublished
45051,A1A5C7,46895d9d-1d38-4e5b-a594-ce9fc868366d,GTELAGVTTTGR,134.0,145.0,,135.0,143.0,,tcl_vva_,,,,,0xHexNAc,,HEK293.SC.SC.POMT1KO.POMT2KO.Differential.O.Ga...
45052,A1A5C7,101adbac-0f47-4811-94f8-c3e4b47282ef,GTELAGVTTTGR,134.0,145.0,,135.0,143.0,,tcl_vva_,,,,,0xHexNAc,,HEK293.SC.SC.POMT1KO.POMT2KO.Differential.O.Ga...
45053,A1A5C7,03e01a89-ba1f-45a8-9cda-68a28ec827b8,GTELAGVTTTGR,134.0,145.0,,135.0,143.0,,tcl_vva_,,,,,0xHexNAc,,HEK293.SC.SC.POMT1KO.POMT2KO.Differential.O.Ga...
45054,A1A5C7,e2ddf56d-c6ee-494b-8150-2c361fa7471f,GTELAGVTTTGR,134.0,145.0,,135.0,143.0,,tcl_vva_,,,,,0xHexNAc,,HEK293.SC.SC.POMT1KO.POMT2KO.Differential.O.Ga...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179259,Q9HC07,e6675d64-6f37-4032-87ea-d06d8a1e4c8d,IFTPAAPVHTNK,71.0,82.0,,73.0,80.0,,tcl_vva_,,M,,,undefinedxHexNAc,,K562.SC.GALNT5.KO.glycoproteome
179260,Q9HC07,f98bff5e-ffc7-4586-8f8a-02eb6deba0fa,IFTPAAPVHTNK,71.0,82.0,,73.0,80.0,,tcl_vva_,,M,,,undefinedxHexNAc,,K562.SC.GALNT5.KO.glycoproteome
180640,Q9NXR5,3cfc9afb-a31c-4300-b711-573905bc5662,NSSSVSNTLTNGCVINGHLDFPSTTPLSGMESR,266.0,298.0,,267.0,297.0,,tcl_vva_,,M,,,undefinedxHexNAc,,K562.SC.GALNT5.KO.glycoproteome
211590,O15240,3052ff9c-e052-424e-8938-bb1b6aa5ff52,O15240,,,,,,,tcl_vva_,,,,,O15240xHexNAc,,MKN45.Glycoproteome..unpublished


### Field consistency validation

In [22]:
def check_field_consistency(df):
    consistent_rows = display_invalid_rows(df, "Inconsistent 'single_site', 'unclear_site_start' & 'unclear_site_start' row combinations", consistent_site_or_unclear_range)
    consistent_rows &= display_invalid_rows(df, "Inconsistent 'peptide' length compared to 'peptide_start' & 'peptide.end' rows", consistent_peptide_length)
    consistent_rows &= display_invalid_rows(df, "Invalid glycosylation site positions outside of peptide position", consistent_sites_position)
    consistent_rows &= display_invalid_rows(df, "Disallowed glycosylation site amino acids in peptide", consistent_sites_aa)
    consistent_rows &= display_invalid_rows(df, "Inconsistent peptide-specific info in rows",
                                            df_func = consistent_id_common_info,
                                            id_column_name = 'peptide_id',
                                            consistent_columns_df = df[['peptide_id', 'uniprot', 'peptide', 'peptide_start', 'peptide_end', 'source', 'composition', 'activation', 'dataset']])
    return consistent_rows

In [23]:
consistent_rows = check_field_consistency(df)

#### Inconsistent 'single_site', 'unclear_site_start' & 'unclear_site_start' row combinations: 9263 (2.90%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
2396,O43278,ba98e2ed-49fe-4f0d-91b7-a25705ec9ced,O43278,,,,,,,tcl_vva_,,,,,O43278xHexNAc,,AGS.Glycoproteome..unpublished
2956,P19174,e74c8163-5548-4a31-bfeb-226434149ad6,GADKIEGAIDIR,65.0,76.0,,,,,sec_vva_,,,,,1xHexNAc,,AGS.Glycoproteome..unpublished
4109,O00391,8fe2ed5e-17b2-4d07-8e4f-1b60da1a6e05,,,,,,,,sec_vva_try,,,,,1xHexNAc,,COLO.205.Glycoproteome
4110,O00391,a5b032cf-4738-4a72-b2ad-7300ed6df08d,,,,,,,,tcl_vva_try,,,,,1xHexNAc,,COLO.205.Glycoproteome
4111,O00391,2dc2cf5f-44e9-4304-b025-522a6317d91f,,,,,,,,sec_vva_try,,,,,1xHexNAc,,COLO.205.Glycoproteome
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316815,Q7Z3E5,a8a92743-0fe3-426a-9ec7-bd5f15cde827,LKDVPLLP,307.0,314.0,,,,,tcl_pna_try,,,,,1xHexHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
317389,O00429,2f0b6365-c7e9-486e-8cdb-dd664550165e,GEENGVEAEEWGK,80.0,92.0,,,,,tcl_vva_try,,,,,1xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
317575,O94782,8cdd75fc-a488-418a-bcd8-f7c7b7d33293,LFQGQLVLR,429.0,437.0,,,,,tcl_vva_try,,,,,2xHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319439,O95983,6bdb787f-159d-4393-b9f2-1fc1f899d281,AHVEELARDGEAPLDKACAEDDDEEDEEEEEEEPDPDPEM,249.0,288.0,,,,,tcl_pna_chy,,,,,1xHexHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection


#### Invalid glycosylation site positions outside of peptide position: 9958 (3.12%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
2244,O00468,4e6850dd-9aed-4957-87db-ba8687fe5119,ALQSNHFELSLR,1910.0,1921.0,1890.0,,,1xHexNAc,sec_vva_,,,,,2xHexNAc,,AGS.Glycoproteome..unpublished
2245,O00468,4e6850dd-9aed-4957-87db-ba8687fe5119,ALQSNHFELSLR,1910.0,1921.0,1896.0,,,1xHexNAc,sec_vva_,,,,,2xHexNAc,,AGS.Glycoproteome..unpublished
2246,O00468,5bf52b55-aee4-4ed1-bcf5-59f1cc86f41f,ALQSNHFELSLR,1910.0,1921.0,1896.0,,,1xHexNAc,sec_vva_,,,,,1xHexNAc,,AGS.Glycoproteome..unpublished
3825,Q9H3R2,325a7707-f81f-4655-a24f-af77fde81828,STGFTNLGAEGSVFPK,471.0,486.0,470.0,,,1xHexNAc,sec_vva_,,,,,3xHexNAc,,AGS.Glycoproteome..unpublished
3828,Q9H3R2,2563a248-cdc8-4528-b9c8-615baddd5e5c,STGFTNLGAEGSVFPK,471.0,486.0,470.0,,,1xHexNAc,sec_vva_,,,,,2xHexNAc,,AGS.Glycoproteome..unpublished
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319163,Q9UPS6,a561f293-bc6b-4b8e-ab65-83c97a1dcae5,AKSQSTETVPATPGGEPPLSGGSSGLSL,1371.0,1398.0,,1333.0,1339.0,1xHexNAc,tcl_vva_chy,,,missing_site_coverage,,1xHexHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319164,Q9UPS6,a561f293-bc6b-4b8e-ab65-83c97a1dcae5,AKSQSTETVPATPGGEPPLSGGSSGLSL,1371.0,1398.0,,1347.0,1351.0,1xHexHexNAc,tcl_vva_chy,,,missing_site_coverage,,1xHexHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319303,Q96JK9,bfc61088-9109-489d-9f48-8c98863ff60b,SGLSQGVPGMPAFSQPPAQQQIPSGSF,1049.0,1075.0,,1045.0,1048.0,1xHexNAc,tcl_vva_chy,,,missing_site_coverage,,2xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319441,Q9NPR2,7483ca26-2a38-4d73-a5d8-0c9f8db918e7,CSASSVVSPSFVPTGEKPCEQVQF,578.0,601.0,,574.0,586.0,,tcl_pna_chy,,,,,1xHexHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection


#### Disallowed glycosylation site amino acids in peptide: 5630 (1.76%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
3819,Q9H3R2,2fd2c3af-4f90-4c8d-9443-f1212682a6a7,DSQMQNPYSR,494.0,503.0,,494.0,501.0,,sec_vva_,,,,,1xHexNAc,,AGS.Glycoproteome..unpublished
3821,Q9H3R2,f686c870-e293-4dcf-9b23-fdec807cadbc,HSSMPRPDY,504.0,512.0,,504.0,511.0,,sec_vva_,,,,,1xHexNAc,,AGS.Glycoproteome..unpublished
3822,Q9H3R2,a42132fb-267e-430e-bc8b-270fd9e07fdc,LRSTGFTNLGAEGSVFPK,469.0,486.0,470.0,,,1xHexNAc,sec_vva_,,,,,3xHexNAc,,AGS.Glycoproteome..unpublished
3824,Q9H3R2,a42132fb-267e-430e-bc8b-270fd9e07fdc,LRSTGFTNLGAEGSVFPK,469.0,486.0,481.0,,,1xHexNAc,sec_vva_,,,,,3xHexNAc,,AGS.Glycoproteome..unpublished
3827,Q9H3R2,325a7707-f81f-4655-a24f-af77fde81828,STGFTNLGAEGSVFPK,471.0,486.0,481.0,,,1xHexNAc,sec_vva_,,,,,3xHexNAc,,AGS.Glycoproteome..unpublished
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318759,Q99996,298b8f4a-8d5a-4239-ab14-927c4e404680,EAIIASEKEVWNREKL,3645.0,3660.0,3654.0,,,1xHexHexNAc,tcl_pna_chy,,,,,1xHexHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319302,Q96JK9,bfc61088-9109-489d-9f48-8c98863ff60b,SGLSQGVPGMPAFSQPPAQQQIPSGSF,1049.0,1075.0,1058.0,,,1xHexNAc,tcl_vva_chy,,,missing_site_coverage,,2xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319314,O75477,e3e42a0c-6bfe-4b7a-b065-07fc3fccaf02,PFITTFRSVQTT,56.0,67.0,57.0,,,1xHexNAc,tcl_vva_chy,,,,,3xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319315,O75477,e3e42a0c-6bfe-4b7a-b065-07fc3fccaf02,PFITTFRSVQTT,56.0,67.0,58.0,,,1xHexNAc,tcl_vva_chy,,,,,3xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection


### UniProt entries data consistency validation

In [24]:
def validate_to_entries(df, isoform_seqs):
    entry_valid_rows = display_invalid_rows(df, "Non-matching peptide sequence between data and UniProt sequence", consistent_entry_peptide, isoform_seqs = isoform_seqs)
    entry_valid_rows &= display_invalid_rows(df, "Non-matching 'site' or 'site_ambiguous_start'/'site_ambiguous_end' between data and UniProt sequence", consistent_entry_sites, isoform_seqs = isoform_seqs)
    return entry_valid_rows

In [25]:
entry_valid_rows = validate_to_entries(df, isoform_seqs)

#### Non-matching peptide sequence between data and UniProt sequence: 328 (0.10%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
106759,Q8WXI7,c5dbc578-a2cc-433c-aea8-d9a35f05c9d3,STSVGPLYSGCR,12899.0,12910.0,,12899.0,12907.0,,tcl_vva_try,,M,,,1xHexHexNAc,CID,hela.erk8.HeLa.bto.0000567
184447,P98088,b9f95a4f-c917-43bd-b917-52500967a20d,ATSPTQSTSSWQK,2767.0,2779.0,2771.0,,,1xHexNAc,tcl_vva_try,32.251,M:L,missing_site_coverage,,2xHexNAc,ETD,ls174.t3ko.LS174T.bto.0001553..WT.C1GALT1C1..K...
184448,P98088,b9f95a4f-c917-43bd-b917-52500967a20d,ATSPTQSTSSWQK,2767.0,2779.0,,2768.0,2769.0,1xHexNAc,tcl_vva_try,32.251,M:L,missing_site_coverage,,2xHexNAc,ETD,ls174.t3ko.LS174T.bto.0001553..WT.C1GALT1C1..K...
184461,P98088,e4149c23-937b-48f2-83a3-dfbb5b16ee0d,ATSPTQSTSSWQK,2767.0,2779.0,2771.0,,,1xHexNAc,tcl_vva_try,61.106,M:L,,,1xHexNAc,ETD,ls174.t3ko.LS174T.bto.0001553..WT.C1GALT1C1..K...
184471,P98088,49800dc3-7f71-4b1a-8e1e-bba6b236efbe,ATSPTQSTSSWQK,2767.0,2779.0,2768.0,,,1xHexNAc,tcl_vva_try,rejected_ratio,M,,,7xHexNAc,ETD,ls174.t3ko.LS174T.bto.0001553..WT.C1GALT1C1..K...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
218839,P22105,ba9ac1f7-0439-4361-a074-11e0caa240d3,VPGHEDGVTISGLEPDHK,2312.0,2329.0,,2320.0,2322.0,,sec_pna_try,,,,,1xHexHexNAc,CID,plasma.9606.bto.0000131
294344,P22105,e4b7bf06-7b62-4d8a-b950-ce965b139850,ELTVTGSSPDSLSLSWTVPQGRFDSFTVQYKDR,2172.0,2204.0,2176.0,,,1xHexHexNAc,tcl_pna_try,,,missing_site_coverage,,1xHexNAc,ETD,viral.msf.HELF.bto.0003034..Other.HCMV.Towne.i...
294345,P22105,e4b7bf06-7b62-4d8a-b950-ce965b139850,ELTVTGSSPDSLSLSWTVPQGRFDSFTVQYKDR,2172.0,2204.0,,2186.0,2188.0,1xHexHexNAc,tcl_pna_try,,,missing_site_coverage,,1xHexNAc,ETD,viral.msf.HELF.bto.0003034..Other.HCMV.Towne.i...
294346,P22105,e4b7bf06-7b62-4d8a-b950-ce965b139850,ELTVTGSSPDSLSLSWTVPQGRFDSFTVQYKDR,2172.0,2204.0,,2196.0,2201.0,1xHexNAc,tcl_pna_try,,,missing_site_coverage,,1xHexNAc,ETD,viral.msf.HELF.bto.0003034..Other.HCMV.Towne.i...


#### Non-matching 'site' or 'site_ambiguous_start'/'site_ambiguous_end' between data and UniProt sequence: 18537 (5.80%):

Unnamed: 0,uniprot,peptide_id,peptide,peptide_start,peptide_end,single_site,unclear_site_start,unclear_site_end,site_composition,source,quantification,quantification_channels,site_ambiguity,quantification_confidence,composition,activation,dataset
46,O00468,65323e19-97c2-4967-8640-939b21c2807e,,,,1882.0,,,1xHexNAc,sec_vva_chy,,,,,1xHexNAc,,AGS.Glycoproteome
47,O00468,6a0f52fd-5b39-40fa-b79d-a5e107e0b45f,,,,1890.0,,,1xHexNAc,sec_vva_try,,,,,1xHexNAc,,AGS.Glycoproteome
48,O00468,d621131a-2f39-4986-bff4-fffa12a41ee4,,,,1896.0,,,1xHexNAc,sec_vva_try,,,,,1xHexNAc,,AGS.Glycoproteome
50,O00468,5be99f20-2778-4973-832d-b9da5c88664b,,,,1909.0,,,1xHexNAc,sec_vva_gluc,,,,,1xHexNAc,,AGS.Glycoproteome
138,O15031,26767a69-b0a2-4d25-ade7-4c1d8e0153b1,,,,37.0,,,1xHexNAc,tcl_vva_try,,,,,1xHexNAc,,AGS.Glycoproteome
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319314,O75477,e3e42a0c-6bfe-4b7a-b065-07fc3fccaf02,PFITTFRSVQTT,56.0,67.0,57.0,,,1xHexNAc,tcl_vva_chy,,,,,3xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319315,O75477,e3e42a0c-6bfe-4b7a-b065-07fc3fccaf02,PFITTFRSVQTT,56.0,67.0,58.0,,,1xHexNAc,tcl_vva_chy,,,,,3xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319316,O75477,e3e42a0c-6bfe-4b7a-b065-07fc3fccaf02,PFITTFRSVQTT,56.0,67.0,61.0,,,1xHexNAc,tcl_vva_chy,,,,,3xHexNAc,ETD,viral.msf.P3HR1.bto.0000164..Other.EBV.infection
319441,Q9NPR2,7483ca26-2a38-4d73-a5d8-0c9f8db918e7,CSASSVVSPSFVPTGEKPCEQVQF,578.0,601.0,,574.0,586.0,,tcl_pna_chy,,,,,1xHexHexNAc,CID,viral.msf.P3HR1.bto.0000164..Other.EBV.infection


## Create clean dataframe

In [35]:
n_original_rows

319491

In [32]:
sum(~valid_rows), sum(~valid_rows)/n_original_rows

(1322, 0.004137831738609226)

In [33]:
sum(~consistent_rows), sum(~consistent_rows)/n_original_rows

(24851, 0.0777830987414356)

In [34]:
sum(~entry_valid_rows), sum(~entry_valid_rows)/n_original_rows

(18756, 0.058705879038846165)

In [36]:
sum(~(valid_rows & consistent_rows & entry_valid_rows)), sum(~(valid_rows & consistent_rows & entry_valid_rows))/n_original_rows

(31234, 0.09776175228723188)

In [26]:
valid_and_consistent_rows = valid_rows & consistent_rows & entry_valid_rows

clean_df = df[valid_and_consistent_rows].reset_index(drop=True).copy()

#clean_df['peptide_start'] = clean_df['peptide_start'].astype(pd.Int32Dtype())
#clean_df['peptide_end'] = clean_df['peptide_end'].astype(pd.Int32Dtype())
#clean_df['single_site'] = clean_df['single_site'].astype(pd.Int32Dtype())
#clean_df['unclear_site_start'] = clean_df['unclear_site_start'].astype(pd.Int32Dtype())
#clean_df['unclear_site_end'] = clean_df['unclear_site_end'].astype(pd.Int32Dtype())

print(f"Clean dataframe has {len(clean_df)} rows, {(n_original_rows-len(clean_df))/n_original_rows*100:4.2f}% of original rows were discarded")

Clean dataframe has 288257 rows, 9.78% of original rows were discarded


In [27]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288257 entries, 0 to 288256
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   uniprot                    288257 non-null  object 
 1   peptide_id                 288257 non-null  object 
 2   peptide                    273061 non-null  object 
 3   peptide_start              272843 non-null  float64
 4   peptide_end                272843 non-null  float64
 5   single_site                184888 non-null  float64
 6   unclear_site_start         103369 non-null  float64
 7   unclear_site_end           103369 non-null  float64
 8   site_composition           209738 non-null  object 
 9   source                     273067 non-null  object 
 10  quantification             99123 non-null   object 
 11  quantification_channels    162593 non-null  object 
 12  site_ambiguity             60375 non-null   object 
 13  quantification_confidence  13

In [39]:
len(clean_df['uniprot'].unique())

4684

### Check how redundant peptides are

In [28]:
representative_rows, redundant_rows = cluster_peptides(clean_df)
print(f"Cluster representatives: {sum(representative_rows)} {sum(representative_rows)/clean_df.shape[0]*100}%")
print(f"Redundant rows: {sum(redundant_rows)} {sum(redundant_rows)/clean_df.shape[0]*100}%")
print(f"Invalid peptide rows: {clean_df.shape[0]-sum(representative_rows | redundant_rows)} {(clean_df.shape[0]-sum(representative_rows | redundant_rows))/clean_df.shape[0]*100}%")

Cluster representatives: 279733 97.04291656403834%
Redundant rows: 8524 2.957083435961659%
Invalid peptide rows: 0 0.0%


In [29]:
notebook_end_time = time.perf_counter()
print(f"Notebook took {notebook_end_time-notebook_start_time} seconds to run")

Notebook took 171.53998016699916 seconds to run
