**Goal** : This notebook extracts meaningful information from the files `mcs_properties.tsv` and `dpcstruct_consistency.tsv` and returns a CSV file presenting DPCStruct Metacluster properties.

**1. Imports**

In [1]:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

**2. Loading data**

In [2]:
# A. Path to data
file_path_mcs_props = "/u/mdmc/enyanduk/internship_areasciencepark/Data/dpcstruct/mcs_properties.tsv"
file_path_consistency = "/u/mdmc/enyanduk/internship_areasciencepark/Data/dpcstruct/dpcstruct_consistency.tsv"

In [3]:
# File 1: mcs_properties.tsv
with open(file_path_mcs_props, "r",encoding="utf-8") as f:
  print(repr(f.readline()))

'mcID size len_aa len_std len_ratio plddt disorder alntmscore tmscore lddt prob pident\n'


In [4]:
# File 2: dpcstruct_consistency.tsv
with open(file_path_consistency, "r",encoding="utf-8") as f:
  print(repr(f.readline()))

'mcID score pfam-labels\n'


`Oberservation` : The two  files are `space`-separated files.

In [5]:
# B. Read in the data
# File 1: mcs_properties.tsv
df_mcs_props = pd.read_csv(file_path_mcs_props, sep=r"\s+")
df_mcs_props.head()

Unnamed: 0,mcID,size,len_aa,len_std,len_ratio,plddt,disorder,alntmscore,tmscore,lddt,prob,pident
0,61388,30643,210.005,38.0453,0.181164,82.622,0.284676,0.2305,0.2478,0.3321,0.3082,10.7832
1,49953,28922,282.901,60.4128,0.213547,81.7385,0.225913,0.3476,0.3795,0.3599,0.6535,10.5943
2,52595,26312,122.438,21.1918,0.173082,84.0467,0.286809,0.3575,0.3819,0.3671,0.4309,10.8535
3,42223,17503,260.899,60.0432,0.23014,78.8912,0.223797,0.4164,0.4628,0.4638,0.7612,16.7721
4,124,16132,191.621,37.0314,0.193253,85.6541,0.197148,0.3631,0.3936,0.3936,0.5154,12.6375


In [6]:
df_mcs_props.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28246 entries, 0 to 28245
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   mcID        28246 non-null  int64  
 1   size        28246 non-null  int64  
 2   len_aa      28246 non-null  float64
 3   len_std     28246 non-null  float64
 4   len_ratio   28246 non-null  float64
 5   plddt       28246 non-null  float64
 6   disorder    28246 non-null  float64
 7   alntmscore  28246 non-null  float64
 8   tmscore     28246 non-null  float64
 9   lddt        28246 non-null  float64
 10  prob        28246 non-null  float64
 11  pident      28246 non-null  float64
dtypes: float64(10), int64(2)
memory usage: 2.6 MB


In [7]:
# File 2: dpcstruct_consistency.tsv
df_consistency = pd.read_csv(file_path_consistency, sep=r"\s+")
df_consistency.head()

Unnamed: 0,mcID,score,pfam-labels
0,5,1.0,CL0263
1,6,1.0,PF14802
2,7,1.0,PF00110
3,8,1.0,CL0465
4,9,0.75,CL0272-CL0016


In [8]:
df_consistency.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14423 entries, 0 to 14422
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mcID         14423 non-null  int64  
 1   score        14423 non-null  float64
 2   pfam-labels  14423 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 338.2+ KB


``` We have a total of 28,246 mcs(rows) in the first dataframe, but only 14,423 are consistent with Pfam-36 labels(clans/families) in the second dataframe.```

Let's merge these two dataframes based on `mcID`, by using a `left join` to keep all rows from df_mcs_props and add the corresponding score and pfam-labels columns where they exist in df_consistency.

In [9]:
# Merge the two dataframes on mcID (left join to keep all rows from df_mcs_props)
df_merged = df_mcs_props.merge(df_consistency, on='mcID', how='left')
# Check the result
df_merged.head()

Unnamed: 0,mcID,size,len_aa,len_std,len_ratio,plddt,disorder,alntmscore,tmscore,lddt,prob,pident,score,pfam-labels
0,61388,30643,210.005,38.0453,0.181164,82.622,0.284676,0.2305,0.2478,0.3321,0.3082,10.7832,0.681352,PF20912-PF11797-PF21307-PF18701-PF20419
1,49953,28922,282.901,60.4128,0.213547,81.7385,0.225913,0.3476,0.3795,0.3599,0.6535,10.5943,0.823722,PF02751-PF10395-PF19755-PF00045-PF03736
2,52595,26312,122.438,21.1918,0.173082,84.0467,0.286809,0.3575,0.3819,0.3671,0.4309,10.8535,0.544473,PF11797-CL0003-PF01324-PF20254-CL0219
3,42223,17503,260.899,60.0432,0.23014,78.8912,0.223797,0.4164,0.4628,0.4638,0.7612,16.7721,0.975607,PF06146-PF08492-CL0003-CL0219-CL0366
4,124,16132,191.621,37.0314,0.193253,85.6541,0.197148,0.3631,0.3936,0.3936,0.5154,12.6375,0.944872,PF04931-CL0219-PF11899-PF05890-CL0273


In [10]:
# A small check : How many rows have Pfam annotations?
print(f"Total rows: {len(df_merged)}")
print(f"Rows with Pfam annotations: {df_merged['pfam-labels'].notna().sum()}")
print(f"Rows without Pfam annotations: {df_merged['pfam-labels'].isna().sum()}")

Total rows: 28246
Rows with Pfam annotations: 14423
Rows without Pfam annotations: 13823


In [11]:
# Great, we merged succefully. Let's preprocess the newly merged dataframe :
df = df_merged.copy()
# T1. We drop some columns : alntmscore and prob
df.drop(columns=["alntmscore", "prob"], inplace=True)
# T2. We rename the columns to more meaningful names
df.rename(columns={
    "mcID": "mc_id",
    "size": "mc_size",
    "score":"pfam_score",
    "pfam-labels":"pfam_labels"
    }, inplace=True)
# T3 : We sort the data by mc_id in ascending order
df.sort_values(by="mc_id", inplace=True)
# T4 : Rewrite each ID in mc_id column as MCID: e.g.: 1 -> MC1
df["mc_id"] = df["mc_id"].apply(lambda x: f"MC{x}")
# T5 : We reset the index
df.reset_index(drop=True, inplace=True)
# T6 : We ensure each float value in the dataframe is rounded to 2 decimal places
df = df.round(2)
# T7: We replace each NaN value in the pfam_labels column with "NONE"
df["pfam_labels"] = df["pfam_labels"].fillna("NONE")
# Putting everything together, the new head of the dataframe looks like this :
df.head(11)

Unnamed: 0,mc_id,mc_size,len_aa,len_std,len_ratio,plddt,disorder,tmscore,lddt,pident,pfam_score,pfam_labels
0,MC0,43,112.02,12.36,0.11,78.53,0.29,0.54,0.65,23.74,,NONE
1,MC1,19,68.89,7.43,0.11,90.81,0.28,0.77,0.82,25.87,,NONE
2,MC2,19,368.05,52.78,0.14,85.0,0.3,0.46,0.56,19.6,,NONE
3,MC3,18,346.39,34.27,0.1,82.09,0.25,0.55,0.49,14.0,,NONE
4,MC4,12,135.17,13.06,0.1,86.24,0.22,0.69,0.62,23.25,,NONE
5,MC5,46,159.04,38.84,0.24,81.34,0.19,0.56,0.57,17.34,1.0,CL0263
6,MC6,20,242.6,19.41,0.08,71.32,0.33,0.62,0.7,24.5,1.0,PF14802
7,MC7,6,79.0,3.96,0.05,74.06,0.18,0.84,0.82,57.2,1.0,PF00110
8,MC8,24,85.71,6.73,0.08,79.14,0.41,0.71,0.82,49.28,1.0,CL0465
9,MC9,7,50.57,4.81,0.1,81.26,0.19,0.85,0.88,82.23,0.75,CL0272-CL0016


In [12]:
# The last few rows of the dataframe look like this :
df.tail()

Unnamed: 0,mc_id,mc_size,len_aa,len_std,len_ratio,plddt,disorder,tmscore,lddt,pident,pfam_score,pfam_labels
28241,MC64563,63,136.46,22.21,0.16,79.54,0.24,0.76,0.81,42.8,1.0,CL0123
28242,MC64566,76,139.09,19.83,0.14,77.84,0.43,0.37,0.61,17.84,1.0,CL0673
28243,MC64569,21,144.95,25.91,0.18,74.17,0.55,0.44,0.69,27.02,,NONE
28244,MC64572,42,105.1,7.45,0.07,85.94,0.22,0.9,0.88,51.64,,NONE
28245,MC64574,11,90.36,9.42,0.1,86.57,0.2,0.87,0.84,39.65,,NONE


In [13]:
# General information about the dataframe :
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28246 entries, 0 to 28245
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mc_id        28246 non-null  object 
 1   mc_size      28246 non-null  int64  
 2   len_aa       28246 non-null  float64
 3   len_std      28246 non-null  float64
 4   len_ratio    28246 non-null  float64
 5   plddt        28246 non-null  float64
 6   disorder     28246 non-null  float64
 7   tmscore      28246 non-null  float64
 8   lddt         28246 non-null  float64
 9   pident       28246 non-null  float64
 10  pfam_score   14423 non-null  float64
 11  pfam_labels  28246 non-null  object 
dtypes: float64(9), int64(1), object(2)
memory usage: 2.6+ MB


**3. Data Exploration : Profiling Report**

In [14]:
# Profiling the dataset
profile = ProfileReport(df, title="DPCStruct Metacluster Properties Profiling Report", explorative=True)
path_to_report = "/u/mdmc/enyanduk/internship_areasciencepark/Notebooks/dpcstruct/profiling_reports/dpcstruct_mcs_properties_report.html"
profile.to_file(path_to_report)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 12/12 [00:00<00:00, 57.83it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
profile



**4. Save the dataframe**

In [16]:
# We save the preprocessed dataframe as a csv file for our PostrgreSQL needs:
output_path = "/u/mdmc/enyanduk/internship_areasciencepark/Dataframes/DPCStruct/dpcstruct_mcs_properties.csv"
df.to_csv(output_path, index=False)