This notebook objective is to analyze the result from the previous script: 01_explore_location.py

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
 # Load the CSV
csv_filepath = "/home/alex/Documents/pocket_tool/results/04_combine_pockets_pdb/pockets_location.csv"
location_df = pd.read_csv(csv_filepath, sep=";")
location_df

# Load CSV containing the class per dynid
csv_filepath = "/home/alex/sshfs_mountpoints/verde/Documents/pocket_tool/old/pre250109/results/align.csv"
align_df = pd.read_csv(csv_filepath, sep=";")
dynid_class_df = align_df[["dynid", "class_name"]].drop_duplicates().set_index("dynid")

# Filter for class A only
location_df = location_df.join(dynid_class_df, on="dynid")
location_df = location_df[location_df["class_name"] == "Class A (Rhodopsin)"]
location_df

Unnamed: 0,dynid,trajid,pocketid,is_centered,z_location,tm_contacts,class_name
1,184,11714,7,False,ecl,,Class A (Rhodopsin)
2,124,11193,0,False,above,,Class A (Rhodopsin)
3,910,16517,3,True,below,tm2_tm3_tm5,Class A (Rhodopsin)
4,183,11706,3,False,below,tm2_tm7,Class A (Rhodopsin)
5,51,10522,8,False,above,tm4,Class A (Rhodopsin)
...,...,...,...,...,...,...,...
20440,947,16772,9,False,above,,Class A (Rhodopsin)
20443,736,15338,9,False,below,tm6,Class A (Rhodopsin)
20444,810,15839,3,True,above,tm1_tm2_tm3_tm5_tm6_tm7,Class A (Rhodopsin)
20445,987,17033,3,False,above,tm1_tm3_tm6_tm7,Class A (Rhodopsin)


Where are the pockets located in the Z-axis? In the ICL region? ECL? In the TM region above or below the middle line?

In [4]:
location_df.groupby("z_location").size().reset_index(name='count').sort_values(by='z_location', ascending=True)

Unnamed: 0,z_location,count
0,above,7228
1,below,7121
2,ecl,1169
3,icl,2680


How many of the pockets are located in the core of the receptor?

In [5]:
location_df.groupby(["is_centered"]).size().reset_index(name='count').sort_values(by='is_centered', ascending=True)

Unnamed: 0,is_centered,count
0,False,14832
1,True,3366


In [6]:
df_mod = location_df.groupby(["is_centered", "z_location"]).size().reset_index(name='count').sort_values(by='is_centered', ascending=True)
df_mod['frequency'] = df_mod['count'] / df_mod['count'].sum() * 100
df_mod

Unnamed: 0,is_centered,z_location,count,frequency
0,False,above,5489,30.162655
1,False,below,6028,33.124519
2,False,ecl,1051,5.77536
3,False,icl,2264,12.440928
4,True,above,1739,9.555995
5,True,below,1093,6.006155
6,True,ecl,118,0.648423
7,True,icl,416,2.285965


How many pockets did not have any contact with a TM helix?

In [7]:
# Check the rows that "tm_contacts" is equal to NaN.
location_df[location_df['tm_contacts'].isna()].groupby("z_location").size().reset_index(name='count').sort_values(by='z_location', ascending=True)

Unnamed: 0,z_location,count
0,above,358
1,below,516
2,ecl,408
3,icl,514


In the above region, in between which helices are pockets located?

In [8]:
df_mod = location_df[(location_df["is_centered"] == False) & (location_df["z_location"] == "above")]
df_mod = df_mod.groupby("tm_contacts").size().reset_index(name='count').sort_values(by='count', ascending=False)
df_mod['frequency'] = df_mod['count'] / df_mod['count'].sum() * 100

# Define the colormap from light pink to deep purple
cmap = mcolors.LinearSegmentedColormap.from_list("pink_purple", ["#ffe7f8", "#FF00FF"])

# Normalize frequencies so that 1 maps to the lowest color and max maps to the highest
norm = mcolors.Normalize(vmin=1, vmax=df_mod['frequency'].max())

# Apply the colormap to get hex colors
df_mod['color'] = df_mod['frequency'].apply(lambda x: mcolors.to_hex(cmap(norm(x))))
df_mod[0:20]

Unnamed: 0,tm_contacts,count,frequency,color
24,tm1_tm7,574,11.128344,#ff00ff
42,tm3_tm4,488,9.461031,#ff26fe
0,tm1,442,8.569213,#ff3afd
55,tm5,385,7.464133,#ff53fc
26,tm2_tm3,369,7.153936,#ff5bfc
19,tm1_tm2_tm7,282,5.467235,#ff82fb
56,tm5_tm6,277,5.370299,#ff83fb
1,tm1_tm2,241,4.672354,#ff94fb
52,tm4_tm5,220,4.265219,#ff9dfa
51,tm4,217,4.207057,#ff9efa


The sum of pockets below 1% frequecy:

In [10]:
df_mod[df_mod['frequency'] < 1]['frequency'].sum()

9.461031407522295

What about the below region?

In [11]:
df_mod = location_df[(location_df["is_centered"] == False) & (location_df["z_location"] == "below")]
df_mod = df_mod.groupby("tm_contacts").size().reset_index(name='count').sort_values(by='count', ascending=False)
df_mod['frequency'] = round(df_mod['count'] / df_mod['count'].sum() * 100, 1)

# Define the colormap from light pink to deep purple
cmap = mcolors.LinearSegmentedColormap.from_list("pink_purple", ["#ffe7f8", "#FF00FF"])

# Normalize frequencies so that 1 maps to the lowest color and max maps to the highest
norm = mcolors.Normalize(vmin=1, vmax=10.431346) # Adjust vmax to the maximum frequency in above region

# Apply the colormap to get hex colors
df_mod['color'] = df_mod['frequency'].apply(lambda x: mcolors.to_hex(cmap(norm(x))))

df_mod[0:30]

Unnamed: 0,tm_contacts,count,frequency,color
57,tm5_tm6,490,8.9,#ff25fe
47,tm3,362,6.6,#ff5dfc
62,tm7,349,6.3,#ff65fc
50,tm3_tm5,343,6.2,#ff67fc
36,tm2_tm3_tm4,300,5.4,#ff7bfb
48,tm3_tm4,300,5.4,#ff7bfb
54,tm4,275,5.0,#ff85fb
1,tm1,274,5.0,#ff85fb
3,tm1_tm2,269,4.9,#ff88fb
35,tm2_tm3,269,4.9,#ff88fb


In [13]:
df_mod[df_mod['frequency'] < 1]['frequency'].sum()

5.0