This notebook objective is to analyze the result from the previous script: 01_explore_location.py

In [66]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

In [67]:
csv_filepath = "/home/alex/Desktop/pockets_location.csv"
location_df = pd.read_csv(csv_filepath, sep=";")
location_df

Unnamed: 0,dynid,trajid,pocketid,is_centered,z_location,tm_contacts
0,743,15389,6,False,below,tm2_tm3_tm4
1,184,11714,7,False,ecl,
2,124,11193,0,False,above,
3,910,16517,3,True,below,tm2_tm3_tm5
4,183,11706,3,False,below,tm2_tm7
...,...,...,...,...,...,...
20442,193,11777,5,True,below,tm2_tm3_tm7
20443,736,15338,9,False,below,tm6
20444,810,15839,3,True,above,tm1_tm2_tm3_tm5_tm6_tm7
20445,987,17033,3,False,above,tm1_tm3_tm6_tm7


Where are the pockets located in the Z-axis? In the ICL region? ECL? In the TM region above or below the middle line?

In [68]:
location_df.groupby("z_location").size().reset_index(name='count').sort_values(by='z_location', ascending=True)

Unnamed: 0,z_location,count
0,above,8134
1,below,7772
2,ecl,1670
3,icl,2871


How many of the pockets are located in the core of the receptor?

In [69]:
location_df.groupby(["is_centered"]).size().reset_index(name='count').sort_values(by='is_centered', ascending=True)

Unnamed: 0,is_centered,count
0,False,16682
1,True,3765


In [70]:
df_mod = location_df.groupby(["is_centered", "z_location"]).size().reset_index(name='count').sort_values(by='is_centered', ascending=True)
df_mod['frequency'] = df_mod['count'] / df_mod['count'].sum() * 100
df_mod

Unnamed: 0,is_centered,z_location,count,frequency
0,False,above,6181,30.229374
1,False,below,6540,31.985132
2,False,ecl,1530,7.48276
3,False,icl,2431,11.889275
4,True,above,1953,9.551523
5,True,below,1232,6.025334
6,True,ecl,140,0.684697
7,True,icl,440,2.151905


How many pockets did not have any contact with a TM helix?

In [86]:
# Check the rows that "tm_contacts" is equal to NaN.
location_df[location_df['tm_contacts'].isna()].groupby("z_location").size().reset_index(name='count').sort_values(by='z_location', ascending=True)

Unnamed: 0,z_location,count
0,above,397
1,below,539
2,ecl,717
3,icl,586


In the above region, in between which helices are pockets located?

In [71]:
df_mod = location_df[(location_df["is_centered"] == False) & (location_df["z_location"] == "above")]
df_mod = df_mod.groupby("tm_contacts").size().reset_index(name='count').sort_values(by='count', ascending=False)
df_mod['frequency'] = df_mod['count'] / df_mod['count'].sum() * 100

# Define the colormap from light pink to deep purple
cmap = mcolors.LinearSegmentedColormap.from_list("pink_purple", ["#ffe7f8", "#FF00FF"])

# Normalize frequencies so that 1 maps to the lowest color and max maps to the highest
norm = mcolors.Normalize(vmin=1, vmax=df_mod['frequency'].max())

# Apply the colormap to get hex colors
df_mod['color'] = df_mod['frequency'].apply(lambda x: mcolors.to_hex(cmap(norm(x))))
df_mod[0:20]

Unnamed: 0,tm_contacts,count,frequency,color
24,tm1_tm7,607,10.431346,#ff00ff
43,tm3_tm4,525,9.022169,#ff22fe
0,tm1,457,7.853583,#ff3ffd
56,tm5,440,7.561437,#ff46fd
26,tm2_tm3,433,7.441141,#ff49fd
57,tm5_tm6,340,5.842928,#ff70fc
19,tm1_tm2_tm7,339,5.825743,#ff71fc
1,tm1_tm2,265,4.554047,#ff90fb
53,tm4_tm5,239,4.107235,#ff9bfa
52,tm4,237,4.072865,#ff9cfa


The sum of pockets below 1% frequecy:

In [72]:
df_mod[df_mod['frequency'] < 1]['frequency'].sum()

9.50335109125279

What about the below region?

In [77]:
df_mod = location_df[(location_df["is_centered"] == False) & (location_df["z_location"] == "below")]
df_mod = df_mod.groupby("tm_contacts").size().reset_index(name='count').sort_values(by='count', ascending=False)
df_mod['frequency'] = round(df_mod['count'] / df_mod['count'].sum() * 100, 1)

# Define the colormap from light pink to deep purple
cmap = mcolors.LinearSegmentedColormap.from_list("pink_purple", ["#ffe7f8", "#FF00FF"])

# Normalize frequencies so that 1 maps to the lowest color and max maps to the highest
norm = mcolors.Normalize(vmin=1, vmax=10.431346) # Adjust vmax to the maximum frequency in above region

# Apply the colormap to get hex colors
df_mod['color'] = df_mod['frequency'].apply(lambda x: mcolors.to_hex(cmap(norm(x))))

df_mod[0:30]

Unnamed: 0,tm_contacts,count,frequency,color
58,tm5_tm6,524,8.7,#ff2afe
48,tm3,380,6.3,#ff65fc
63,tm7,375,6.2,#ff67fc
51,tm3_tm5,368,6.1,#ff6afc
37,tm2_tm3_tm4,322,5.4,#ff7bfb
49,tm3_tm4,318,5.3,#ff7efb
36,tm2_tm3,303,5.0,#ff85fb
1,tm1,302,5.0,#ff85fb
55,tm4,294,4.9,#ff88fb
3,tm1_tm2,293,4.9,#ff88fb


In [78]:
df_mod[df_mod['frequency'] < 1]['frequency'].sum()

5.2