# Conservation
Exploratory analysis of phyloP scores per region. 

In [72]:
import pandas as pd

In [73]:
df = pd.read_csv("data/interim/phylop_stats_per_region.tsv", sep="\t")
df.head(3)

Unnamed: 0,symbol,enst,region,oe_ci_hi,constraint,phylop_count,phylop_mean,phylop_median,phylop_std,phylop_sem
0,OR4F5,ENST00000641515,distal_nmd,2.288089,unconstrained,828,2.641932,2.2165,2.394205,0.083204
1,SAMD11,ENST00000616016,distal_nmd,1.655502,unconstrained,293,2.649078,1.939,3.752493,0.219223
2,SAMD11,ENST00000616016,nmd_target,1.476207,unconstrained,1989,3.247014,3.272,3.242537,0.072706


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54969 entries, 0 to 54968
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   symbol         54969 non-null  object 
 1   enst           54969 non-null  object 
 2   region         54969 non-null  object 
 3   oe_ci_hi       54964 non-null  float64
 4   constraint     54965 non-null  object 
 5   phylop_count   54969 non-null  int64  
 6   phylop_mean    54944 non-null  float64
 7   phylop_median  54944 non-null  float64
 8   phylop_std     54942 non-null  float64
 9   phylop_sem     54942 non-null  float64
dtypes: float64(5), int64(1), object(4)
memory usage: 4.2+ MB


In [75]:
df = df.dropna()
df.shape

(54935, 10)

In [76]:
# Assign percentiles to oe_ci_hi and phyloP mean by region
from scipy import stats

pct = lambda x: stats.percentileofscore(x, x, nan_policy="omit") / 100

df["oe_ci_hi_pct"] = df.groupby(["region"])["oe_ci_hi"].transform(
    lambda x: 1 - pct(x)
)

df["phylop_mean_pct"] = df.groupby(["region"])["phylop_mean"].transform(pct)

## Weakly conserved, strongly constrained regions

In [77]:
print("Size distributions of these regions: ")
df.groupby("region").phylop_count.describe()

Size distributions of these regions: 


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
distal_nmd,17977.0,383.557768,581.581861,3.0,139.0,208.0,382.0,20978.0
long_exon,3286.0,768.99787,1452.40225,3.0,142.25,350.5,858.75,34498.0
nmd_target,15603.0,1273.32936,1546.989452,3.0,414.0,881.0,1614.0,74298.0
start_proximal,18069.0,149.905806,2.587004,34.0,150.0,150.0,150.0,150.0


In [78]:
# Filter for constrained regions with weak phyloP scores

m1 = df["phylop_mean_pct"] < 0.1
m2 = df["oe_ci_hi_pct"] > 0.9
m3 = df["constraint"] == "constrained"

lo_hi = df[m1 & m2 & m3]

print(
    f"Weakly conserved and strongly constrained transcript counts per region:\n"
    f"{lo_hi.groupby('region').size()}"
)

Weakly conserved and strongly constrained transcript counts per region:
region
distal_nmd        10
long_exon          4
nmd_target         5
start_proximal     1
dtype: int64


In [79]:
lo_hi

Unnamed: 0,symbol,enst,region,oe_ci_hi,constraint,phylop_count,phylop_mean,phylop_median,phylop_std,phylop_sem,oe_ci_hi_pct,phylop_mean_pct
4802,CFH,ENST00000367429,nmd_target,0.403852,constrained,3293,1.420477,0.624,3.878535,0.067588,0.918413,0.087035
4849,PTPRC,ENST00000442510,nmd_target,0.350835,constrained,3445,0.631048,0.592,5.87396,0.100077,0.940524,0.044543
5042,MDM4,ENST00000367182,distal_nmd,0.349602,constrained,617,0.829961,0.56,1.518276,0.061123,0.991545,0.094176
5905,OR2T4,ENST00000366473,distal_nmd,0.587501,constrained,810,0.802714,0.8595,1.405646,0.049389,0.960561,0.091895
7486,CRACDL,ENST00000397899,long_exon,0.47452,constrained,1281,0.594703,0.401,2.780605,0.07769,0.904443,0.083384
15150,ICE1,ENST00000296564,long_exon,0.483611,constrained,4401,0.133184,-0.053,3.176849,0.047887,0.901096,0.043822
17047,TCOF1,ENST00000643257,nmd_target,0.201935,constrained,4132,1.300442,0.875,2.701502,0.042027,0.983529,0.078126
23463,USP17L2,ENST00000333796,start_proximal,0.355923,constrained,150,-0.50086,-0.313,1.218451,0.099486,0.999889,0.005645
27234,PPP1R26,ENST00000356818,distal_nmd,0.542464,constrained,3477,0.52795,0.337,3.180275,0.053934,0.968404,0.072704
28684,FAS,ENST00000652046,nmd_target,0.31273,constrained,501,0.414747,0.037,3.606344,0.16112,0.953983,0.035506


Many of these regions are poorly covered, and therefore have strong oe_ci_95 scores, but have an "indeterminate" constraint annotation.
It will be sensible to filter for constrained regions only.

One example is the last exon of NANOG (ENST00000229307, OE95=0.43, mean phyloP=0.2).
This appears to be a mammalian expansion of the CDS. 
The highly conserved homeodomains occur in two smaller internal exons.
A quick scan of the literature suggests that the C-terminal domain of human NANOG contains two highly potent transactivating domains.
Although these lack any homology or structural resemblance to known domains and are poorly modelled by alphafold.

? Immune genes e.g. CFH (ENST00000367429), PTPRC (ENST00000442510), 

Non-conserved long-exon region of ICE1

The central exons of TCOF (Treacher-Collins Syndrome) are poorly conserved.

The start-proximal region of USP17L2 is strongly constrained but weakly conserved.

Four ZNF proteins have strong distal constraint but weak conservation.
(But they don't look particularly interesting.)

## Conclusion
Of the genes prioritised with this approach, NANOG looks the most interesting.