# Constraint in paralogs
Exploring differences in constraint between paralogous genes.

In [11]:
import pandas as pd

FILE_IN = "data/interim/paralogs_delta_constraint.tsv"

df = pd.read_csv(FILE_IN, sep="\t")
df.sample(3)

Unnamed: 0,paralog_family_id,region,paralogs,most_constrained,least_constrained,oe95_min,oe95_max,oe95_delta
230,410,nmd_target,"STRN,STRN3,STRN4",STRN3,STRN,0.431526,0.654431,0.222905
446,784,distal_nmd,"SCAF8,SCAF4",SCAF8,SCAF4,0.401902,0.413312,0.01141
330,566,transcript,"BOLL,DAZL",DAZL,BOLL,0.290006,0.871481,0.581475


## To do
- [ ] Plot the distibution of OE95 deltas per region
- [X] Manually explore the ~5 largest oe95 deltas per region


In [12]:
df[df.groupby("region")["oe95_delta"].rank(ascending=False) <= 5].sort_values(["region","oe95_delta"], ascending=False)

Unnamed: 0,paralog_family_id,region,paralogs,most_constrained,least_constrained,oe95_min,oe95_max,oe95_delta
591,1018,transcript,"GOLGA2,GOLGA8S,GOLGA8F,GOLGA8M,GOLGA8J,GOLGA8T...",GOLGA6A,GOLGA8F,0.250797,7.40487,7.154073
582,1004,transcript,"SMN2,SMN1,SMNDC1",SMNDC1,SMN2,0.469583,2.716513,2.24693
776,1384,transcript,"DDI2,NRIP3,DDI1,NRIP2,UBAC2",DDI2,NRIP3,0.287987,2.090458,1.802471
324,564,transcript,"SIRPB2,SIRPD,SIRPB1,SIRPG,SIRPA",SIRPA,SIRPD,0.338398,1.912263,1.573865
559,982,transcript,"GJC2,GJD2,GJD3,GJC1",GJC1,GJD3,0.406202,1.922164,1.515962
487,865,start_proximal,"TAF5L,TAF5",TAF5L,TAF5,0.545477,3.246342,2.700866
566,989,start_proximal,"AP2A2,AP1G2,AP4E1,AP1G1,AP2A1",AP1G1,AP1G2,0.476252,2.981643,2.505391
310,547,start_proximal,"RFX8,RFX6,RFX3,RFX4,RFX2,RFX1",RFX3,RFX8,0.326872,2.553914,2.227042
318,559,start_proximal,"HCRTR1,NPFFR2,QRFPR,HCRTR2,NPFFR1",HCRTR2,NPFFR1,0.565018,2.632959,2.067941
200,373,start_proximal,"USP17L17,USP17L18,USP17L7,USP17L2",USP17L2,USP17L17,0.355923,2.416125,2.060203


## Initial interpretation
Some of these results are garbage.  
For example, the golgin genes and SMN genes have strong constraint deltas, but are such close paralogs that they are very poorly covered.
The OE95 scores in these cases are meaningless.

### Transcript level
**DDI1, DDI2, NRIP2, NRIP3, UBAC2** are a nicer example.  
DDI2 is highly constrained (LOEUF 0.22). 
DDI1 is not (LOEUF 1.34).
DDI1 and DDI2 have a last common ancestor with marsupials.
The other genes are more distant paralogs.

**SIRPA, SIRPB1, SIRPB2, SIRPD, SIRPG** are also nice.
They show a patern of progressive duplications.

| Gene | MRCA | Identity to SIRPA | LOEUF |
|-|-|-|-|
| SIRPA | - | 100% | 0.31 |
| SIRPB1 | Apes | 78% | 1.19 |
| SIRPG | Simians | 73% | 1.36 |
| SIRPD | Placentals | 39% | 1.61 |
| SIRPB2 | Amniotes | 25% | 0.95 |

## Start proximal
**TAF5L vs TAF5** is a nice example
Both have strong transcript level constraint (LOEUF 0.22 and 0.37 respectively).
They have reasonable sequence identity. TAF5L is the smaller gene and has 40% sequence identity to TAF5.
The MRCA is in bilateral animals.
TAF5L is specifically highly constrained in the start proximal region, and in particular the highly expressed first ~140nt of the first exon.
These early nucleotides have especially poor sequence identity with TAF5, have weak phyloP scores, and do not overlap known Pfam domains.

**AP1G1	& AP1G2** are fairly interesting.  
Very different constraint, and AP1G2 is more lowly expressed by pext.
Fair sequence identity, especially in the first 150nt.
But AP1G1 shows very strong start-proximal constraint, in contrast with AP1G2. 
Both share an alpha adaptin C2 domain which extends from the first coding exon.

**The RFX1 - RFX4** family show a nice story as well.
(Paralog family ID 547)

## Distal
**FSCN1 and FSCN3** are fairly interesting.
FSCN1 is reasonably constrained (LOEUF 0.51), and similarly in distal regions.  
It has limited sequence identity to FSCN3 (30%).  
It one of its fascin domains is encoded in the last exon.  
By constrast, FSCN3 has a poorly conserved last exon which does not encode any Pfam domains.

## Long exon
**ASXL1 vs ASXL3** is interesting.
Both are associated with severe NDD.  
They have limited sequence similarity.
ASXL3 has a distinctive pattern of constraint which marries with the distribution of pathogenic LoFs.
ASXL1 is the inverse. It is apparently enriched for LoFs precisely in those regions.
This is because those LoFs, as somatic variants, drive haematopoetic clonal expansion. 
In the germline they cause an NDD.  
It's a neat example! (But not showing anything "new").

**PROX1 and PROX2** show markedly different constraint.

## Conclusions
I don't think there are any startling findings from this first look at the data. There may be further interesting examples to find in a deeper dive, and these statistics are useful to have. They could be added as supplementary information.