# Porovnání pipeline
Naší a paperu; na strukturách z paperu (apo_holo.dat)
- filtrování struktur
- určení holo/apo
- určení isoformy
- vytvoření párů na základě isoformy
- (porovnání výsledků analýz je v jiném ipynb, to jsme už vyřešili)

In [11]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

from apo_holo_structure_stats.paper_repl.main import get_paper_apo_holo_dataframe

PIPELINE_OUTPUT_FILE = Path('../../o_isoform_paper_residue_holo_fix.json')

paper_pairs = get_paper_apo_holo_dataframe()
paper_pairs


Unnamed: 0,domain_count,ligand_codes,apo_chain_id,holo_chain_id,apo_pdb_code,holo_pdb_code
0,1,SUC;SUC;,R,P,1a0s,1a0t
1,1,SES;,A,A,1a3y,1dzj
2,1,BEZ;,B,A,1a7u,1a8u
3,2,NGA;NGA;,_,A,1a8d,1d0h
4,2,OXL;,_,A,1a8f,1ryo
...,...,...,...,...,...,...
516,1,BHC;,D,A,4pgm,1bq4
517,1,5AS;,_,A,6rhn,1rzy
518,2,NAG-MAN;,_,A,6taa,2guy
519,2,GOL;GOL;GOL;,D,B,7req,1req


## Podvybrat pouze chainy struktur, tak jako v paperu.

In [12]:


chains = pd.read_json(PIPELINE_OUTPUT_FILE)
# drop duplicates, as in the paper a-h pairs there are no duplicates of (pdb_code, chain_id)
# But the outputs of the pipeline contain duplicates, as the only
chains = chains.drop_duplicates()

# vyres to tak: pro vsechny podtrzitka najdi v outputu spravny struktury a ma byt prave 1 myslim
for i, row in enumerate(paper_pairs.itertuples()):
    row = row._asdict()  # convert to dict, so we can easily index columns with strings

    for apo_or_holo in ('apo', 'holo'):
        if row[f'{apo_or_holo}_chain_id'] != '_':
            continue

        hits = chains[chains.pdb_code == row[f'{apo_or_holo}_pdb_code']]
        if len(hits) != 1:
            print('error', row[f'{apo_or_holo}_pdb_code'])
            raise RuntimeError()

        # change '_' into actual chain id, so we can do merges with pd
        paper_pairs[f'{apo_or_holo}_chain_id'].iat[i] = hits['chain_id'].iat[0]

# concat apo and holo columns from pairs to a single `paper_chains` df
paper_chains = []
for apo_or_holo in ('apo', 'holo'):
    c1, c2 = f'{apo_or_holo}_pdb_code', f'{apo_or_holo}_chain_id'
    paper_chains.append(
        paper_pairs[[c1, c2]].rename(columns={c1: 'pdb_code', c2: 'chain_id'}))
paper_chains = pd.concat(paper_chains)
paper_chains = paper_chains.set_index(['pdb_code', 'chain_id'], verify_integrity=True)

chains = chains.merge(paper_chains, left_on=['pdb_code', 'chain_id'], right_index=True)
# todo takhlle nejde mergovat, z nejakyho duvodu mam 2 dupes
chains = chains.merge(paper_pairs, how='left', left_on=['pdb_code', 'chain_id'], right_on=['apo_pdb_code', 'apo_chain_id'])

chains = chains.merge(paper_pairs, how='left', left_on=['pdb_code', 'chain_id'], right_on=['holo_pdb_code', 'holo_chain_id'])
# drop chains that were not in paper dataset
chains = chains.dropna(subset=['apo_pdb_code_x', 'apo_pdb_code_y'], how='all')



##### Nekterym chybi isoform - vyprintovat a odstranit:

In [13]:
print()
print(len(chains))
print(chains[chains.isoform.isna()])
chains = chains.dropna(subset=['isoform'])  # will drop those without isoform field
print(len(chains))


1041
    pdb_code                                               path chain_id  \
18      1ap2  apo_holo_structure_stats/paper_repl/pdb_struct...        C   
232     1i3v  apo_holo_structure_stats/paper_repl/pdb_struct...        B   
233     1i3u  apo_holo_structure_stats/paper_repl/pdb_struct...        A   
303     1seo  apo_holo_structure_stats/paper_repl/pdb_struct...        B   
518     1r9e  apo_holo_structure_stats/paper_repl/pdb_struct...        B   
752     2gdb  apo_holo_structure_stats/paper_repl/pdb_struct...        A   

     is_holo isoform  domain_count_x                        ligand_codes_x  \
18     False    None             1.0  VAL-GLN-GLU-ALA-LEU-ASP-LYS-ARG-GLY;   
232    False    None             1.0                                  RR1;   
233     True    None             NaN                                   NaN   
303     True    None             NaN                                   NaN   
518    False    None             1.0                                  G

Prvni 3 struktury jsou variable chainy imunoglobulinů (holo struktura k 1ap2 ale už má UNP).
Zbylé 3 jsou obsolete pdb struktury (ty asi nebudeme zpracovávat, takže OK).


## Porovnat klasifikaci apo, holo (jednotlive po strukturach)
- FP, FN, TP, TN
<a id='is_holo_classification'></a>

In [14]:

print('apo', np.sum(chains.apo_pdb_code_y.isna()))
print('holo', np.sum(chains.apo_pdb_code_x.isna()))
if 'is_holo_paper' not in chains.columns:
    # check, (so multiple runs of this cell are allowed)
    chains.insert(loc=3, column='is_holo_paper', value=chains.apo_pdb_code_x.isna())

# dat isholo do sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(chains.is_holo_paper, chains.is_holo, digits=3))
tn, fp, fn, tp = confusion_matrix(chains.is_holo_paper, chains.is_holo).ravel()  # line from sklearn docs
print('TP+TN', tp + tn)
print('FP', fp)
print('FN', fn)


chains[chains.is_holo_paper != chains.is_holo]

apo 517
holo 518
              precision    recall  f1-score   support

       False      0.956     0.961     0.959       517
        True      0.961     0.956     0.958       518

    accuracy                          0.958      1035
   macro avg      0.958     0.958     0.958      1035
weighted avg      0.958     0.958     0.958      1035

TP+TN 992
FP 20
FN 23


Unnamed: 0,pdb_code,path,chain_id,is_holo_paper,is_holo,isoform,domain_count_x,ligand_codes_x,apo_chain_id_x,holo_chain_id_x,apo_pdb_code_x,holo_pdb_code_x,domain_count_y,ligand_codes_y,apo_chain_id_y,holo_chain_id_y,apo_pdb_code_y,holo_pdb_code_y
45,1c60,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,P00720,,,,,,,2.0,BME;,A,A,1c62,1c60
102,1f4v,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,P0AE67,,,,,,,1.0,MET-GLY-ASP-SER-ILE-LEU-SER-GLN-ALA-GLU-ILE-AS...,A,A,1eay,1f4v
107,1ehd,apo_holo_structure_stats/paper_repl/pdb_struct...,A,False,True,Q9S7B3,2.0,NAG;NAG;,A,A,1ehd,1ehh,,,,,,
113,1eoa,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,P20371,,,,,,,1.0,CYN;,A,A,1eo2,1eoa
114,1eoa,apo_holo_structure_stats/paper_repl/pdb_struct...,B,True,False,P20372,,,,,,,1.0,CYN;,B,B,1eo2,1eoa
134,2et1,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,P45850,,,,,,,2.0,GLV;,A,A,1fi2,2et1
185,1xz1,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,P02791,,,,,,,1.0,HLT;,A,A,1gwg,1xz1
210,1hjs,apo_holo_structure_stats/paper_repl/pdb_struct...,D,False,True,P83692,1.0,PEG;TRS;,D,A,1hjs,1hju,,,,,,
215,1hnu,apo_holo_structure_stats/paper_repl/pdb_struct...,A,True,False,Q05871,,,,,,,1.0,REO-EDO;,A,A,1hno,1hnu
226,1i1d,apo_holo_structure_stats/paper_repl/pdb_struct...,C,False,True,P43577,1.0,ACO;,C,C,1i1d,1i12,,,,,,


##### 20 False positives (not is_holo_paper)
1vjm má taky retinal, i v LPC https://oca.weizmann.ac.il/oca-bin/Vcofc.cgi?num=1&PDB_ID=1VJM&XID=01319800001637772101
- má 15 kontaktů, z toho 10 destabilizing (hydrophil+phob)
- -> asi to odčítaj

1nmc má 5 kontaktů dle LPC https://oca.weizmann.ac.il/oca-bin/Vcofc.cgi?num=15&PDB_ID=1NMC&XID=01476500001637772518

1n13, taky dost kontaktů, ne všechny ale stabilizing https://oca.weizmann.ac.il/oca-bin/Vcofc.cgi?num=9&PDB_ID=1N13&XID=01552300001637772752


LPC s peptide ligandama nefunguje (dava 404, nebo tam peptid neni mezi ligandama, takze peptidy delali asi jinak?)

##### 23 False negatives
- mezi nima i maly molekuly < 6 atomu (ty skippuju, jak psali?!), např. kyanid (CYN)
- 5 peptidu





## Porovnat uniprot-isoform skupiny (ideálně všechny vel. 2)


In [15]:
chains2 = chains.set_index(['isoform', 'pdb_code', 'chain_id'], verify_integrity=True)
# chains[]
isoform_groups = chains2.groupby(level='isoform')
chains2['isoform_group_size'] = isoform_groups['path'].transform(len)  # todo jde nejak dostat row counts, bez toho, abych musel vybirat primo nejaky (ale libovolny) sloupec (treba `path`)
# by doing transform, I don't have to do a merge (it preserves the df len)
# apply won't work - result won't be broadcasted to original df length
isoform_group_sizes = isoform_groups.size()
print(len(isoform_group_sizes[isoform_group_sizes != 2]))
isoform_group_sizes[isoform_group_sizes != 2]


24


isoform
O68720      1
O83008      1
P00257      1
P00257-2    1
P00766      4
P01869      1
P03367      1
P04585      1
P07254      1
P09012      1
P0A8M3      4
P15273      1
P36655      1
P37595      1
P42212      4
P58162      1
P62508      1
P62509      1
P69178      1
P69179      1
Q569W9      1
Q6KB05      1
Q8GEZ8      1
Q9CAQ2      1
dtype: int64

Three groups of 4 structures and 21 groups of one structure

In [16]:
chains2[chains2.isoform_group_size != 2]


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,path,is_holo_paper,is_holo,domain_count_x,ligand_codes_x,apo_chain_id_x,holo_chain_id_x,apo_pdb_code_x,holo_pdb_code_x,domain_count_y,ligand_codes_y,apo_chain_id_y,holo_chain_id_y,apo_pdb_code_y,holo_pdb_code_y,isoform_group_size
isoform,pdb_code,chain_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Q6KB05,1mvu,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,VAL-GLN-GLU-ALA-LEU-ASP-LYS-ARG-GLY;,C,A,1ap2,1mvu,1
P69178,1c48,E,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,GAL-GLC;GAL-GLC;,E,A,1c48,1cqf,,,,,,,1
P69179,1cqf,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,GAL-GLC;GAL-GLC;,E,A,1c48,1cqf,1
P00766,1cgj,E,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,CYS-GLY-VAL-PRO-ALA-ILE-GLN-LEU;,E,B,1cgj,1ab9,,,,,,,4
P00766,1ab9,B,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,CYS-GLY-VAL-PRO-ALA-ILE-GLN-LEU;,E,B,1cgj,1ab9,4
P00766,1ab9,C,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,THR-PRO-GLY-VAL-TYR;CYS-GLY-VAL-PRO-ALA-ILE-GL...,B,C,1gl1,1ab9,4
P00257-2,1e6e,B,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,RUA;,B,B,1e6e,2bt6,,,,,,,1
P00257,2bt6,B,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,RUA;,B,B,1e6e,2bt6,1
O83008,1edq,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,3.0,NAA-AMI;,A,A,1edq,1ffq,,,,,,,1
P07254,1ffq,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,3.0,NAA-AMI;,A,A,1edq,1ffq,1


## Porovnat pocet paru, ktery nam vysel stejne, (zbytek budou pary, ktere se nevytvorily, bud oba apo nebo oba holo -> počty)

In [17]:
# for each group (bud size ==2 nebo >2), do A-H pairs
# > 2 muzu dostat i nejaky FP oprati paperu
# == 2 budu mit pouze missing pary (protoze oba holo/apo, nebo neuvazuju tu strukuru)
# vlastne ty skupine jeste by teoreticky mohly byt jine, nez v paperu, takze tam bych taky mohl dostat FP i FN (ale to asi nebudou)

def make_pairs(isoform_group):
    df = isoform_group
    apo = df[~df.is_holo]
    holo = df[df.is_holo]
    return apo.reset_index().merge(holo.reset_index(), how='cross', suffixes=('_apo', '_holo'))

# groups[apo] groups[holo] crossproduct
pairs = isoform_groups.apply(make_pairs).set_index(['isoform_apo', 'pdb_code_apo', 'chain_id_apo',
                                                    'isoform_holo', 'pdb_code_holo', 'chain_id_holo'],
                                                   verify_integrity=True)
pairs

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,path_apo,is_holo_paper_apo,is_holo_apo,domain_count_x_apo,ligand_codes_x_apo,apo_chain_id_x_apo,holo_chain_id_x_apo,apo_pdb_code_x_apo,holo_pdb_code_x_apo,domain_count_y_apo,...,holo_chain_id_x_holo,apo_pdb_code_x_holo,holo_pdb_code_x_holo,domain_count_y_holo,ligand_codes_y_holo,apo_chain_id_y_holo,holo_chain_id_y_holo,apo_pdb_code_y_holo,holo_pdb_code_y_holo,isoform_group_size_holo
isoform_apo,pdb_code_apo,chain_id_apo,isoform_holo,pdb_code_holo,chain_id_holo,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
A0A0H2US34,2j1r,A,A0A0H2US34,2j1s,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,FUL;,A,A,2j1r,2j1s,,...,,,,1.0,FUL;,A,A,2j1r,2j1s,2
A0A0K0K1A5,2bnu,B,A0A0K0K1A5,2bnq,E,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,2.0,SER-LEU-LEU-MET-TRP-ILE-THR-GLN-VAL;,B,E,2bnu,2bnq,,...,,,,2.0,SER-LEU-LEU-MET-TRP-ILE-THR-GLN-VAL;,B,E,2bnu,2bnq,2
B0R5M0,2cc9,A,B0R5M0,2cc8,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,RBF;,A,A,2cc9,2cc8,,...,,,,1.0,RBF;,A,A,2cc9,2cc8,2
O06553,2aq6,B,O06553,1y30,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,FMN;,B,A,2aq6,1y30,,...,,,,1.0,FMN;,B,A,2aq6,1y30,2
O06644,1p5h,B,O06644,1p5r,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,4.0,COA;,B,A,1p5h,1p5r,,...,,,,4.0,COA;,B,A,1p5h,1p5r,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q9X9I8,2fn0,A,Q9X9I8,2fn1,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,3.0,SAL;,A,A,2fn0,2fn1,,...,,,,3.0,SAL;,A,A,2fn0,2fn1,2
Q9Y275,1oqe,F,Q9Y275,1kxg,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,CIT;,F,A,1oqe,1kxg,,...,,,,1.0,CIT;,F,A,1oqe,1kxg,2
Q9YE81,1xgv,A,Q9YE81,1tyo,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,3.0,ENP;,A,A,1xgv,1tyo,,...,,,,3.0,ENP;,A,A,1xgv,1tyo,2
Q9Z214,1i2h,A,Q9Z214,1ddv,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,THR-PRO-PRO-SER-PRO-PHE;,A,A,1i2h,1ddv,,...,,,,1.0,THR-PRO-PRO-SER-PRO-PHE;,A,A,1i2h,1ddv,2


##### Classification report

In [18]:
# spocitat pary
# FP pary nebudou (protoze mame dataset danej tema paper parama uz puvodne), ledaze v groups > 4
pairs2 = pairs[pairs.isoform_group_size_apo == 2]
true_positive_pair_indices = (~pairs2.is_holo_paper_apo) & (pairs2.is_holo_paper_holo)
true_positive_pairs = pairs2[true_positive_pair_indices]
# (pro zajimavost se podivat, jesti se vytvorily nejake FP páry (dava smysl, ze ne))
assert len(pairs2[~true_positive_pair_indices]) == 0

true_positive_pairs_count = 3 + len(true_positive_pairs)  # +3 are in 4-groups, see one of the bottom cells

# vyprintovat rozdil mezi pocty paru v paperu a nami nalezenych, melo by to byt max kolik procent?
print(f'paper pairs count {len(paper_pairs)}')
print(f'true positive pairs count {true_positive_pairs_count}')
print(f'recall {true_positive_pairs_count/ len(paper_pairs):.3f}')

# What is the expected recall (lower bound)?
# 1 struct was skipped due to being a protein-nucleid acid complex
# 6 structs without isoform information (api 404)
# max. 33 - 12 + 4 pairs not processed (groups size != 2)
# accuracy on is_holo 0.958 in 1035 structs
tp_pairs_lower_bound = 521 - 1 - 6 - (33-3*4) - ((1 - 0.958) * 1035)
print(f'recall expected at least {tp_pairs_lower_bound/len(paper_pairs):.3f}')

paper pairs count 521
true positive pairs count 462
recall 0.887
recall expected at least 0.863


Zbývá mi tedy na prozkoumání:
a) tři 4-groupy a
b) 33-12 (=21) 1-group

### Tři 4-groupy
P00766, P0A8M3, P42212

Zkusim dát do run_analyses ty tři 4-groupy.

Proč jim vlastně vyšly dva páry (dvě sekvence) se stejnym UNP id 2, když clusterovali na 35 % identity?
- to uvidim, až spočítám LCS snad




In [19]:
import logging
import importlib
import apo_holo_structure_stats.paper_repl.main
importlib.reload(apo_holo_structure_stats.paper_repl.main)
from apo_holo_structure_stats.pipeline.run_analyses import JSONAnalysisSerializer
from apo_holo_structure_stats.paper_repl.main import process_pair

logging.root.setLevel(logging.WARNING)

# vybrat isoform_group_size >2 a hodit je do run_analyses (process_pair)
df = pairs[pairs.isoform_group_size_apo > 2]

analyses_fname = 'output_three_large_groups.json'
analyses_serializer = JSONAnalysisSerializer(analyses_fname)
_domains_info = []

for row in df.itertuples():
    print('processing', row.Index)  # todo mel jsem udelat s reset index, ale doufal jsem, ze row.Index taky bude namedtuple
    index = row.Index  # ('P00766', '1cgj', 'E', 'P00766', '1ab9', 'B')
    process_pair(index[1], index[4], index[2], index[5], analyses_serializer, _domains_info)

# print(analyses_serializer.data)
# analyses_serializer.dump_data()


processing ('P00766', '1cgj', 'E', 'P00766', '1ab9', 'B')




processing ('P00766', '1cgj', 'E', 'P00766', '1ab9', 'C')




processing ('P00766', '1gl1', 'B', 'P00766', '1ab9', 'B')




processing ('P00766', '1gl1', 'B', 'P00766', '1ab9', 'C')




processing ('P0A8M3', '1evk', 'A', 'P0A8M3', '1evl', 'A')
processing ('P0A8M3', '1evk', 'A', 'P0A8M3', '1tke', 'A')




processing ('P0A8M3', '1tje', 'A', 'P0A8M3', '1evl', 'A')




processing ('P0A8M3', '1tje', 'A', 'P0A8M3', '1tke', 'A')
processing ('P42212', '1yhh', 'A', 'P42212', '2g16', 'A')
processing ('P42212', '1yhh', 'A', 'P42212', '2g16', 'B')




processing ('P42212', '1yhi', 'A', 'P42212', '2g16', 'A')
processing ('P42212', '1yhi', 'A', 'P42212', '2g16', 'B')




2g16B nema nic spolecnyho s 2g16A, sekvencne (jine useky jednoho UNP)
takze udelali ty LCS a pak clusterovali ty LCS samozrejmě.
Takze jim vysly dva pary, disjunktni sekvence, i kdyz ze stejnech UNP


1evkA 1evlA
1tjeA 1tkeA
- tady vyšly opravdu jenom tyto dvě kombinace
- zas asi disjunktni (overit v unp/ebi)

P00766
- divny, tady vypada, ze vsechny sekvence maj neco spolecnyho
- ledaze nejaka mutace ke konci, ale konec stejnej?
- 1cgjE 1ab9B
- 1ab9B a 1ab9C disjunktni
- to, ze jich ja mam vic nevadi, protoze oni brali jenom cluster center, ja beru kartezskej soucin

### 21 1-group

In [20]:
chains2[chains2.isoform_group_size != 2]


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,path,is_holo_paper,is_holo,domain_count_x,ligand_codes_x,apo_chain_id_x,holo_chain_id_x,apo_pdb_code_x,holo_pdb_code_x,domain_count_y,ligand_codes_y,apo_chain_id_y,holo_chain_id_y,apo_pdb_code_y,holo_pdb_code_y,isoform_group_size
isoform,pdb_code,chain_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Q6KB05,1mvu,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,VAL-GLN-GLU-ALA-LEU-ASP-LYS-ARG-GLY;,C,A,1ap2,1mvu,1
P69178,1c48,E,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,GAL-GLC;GAL-GLC;,E,A,1c48,1cqf,,,,,,,1
P69179,1cqf,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,GAL-GLC;GAL-GLC;,E,A,1c48,1cqf,1
P00766,1cgj,E,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,CYS-GLY-VAL-PRO-ALA-ILE-GLN-LEU;,E,B,1cgj,1ab9,,,,,,,4
P00766,1ab9,B,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,CYS-GLY-VAL-PRO-ALA-ILE-GLN-LEU;,E,B,1cgj,1ab9,4
P00766,1ab9,C,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,THR-PRO-GLY-VAL-TYR;CYS-GLY-VAL-PRO-ALA-ILE-GL...,B,C,1gl1,1ab9,4
P00257-2,1e6e,B,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,1.0,RUA;,B,B,1e6e,2bt6,,,,,,,1
P00257,2bt6,B,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,1.0,RUA;,B,B,1e6e,2bt6,1
O83008,1edq,A,apo_holo_structure_stats/paper_repl/pdb_struct...,False,False,3.0,NAA-AMI;,A,A,1edq,1ffq,,,,,,,1
P07254,1ffq,A,apo_holo_structure_stats/paper_repl/pdb_struct...,True,True,,,,,,,3.0,NAA-AMI;,A,A,1edq,1ffq,1


1-groupa
- je možné, že jde o jiný gen (UNP), ale prostě tam našli dlouhý LCS

Některé z nich jsou ty, co ztratily partnera isoform==null nebo to byl ten RNA chain (max 7)
- např 1mvu (isoform(1ap2) == null)

P69178,1c48
P69179,1cqf
- bakteriofágové ale jiní
- sekvence stejné

P00257-2,1e6e
P00257,2bt6
- nechápu, proč to SIFTS dalo to prvni na tu -2 isoformu
- alignment 1e6e sekv. a P00257 má vyšší skóre (vypadá i líp)
    - P00257-2 aligment https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=emboss_needle-I20211126-144339-0000-20069487-p2m
    - P00257 https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=emboss_needle-I20211126-144425-0497-77990938-p1m

O83008,1edq
P07254,1ffq
- oba Organism Serratia marcescens
- zas nechapu, proc to priradilo jiny isoformy (mezi sebou dost mutaci)
- pritom poly sekvence v mmcifu jsou uplne stejny
- spravna vypada O83008
- Je to stejny i na webu ebi, takze asi je problem nekde v SIFTS?
    - nevim, proc to takhle zaradil, dela to podle taxonomy + sekvence. Taxonomy je u obou pdb entry stejna, sekvence mnohem podobnejsi je ta O83008
- v mmcifu je accession 1ffq jiny! _struct_ref.pdbx_db_accession          AB015996, ale je to GB (genebank)


P01869,1a3l
Q569W9,1rur
- todo, podobne, jako dva posledni pripady?
- v mmcifu je chain 1rur H na P01869, tak proč?, API vraci presne obracene.. https://www.ebi.ac.uk/pdbe/graph-api/mappings/isoforms/1rur
...




## Závěr
Kolik jednotlivých struktur jsme "ztratili" oproti paperu a můžeme to nějak ovlivnit.

Proč API nevrací isoform (jen 6 případů)?
- zjištěno: 3 variable chain immunoglobulinů, 3 obsolete

Zjistit, proč SIFTS mapuje na horší UNP, i když má větší similarity s jinou sekvencí (a je ve stejnym taxa).
- tak +- 15 případů

IsHolo accuracy 1035 - 992 = 43 případů [cell](#is_holo_classification)
- web LPC mi nefunguje na peptid ligandy
- viz cell na apo/holo classification

## TODO

Vykašlat se na isoformy, jako jsem to udělal v production a udělat LCS within primary UNP, a pak spočítat, kolik mám párů..

Však to můžu hodit do těch skriptů, co mám, nebo ne? Jo..