# Burren IBS
This is an attempt to discover animal that I already have in SMARTER database. The hypotesis is to calculate the *Identical by State* between samples in order to discover potential duplicates. There are multiple ways to calculate *IBS*: here we calculate them with [king](https://www.kingrelatedness.com/) software, and then we try to identify samples that should be removed from the new background dataset. Let's load some useful modules

In [1]:
import os

import pandas as pd

from src.features.smarterdb import global_connection, Dataset

In [2]:
conn = global_connection()

In [3]:
dataset = Dataset.objects.get(file="burren_et_al_2016.zip")

In [4]:
os.chdir(dataset.working_dir / "doi_10.5061_dryad.q1cv6__v1/")

Open the *king* output file: we know that there could be actually duplicated samples in the latest SMARTER database release, however let's focus on duplicates between SMARTER and the burren dataset:

In [5]:
king = pd.read_table("king.con")
king['in_smarter'] = king[['ID1', 'ID2']].apply(lambda x: x['ID1'][2:4] == 'CH' and x['ID2'][2:4] == 'CH', axis=1)
king = king[king['in_smarter'] == False]
king.head()

Unnamed: 0,FID1,ID1,FID2,ID2,N,N_IBS0,N_IBS1,N_IBS2,Concord,HomConc,HetConc,in_smarter
0,ALP,CHCH-ALP-000002441,1,goat27,41790,0,0,41790,1.0,1.0,1.0,False
1,ALP,CHCH-ALP-000002442,1,goat28,41779,0,0,41779,1.0,1.0,1.0,False
2,ALP,CHCH-ALP-000002443,1,goat29,41701,0,0,41701,1.0,1.0,1.0,False
3,ALP,CHCH-ALP-000002445,1,goat130,41769,0,0,41769,1.0,1.0,1.0,False
4,ALP,CHCH-ALP-000002448,1,goat131,41792,0,0,41792,1.0,1.0,1.0,False


Let's count how many duplicates I have between the two datasets (SMARTER vs burren)

In [6]:
king[['FID1', 'FID2']].value_counts()

FID1  FID2
ALP   1       66
SAA   1       43
dtype: int64

Ok, write the sample names into a file. This samples will be excluded from the *burren* dataset

In [7]:
king['ID2'].to_csv('to_remove.csv')