# SNP50_Breedv2
This notebook try to analyze the SNP50_Breedv2 datafile, which is a new release of SheepHapMap data. This is a multicountry/multibreed data file. Try to read informations and to generate a metadata file for genotype import

In [1]:
import re

import pandas as pd

from src.features.smarterdb import Dataset, global_connection
from src.features.plinkio import TextPlinkIO
from src.features.utils import get_interim_dir

In [2]:
conn = global_connection()

Get the `Dataset` object:

In [3]:
pattern = re.compile("hapmap", re.IGNORECASE)
dataset = Dataset.objects.get(file=pattern, type_="genotypes")
dataset.contents

['ovine_SNP50HapMap_data/',
 'ovine_SNP50HapMap_data/Heaton/',
 'ovine_SNP50HapMap_data/Heaton/Mike Heaton Sheep 07may2009_DNAReport.csv',
 'ovine_SNP50HapMap_data/Heaton/Mike Heaton Sheep 07may2009_FinalReport.txt',
 'ovine_SNP50HapMap_data/Heaton/Mike Heaton Sheep 07may2009_LocusSummary.csv',
 'ovine_SNP50HapMap_data/Heaton/Mike Heaton Sheep 07may2009_LocusXDNA.csv',
 'ovine_SNP50HapMap_data/Heaton/SNP_Map.txt',
 'ovine_SNP50HapMap_data/Heaton/Sample_Map.txt',
 'ovine_SNP50HapMap_data/OaCoordinates2104.xlsx',
 'ovine_SNP50HapMap_data/Parentage_04_may_09.PED',
 'ovine_SNP50HapMap_data/SNP50_Breedv1/',
 'ovine_SNP50HapMap_data/SNP50_Breedv1/SNP50_Breedv1.map',
 'ovine_SNP50HapMap_data/SNP50_Breedv1/SNP50_Breedv1.ped',
 'ovine_SNP50HapMap_data/SNP50_Breedv2/',
 'ovine_SNP50HapMap_data/SNP50_Breedv2/SNP50_Breedv2.map',
 'ovine_SNP50HapMap_data/SNP50_Breedv2/SNP50_Breedv2.ped',
 'ovine_SNP50HapMap_data/SNP50_Breedv2/ovine SNP50 Breedv2 data release.pdf',
 'ovine_SNP50HapMap_data/ancestral

Get samples from ped files

In [4]:
prefix = str(dataset.working_dir / "ovine_SNP50HapMap_data/SNP50_Breedv2/SNP50_Breedv2")
plink_file = TextPlinkIO(prefix=prefix, chip_name=dataset.chip_name, species=dataset.species)
sample_list = plink_file.get_samples()

Define a dictionary to define the proper information for each sample:

In [5]:
suffix2breed = {"AAW": "Afec Assaf", "APA": "Arawapa", "EBI": "Egyptian Barki", "ICE": "Icelandic", "IAW": "Improved Awassi", "LAW": "Local Awassi", "SLI": "Sri Lankan"}
suffix2fid = {"AAW": "AfecAssaf", "APA": "Arawapa", "EBI": "EgyptianBarki", "ICE": "Icelandic", "IAW": "ImprovedAwassi", "LAW": "LocalAwassi", "SLI": "SriLankan"}
suffix2country = {"AAW": "Israel", "APA": "New Zealand", "EBI": "Egypt", "ICE": "Iceland", "IAW": "Israel", "LAW": "Israel", "SLI": "Sri Lanka"}
suffix2code = {"AAW": "AAW", "APA": "APA", "EBI": "EBI", "ICE": "ICL", "IAW": "IAW", "LAW": "LAW", "SLI": "SLI"}

In [6]:
breeds = []
countries = []
codes = []
fids = []

for sample in sample_list:
    suffix = sample[:3]
    breeds.append(suffix2breed[suffix])
    countries.append(suffix2country[suffix])
    codes.append(suffix2code[suffix])
    fids.append(suffix2fid[suffix])

Create a data table:

In [7]:
SNP50_Breedv2 = pd.DataFrame.from_dict({"original_id": sample_list, "breed": breeds, "country": countries, "code": codes, "fid": fids})
SNP50_Breedv2.head()

Unnamed: 0,original_id,breed,country,code,fid
0,IAW101,Improved Awassi,Israel,IAW,ImprovedAwassi
1,IAW102,Improved Awassi,Israel,IAW,ImprovedAwassi
2,IAW103,Improved Awassi,Israel,IAW,ImprovedAwassi
3,IAW104,Improved Awassi,Israel,IAW,ImprovedAwassi
4,IAW106,Improved Awassi,Israel,IAW,ImprovedAwassi


Write metadata to file:

In [8]:
SNP50_Breedv2.to_excel(str(get_interim_dir() / "SNP50_Breedv2.xlsx"))