# Infer ancestry using EST-SFS
This was the first attempt to create *tstree* object using EST-SFS. In this example
we collect all *background* samples from SMARTER database and we use all *Ovis
aries* samples as *focal samples*, and *european*, *sardinian* and *spanish mouflon*
as three different *outgroups* to make inference with EST-SFS. We will write all
selection of samples as CSV files with `FID` and `IID` columns in order to extract
from the whole genotype files only the samples we need.

Try to collect sheep *background* samples from SMARTER database:

In [None]:
import pandas as pd

from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint, BreedEndpoint

Connect to *SMARTER* database and retrieve information on *background* samples:

In [None]:
sheep_api = SheepEndpoint()

data = sheep_api.get_samples(_type="background")
page = 1
sheep = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = sheep_api.get_samples(page=page+1, _type="background")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    sheep = pd.concat([sheep, df_page], ignore_index=True)

sheep.info()

Are those all *background* samples?

In [None]:
sheep.value_counts("type")

Ok. Let's collect all available species:

In [None]:
sheep.value_counts("species")

Ok, now collect all samples which are *Ovis aries*:

In [None]:
ovis_aries = sheep[sheep["species"] == "Ovis aries"]
ovis_aries.head()

How many breeds I have?

In [None]:
ovis_aries.value_counts("breed")

Ensure that there are no *mouflon* in sheep breed names:

In [None]:
ovis_aries["breed"].str.contains("Mouflon", case=False).any()

Ok, now collect *Ovis aries musimon* samples:

In [None]:
ovis_aries_musimon = sheep[sheep["species"] == "Ovis aries musimon"]
ovis_aries_musimon.head()

How many breeds I have?

In [None]:
ovis_aries_musimon.value_counts("breed")

Ok, try to collect *European mouflon*:

In [None]:
european_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "European mouflon"]
european_mouflon.head()

Ok, I'm also interested in *Sardinian mouflon*:

In [None]:
sardinian_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Sardinian mouflon"]
sardinian_mouflon.head()

Should I take *Spanish mouflon* as third outgroup?

In [None]:
spanish_mouflon = ovis_aries_musimon[ovis_aries_musimon["breed"] == "Spanish mouflon"]
spanish_mouflon.head()

Ok, now track those breeds as three different *outgroup* list:

In [None]:
european_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "european_mouflon.tsv", index=False, header=False, sep="\t")
sardinian_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sardinian_mouflon.tsv", index=False, header=False, sep="\t")
spanish_mouflon[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "spanish_mouflon.tsv", index=False, header=False, sep="\t")

Now, create a *sample txt* file which I can use to extract the *focal* sample I need from smarter database using plink:

In [None]:
ovis_aries[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "sheep_dataset.tsv", index=False, header=False, sep="\t")

## Attempt to limit sample size

Ok try to download a small dataset to test the pipeline: get information about
breeds:

In [None]:
breed_api = BreedEndpoint()

data = breed_api.get_breeds(species="Sheep")
page = 1
breeds = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = breed_api.get_breeds(page=page+1, species="Sheep")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    breeds = pd.concat([breeds, df_page], ignore_index=True)

breeds.info()

Try to select samples with a limited number of individuals, for example 50:

In [None]:
breeds[breeds["n_individuals"] == 50]

Ok first focus on `AustralianMerino` breed:

In [None]:
data = sheep_api.get_samples(code="AME")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()

These samples seem to come from *50K* chip:

In [None]:
sheep["chip_name"].value_counts()

Track those samples in a CSV file:

In [None]:
sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "AME_50K.tsv", index=False, header=False, sep="\t")

Now on `Île de France` breed:

In [None]:
data = sheep_api.get_samples(code="IDF")
page = 1
sheep = pd.DataFrame(data["items"])
sheep.head()

These samples seem to come from bot *50k* and *HD* chip:

In [None]:
sheep["chip_name"].value_counts()

Ok take only *HD* samples:

In [None]:
sheep = sheep[sheep["chip_name"] == "IlluminaOvineHDSNP"]
sheep.info()

Track those samples in a CSV file:

In [None]:
sheep[["breed_code", "smarter_id"]].to_csv(get_data_dir() / "IDF_HD.tsv", index=False, header=False, sep="\t")