# SMARTER 50K

This is an attempt to test tstree with different 50K data. A single example of
50 samples is described in `notebooks/03-smarter_database.ipynb`. Here we want
to select different breeds from 50K and test how the tstree will results

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tskitetude import get_data_dir
from tskitetude.smarterapi import SheepEndpoint

Collect all sheep samples from 50K:

In [None]:
sheep_api = SheepEndpoint()

data = sheep_api.get_samples(chip_name="IlluminaOvineSNP50")
page = page = data["page"]
sheep = pd.DataFrame(data["items"])

while data["next"] is not None:
    data = sheep_api.get_samples(page=page+1, _type="background")
    df_page = pd.DataFrame(data["items"])
    page = data["page"]
    sheep = pd.concat([sheep, df_page], ignore_index=True)

sheep.info()

Count how many samples I have by breed:

In [None]:
sheep_count = sheep.groupby('breed_code').size().reset_index(name='count')
sheep_count

Plot count distribution:

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(data=sheep_count, x='count', kde=True, binwidth=10, ax=ax)
plt.show()

The majority of the breed have <= 50 samples:

In [None]:
print(f"There are {sheep_count[sheep_count['count'] < 50]['count'].sum()} sheep with equal or less than 50 samples.")