# SMARTER vs Vargoats
In this notebook we try to determine how many samples are in common between *Vargoats* and *SMARTER*. First of all, load *Vargoats* data table

In [None]:
import pandas as pd

from src.features.smarterdb import global_connection, SampleGoat, Dataset

In [None]:
conn = global_connection()

In [None]:
vargoats = pd.read_excel("VarGoats data access.xlsx", header=1)
vargoats.head()

In [None]:
vargoats.info()

Ok, try to explore `ADAPTmap ID`:

In [None]:
vargoats["ADAPTmap ID"].value_counts()

Well, about `897` vargoats animals don't have and adaptmap id. Get all the adaptmap ids:

In [None]:
vargoats["ADAPTmap ID"]

I see that some animals have extra characters in their name. Try to normalize those samples:

In [None]:
vargoats["ADAPTmap ID"] = vargoats["ADAPTmap ID"].apply(lambda name: name.split('*')[0])

Ok, get all samples from SMARTER database:

In [None]:
samples = SampleGoat.objects.all()
samples.count()

Now read all data into dataframe:

In [None]:
smarter = pd.read_json(samples.to_json())
smarter["dataset_id"] = smarter["dataset_id"].apply(lambda name: name['$oid'])
smarter.drop("_id", axis=1, inplace=True)
smarter.head()

In [None]:
merged_datasets = vargoats.set_index("ADAPTmap ID").join(smarter.set_index("original_id"), lsuffix="vargoats", rsuffix="smarter", how="outer")
merged_datasets.head()
merged_datasets.info()

In [None]:
merged_datasets.to_excel("merged_datasets.xlsx")

Is it possible that I have a vargoat original id in smarter but outside adaptmap? get all the non adaptmap samples from vargoats:

In [None]:
original_ids = vargoats[vargoats["ADAPTmap ID"] == "not applicable"]["Original ID"]

In [None]:
samples = SampleGoat.objects.filter(original_id__in=original_ids.to_list())
samples.count()

Vargoats has its own id representation

## SMARTER Goat stats by breeds
let's discover how many breeds are in SMARTER goat database

In [None]:
smarter.head()

First question: how many goat breeds are in SMARTER database?

In [None]:
print(f"There are {smarter['breed'].nunique()} goat breeds in smarter database")

Count by `breed` column:

In [None]:
count_breed = pd.DataFrame(data=smarter.groupby(["breed"]).count()["smarter_id"]).rename(columns={"smarter_id": "count"})
count_breed = count_breed.reset_index()
count_breed

Now group by `breed` and `type` columns:

In [None]:
count_breedandtype = pd.DataFrame(data=smarter.groupby(["breed", "type"]).count()["smarter_id"]).rename(columns={"smarter_id": "count"})
count_breedandtype = count_breedandtype.reset_index()
count_breedandtype.info()

This time I have 169 rows. There are breeds which are background and foreground

In [None]:
# https://stackoverflow.com/a/22107169
both_types = count_breedandtype.groupby("breed").filter(lambda x: len(x) > 1)
both_types

Which datasets provide these animals?

In [None]:
dataset_ids = smarter[smarter["breed"].isin(["Fosses", "Landrace", "Provencale"])]["dataset_id"].unique()
datasets = pd.read_json(Dataset.objects.filter(id__in=dataset_ids).fields(type_=1, partner=1, file=1).to_json())
datasets["type"] = datasets["type"].apply(lambda x: x[1])
both_datasets = datasets[["file", "partner", "type"]]

How many samples are foreground (by breed)?

In [None]:
foreground_breeds = count_breedandtype[count_breedandtype["type"] == "foreground"]
print(f"There are {foreground_breeds['count'].sum()} foreground goat samples")
foreground_breeds

Save data in tables:

In [None]:
with pd.ExcelWriter("smarter_goats_breeds.xlsx") as writer:
    count_breed.to_excel(writer, sheet_name="breeds count", index=False)
    count_breedandtype.sort_values(["type", "breed"]).to_excel(writer, sheet_name="breeds count by type", index=False)
    both_types.to_excel(writer, sheet_name="both foreground and background", index=False)