# SMARTER SUMMARY (2021/09/28)
* [Dataset composition](#datasets-composition)
    - [Foreground / background datasets](#foreground-vs-background-datasets)
    - [Datasets by chip type](#datasets-by-chip-type)
* [Samples composition](#samples-composition)
    - [Foreground / background samples for sheep](#foreground-background-samples-sheep)
    - [Foreground / background samples for goat](#foreground-background-samples-goat)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import Point
import geopandas as gpd

from src.features.smarterdb import global_connection, Dataset, SampleSheep, SampleGoat

conn = global_connection()

def fix_id(df: pd.DataFrame):
    """Parse id and make index"""
    df['_id'] = df['_id'].apply(lambda val: val['$oid'])
    df = df.set_index("_id")
    return df

def add_geometry(df: pd.DataFrame):
    """Add a geometry column from locations"""
    
    def get_geometry(value):
        if isinstance(value, list):
            return Point(*value[0]['coordinates'])
        return value

    df['geometry'] = df['locations'].apply(get_geometry)
    return df

<a id='datasets-composition'></a>
## Dataset composition
Read datasets informations and try to describe how many *background/foreground* data we have

In [None]:
tmp = Dataset.objects.filter(type_="genotypes").to_json()
datasets = pd.read_json(tmp).dropna(thresh=1, axis=1)
datasets['type'] = datasets['type'].apply(lambda val: val[1])
datasets = fix_id(datasets)
datasets[['breed', 'country', 'species', 'type', 'partner', 'chip_name']]

<a id='foreground-vs-background-datasets'></a>
### Foreground / background datasets
Plotting *foreground* VS *background* datasets:

In [None]:
plot = datasets['type'].value_counts().plot.pie(y="type", figsize=(8,8), shadow=True, startangle=45, rotatelabels=45, autopct='%1.1f%%')

<a id='datasets-by-chip-type'></a>
### Datasets by chip type
Plotting datasets by *chip type*:

In [None]:
plot = datasets['chip_name'].value_counts().plot.pie(y="chip", figsize=(8,8), shadow=True, startangle=45, rotatelabels=45, autopct='%1.1f%%')

<a id='samples-composition'></a>
## Samples composition
<a id='foreground-background-samples-sheep'></a>
### Foreground / Background samples for sheep
Ok get the *background/foreground* sheep samples. Two queries since the type is a `Dataset` property:

In [None]:
foreground_sheeps = SampleSheep.objects.filter(dataset__in=Dataset.objects.filter(type_="foreground")).fields(country=True, breed=True, chip_name=True, locations=True)
background_sheeps = SampleSheep.objects.filter(dataset__in=Dataset.objects.filter(type_="background")).fields(country=True, breed=True, chip_name=True, locations=True)
samples_sheep = pd.Series({"foreground": foreground_sheeps.count(), "background": background_sheeps.count()}, name="Sheeps")
plot = samples_sheep.plot.pie(y="Sheeps", figsize=(8,8), shadow=True, rotatelabels=45, autopct='%1.1f%%')

Most of the data we have comes from the background dataset. Where are samples locations? where are *background / foreground* data? Read data from database and them add a geometry feature for `GeoDataFrame`:

In [None]:
tmp = foreground_sheeps.to_json()
foreground_sheeps = pd.read_json(tmp).dropna(thresh=1, axis=1)
tmp = background_sheeps.to_json()
background_sheeps = pd.read_json(tmp).dropna(thresh=1, axis=1)

foreground_sheeps = fix_id(foreground_sheeps)
background_sheeps = fix_id(background_sheeps)

foreground_sheeps = add_geometry(foreground_sheeps)
background_sheeps = add_geometry(background_sheeps)

Next, we need to get the world boundary features:

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

Now convert `DataFrame` to `GeoDataFrame`. Clearly state the *coordinate system* which is *WGS84 (EPSG:4326)*:

In [None]:
background_sheeps = gpd.GeoDataFrame(background_sheeps, crs="EPSG:4326")
background_sheeps = background_sheeps.set_crs(world.crs)

foreground_sheeps = gpd.GeoDataFrame(foreground_sheeps, crs="EPSG:4326")
foreground_sheeps = foreground_sheeps.set_crs(world.crs)

Now draw background and foreground sheeps in a picture:

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_sheeps.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_sheeps.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Sheep Samples")

<a id='foreground-background-samples-goat'></a>
### Foreground / Background samples for goat

In [None]:
foreground_goats = SampleGoat.objects.filter(dataset__in=Dataset.objects.filter(type_="foreground")).fields(country=True, breed=True, chip_name=True, locations=True)
background_goats = SampleGoat.objects.filter(dataset__in=Dataset.objects.filter(type_="background")).fields(country=True, breed=True, chip_name=True, locations=True)
samples_goat = pd.Series({"foreground": foreground_goats.count(), "background": background_goats.count()}, name="Goats")
plot = samples_goat.plot.pie(y="Goat", figsize=(8,8), shadow=True, rotatelabels=45, autopct='%1.1f%%')

In [None]:
tmp = foreground_goats.to_json()
foreground_goats = pd.read_json(tmp).dropna(thresh=1, axis=1)
tmp = background_goats.to_json()
background_goats = pd.read_json(tmp).dropna(thresh=1, axis=1)

foreground_goats = fix_id(foreground_goats)
background_goats = fix_id(background_goats)

foreground_goats = add_geometry(foreground_goats)
background_goats = add_geometry(background_goats)

In [None]:
background_goats = gpd.GeoDataFrame(background_goats, crs="EPSG:4326")
background_goats = background_goats.set_crs(world.crs)

foreground_goats = gpd.GeoDataFrame(foreground_goats, crs="EPSG:4326")
foreground_goats = foreground_goats.set_crs(world.crs)

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_goats.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_goats.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Goat Samples")