# SMARTER SUMMARY (2022/09/26)
* [Dataset composition](#datasets-composition)
    - [Foreground / background datasets](#foreground-vs-background-datasets)
    - [Datasets by chip type](#datasets-by-chip-type)
* [Samples composition](#samples-composition)
    - [Foreground / background samples for sheep](#foreground-background-samples-sheep)
    - [Sheep samples by chip type](#sheep-sample-by-chip-type)
    - [Sheep sample locations](#sheep-sample-locations)
    - [Foreground / background samples for goat](#foreground-background-samples-goat)
        - [Grece foreground goat data](#greece-foreground-goat-data)
        - [Sweden foreground goat data](#sweden-foreground-goat-data)

In [None]:
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import Point
import geopandas as gpd
import pycountry
from mpl_toolkits.axes_grid1 import make_axes_locatable

from src.features.smarterdb import global_connection, Dataset, SampleSheep, SampleGoat

conn = global_connection()

def fix_id(df: pd.DataFrame):
    """Parse id and make index"""
    df['_id'] = df['_id'].apply(lambda val: val['$oid'])
    df = df.set_index("_id")
    return df

def add_geometry(df: pd.DataFrame):
    """Add a geometry column from locations"""
    
    def get_geometry(value):
        if isinstance(value, dict):
            return Point(*value['coordinates'][0])
        return value

    df['geometry'] = df['locations'].apply(get_geometry)
    return df

<a id='datasets-composition'></a>
## Dataset composition
Read datasets informations and try to describe how many *background/foreground* data we have

In [None]:
tmp = Dataset.objects.filter(type_="genotypes").to_json()
datasets = pd.read_json(tmp).dropna(thresh=1, axis=1)
datasets['type'] = datasets['type'].apply(lambda val: val[1])
datasets = fix_id(datasets)
datasets[['breed', 'country', 'species', 'type', 'partner', 'chip_name', 'n_of_individuals']]

<a id='foreground-vs-background-datasets'></a>
### Foreground / background datasets
Plotting *foreground* VS *background* datasets:

In [None]:
plot = datasets.value_counts("type").plot.pie(y="type", figsize=(8,8), shadow=True, startangle=45, rotatelabels=45, autopct='%1.1f%%')
_ = plt.title("Foreground vs Background genotype datasets")
plt.ylabel(None)
plt.savefig('smarter-fgVsbg-datasets.png', dpi=300, bbox_inches='tight')
plt.show()

<a id='datasets-by-chip-type'></a>
### Datasets by chip type
Plotting datasets by *chip type*:

In [None]:
plot = datasets['chip_name'].value_counts().plot.pie(y="chip", figsize=(8,8), shadow=True, startangle=45, rotatelabels=45, autopct='%1.1f%%')
_ = plt.title("Datasets by chip type")
plt.ylabel(None)
plt.savefig('smarter-datasets-by-chips.png', dpi=300, bbox_inches='tight')
plt.show()

<a id='samples-composition'></a>
## Samples composition
<a id='foreground-background-samples-sheep'></a>
### Foreground / Background samples for sheep
Ok get the *background/foreground* sheep samples. Two queries since the type is a `Dataset` property:

In [None]:
foreground_sheeps = SampleSheep.objects.filter(type_="foreground").fields(country=True, breed=True, chip_name=True, locations=True)
background_sheeps = SampleSheep.objects.filter(type_="background").fields(country=True, breed=True, chip_name=True, locations=True)
samples_sheep = pd.Series({"foreground": foreground_sheeps.count(), "background": background_sheeps.count()}, name="Sheeps")
plot = samples_sheep.plot.pie(y="Sheeps", figsize=(8,8), shadow=True, startangle=90, rotatelabels=45, autopct='%1.1f%%')
_ = plt.title("Background VS Foreground sheeps")
plt.ylabel(None)
plt.savefig('sheep-foreground-vs-background-pie.png', dpi=300, bbox_inches='tight')
plt.show()

Most of the data we have comes from the foreground datasets (after last inserts).

<a id='sheep-sample-by-chip-type'></a>
### Sheep samples by chip type
Try to determine the sample composition by chip type:

In [None]:
sheep_by_chip = defaultdict(list)
for chip_name in SampleSheep.objects.distinct("chip_name"):
    sheep_by_chip['chip_name'].append(chip_name) 
    sheep_by_chip['count'].append(SampleSheep.objects.filter(chip_name=chip_name).count())
sheep_by_chip = pd.DataFrame.from_dict(sheep_by_chip).set_index("chip_name")

In [None]:
sheep_by_chip.plot.pie(y="count", figsize=(8,8), shadow=True, startangle=-45, rotatelabels=45, autopct='%1.1f%%', legend=False)
_ = plt.title("Sheep samples by chip type")
plt.ylabel(None)
plt.show()

<a id='sheep-sample-locations'></a>
### Sheep sample locations
Where are samples located? where are *background / foreground* data? Read data from database and them add a geometry feature for `GeoDataFrame`:

In [None]:
tmp = foreground_sheeps.to_json()
foreground_sheeps = pd.read_json(tmp).dropna(axis=0)
tmp = background_sheeps.to_json()
background_sheeps = pd.read_json(tmp).dropna(axis=0)

foreground_sheeps = fix_id(foreground_sheeps)
background_sheeps = fix_id(background_sheeps)

foreground_sheeps = add_geometry(foreground_sheeps)
background_sheeps = add_geometry(background_sheeps)

Next, we need to get the world boundary features:

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

Now convert `DataFrame` to `GeoDataFrame`. Clearly state the *coordinate system* which is *WGS84 (EPSG:4326)*:

In [None]:
background_sheeps = gpd.GeoDataFrame(background_sheeps, crs="EPSG:4326")
background_sheeps = background_sheeps.set_crs(world.crs)

foreground_sheeps = gpd.GeoDataFrame(foreground_sheeps, crs="EPSG:4326")
foreground_sheeps = foreground_sheeps.set_crs(world.crs)

Now draw background and foreground sheeps in a picture:

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_sheeps.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_sheeps.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Sheep Samples")
plt.savefig('sheep-foreground-vs-background-map.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
world.plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_sheeps.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_sheeps.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Sheep Samples foreground")
_ = plt.xlim([10,140])
_ = plt.ylim([10, 75])
plt.savefig('sheep-foreground.png', dpi=300, bbox_inches='tight')
plt.show()

<a id='foreground-background-samples-goat'></a>
### Foreground / Background samples for goat

In [None]:
foreground_goats = SampleGoat.objects.filter(dataset__in=Dataset.objects.filter(type_="foreground")).fields(country=True, breed=True, chip_name=True, locations=True, metadata=True)
background_goats = SampleGoat.objects.filter(dataset__in=Dataset.objects.filter(type_="background")).fields(country=True, breed=True, chip_name=True, locations=True, metadata=True)
samples_goat = pd.Series({"foreground": foreground_goats.count(), "background": background_goats.count()}, name="Goats")
plot = samples_goat.plot.pie(y="Goat", figsize=(8,8), shadow=True, rotatelabels=45, autopct='%1.1f%%')
_ = plt.title("Background VS Foreground goats")
plt.ylabel(None)
plt.savefig('goat-foreground-vs-background-pie.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
tmp = foreground_goats.to_json()
foreground_goats = pd.read_json(tmp).dropna(thresh=1, axis=1)
tmp = background_goats.to_json()
background_goats = pd.read_json(tmp).dropna(thresh=1, axis=1)

foreground_goats = fix_id(foreground_goats)
background_goats = fix_id(background_goats)

foreground_goats = add_geometry(foreground_goats)
background_goats = add_geometry(background_goats)

In [None]:
background_goats = gpd.GeoDataFrame(background_goats, crs="EPSG:4326")
background_goats = background_goats.set_crs(world.crs)

foreground_goats = gpd.GeoDataFrame(foreground_goats, crs="EPSG:4326")
foreground_goats = foreground_goats.set_crs(world.crs)

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_goats.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_goats.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Goat Samples")
plt.savefig('goat-foreground-vs-background-map.png', dpi=300, bbox_inches='tight')
plt.show()

<a id='greece-foreground-goat-data'></a>
#### Greece foreground goat data

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
world[world.continent == "Europe"].plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_goats.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_goats.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Goat Samples foreground")
_ = plt.xlim([15,35])
_ = plt.ylim([30, 50])
plt.savefig('greece-goat-foreground.png', dpi=300, bbox_inches='tight')
plt.show()

<a id='sweden-foreground-goat-data'></a>
#### Sweden foreground goat data

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
world[world.continent == "Europe"].plot(ax=ax, color='white', edgecolor='gray')
plot = foreground_goats.plot(ax=ax, marker='o', color='red', markersize=5, label="foreground")
plot = background_goats.plot(ax=ax, marker='x', color='blue', markersize=5, label="background")
plot = ax.legend()
_ = plt.title("SMARTER Goat Samples foreground")
_ = plt.xlim([0,30])
_ = plt.ylim([50, 70])
plt.savefig('sweden-goat-foreground.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
foreground_goats['region'] = foreground_goats[foreground_goats.country == "Greece"]['metadata'].apply(lambda metadata: metadata['region'])

In [None]:
foreground_goats.value_counts('region')

## Draw samples by countries
### Sheep by country

Collect all sheep samples in a dataframe. Fix *ISO3* codes

In [None]:
all_sheeps = pd.concat([foreground_sheeps, background_sheeps], axis=0)
sheeps_by_country = all_sheeps.value_counts('country')
sheeps_by_country = pd.DataFrame(data=sheeps_by_country, columns=['count']).reset_index()
sheeps_by_country['iso_a3'] = sheeps_by_country['country'].apply(lambda country: pycountry.countries.search_fuzzy(country)[0].alpha_3)

Now fix *ISO3* on world dataset. Then merge dataframe on *ISO3*. Cast to a `GeoDataFrame`

In [None]:
def fix_iso_a3(name, iso_a3):
    if iso_a3 == '-99':
        try:
            return pycountry.countries.search_fuzzy(name.split()[-1])[0].alpha_3
        except LookupError:
            return "-99"
    else:
        return iso_a3
    
world['iso_a3'] = world[['name', 'iso_a3']].apply(lambda df: fix_iso_a3(df['name'], df['iso_a3']), axis=1)
sheeps_by_country = pd.merge(sheeps_by_country, world, how="outer", on='iso_a3')[['country', 'iso_a3', 'continent', 'geometry', 'count']]
sheeps_by_country = gpd.GeoDataFrame(sheeps_by_country)

Now draw a chloroplet map using `matplotlib`:

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
plot = sheeps_by_country.plot(
    column='count', 
    ax=ax, 
    legend=True, 
    cax=cax,
    missing_kwds={"color": "lightgrey", "label": "Missing values"})
_ = ax.set_title("SMARTER Sheep Samples by country")
plt.savefig('sheep-by-country.png', dpi=300, bbox_inches='tight')
plt.show()

### Goat by country

In [None]:
all_goats = pd.concat([foreground_goats, background_goats], axis=0)
goats_by_country = all_goats.value_counts('country')
goats_by_country = pd.DataFrame(data=goats_by_country, columns=['count']).reset_index()
goats_by_country['iso_a3'] = goats_by_country['country'].apply(lambda country: pycountry.countries.search_fuzzy(country)[0].alpha_3)

In [None]:
goats_by_country = pd.merge(goats_by_country, world, how="outer", on='iso_a3')[['country', 'iso_a3', 'continent', 'geometry', 'count']]
goats_by_country = gpd.GeoDataFrame(goats_by_country)

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
plot = goats_by_country.plot(
    column='count', 
    ax=ax, 
    legend=True, 
    cax=cax,
    missing_kwds={"color": "lightgrey", "label": "Missing values"})
_ = ax.set_title("SMARTER Goat Samples by country")
plt.savefig('goat-by-country.png', dpi=300, bbox_inches='tight')
plt.show()