<img src='../images/xebia-logo.png' width='300px' align='right' style="padding: 15px">

# Python Packaging

In this notebook you will practice how to organice python code into a package.

The first step will be to set up the basic infrastructure required to run these notebooks. 

**If you are comfortable using git we recommend checking out a new branch to follow along (`git checkout -b BRANCH_NAME`) during the training.**

You need:

1. A `pyproject.toml` file in the root of the repository.
    - You can use `poetry init` to create it
    
1. `pandas` installed as a dependency with `poetry add pandas`. Notice that this will create a virtual environment if it was not already created.

1. `jupyter` installed as a development dependency so that we can run these notebooks from the virtual environment `poetry add -G dev jupyter`.
    - **Question:** Why are we adding `jupyter` as a dev-dependency and `pandas` as a normal dependency?

1. Make the `Ipython` kernel from the virtual environment accesible to VSCode (and the rest of the system) by running `python -m ipykernel install --user --name=venv` **from the virtual environment**.
    - *Hint:* use `poetry run ...` or `poetry shell`. What was the difference between these two commands?

Now let's have a look as the use case these training uses as an example. It concerns an animal shelter that is trying to predict the outcome (e.g. adopted, transferred) of the animals that come through it.

In [None]:
import pandas as pd
import re

def load_data(path):
    """Load the data and convert the column names.

    Parameters
    ----------
    path : str
        Path to data
    Returns
    -------
    df : pandas.DataFrame
        DataFrame with data
    """
    df = (
        pd.read_csv(path, parse_dates=["DateTime"])
        .rename(columns=lambda x: x.replace("upon", "Upon"))
        .rename(columns=convert_camel_case)
        .fillna("Unknown")
    )
    return df


def convert_camel_case(name):
    """Convert camelCaseString to snake_case_string."""
    s1 = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
    return re.sub("([a-z0-9])([A-Z])", r"\1_\2", s1).lower()

In [None]:
animal_outcomes = load_data('../data/train.csv')
animal_outcomes.head()

Feel free to spend some time doing some preliminary data exploration.

From this dataset you can generate the following features about each animal that may be helpful to train a machine learning model later on.

- boolean indicator for whether it is a dog
- boolean indicator for whether it has a name
- categorical feature indicating its sex
- categorical feature indicating whether it is neutered
- catergorical feature indicating its hair type
- age upon outcome in days

You can add all of these features to the dataset with the functions below.

In [None]:
import numpy as np
import pandas as pd


def add_features(df):
    """Add some features to our data.
    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame with data (see load_data)
    Returns
    -------
    with_features : pandas.DataFrame
        DataFrame with some column features added
    """
    df['is_dog'] = check_is_dog(df['animal_type'])


    # Check if it has a name.
    df['has_name'] = df['name'].str.lower() != 'unknown'


    # Get sex.
    sexUponOutcome = df['sex_upon_outcome']
    sex = pd.Series('unknown', index=sexUponOutcome.index)

    sex.loc[sexUponOutcome.str.endswith('Female')] = 'female'
    sex.loc[sexUponOutcome.str.endswith('Male')] = 'male'
    df['sex'] = sex



    # Check if neutered.
    neutered = sexUponOutcome.str.lower()
    neutered.loc[neutered.str.contains('neutered')] = 'fixed'
    neutered.loc[neutered.str.contains('spayed')] = 'fixed'


    neutered.loc[neutered.str.contains('intact')] = 'intact'
    neutered.loc[~neutered.isin(['fixed', 'intact'])] = 'unknown'


    df['neutered'] = neutered


    # Get hair type.

    hairType = df['breed'].str.lower()
    Valid_hair_types = ['shorthair', 'medium hair', 'longhair']



    for hair in Valid_hair_types:
        is_hair_type = hairType.str.contains(hair)
        hairType[is_hair_type] = hair

    hairType[~hairType.isin(Valid_hair_types)] = 'unknown'


    df['hair_type'] = hairType


    # Age in days upon outcome.

    Split_Age = df['age_upon_outcome'].str.split()
    time = Split_Age.apply(lambda x: x[0] if x[0] != 'Unknown' else np.nan)
    period = Split_Age.apply(lambda x: x[1] if x[0] != 'Unknown' else None)
    period_Mapping = {'year': 365, 'years': 365, 'weeks': 7, 'week': 7,
                      'month': 30, 'months': 30, 'days': 1, 'day': 1}
    days_upon_outcome = time.astype(float) * period.map(period_Mapping)
    df['days_upon_outcome'] = days_upon_outcome



    return df

def check_is_dog(animal_type):
    """Check if the animal is a dog, otherwise return False.
    Parameters
    ----------
    animal_type : pandas.Series
        Type of animal
    Returns
    -------
    result : pandas.Series
        Dog or not
    """
    # Check if it's either a cat or a dog.
    is_cat_dog = animal_type.str.lower().isin(['dog', 'cat'])
    if not is_cat_dog.all():
        print('Found something else but dogs and cats:\n%s',
              animal_type[~is_cat_dog])
        raise RuntimeError("Found pets that are not dogs or cats.")
    is_dog = animal_type.str.lower() == 'dog'
    return is_dog

In [None]:
animal_outcomes = load_data('../data/train.csv')
with_features = add_features(animal_outcomes)
with_features.head()

There are some bad practices going on in the functions above, but don't worry about their quality for now. Let's focus on packaging the code.

## <mark> Exercise
Your goal is to copy-paste the code from the cells above into a package that exports the functionality that a user (e.g. an analyst writing a report in a notebook or a service serving predictions) would *use*. 

They should be able to import the functions as in the cell below:

In [None]:
from animal_shelter.data import load_data
from animal_shelter.features import add_features
animal_outcomes = load_data('../data/test.csv')
with_features = add_features(animal_outcomes)
with_features.head()

*Hint:* the location of the package should be in this folder structure: `repository_root/src/animal_shelter/__init__.py`

*Hint:* Your `pyproject.toml` file should also point to the path of the code.
```toml
[tool.poetry]
packages = [ { include = "animal_shelter", from = "src" } ]
```

You can run the cell below to automatically auto-reload changes to the source code of any imported package, which is very useful during development.

In [None]:
%load_ext autoreload
%autoreload 2