# More testing practice...
<img src='../images/xebia-logo.png' width='300px' align='right' style="padding: 15px">

### ... and high-quality-code-writing practice!

Your goal for this section of the training will be to refactor some code produced by a Data Scientist that implements a ML application for the animal shelter usecase.

A good design principle is that *good* functions are *testable* functions. So try to break down the code at the end of the notebook into the smallest units that you think makes sense testing.

Once the code is refactored, you should be able to call the following functions:

In [None]:
from animal_shelter.model.train import train
from animal_shelter.model.predict import predict

train("../data/train.csv", "../output/model.pkl")
predict("../data/test.csv", "../output/model.pkl")

1. Create a subpackage called `model` within `animal_shelter`.
2. Create two modules called `train` and `predict` within that subpackage.
3. Copy-paste the code from bellow into the respective modules, and make sure that all imports are correct.
4. Refactor the code into smaller functions, and write unit tests for their essential behaviour.
    - Think about which individual steps *make sense testing*.
    - Think about which parameter types your functions should accept.
    - Here are some pointers:
        - You probably want a function called `train` that accepts a `Path` (from `pathlib`) to the training data and another `Path` to a location where to save a fitted model (e.g. `output/model.pkl`).
        - You probably want to abstract away the process of building the `Pipeline` into a separate function so that you can test that it's constructed properly.
        - You probabl also want a function called `predict` that accepts a `Path` to the data and a `Path` to the model used to generate predictions.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

from animal_shelter.data import load_data
from animal_shelter.features import add_features

raw_data = load_data("../data/train.csv")
with_features = add_features(raw_data)
cat_features = [                                  
    "animal_type",                                        
    "is_dog",                                             
    "has_name",                                           
    "sex",                                                
    "hair_type",                                          
]                                                         
num_features = ["days_upon_outcome"]                  

num_transformer = Pipeline(                                                
    steps=[("imputer", SimpleImputer()), ("scaler", StandardScaler())]     
)                                                                          
cat_transformer = Pipeline(steps=[("onehot", OneHotEncoder(drop="first"))])
transformer = ColumnTransformer(                                           
    (                                                                      
        ("numeric", num_transformer, num_features),                        
        ("categorical", cat_transformer, cat_features),                    
    )                                                                      
)

clf_model = Pipeline(                                                      
    [("transformer", transformer), ("model", RandomForestClassifier())]    
)
                                                          
X = with_features[cat_features + num_features]
y = with_features["outcome_type"] 

clf_model.fit(X, y)

In [None]:
test_data = load_data("../data/test.csv")
with_features = add_features(test_data)
X_test = with_features[cat_features + num_features]
clf_model.predict(X_test)