In this notebook, you should implement a first version of a working machine learning model to predict the age of an Abalone.

A few guidelines:
- The model does not have to be complex. A simple linear regression model is enough.
- You should use MLflow to track your experiments. You can use the MLflow UI to compare your experiments.
- Do not push any MLflow data to the repository. Only the code to run the experiments is interesting and should be pushed.

### Initializing MLflow

Let's print the tracking server URI, where the experiments and runs are going to be logged. We observe it refers to a local path.

In [None]:
import mlflow

print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

## Loading the data

Let's get back to our abalone cases

In [None]:
from typing import List

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

DATA_FOLDER = "../data"
input_path = f"{DATA_FOLDER}/abalone.csv"

## 1 - Load data

In [None]:
def load_data(path: str) -> pd.DataFrame:
    """Loading csv file with abalone data"""
    return pd.read_csv(path)


df = load_data(input_path)
df.head()

In [None]:
df.isna().sum() # dataframe is clean, without any missing values

## 2 - Prepare the data

Let's prepare the data to make it Machine Learning ready. \
For this, we need to clean it, compute the target (what we want to predict), and compute some features to help the model understand the data better.

### 2-1 Compute the target

We want to predict the age of an abalone. As stated on Kaggle, the true age of an abalone is obtaine by multiplying the ring size by 1.5

In [None]:
def compute_target(
    df: pd.DataFrame,
    pre_target_column: str = "Rings",
) -> pd.DataFrame:
    """Adding a new column "age" to the dataframe.
    It will be the target of our machine learning algorithm
    """
    df["Age"] = df[pre_target_column]*1.5
    return df


df = compute_target(df)

In [None]:
df["Age"].describe()

Let's remove outliers and reduce the scope to trips between 1 minute and 1 hour

In [None]:
df["Age"].plot()

In [None]:
# To adapt depending on the EDA
MIN_AGE = 1
MAX_AGE = 40


def filter_outliers(df: pd.DataFrame, min_age: int, max_age: int) -> pd.DataFrame:
    """Getting rid of extreme values"""
    return df[df["Age"].between(min_age, max_age)]


df = filter_outliers(df, MIN_AGE, MAX_AGE)

### 2-2 Prepare features

#### 2-2-1 Categorical features

Most machine learning models don't work with categorical features. Because of this, they must be transformed so that the ML model can consume them.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

CATEGORICAL_COLS = ["Sex"]


def encode_categorical_cols(df: pd.DataFrame, categorical_cols: List[str] = None) -> pd.DataFrame:
    """Encoding the sole categorical column of the dataset
    Let's use an ordinal encoding strategy to do so"""
    if categorical_cols is None:
        categorical_cols = ["Sex"]
    encoder = OrdinalEncoder()
    df[categorical_cols] = encoder.fit_transform(df[categorical_cols])
    return df


df = encode_categorical_cols(df)

In [None]:
def extract_x_y(
    df: pd.DataFrame,
    dv: DictVectorizer = None,
    target_column: str = "Age",
    with_target: bool = True,
) -> dict:
    """Clear distinction between training features and target"""
    y = None
    dicts = df.drop(columns = target_column).to_dict(orient="records")

    if with_target:
        if dv is None:
            dv = DictVectorizer() # ne sert à rien parce qu'on fait de l'ordinal encoding ?
            dv.fit(dicts)
        y = df[target_column].values


    X = dv.transform(dicts)

    return X, y, dv


X, y, dv = extract_x_y(df)

## Have a split between train and test values

In [None]:
from sklearn.model_selection import train_test_split

def distinction_train_test(X: pd.DataFrame, y:np.array):
    """Doing a clear distinction between train et test sets"""
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = distinction_train_test(X, y)


## 3 - Train model

We train a basic linear regression model to have a baseline performance

In [None]:
def train_model(x_train: csr_matrix, y_train: np.ndarray) -> LinearRegression:
    """Train a Linear Regression."""
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    return lr


model = train_model(X_train, y_train)

## 4 - Evaluate model

We evaluate the model on train and test data

### 4-1 On train data

In [None]:
def predict_age(input_data: csr_matrix, model: LinearRegression) -> np.array:
    """Predicting on train or test data using the .predict"""
    return model.predict(input_data)


def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Using root_mean_squared_error, assessing the model's efficiency"""
    return root_mean_squared_error(y_true, y_pred) # to change eventually ?


prediction = predict_age(X_train, model)
train_me = evaluate_model(y_train, prediction)
train_me

### 4-2 On test data

In [None]:
y_pred_test = predict_age(X_test, model)
test_me = evaluate_model(y_test, y_pred_test)
test_me

In [None]:
# Set the experiment name
mlflow_experiment_path = f"/mlflow/linear_reg_abalone"
mlflow.set_experiment(mlflow_experiment_path)

# Start a run
with mlflow.start_run() as run:
    run_id = run.info.run_id

    # Set tags for the run
    mlflow.set_tag("Level", "Development")
    mlflow.set_tag("Team", "Data Science")

    # Load data
    df = load_data(input_path)



    # Compute target
    df = compute_target(df)

    # Filter outliers
    mlflow.log_param("filtered_outliers", True)
    df = filter_outliers(df)

    # Encode categorical columns
    df = encode_categorical_cols(df)

    # Extract X and y
    X, y, dv = extract_x_y(df)
    X_train, X_test, y_train, y_test = distinction_train_test(X, y)

    mlflow.log_param("train_set_size", X_train.shape[0])
    mlflow.log_param("test_set_size", X_test.shape[0])

    # Train model
    model = train_model(X_train, y_train)

    # Evaluate model
    prediction = predict_age(X_train, model)
    train_me = evaluate_model(y_train, prediction)
    mlflow.log_metric("train_me", train_me)

    # Evaluate model on test set

    y_pred_test = predict_age(X_test, model)
    test_me = evaluate_model(y_test, y_pred_test)
    mlflow.log_metric("test_me", test_me)

    # Log your model
    mlflow.sklearn.log_model(model, "models")

    # Register your model as the production model
    mlflow.register_model(f"runs:/{run_id}/models", "linear_reg_abalone")

If the model is satisfactory, we stage it as production using the appropriate version. This will help us retreiving it for predictions.

In [None]:
client = MlflowClient()
production_version = 1

client.transition_model_version_stage(
    name="linear_reg_abalone", version=production_version, stage="Production"
)