# MLOps

This notebook contains an example of fitting and evaluating linear regression model on Titanic data. We will use tickets as modelling units (rows, entities), *fare* as target (possibly log fare) and various features as predictors.

## Data

We use the dataset Titanic and data preparation from the recent practice (see Data Preparation).

## Tasks

1. Add tracking for following items into experiment named `dev-titanic`:
   - Log a regression model
   - Log performance metrics
   - Log names of used features (name the parameter "schema")
   - Log model's class name (name the parameter "model_class")
   - Then in another runs:
     - Log summary of statsmodels' OLS model (as text)
     - Log image `self_description.png`
2. Collect **same** metrics from various regression models into experiment named `titanic`
   - Keep same logging strategy (keep the lines starting with `mlflow.` almost the same), just change the model
   - Use different hyperparameters for different regressors
     - Log the hyperparameters to mlflow
3. OPTIONAL: Compare models using MLFlow
   - Find the best performing model using "visual metrics comparison"
   - Find the best performing model using barplots.
4. OPTIONAL: send some screenshots of some (preferably last) task to `samuel.fabo@profinit.eu`
   - Totally not mandatory, but your guide will be glad and would like to see that you learned something useful :)

In [4]:
# setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
import mlflow

pd.set_option("display.precision", 2)
plt.rcParams['figure.figsize'] = [8, 6]

In [5]:
# Titanic data reading and preparing - reminder from `Data Preparation` practice
df_t1 = pd.read_csv('titanic_train.csv') # adjust file path
df_t1 = df_t1[['passenger_id', 'ticket', 'pclass', 'fare', 'sex', 'age', 'cabin', 'embarked']]

# cleaning
df_t1 = df_t1[df_t1['fare'].notna() & (df_t1['fare']>0) & (df_t1['embarked'].notna())]

# making new dataset of tickets
# User function
def rate_males(s):
    return np.mean(np.where(s=='male', 1, 0))

### Base table
df_t2_base = df_t1[['ticket', 'pclass', 'fare']].drop_duplicates()
df_t2_base = df_t2_base.set_index('ticket') # setting 'ticket' column as key

### Multiple embarkment solution
df_t2_emb = df_t1.groupby('ticket').agg({'embarked': 'max'})
# no need to set index - groupby + agg sets index by default

### Some chosen features
df_t2_feat = df_t1.groupby('ticket').agg({'ticket': 'count', 'sex': [rate_males],
                                      'age': ['min', 'max', np.mean, 'count'], 'cabin': 'nunique'})
# column names update
df_t2_feat.columns = ['pass_cnt', 'rate_males', 'age_min', 'age_max', 'age_mean', 'age_valid_cnt', 'cabin_cnt']

# sex of the oldest person for the ticket
df_t2_feat_sex_oldest = df_t1.sort_values(by=['ticket', 'age'], ascending=[True, False]) \
    .drop_duplicates('ticket')[['ticket', 'sex']]
df_t2_feat_sex_oldest = df_t2_feat_sex_oldest.set_index('ticket') # setting 'ticket' column as key
df_t2_feat_sex_oldest.columns = ['sex_oldest']

### Joining tables together
df_t2 = df_t2_base.join(df_t2_emb) # join is by default LEFT and index<->index
df_t2 = df_t2.join(df_t2_feat)
df_t2 = df_t2.join(df_t2_feat_sex_oldest)

# mathematical transformations
df_t2['fare_log'] = np.log10(df_t2['fare']) # we use log10 for better interpretation, but simple log is ok, too.
df_t2['fare_per_pass'] = df_t2['fare'] / df_t2['pass_cnt']

# binning, making categories and flags
### pass_cnt
df_t2['pass_cnt_cat'] = pd.cut(df_t2['pass_cnt'], [0, 1, 2, 3, 1000], labels=['1', '2', '3', '4+'])

### age_mean
df_t2['age_mean_cat'] = pd.cut(df_t2['age_mean'], [0, 15, 20, 25, 30, 40, 1000],
                             labels=['15-', '15-20', '20-25', '25-30', '30-40', '40+'])

### cabin_cnt (same approach as pass_cnt)
df_t2['cabin_cnt_cat'] = pd.cut(df_t2['cabin_cnt'], [0, 1, 2, 1000], right=False, labels=['none', '1', '2+'])

# flags
df_t2['flag_child'] = (df_t2['age_min'] < 15)
df_t2['flag_baby'] = (df_t2['age_min'] < 3)

### cleanup
del df_t2_base
del df_t2_emb
del df_t2_feat
del df_t2_feat_sex_oldest

## Linear regression

We learned that *fare* is very skew, we have transformed it by log10. So we take *fare_log* as target and *embarked*, *pclass* and *pass_cnt* as predictors.

In [6]:
X = df_t2[['pass_cnt', 'pclass']]
y = df_t2['fare_log']

# fit model
modelA = LinearRegression().fit(X, y)

# get coefficients
print('Intercept: ', modelA.intercept_)
print('Beta coefficients: ', modelA.coef_)

Intercept:  1.7463160900361796
Beta coefficients:  [ 0.17662938 -0.33836997]


In [7]:
scores = cross_val_score(LinearRegression(), X, y, cv=4)
print('R2 by cval: ', scores)

R2 by cval:  [0.81738928 0.8683794  0.77887764 0.82084394]


In [8]:
ridgeModel = smf.ols("fare_log ~ pass_cnt + pclass", data=df_t2).fit()
ols_summary = ridgeModel.summary()
print(ols_summary)

                            OLS Regression Results                            
Dep. Variable:               fare_log   R-squared:                       0.839
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     1680.
Date:                Wed, 07 Dec 2022   Prob (F-statistic):          6.63e-257
Time:                        11:31:27   Log-Likelihood:                 344.40
No. Observations:                 650   AIC:                            -682.8
Df Residuals:                     647   BIC:                            -669.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.7463      0.021     83.030      0.0

In [9]:
%set_env MLFLOW_TRACKING_URI=http://127.0.0.1:5000
mlflow.set_experiment("my-experiment")

env: MLFLOW_TRACKING_URI=http://127.0.0.1:5000


<Experiment: artifact_location='file:C:\\Users\\vojta\\UK\\22-23_W\\data-science\\NDBI048-data-science\\practice\\week_9_mlops\\analysis/mlruns/1', creation_time=1670407081422, experiment_id='1', last_update_time=1670407081422, lifecycle_stage='active', name='my-experiment', tags={}>

In [10]:
with mlflow.start_run() as run:
    modelA = LinearRegression().fit(X,y)
    mlflow.sklearn.log_model(modelA, "model")
    
    scores = cross_val_score(LinearRegression(), X, y, cv=4)
    mlflow.log_metric("r2_score", np.mean(scores))
    mlflow.log_param("schema", list(X.columns))
    mlflow.log_param("model_class", type(modelA))



In [11]:
with mlflow.start_run() as run:
    modelB = smf.ols("fare_log ~ pass_cnt + pclass", data=df_t2).fit()
    ols_summary = modelB.summary()
    print(ols_summary)
    mlflow.log_text(str(ols_summary), "ols_summary.txt") # detailed information of model and coefficients

                            OLS Regression Results                            
Dep. Variable:               fare_log   R-squared:                       0.839
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     1680.
Date:                Wed, 07 Dec 2022   Prob (F-statistic):          6.63e-257
Time:                        11:31:30   Log-Likelihood:                 344.40
No. Observations:                 650   AIC:                            -682.8
Df Residuals:                     647   BIC:                            -669.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.7463      0.021     83.030      0.0

In [13]:
with mlflow.start_run() as run:
    alpha = 1.0
    ridgeModel = Ridge(alpha=alpha).fit(X,y)
    mlflow.sklearn.log_model("ridge", ridgeModel)
    
    scores = cross_val_score(ridgeModel, X, y, cv=3)
    mlflow.log_metric("r2_score", np.mean(scores))



RepresenterError: ('cannot represent an object', Ridge())