# MLOps exercises

## Execise 1

In this exercise, do the following:
1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.
2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.
3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.
4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?
5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?
6. Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?
7. Create a function that retrain a model on the new data as well as the old training data
8. Retrain the `model_final` on the new data "NewAmesData1.csv" as well as the old training data, using the function from 5. Then test the new model on the old testset.
9. Split the "NewAmesData2.csv" dataset into a train and test set. Train  the best model from the `MLOps.ipynb` notebook on the training part and test it on the test part. Did you get a better model? Now combine your new training data with the original training data and retrain the model on that. Did that give you a better model?

In [315]:
import pandas as pd


In [316]:
# Preprocess the Ames dataset
ames = pd.read_csv("AmesHousing.csv")

ames = ames[["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "TotRms AbvGrd", "Mo Sold", "Yr Sold", "Bldg Type", "Neighborhood", "SalePrice"]]
ames.head()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice
0,31770,5,1960,1656,7,5,2010,1Fam,NAmes,215000
1,11622,6,1961,896,5,6,2010,1Fam,NAmes,105000
2,14267,6,1958,1329,6,6,2010,1Fam,NAmes,172000
3,11160,5,1968,2110,8,4,2010,1Fam,NAmes,244000
4,13830,5,1997,1629,6,3,2010,1Fam,Gilbert,189900


1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.


In [317]:
# Function to preprocess the Ames dataset
def preprocess_ames_data(data):
    #Preprocess the Ames dataset by handling categorical variables
    data = data.copy()

    # Create dummy variables for "Bldg Type" and "Neighborhood"
    data = data.join(pd.get_dummies(data["Bldg Type"], drop_first=True, dtype="int", prefix="BType"))
    data = data.join(pd.get_dummies(data["Neighborhood"], drop_first=True, dtype="int", prefix="Nbh"))

    # Drop original categorical columns
    data.drop(columns=["Bldg Type", "Neighborhood"], inplace=True)
    
    return data


Preprocessing the original Ames dataset

In [318]:
# Preprocess the Ames dataset
ames_wd = preprocess_ames_data(ames)
ames_wd.head()

# Function to make AmesHousing1.csv with the preprocessed data
def make_ames_data(data, filename):
    # Taking 1000 entries from the data randomly
    data.sample(1000, random_state = 42).to_csv("AmesHousing1.csv", index=False)

# Make AmesHousing1.csv with the preprocessed data
make_ames_data(ames_wd, "AmesHousing1.csv")

pd.read_csv("AmesHousing1.csv")


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
0,5100,7,1925,1666,7,6,2008,161000,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1890,7,1972,1030,6,7,2006,116000,0,0,...,0,0,0,0,0,0,0,0,0,0
2,7162,5,2003,1724,8,5,2006,196500,0,0,...,0,0,0,0,0,0,0,0,0,0
3,8070,5,1994,990,5,8,2007,123600,0,0,...,0,0,0,0,0,0,0,0,0,0
4,7000,8,1926,919,5,7,2008,126000,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,16287,6,1925,1351,7,7,2007,122000,0,0,...,0,0,0,0,0,0,0,0,0,0
996,21780,4,1910,810,4,3,2009,57625,0,0,...,0,0,0,0,0,0,0,0,0,0
997,6324,6,1927,520,4,5,2008,68500,0,0,...,0,0,0,0,0,0,0,0,0,0
998,8712,7,1896,952,5,6,2010,50138,0,0,...,0,0,0,0,0,0,0,0,0,0


2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.

In [319]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Function that takes a new ames dataset and a model
def train_and_test(data, model):
    # Split the data into features and target
    X = data.drop(columns="SalePrice")
    y = data["SalePrice"]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    predictions = model.predict(X_test)
    
    # Calculate the mean absolute error
    mae = mean_absolute_error(y_test, predictions)
    
    return mae

# Train and test a random forest model
rf = RandomForestRegressor(random_state=42)
mae = train_and_test(ames_wd, rf)

mae

20425.522514220705

3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.


In [320]:
X_ames = ames_wd.drop(columns=["SalePrice"])
y_ames = ames_wd.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X_ames, y_ames, test_size=0.2, random_state=1742)

In [321]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=4217)

In [322]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(1875, 38)
(469, 38)
(586, 38)
(1875,)
(469,)
(586,)


In [323]:
model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_val)
mean_absolute_error(y_val, y_pred_rf)

21945.45637425119

In [324]:
model_rf_500 = RandomForestRegressor(n_estimators=500)
model_rf_500.fit(X_train, y_train)
y_pred_rf_500 = model_rf_500.predict(X_val)
mean_absolute_error(y_val, y_pred_rf_500)

21681.70781693573

In [325]:
model_final = model_rf_500

In [326]:
#Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.
new_ames = pd.read_csv("AmesHousing1.csv")

mae_new_ames = train_and_test(new_ames, model_final)
mae_new_ames

22809.843675

In [327]:
new_ames = pd.read_csv("AmesHousing1.csv")

mae_new_ames = train_and_test(new_ames, rf)
mae_new_ames

22811.851333333332

4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?


In [331]:
# Function to make AmesHousing2.csv with the preprocessed data
def make_ames_data(data, filename):
    # Taking 1000 entries from the data randomly
    data.sample(1000, random_state = 99).to_csv("AmesHousing2.csv", index=False)

# Make AmesHousing2.csv with the preprocessed data
make_ames_data(ames_wd, "AmesHousing2.csv")

pd.read_csv("AmesHousing2.csv")

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
0,11210,5,2005,1614,7,7,2006,221500,0,0,...,0,0,0,0,0,0,0,0,0,0
1,53504,5,2003,3279,12,6,2010,538000,0,0,...,0,0,0,0,0,0,0,1,0,0
2,19690,7,1966,2201,8,8,2006,274970,0,0,...,0,0,0,0,0,0,0,0,0,0
3,7407,7,1957,1236,6,4,2010,149700,0,0,...,0,0,1,0,0,0,0,0,0,0
4,11578,5,2008,1736,7,7,2009,360000,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,13383,5,1969,1404,7,3,2009,160000,0,0,...,0,0,0,0,0,0,0,0,0,0
996,19378,5,2005,2462,9,3,2006,320000,0,0,...,0,0,0,0,0,0,0,0,0,0
997,10480,5,1936,1639,6,3,2008,115000,0,0,...,0,0,0,1,0,0,0,0,0,0
998,13673,5,1962,1696,8,3,2007,143900,0,0,...,0,0,0,0,1,0,0,0,0,0


In [332]:
new_ames = pd.read_csv("AmesHousing2.csv")

mae_new_ames = train_and_test(new_ames, model_final)
mae_new_ames

20357.80727

In [333]:
new_ames = pd.read_csv("AmesHousing2.csv")

mae_new_ames = train_and_test(new_ames, rf)
mae_new_ames

20830.775700000002

5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?


In [339]:
# Looking for a data drift in the AmesHousing1.csv dataset, if so for which variables
ames1 = pd.read_csv("AmesHousing1.csv")

ames1.describe()



Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,10077.167,5.506,1973.513,1501.627,6.468,6.252,2007.814,183323.831,0.014,0.043,...,0.021,0.065,0.069,0.013,0.045,0.042,0.069,0.02,0.026,0.004
std,6512.654316,1.071967,29.846116,503.497084,1.543165,2.753557,1.322556,82057.291407,0.117549,0.202959,...,0.143456,0.246649,0.253581,0.113331,0.207408,0.20069,0.253581,0.14007,0.159215,0.063151
min,1484.0,1.0,1880.0,438.0,3.0,1.0,2006.0,35311.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7397.25,5.0,1954.0,1128.0,5.0,4.0,2007.0,129375.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,9554.5,5.0,1976.0,1442.0,6.0,6.0,2008.0,161500.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,11643.75,6.0,2003.0,1740.0,7.0,8.0,2009.0,220000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,115149.0,9.0,2010.0,5642.0,12.0,12.0,2010.0,625000.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


6. Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?


In [340]:
# Looking for data drift in the AmesHousing2.csv dataset
ames2 = pd.read_csv("AmesHousing2.csv")

ames2.describe()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,10038.743,5.668,1971.291,1502.475,6.421,6.289,2007.772,181287.982,0.029,0.038,...,0.023,0.057,0.087,0.016,0.053,0.037,0.07,0.014,0.024,0.013
std,5612.654505,1.112211,29.480153,525.69273,1.54277,2.741349,1.320515,79344.191047,0.16789,0.191292,...,0.149978,0.231959,0.281976,0.125538,0.224146,0.188856,0.255275,0.117549,0.153126,0.113331
min,1476.0,2.0,1880.0,480.0,3.0,1.0,2006.0,35311.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.0,5.0,1954.75,1123.5,5.0,4.0,2007.0,129000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,9402.5,5.0,1972.0,1445.0,6.0,6.0,2008.0,160250.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,11425.75,6.0,1999.0,1739.25,7.0,8.0,2009.0,211125.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,63887.0,9.0,2009.0,5642.0,15.0,12.0,2010.0,755000.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [341]:
# Comparing the data drift in the AmesHousing1.csv and AmesHousing2.csv datasets
ames1.describe() - ames2.describe()


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,38.424,-0.162,2.222,-0.848,0.047,-0.037,0.042,2035.849,-0.015,0.005,...,-0.002,0.008,-0.018,-0.003,-0.008,0.005,-0.001,0.006,0.002,-0.009
std,899.999811,-0.040244,0.365963,-22.195646,0.000395,0.012208,0.002041,2713.10036,-0.050341,0.011667,...,-0.006522,0.014691,-0.028395,-0.012207,-0.016738,0.011833,-0.001694,0.022521,0.006089,-0.05018
min,8.0,-1.0,0.0,-42.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-102.75,0.0,-0.75,4.5,0.0,0.0,0.0,375.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,152.0,0.0,4.0,-3.0,0.0,0.0,0.0,1250.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,218.0,0.0,4.0,0.75,0.0,0.0,0.0,8875.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,51262.0,0.0,1.0,0.0,-3.0,0.0,0.0,-130000.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


7. Create a function that retrain a model on the new data as well as the old training data


In [342]:
# Creating a function that retrain a model on the new data as well as the old training data

def retrain_model(data, new_data, model):
    # Split the data into features and target
    X = data.drop(columns="SalePrice")
    y = data["SalePrice"]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Split the new data into features and target
    X_new = new_data.drop(columns="SalePrice")
    y_new = new_data["SalePrice"]
    
    # Train the model on the new data
    model.fit(X_new, y_new)
    
    return model

8. Retrain the `model_final` on the new data "NewAmesData1.csv" as well as the old training data, using the function from 5. Then test the new model on the old testset.


In [343]:
# Retain the model_final on the new data "AmesHousing1.csv" and the old data "AmesHousing.csv", using the function from 5.
model_final_retrained = retrain_model(ames_wd, new_ames, model_final)
model_final_retrained

In [344]:
# Testing the new model on the old dataset
mae_retrained = train_and_test(ames_wd, model_final_retrained)
mae_retrained

20466.175740614333

9. Split the "NewAmesData2.csv" dataset into a train and test set. Train  the best model from the `MLOps.ipynb` notebook on the training part and test it on the test part. Did you get a better model? Now combine your new training data with the original training data and retrain the model on that. Did that give you a better model?

In [345]:
# Splitting the "AmesHousing2.csv" into a train and test set
X_new_ames = new_ames.drop(columns=["SalePrice"])
y_new_ames = new_ames.SalePrice

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new_ames, y_new_ames, test_size=0.2, random_state=1742)

In [347]:
# Training the model_final on the training set of the new data
model_final.fit(X_train_new, y_train_new)

In [349]:
# testing the model_final on the test set of the new data
y_pred_final = model_final.predict(X_test_new)
mae_final = mean_absolute_error(y_test_new, y_pred_final)
mae_final

22689.494733333333

In [350]:
# Combining the new training data and the original training data
X_train_combined = pd.concat([X_train, X_train_new])
y_train_combined = pd.concat([y_train, y_train_new])

In [351]:
# Retraining the model_final on the combined training data
model_final.fit(X_train_combined, y_train_combined)

In [352]:
mae_final_combined = train_and_test(new_ames, model_final)
mae_final_combined

20337.18568