### Get Data

We will download the House Sales in King County, USA dataset from Kaggle and define the column names in this dataset: <br>  <br>
**id** - Unique ID for each home sold <br>
**date** - Date of the home sale <br>
**price** - Price of each home sold <br>
**bedrooms** - Number of bedrooms <br>
**bathrooms** - Number of bathrooms, where .5 accounts for a room with a toilet but no shower <br>
**sqft_living** - Square footage of the apartments interior living space <br>
**sqft_lot** - Square footage of the land space  <br>
**floors** - Number of floors <br>
**waterfront** - A dummy variable for whether the apartment was overlooking the waterfront or not <br>
**view** - An index from 0 to 4 of how good the view of the property was <br>
**condition** - An index from 1 to 5 on the condition of the apartment  <br>
**grade** - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.  <br>
**sqft_above** - The square footage of the interior housing space that is above ground level  <br>
**sqft_basement** - The square footage of the interior housing space that is below ground level  <br>
**yr_built** - The year the house was initially built <br>
**yr_renovated** - The year of the house’s last renovation <br>
**zipcode** - What zipcode area the house is in <br>
**lat** - Latitude <br>
**long** - Longitude <br>
**sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors <br>
**sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors <br>


In [5]:
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting mlflow
  Downloading mlflow-1.29.0-py3-none-any.whl (16.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting querystring-parser<2
  Using cached querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting prometheus-flask-exporter<1
  Using cached prometheus_flask_exporter-0.20.3-py3-none-any.whl (18 kB)
Collecting databricks-cli<1,>=0.8.7
  Using cached databricks-cli-0.17.3.tar.gz (77 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting gunicorn<21
  Using cached gunicorn-20.1.0-py3-none-any.whl (79 kB)
Collecting alembic<2
  Using cached alembic-1.8.1-py3-none-any.whl (209 kB)
Collecting gitpython<4,>=2.1.0
  Using cached GitPython-3.1.27-py3-none-any.whl (181 kB)
Collecting sqlalchemy<2,>=1.4.0
  Downloading SQLAlchemy-1.4.41-cp38-cp38-macosx_10_15_x86_64.whl (1.5 MB)
[2K     

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('kc_house_data.csv')
df = df.drop(['id', 'date'], axis=1)
df = df.dropna()
# split into input and output elements
X = df.loc[:, df.columns != 'price']
y = df.loc[:, 'price']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [12]:
from mlflow.tracking import MlflowClient

def yield_artifacts(run_id, path=None):
    """Yield all artifacts in the specified run"""
    client = MlflowClient()
    for item in client.list_artifacts(run_id, path):
        if item.is_dir:
            yield from yield_artifacts(run_id, item.path)
        else:
            yield item.path

def fetch_logged_data(run_id):
    """Fetch params, metrics, tags, and artifacts in the specified run"""
    client = MlflowClient()
    data = client.get_run(run_id).data
    # Exclude system tags: https://www.mlflow.org/docs/latest/tracking.html#system-tags
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = list(yield_artifacts(run_id))
    return {
        "params": data.params,
        "metrics": data.metrics,
        "tags": tags,
        "artifacts": artifacts,
    }

## Train Model

In [14]:
import mlflow
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from pprint import pprint

mlflow.sklearn.autolog()
model = RandomForestRegressor(random_state=42)
param_grid = {
    'n_estimators': [100, 200],
    'max_features': [1.0],
    'max_depth': [4, 6, 8],
    'criterion': ['squared_error']
}
# define search
search = GridSearchCV(
    estimator=model, param_grid=param_grid, n_jobs=-1)
result = search.fit(X_train, y_train)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
run_id = mlflow.last_active_run().info.run_id

# show data logged in the parent run
print("========== parent run ==========")
for key, data in fetch_logged_data(run_id).items():
        print("\n---------- logged {} ----------".format(key))
        pprint(data)

# show data logged in the child runs
filter_child_runs = "tags.mlflow.parentRunId = '{}'".format(run_id)
runs = mlflow.search_runs(filter_string=filter_child_runs)
param_cols = ["params.{}".format(p) for p in param_grid.keys()]
metric_cols = ["metrics.mean_test_score"]

print("\n========== child runs ==========\n")
pd.set_option("display.max_columns", None)  # prevent truncating columns
print(runs[["run_id", *param_cols, *metric_cols]])


2022/10/02 19:23:25 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '6cc288bd42324baf890ad1be2aa49814', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2022/10/02 19:24:16 INFO mlflow.sklearn.utils: Logging the 5 best runs, one run will be omitted.


Best Score: 0.8438245555164338
Best Hyperparameters: {'criterion': 'squared_error', 'max_depth': 8, 'max_features': 1.0, 'n_estimators': 200}

---------- logged params ----------
{'best_criterion': 'squared_error',
 'best_max_depth': '8',
 'best_max_features': '1.0',
 'best_n_estimators': '200',
 'cv': 'None',
 'error_score': 'nan',
 'estimator': 'RandomForestRegressor(random_state=42)',
 'n_jobs': '-1',
 'param_grid': "{'n_estimators': [100, 200], 'max_features': [1.0], "
               "'max_depth': [4, 6, 8], 'criterion': ['squared_error']}",
 'pre_dispatch': '2*n_jobs',
 'refit': 'True',
 'return_train_score': 'False',
 'scoring': 'None',
 'verbose': '0'}

---------- logged metrics ----------
{'best_cv_score': 0.8438245555164338,
 'training_mae': 73012.42647618214,
 'training_mse': 12968713523.052935,
 'training_r2_score': 0.8981991847976264,
 'training_rmse': 113880.25958458707,
 'training_score': 0.8981991847976264}

---------- logged tags ----------
{'estimator_class': 'sklearn.