# Interview Task for Machine Learning Engineer / Data Scientist Role
## Instructions

**Data**

The data set contains rental property listings from a real estate platform with related information, such as living area size, rent, the location (street and house number, if available, ZIP code and state), type of energy etc. It also has two variables containing longer free text descriptions: description with a text describing the offer and facilities describing all available facilities, newest renovation etc. 

**Task**

- 1. Please train a machine learning model to predict the total rent using only the structural data (without “description” and “facilities” fields).  
- 2. Please train a machine learning model to predict the total rent using both the structural data AND text data (“description” and “facilities” fields). 

We expect the performance reporting to conform to the expected ML best practices, i.e. please split the data set to necessary groups (train, validation, test). 

One can ignore some of the fields during prediction. Please provide some discussion on the feature processing, model selection, model performance and comparison of two models and any possible improvements. Please provide the trained ML model files as well.

We expect a git repo (link shared privately with us) that contains your solution in Python 3 language.

Additional points will be considered for DB schema design and access (postgreSQL preferred), model versioning, containerization, CI/CD and testing.

**Author information**:

Fernando Pozo

- [GitHub (github.com/fpozoc)](https://github.com/fpozoc)
- [Homepage (fpozoc.com)](https://fpozoc.com)

# Model selection, model training and evaluation

In [9]:
# python modules
import yaml
import sys
import warnings

# 3rd partymodules
import pandas as pd
import numpy as np
sys.path.append('../')
from src.model.model_selection import *

from xgboost import XGBRegressor


warnings.filterwarnings("ignore")
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

%load_ext autoreload
%autoreload 2

!pwd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/local/fpozoc/Projects/Nomoko-ML-engineer-interview-task/notebooks


In [2]:
SEED=123

In [4]:
df = pd.read_csv('../data/processed/training_set.v1.tsv.gz', sep='\t', compression='gzip')

In [5]:
pretrained_model = pickle.load(open('../models/model.v.1.0.0.pkl', 'rb'))
model = Classifier(
    model=pretrained_model,
    df=df[df.columns[1:]],
    features_col=df.columns[2:],
    target_col='totalRent',
    model_type='regression',
    )

In [6]:
model.evaluate

Unnamed: 0,metric
R2,0.735464
Mean Absolute Error,110.3784
Mean Squared Error,23515.884843


In [7]:
model.cross_validate

Unnamed: 0_level_0,mean,std
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
fit_time,13.4969,0.8247
score_time,0.0355,0.0023
test_R2,0.7345,0.0036
train_R2,0.7707,0.0011
test_Mean Absolute Error,-110.4552,0.5394
train_Mean Absolute Error,-103.063,0.154
test_Mean Squared Error,-23555.3292,198.1859
train_Mean Squared Error,-20342.1338,66.6575
