# [v1] - Training & Evaluation (baseline, dummy, PoC, prototype)

> The notebook's goal is to Training & Evaluation Machine Learning model.

---

## Get Training and Testing datasets

In [1]:
import pandas as pd

train_df = pd.read_csv("../datalake/landing/Train_rev1.csv")
test_df = pd.read_csv("../datalake/landing/Test_rev1.csv")

---

## Get Independent and Dependent (target) variables to training the model

In [2]:
# Independent variables = ContractType + ContractTime.
X = train_df.drop(columns=[
    'Id',
    'Title',
    'FullDescription',
    'LocationRaw',
    'LocationNormalized',
    'Company',
    'Category',
    'SalaryRaw',
    'SalaryNormalized',
    'SourceName'
]).fillna(0)  # Fill missing data with zeros.
X

Unnamed: 0,ContractType,ContractTime
0,0,permanent
1,0,permanent
2,0,permanent
3,0,permanent
4,0,permanent
...,...,...
244763,0,contract
244764,0,contract
244765,0,contract
244766,0,contract


In [3]:
y = train_df["SalaryNormalized"]
y

0         25000
1         30000
2         30000
3         27500
4         25000
          ...  
244763    22800
244764    22800
244765    22800
244766    22800
244767    42500
Name: SalaryNormalized, Length: 244768, dtype: int64

---

## Split Training dataset "training" and "validation"

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)

---

## Training the model

In [5]:
from catboost import Pool

# Encapsulate training data.
pool_train = Pool(
    X_train,
    y_train,
    cat_features = ['ContractType', 'ContractTime'],
)

# Encapsulate validate data.
pool_valid = Pool(
    X_valid,
    y_valid,
    cat_features = ['ContractType', 'ContractTime'],
)

In [6]:
from catboost import CatBoostRegressor

model = CatBoostRegressor()

model.fit(
    pool_train,
    eval_set=pool_valid,
    silent=True,
)

<catboost.core.CatBoostRegressor at 0x7f331d186a70>

---

## Making some predictions

In [7]:
# Predictions to validation data.
salaries_predicted = model.predict(X_valid)

In [8]:
salaries_predicted

array([35356.20052772, 24519.87864313, 35898.96578245, ...,
       35356.20052772, 18477.37896706, 35356.20052772])

In [9]:
salaries_predicted.shape

(73431,)

---

## Comparing "predicted salaries" with actual salaries (y_valid)

**Preparing statistics for predicted salaries:**  
I had problems adding the mode() statistic at the end of the DataFrame. So I had to take a manual approach with dictionaries.

In [10]:
# Create a DataFrame to store salaries predicted.
df_salaries_predicted = pd.DataFrame({'Salary Predicted': salaries_predicted})

In [11]:
# Create a dictionary to store describe() method statistics.
predicted_dict = {}
for index, value in zip(df_salaries_predicted.describe().index, df_salaries_predicted.describe().values):
    predicted_dict[index] = value[0]

In [12]:
# Append mode() statistics to dictionary.
predicted_dict['mode'] = df_salaries_predicted.mode().iloc[0, 0]

In [13]:
# Create a DataFrame to store statistics of the predicted salaries.
salaries_predicted_statistics = pd.DataFrame({'Statistics of Predicted Salaries': predicted_dict}, predicted_dict.keys())
salaries_predicted_statistics

Unnamed: 0,Statistics of Predicted Salaries
count,73431.0
mean,34160.545258
std,4154.619585
min,18477.378967
25%,35356.200528
50%,35356.200528
75%,35767.796688
max,36596.411695
mode,35356.200528


**Preparing statistics for actual salaries (y_valid):**

In [14]:
# Create a DataFrame to store actual salaries (y_valid).
df_actual_salaries = pd.DataFrame({'Actual Salaries': y_valid})

In [15]:
# Create a dictionary to store describe() method statistics.
actual_salaries_dict = {}
for index, value in zip(df_actual_salaries.describe().index, df_actual_salaries.describe().values):
    actual_salaries_dict[index] = value[0]

In [16]:
# Append mode() statistics to dictionary.
actual_salaries_dict['mode'] = df_actual_salaries.mode().iloc[0, 0]

In [17]:
# Create a DataFrame to store statistics of the actual salaries (y_valid).
actual_salaries_statistics = pd.DataFrame({'Statistics of  Actual Salaries (y_valid)': actual_salaries_dict}, actual_salaries_dict.keys())
actual_salaries_statistics

Unnamed: 0,Statistics of Actual Salaries (y_valid)
count,73431.0
mean,34070.297531
std,17589.390641
min,5000.0
25%,21500.0
50%,30000.0
75%,42500.0
max,200000.0
mode,35000.0


**Create a diff_df to compare the values:**

In [18]:
diff_df = pd.concat([salaries_predicted_statistics, actual_salaries_statistics], axis=1)

In [19]:
diff_df

Unnamed: 0,Statistics of Predicted Salaries,Statistics of Actual Salaries (y_valid)
count,73431.0,73431.0
mean,34160.545258,34070.297531
std,4154.619585,17589.390641
min,18477.378967,5000.0
25%,35356.200528,21500.0
50%,35356.200528,30000.0
75%,35767.796688,42500.0
max,36596.411695,200000.0
mode,35356.200528,35000.0


---

## Evaluation the model

> Finally, let's **Evaluation the model**.

The **Evaluation Metric** is **[MAE](https://en.wikipedia.org/wiki/Mean_absolute_error)**.

In [20]:
from sklearn.metrics import mean_absolute_error

In [21]:
mae = mean_absolute_error(y_valid, salaries_predicted)

In [22]:
mae

12877.813536658401

---

## Saving the model

In [23]:
model.save_model("../datalake/curated/model-v1.cbm")

---

# [v1] - Training & Evaluation (Resume)

 - **In this model, we use the features:**
   - **Independent variables:**
     - ContractType
     - ContractTime
   - **Dependent variables:**
     - SalaryNormalized
 - **Preprocessing:**
   - We only apply "fillna = 0" to missing data.
     - ContractType had 73% missing data.
     - ContractTime had 26% missing data.
   - **NOTE:**
     - We have many missing data, but the focus for now is creating a baseline model (baseline, dummy, PoC, prototype).
     - That's, creates a more simple model possible.
 - **Comparison between predicted data and validation data (y_valid):**
   - **Mean:**
     - Salary predicted: 34.160
     - y_valid: 34.070
   - **Standard Deviation (std):**
     - Salary predicted: 4.154
     - y_valid: 17.589
   - **Min value:**
     - Salary predicted: 18.477
     - y_valid: 5.000
   - **25% = Lower quartile, or first quartile (Q1):**
     - Salary predicted: 35.356
     - y_valid: 21.500
   - **50% = Second quartile (Q2, or the Median):**
     - Salary predicted: 35.356
     - y_valid: 30.000
   - **75% = The upper quartile, or third quartile (Q3):**
     - Salary predicted: 35.767
     - y_valid: 42.500
   - **Max value:**
     - Salary predicted: 36.596
     - y_valid: 200.000
   - **Mode:**
     - Salary predicted: 35.356
     - y_valid: 35.000
 - **The result of Evaluation Metric (MAE) was:**
   - 12877.813536658401.1

---

Ro**drigo** **L**eite da **S**ilva - **drigols**