# Training the model

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

### 1. Introduction

In [46]:
df = pd.read_csv('data/preprocessed_data.csv')

In [47]:
df.drop('_id', axis=1, inplace=True)
df.drop('Deposit', axis=1, inplace=True)

### 2. Encoding & Train-Validation-Test Split

One-hot Encoding has to be done after the train-validation-test split for several reasons:
1. **Data Leakage Prevention**: If we do the one-hot encoding before the split, we will be using information from the validation and test sets to encode the categorical variables. This is a problem because we are using information from the validation and test sets to train our model, which is not allowed.
2. **Avoiding the Curse of Dimensionality**: If we do the one-hot encoding before the split, we will be increasing the number of columns in the dataset. This is a problem because we will be increasing the number of dimensions of the dataset, which will make the model more complex and prone to overfitting.

### 2.1. Label Encoding of 'Floor'

In [48]:
# get unique values of the column 'Floor' of df
floor_unique = df['Floor'].unique()
# remove the samples that have 'planta in floor_unique (but avoid entreplanta)
floor_values_without_planta = [x for x in floor_unique if 'planta ' not in x or 'entreplanta' in x]
floor_values_without_planta

['bajo exterior',
 'no aplica',
 'bajo interior',
 'semi-sótano interior',
 'semi-sótano exterior',
 ' exterior',
 'entreplanta interior',
 'bajo',
 'entreplanta exterior',
 'sótano exterior',
 ' interior',
 'sótano interior']

* 'bajo exterior', 'bajo interior', 'bajo': 0

* 'semi-sótano interior', 'semi-sótano exterior': -1

* 'sótano interior', 'sótano exterior': -2

And the rest of the values are very differentiated to make them more "neutral":

* 'entreplanta interior', 'entreplanta exterior': -100

* 'exterior', 'interior': -999

* 'no aplica': 999

In [50]:
df['Floor'] = df['Floor'].apply(extract_floor_number).astype(int)

In [51]:
print(sorted(df['Floor'].unique().tolist()))

[-999, -100, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 30, 53, 60, 999]


#### 2.2. Train-Validation-Test Split

In [53]:
df.shape[0]

8280

#### 2.3. One-Hot Encoding of 'Type'

Encode the column 'Type' using one-hot encoding. 

#### 2.4. Embeddings