# Henry PI 2: Machine Learning

• Caros incluyen el promedio

• stacking & walking

• revisar descripciones repetidas: anuncios publicados múltiples veces

• robustscaler lidia mejor con outliers que standardscaler

• el registro más al sur parece estar en Nariño, pero también hay registros en el amazonas

## ------------- D A T A --- E X P L O R A T I O N --------------

We start by importing the libraries that we need

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from sklearn import preprocessing
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from helpers import *

In [None]:
# Next we import the dataset with the training data into a Pandas DataFrame

original_df = pd.read_csv('datasets/properties_colombia_train.csv')
#original_df.sample(5)

In [None]:
# Now we obtain some basic information about the DataFrame, along with the mean value from the feature we will use to create the target column

original_price_mean = original_df.price.mean()

print(f'• Original shape: {original_df.shape}\n')
print(f'• Original columns: {original_df.columns}\n')
print(f"• Original price column's mean: {original_price_mean}")

In [None]:
# We look for duplicated registers (spoiler: there are none)

original_df.duplicated().value_counts()

In [None]:
# We look for missing values per feature (we find a lot of them, particularly in l4, l5, l6, rooms, bedrooms, surface_total, surface_covered and price_period)

original_df.isnull().sum()

## ---------- G E N E R A T I N G --- T A R G E T --- C O L U M N ----------

In [None]:
# We start by creating a copy of the original dataset and looking for missing values in the 'price' column (which we can see above as well)
# Our targets will be obtained from the information contained in this column, so any training data without an associated target value will be pretty much useless.

df_Xy = original_df.copy()
df_Xy.price.isnull().sum()

We can see that we have 63 missing values in the 'price' column, the one we will be using to create our target classification based on it's mean value.

We procceed to drop those registers, this is because we need them to have a target value in order to train our models.

In [None]:
df_Xy.dropna(subset=['price'], inplace=True)

price_mean_after_dropna = df_Xy.price.mean()

print(f'Original DataFrame Shape: {original_df.shape}')
print(f'• DataFrame Shape (after dropna-price): {df_Xy.shape}\n')
print(f'• DataFrame Price column mean (after dropna-price): {price_mean_after_dropna}\n')
print(f"• Is the price's mean still the same as the original: {price_mean_after_dropna==original_price_mean}")

In [None]:
# We check again for missing values

df_Xy.price.isnull().sum()

In [None]:
# We check for the extreme values in the column

df_Xy.price.min(), df_Xy.price.max()

In [None]:
# We check for the amount of appereances of these extreme values

df_Xy.price.value_counts()[0], df_Xy.price.value_counts()[345000000000.0]

In [None]:
# As we found an absurdly big value as the max value of price column we check for some information about the biggest values in this column.

#df_Xy.sort_values(by='price',ascending=False).head(100).price.mean() # Output: 54,246'115,351.52
#df_Xy.sort_values(by='price',ascending=False).head(1000).price.mean() # Output: 17,829'070,843.313

#### • IMPORTANT NOTE:

Above we can see that there are extreme outliers in the column from which we are getting our training data targets. This is an important situation that must be adressed with the client, as these outliers (specially the big ones) will distort the column's mean value, affecting the division betweeen 'expensive' and 'cheap' house we are creating in our target column.

Now we will create the 'target' column using the values from 'price', separating them into two categories based on the mean of the column.

In [None]:
df_Xy['target'] = (df_Xy['price'] >= original_price_mean).astype(int)
print(df_Xy['target'].shape)
df_Xy['target'].value_counts()

In [None]:
# Now we look for the amount of different values per feature (in order to filter out redundant and non-informative features)

for x in df_Xy:
    print(f'\n• {x}:\t{len(df_Xy[x].value_counts())}')

From the output above we can see that:
1) There are several features with only one value throughout all of the 197486 registers (ad_type, l1, price_period, operation_type). This features give us no information.
2) We can see that the columns labeled 'Unnamed: 0' and the 'id' have unique values (identifiers) for each one of the rows and thus are redundant.

We will procceed to create another dataframe from the original one ignoring these features, along with the 'price' column which was only useful for us in order to obtain our 'target' column. 

After this we will check for duplicates (once we have removed the identifiers that guaranteed every row was unique) and remove them. This will give us a somewhat clean dataset to begin preprocessing our data, i.e. applying to it the changes that we would apply to any input data given to our finished model in order to get predictions from it.

In [None]:
# We create a new DataFrame which we will use to train our model with, ignoring the unnecessary columns

df_train = df_Xy.drop(['ad_type', 'l1', 'price_period', 'operation_type', 'Unnamed: 0', 'id', 'price'], axis=1)
print(f'• Training DataFrame Shape: {df_train.shape}\n')
print(f'• Training DataFrame Columns: {df_train.columns}\n')

In [None]:
df_train.duplicated().value_counts()

We can see that after dropping the redundant and identifier columns we got a total of 4091 duplicated registers.

We procceed to eliminate them.

In [None]:
df_train.drop_duplicates(inplace=True)
df_train.duplicated().value_counts()

## -------------- D A T A --- P R E P R O C E S S I N G --- 1 --------------

### ---------------------- FINDING THE APPROPIATE TRANSFORMATIONS ----------------------

In this section we will analyze our dataset's features grouping them by the type of data portrayed in them (date, location, ).

This way, we will be able to determine the best transformations to perform on each of them in order to feed our models with the best quality data we can get.

In [None]:
df_train.info()

In [None]:
print(f'• Total registers: {len(df_train)}')
print('• Null values per feature:')
df_train.isnull().sum()

In [None]:
def get_info(feature_list, dataset=df_train, maxmin=False, stats=False):
    for x in feature_list:
        types = set()
        for y in dataset[x]:
            types.add(type(y))
        print(f'\n----- {x} -----\n •Data types: {types}\n •Missing values:')
        print(dataset[x].isnull().value_counts(),'\n')
        if maxmin:
            print(f' •Min: {dataset[x].min()}\n •Max: {dataset[x].max()}\n')
        if stats:
            print(f' •Mean: {dataset[x].mean()}\n •Median: {dataset[x].median()}\n •Mode: {dataset[x].mode()}\n')

### 1) DATE FEATURES: start_date, end_date & created_on

In [None]:
date_features = ['start_date', 'end_date', 'created_on']

get_info(date_features, maxmin=True)

Here we can see that the maximum value for the 'end_date' feature has wrong data, as it is supposed to be the date when the 'for sale' announcement stopped showing.

In [None]:
df_train.end_date.value_counts()

In [None]:
df_train[['start_date', 'end_date', 'created_on']].sort_values(by=['end_date', 'start_date'], ascending=False).head(11928)

In [None]:
11925/len(df_train)

We have 11925 wrong values in the 'end_date' feature (0.06%). 

We will try to replace them with a the average date difference (between start_date and end_date, not including the wrong values) if we use this column.

Also, we will convert this colonms to 'datetime' data type and afterwards wi will format them into timestamp format.

### 2) LOCATION FEATURES: l2, l3, l4, l5, l6, geometry, lat & lon

In [None]:
location_features = ['l2', 'l3', 'l4', 'l5', 'l6', 'geometry', 'lat', 'lon']

get_info(location_features[:-2])
get_info(location_features[-2:], maxmin=True)

From the output above we can see that the 'l4', 'l5' and 'l6' have more than half the values missing, so this columns must be dropped.

#### In the 'l2' feature, corresponding to Colombia's departments (their equivalent to states or provinces) we have no values missing.

#### In the case of 'l3', there are 10828 values missing (about 5.5% of the registers). We will replace the missing values with the capital of the corresponding departments obtained from 'l2'.

In [None]:
# Here we have a dictionary containing each of the 32 colombian departments as keys followed by their corresponding capitals as values.
# This list will be stored in the helpers.py file

'''
capitals = {'Amazonas': 'Leticia', 'Antioquia': 'Medellín', 'Arauca': 'Arauca', 'Atlántico': 'Barranquilla', 'Bolívar' : 'Cartagena', 'Boyacá': 'Tunja',
            'Caldas': 'Manizales', 'Caquetá': 'Florencia', 'Casanare': 'Yopal', 'Cauca': 'Popayán', 'Cesar': 'Valledupar', 'Chocó': 'Quibdó', 'Córdoba': 'Montería',
            'Cundinamarca': 'Bogotá D.C', 'Guainía': 'Puerto Inírida', 'Guaviare': 'San José del Guaviare', 'Huila': 'Neiva', 'La Guajira': 'Riohacha', 
            'Magdalena': 'Santa Marta', 'Meta': 'Villavicencio', 'Nariño': 'Pasto', 'Norte de Santander': 'Cúcuta', 'Putumayo': 'Mocoa', 'Quindío': 'Armenia',
            'Risaralda': 'Pereira', 'San Andrés Providencia y Santa Catalina': 'San Andrés', 'Santander': 'Bucaramanga', 'Sucre': 'Sincelejo', 'Tolima': 'Ibagué',
            'Valle del Cauca': 'Cali', 'Vaupés': 'Mitú', 'Vichada': 'Puerto Carreño'}
'''
len(capitals)


In [None]:
# This is the amount of different cities in the 'l3' feature
len(df_train.l3.unique())

#### Regarding the latitude and longitude values from the dataset, we can see from the output from the function at the beginning of this section that there are 48519 missing values on each of these features.


In [None]:
# Here we check whether the missing values correspond to the same registers in the dataset:

print(f"Rows missing 'lat' values: {len(df_train[df_train['lat'].isnull()])}")
print(f"Rows missing 'lon' values: {len(df_train[df_train['lon'].isnull()])}")
#print(f"Rows missing both 'lat' and 'lon' values (1): {len(df_train[df_train['lat'].isnull()][df_train['lon'].isnull()])}")
print(f"Rows missing both 'lat' and 'lon' values (2): {len(df_train[df_train['lat'].isnull()][df_train[df_train['lat'].isnull()]['lon'].isnull()])}")


In [None]:
# Here we take a sample to further proof that every row missing a 'lat' value is missing it's 'lon' value as well
df_train[df_train['lat'].isnull()].sample(10)

#### In order to analize the latitudes and longitudes from the rows with a value for these columns, we need to define certain limits for the colombian territory, beyond which we shouldn't expect to find any lat or lon values.

![Colombia Latitudes and Longitudes](https://i.imgur.com/ZdKWfRG.png)

In [None]:
# We define the corresponding limits as two lists, one for the latitudes and one for the longitudes
# This limits encompass the colombian insular territories, which extend further west an north than it's continental territory

lat_col = [-4.5, 15]    # Southernmost and northernmost latitudes respectively
lon_col = [-82, -67]    # Westernmost and easternmost longitudes respectively

In [None]:
count_lat_smaller = 0   # Registers with a latitude to the south of Colombia
count_lat_greater = 0   # Registers with a latitude to the north of Colombia
for x in df_train.lat:
    if x<lat_col[0]:
        count_lat_smaller += 1
    elif x>lat_col[1]:
        count_lat_greater += 1
    
print(f'• Latitudes south from Colombia: {count_lat_smaller}\n• Latitudes north from Colombia: {count_lat_greater}')

As we can see, there's only 1 value exceeding Colombia's latitudes on each direction in our dataset. We can visualize them:

In [None]:
# 
df_train.sort_values(by='lat').head(2)
#df_Xy.sort_values(by='price',ascending=False).head(100).price.mean() # Output: 54,246'115,351.52

In [None]:
df_train.sort_values(by='lat', ascending=False).head(2)

In [None]:
count_lon_smaller = 0   # Registers with a longitude to the west of Colombia
count_lon_greater = 0   # Registers with a longitude to the east of Colombia
for x in df_train.lon:
    if x<lon_col[0]:
        count_lon_smaller += 1
    elif x>lon_col[1]:
        count_lon_greater += 1
    
print(f'• Longitudes to the west from Colombia: {count_lon_smaller}\n• Longitudes to the east from Colombia: {count_lon_greater}')

We found only one missplaced longitude, to the west of Colombia. Now we visualize it:

In [None]:
df_train.sort_values(by='lon').head(2)

In [None]:
df_train.sort_values(by='lon', ascending=False).head(2)

#### There where only 3 misplaced latitude and longitude values in total. Those can be replaced by the coordinates from the city in the register ('l3' value).

In [None]:
# Here we get a pd.Series with the possible combinatorics of the values in 'l2' and 'l3' for each row.

combinations = []
for x in range(len(df_train)):
    if str(df_train.iloc[x].l3) != 'nan':
        combinations.append(f'{df_train.iloc[x].l3}, {df_train.iloc[x].l2}')
comb_series = pd.Series(combinations)
unique_l2_l3 = comb_series.unique()

In [None]:
print(len(unique_l2_l3))

In [None]:
df_cities = df_train.l3.unique()
print(f'Amount of different cities in df_train.l3: {len(df_cities)}')
print(f'Amount of different cities in the combination df_train.l3-df_train.l2: {len(unique_l2_l3)}')

In [None]:
l2_l3_cities = []
repeated_cities = []
for x in unique_l2_l3:
    y = x.split(',')
    if y[0] not in l2_l3_cities:
        l2_l3_cities.append(y[0])
    else:
        repeated_cities.append(y[0])
        print(y)


The output above represents different cities with the same name but in different departments from Colombia. This explains the difference between the amount of unique values in 'l3' and the amount of unique values in the combinatory of 'l2' and 'l3'.

Below you can see the complete list of the aforementioned combinatorics and corroborate that the cities listed above have two entries in the list.

In [None]:
for x in unique_l2_l3:
    y = x.split(',')
    if y[0] in repeated_cities:
        print(x)

In [None]:
# Now we create a dictionary with the coordinates for each of the unique combinatorics
# This code takes too long to run, so it will be commented out and it's output saved in a dictionary in the helpers.py file.

'''
geolocator = Nominatim(user_agent='acidminded')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

dep_ciud_lat_lon = {}

for x in capitals.keys():
    dep_ciud_lat_lon[x] = {}

for x in unique_l2_l3:
    coor = geocode(x)
    y = x.split(',')
    ciud = y[0]
    dep = y[1][1:]
    dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}

for x in capitals:
    dep = x
    if capitals[x] not in dep_ciud_lat_lon[x]:
        ciud = capitals[x]
        coor = geocode(f'{ciud}, {dep}')
        dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}

print(dep_ciud_lat_lon)
'''

In [None]:
# Here we check for any problem within our dictionary and found one which is corrected manually.

'''
count = 0
problems = []
for dep in dep_ciud_lat_lon:
    for city in  dep_ciud_lat_lon[dep]:
        if (dep_ciud_lat_lon[dep][city]['lat'] < lat_col[0]) or (dep_ciud_lat_lon[dep][city]['lat'] > lat_col[1]):
            count += 1
            problems.append((dep, city, 'lat problem'))
        if (dep_ciud_lat_lon[dep][city]['lon'] < lon_col[0]) or (dep_ciud_lat_lon[dep][city]['lon'] > lon_col[1]):
            count += 1
            problems.append((dep, city, 'lon problem'))
print(count)
print(problems)
'''

'''
OUTPUT:
1
[('Bolívar', 'Santa Rosa', 'lon problem')]
'''


In [None]:
print(dep_ciud_lat_lon.keys())

In [None]:
# Let's see the missing values from the 'geometry' feature

df_train.geometry.value_counts()

There are 48519 values missing from 'geometry', but once we have obtained the lat and lon for all our missing columns we can fill in this feature as well.

### 4) PROPERTY FEATURES: rooms, bedrooms, bathrooms, surface_total, surface_covered & property_type

In [None]:
property_features = ['rooms', 'bedrooms', 'bathrooms', 'surface_total', 'surface_covered', 'property_type']

get_info(property_features[:-1], maxmin=True, stats=True)
get_info(property_features[-1:])

From the output above we can see that the only ones of these features that have less than half of it's values missing are 'bathrooms' and 'property_type'.

Because of this, 'bathrooms' and 'property_type' will be the only columns from this subset of features that we will be using for training our models by the moment.

We are awere that it exists the possibility for us to extract meaningful information from each sale's description in order to fill the missing data from these columns and that may be a path we will explore when improving our first models. But for the moment these two features will suffice.

#### The missing values from the 'bathrooms' column will be imputed with it's floor rounded mean value (2), which also happens to be it's median and mode.

In [None]:
df_train.bathrooms.value_counts()

In [None]:
df_train.sort_values(by=['bathrooms'], ascending=False).head(15)

In [None]:
df_train.property_type.value_counts()

In [None]:
property_types = df_train.property_type.unique()
print(property_types)

In [None]:
print(f'\nHouse or apartment registries by amount of bathrooms:\n')
for x in range(5,21):
    print(f'• {x} bathrooms:')
    print('\t',len(df_train.loc[((df_train.property_type == 'Casa') | (df_train.property_type == 'Apartamento'))&((df_train.bathrooms >= x))]))

It is very unlikely that a house or an apartment will have 6 or more bathrooms, for this reason, those values will be replaced by the floor rounded mean of the column (2).

### 5) ADVERTISING FEATURES: currency, title & description 

In [None]:
advertising_features = ['currency', 'title', 'description']

get_info(advertising_features)

In [None]:
df_train[['currency', 'title', 'description']].info()

In [None]:
df_train[df_train['currency'].isnull()]

In [None]:
# Here we can see that 8 of the registers have a price in usd
df_train.currency.value_counts()

In [None]:
df_train.loc[df_train.currency=='USD']

In [None]:
df_train.description.duplicated().value_counts()


In [None]:
df_train.columns

In [None]:
df_train_nd = df_train[['lat', 'lon', 'l2', 'l3', 'l4',
       'l5', 'l6', 'rooms', 'bedrooms', 'bathrooms', 'surface_total',
       'surface_covered', 'currency', 'title', 'description', 'property_type',
       'geometry', 'target']].copy()
df_train_nd.duplicated().value_counts()

In [None]:
df_train_nd.description.duplicated().value_counts()

In [None]:
df_train_nd.drop_duplicates(inplace=True)
df_train_nd.description.duplicated().value_counts()

In [None]:
df_train_nd.info()

In [None]:
dft = df_train.copy()
dft.shape

In [None]:
dft[['lat', 'lon', 'l2', 'l3', 'l4',
       'l5', 'l6', 'rooms', 'bedrooms', 'bathrooms', 'surface_total',
       'surface_covered', 'currency', 'title', 'description', 'property_type',
       'geometry', 'target']].duplicated().value_counts()

In [None]:
dft.drop_duplicates(subset=['lat', 'lon', 'l2', 'l3', 'l4',
       'l5', 'l6', 'rooms', 'bedrooms', 'bathrooms', 'surface_total',
       'surface_covered', 'currency', 'title', 'description', 'property_type',
       'geometry', 'target'], inplace=True)
dft.shape

In [None]:
dft.columns

In [None]:
dft.duplicated().value_counts()

In [None]:
#dft.surface_total.isnull().value_counts()
#dft.info()

## -------------- D A T A --- P R E P R O C E S S I N G --- 2 --------------

### ------------------------------------ CREATING THE PIPELINE ------------------------------------

We will design a pipeline the recieves a dataset with the same features as the one we just explored ('df_train') as it was at the beginning of the previous section (minus the target column). This pipeline will perform the necessary changes to the dataset, feed it to a model of our selection, perform a cross validation and give us the results.

As we concluded on the previous section, we will select a few features that will be considered relevant to continue with the data preprocessing and model training: l2, l3, lat, lon, bathrooms and property_type.

l2--- categorical (needs encoding). MissVal (0, ok!)

l3--- categorical (needs encoding). MissVal (needs imputation using 'capitals')

lat--- numerical (ok). MissVal (needs imputation using 'dep_ciud_lat_lon') -standard scaler

lon--- numerical (ok). MissVal (needs imputation using 'dep_ciud_lat_lon')

bathrooms-- numerical (ok). MissVal (needs imputation using mean 2). Replace values greater than 5 (and with property type 'casa' or 'apartamento) with 2 by 2

property_type--- categorical (needs encoding). MissVal (0, ok!)

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score
from sklearn.model_selection import train_test_split

In [None]:
X1 = dft.drop('target',axis=1)
y1 = dft.target

#### Here we define some helper functions that will be used to fill missing values during the preprocessing

In [None]:
def fill_l3(df):
    l3_ok = []
    for x in range(len(df)):
        if type(df.loc[x,'l3']) == float:
            dep = df.loc[x,'l2']
            l3_ok.append(str(capitals[dep]))
        else:
            l3_ok.append(str(df.loc[x,'l3']))
    return pd.Series(l3_ok)
        

In [None]:
def fill_coor(df):
    lat_ok = []
    lon_ok = []
    for x in range(len(df)):
        if str(df.loc[x,'lat']) == 'nan':
            dep, city = df.loc[x,'l2':'l3']
            #print('NAN FOUND: ',dep, city, df.loc[x,'lat'], df.loc[x,'lon'])
            try:
                lat_ok.append(float(dep_ciud_lat_lon[dep][city]['lat']))
                lon_ok.append(float(dep_ciud_lat_lon[dep][city]['lon']))
            except KeyError:
                geolocator = Nominatim(user_agent='acidminded')
                geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
                txt = f'{city}, {dep}'
                coor = geocode(txt)
                lat_ok.append(coor.latitude)
                lon_ok.append(coor.longitude)
                dep_ciud_lat_lon[dep][city] = {'lat':coor.latitude, 'lon':coor.longitude}
                print(f'NEW DATA: {dep}{city} lat: {coor.latitude} lon: {coor.longitude}')
        else:
            lat_ok.append(df.loc[x,'lat'])
            lon_ok.append(df.loc[x,'lon'])
    return pd.Series(lat_ok), pd.Series(lon_ok)

In [None]:
prtypes = ['Casa', 'Apartamento', 'Otro', 'Oficina', 'Finca', 'Lote', 'Local comercial', 'Parqueadero']
prtypes_avgbtrms = {}

for x in prtypes:
    prtypes_avgbtrms[x] = X1.loc[X1.property_type==x].bathrooms.mean()
    if str(prtypes_avgbtrms[x]) == 'nan':
        prtypes_avgbtrms[x] = 1
for x in prtypes_avgbtrms:
    prtypes_avgbtrms[x] = round(prtypes_avgbtrms[x])

def fill_bathrooms(df):  #hehe
    btrms_ok = []
    for x in range(len(df)):
        btrms = df.loc[x,'bathrooms'] 
        prtype = df.loc[x,'property_type']

        if str(btrms) == 'nan':
            btrms_ok.append(prtypes_avgbtrms[prtype])
        elif btrms >= 6:
            if prtype in ['Casa', 'Apartamento']:
                btrms_ok.append(float(2))
            else:
                btrms_ok.append(btrms)
        else:
            btrms_ok.append(btrms)
    return pd.Series(btrms_ok)

In [None]:
#std_scaler = StandardScaler().fit(X1[['lat','lon']].to_numpy()[:,:])
#min_max_scaler = MinMaxScaler().fit(X1[['bathrooms']].to_numpy()[:,:])
print(X1.to_numpy().shape)

In [None]:
def fill_nan(dataset):
    X = dataset[['bathrooms', 'lat', 'lon', 'l2', 'l3', 'property_type']].copy()
    X.reset_index(inplace=True, drop=True)
    # Fill missing values
    X['l3'] = fill_l3(X)
    X['lat'], X['lon'] = fill_coor(X)
    X['bathrooms'] = fill_bathrooms(X)
    # Prepare to transform to numerical
    X['l2'] = X['l2'].astype('category')
    X['l3'] = X['l3'].astype('category')
    X['property_type'] = X['property_type'].astype('category')
    return X

In [None]:
X2 = fill_nan(X1)
X2.isnull().sum()

In [None]:
l2_coder = OneHotEncoder()
l3_coder = OneHotEncoder()
pt_coder = OneHotEncoder()
depts = capitals.keys()
cities = set()
for dept in dep_ciud_lat_lon:
    for city in dep_ciud_lat_lon[dept]:
        cities.add(city)

l2_cod = l2_coder.fit(pd.Series(depts).values.reshape(-1, 1))
l3_cod = l3_coder.fit(pd.Series(list(cities)).values.reshape(-1, 1))
pt_cod = pt_coder.fit(X2[['property_type']])
'''
l2_cod = l2_coder.fit(X2[['l2']])
l3_cod = l3_coder.fit(X2[['l3']])
pt_cod = pt_coder.fit(X2[['property_type']])
'''

In [None]:
def preprocess_to_num(dataset):
    #print(f'INITIAL TO_NUM SHAPE: {dataset.shape}')
    
    X = dataset.copy()
    
    #print(f'TO_NUM SHAPE BEFORE ONEHOT: {X.shape}')

    #l2_coder = OneHotEncoder()
    #l2_cod = l2_coder.fit_transform(X[['l2']])
    l2_cod = l2_coder.transform(X[['l2']])
    new_l2 = pd.DataFrame(l2_cod.toarray())
    #print(f'+{new_l2.shape} (l2)')

    #l3_coder = OneHotEncoder()
    #l3_cod = l3_coder.fit_transform(X[['l3']])
    l3_cod = l3_coder.transform(X[['l3']])
    new_l3 = pd.DataFrame(l3_cod.toarray())
    #print(f'+{new_l3.shape} (l3)')

    #pt_coder = OneHotEncoder()
    #pt_cod = pt_coder.fit_transform(X[['property_type']])
    pt_cod = pt_coder.transform(X[['property_type']])
    new_pt = pd.DataFrame(pt_cod.toarray(), columns=pt_coder.categories_)
    #print(f'+{new_pt.shape} (pt)')

    #print(f'TO_NUM SHAPE AFTER ONEHOT: {X.shape}')

    X_ok = pd.concat([X, new_l2, new_l3, new_pt], axis=1)
    X_ok.drop(['l2','l3','property_type'], axis=1, inplace=True)

    X_num = X_ok.to_numpy()
    #print(f'FINAL TO_NUM SHAPE: {X_num.shape}')

    return X_num
    

X_num = preprocess_to_num(X2)

In [None]:
print(X_num.shape)
X_num

In [None]:
min_max_scaler = MinMaxScaler().fit(X_num[:,:1])
std_scaler = StandardScaler().fit(X_num[:,1:])
pca = PCA(n_components=30, whiten=False).fit(X_num)

def preprocess_std_dimred(matrix):
    #print(f' STANDARDIZATION INITIAL SHAPE: {matrix.shape}')
    X =  np.copy(matrix)
    X[:,:1] = min_max_scaler.transform(matrix[:,:1])
    X[:,1:] = std_scaler.transform(matrix[:,1:])
    #print(f' STANDARDIZATION INTERMEDIATE SHAPE: {X.shape}')
    X = pca.transform(X)
    #print(f' STANDARDIZATION FINAL SHAPE: {X.shape}')
    return X
    

In [None]:
X_ready = preprocess_std_dimred(X_num)
X_ready.shape

In [None]:
fill_df = FunctionTransformer(fill_nan)
df_to_num = FunctionTransformer(preprocess_to_num)
mat_to_X = FunctionTransformer(preprocess_std_dimred)

In [None]:
def preprocess(df):
    df_ok = df.copy()
    df_ok2 = fill_nan(df_ok)
    df_ok3 = preprocess_to_num(df_ok2)
    df_ok4 = preprocess_std_dimred(df_ok3)

    return df_ok4

In [None]:
def fit_and_print(pipeline, X_train, y_train, X_test, y_test):
    #print(f'SHAPE TO FIT: {X_train.shape}')
    pipeline.fit(X_train, y_train)
    train_preds = pipeline.predict(X_train)
    test_preds = pipeline.predict(X_test)
    print('• TRAIN DATA:')
    print(f'Train confusion matrix: \n{confusion_matrix(y_train, train_preds)}')
    print(f'Train accuracy: {accuracy_score(y_train, train_preds)}')
    print(f'Train recall: {recall_score(y_train, train_preds)}\n\n')

    print('• TEST DATA:')
    print(f'Test confusion matrix: \n{confusion_matrix(y_test, test_preds)}')
    print(f'Test accuracy: {accuracy_score(y_test, test_preds)}')
    print(f'Test recall: {recall_score(y_test, test_preds)}')

In [None]:
pipeline_1 = Pipeline([('Fill_DF', fill_df),('To_Num', df_to_num),('Standardize_&_DimRed', mat_to_X), ('KNClassifier-10nneigh', KNeighborsClassifier(n_neighbors=10))])
pipeline_2 = Pipeline([('Fill_DF', fill_df),('To_Num', df_to_num),('Standardize_&_DimRed', mat_to_X), ('KNClassifier-15nneigh', KNeighborsClassifier(n_neighbors=15))])
pipeline_3 = Pipeline([('Fill_DF', fill_df),('To_Num', df_to_num),('Standardize_&_DimRed', mat_to_X), ('DTClassifier-mxdepth8', DecisionTreeClassifier(max_depth=8))])
pipeline_4 = Pipeline([('Fill_DF', fill_df),('To_Num', df_to_num),('Standardize_&_DimRed', mat_to_X), ('DTClassifier-mxdepth15', DecisionTreeClassifier(max_depth=15))])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2, y1, test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
# Kneighbors classifier n_neighbors=10
fit_and_print(pipeline_1, X_train, y_train, X_test, y_test)

In [None]:
# Kneighbors classifier n_neighbors=15
fit_and_print(pipeline_2, X_train, y_train, X_test, y_test)

In [None]:
# Decision tree classifier max_depth=8
fit_and_print(pipeline_3, X_train, y_train, X_test, y_test)

In [None]:
# Decision tree classifier max_depth=15
fit_and_print(pipeline_4, X_train, y_train, X_test, y_test)

In [None]:
real_test = pd.read_csv('datasets/properties_colombia_test.csv')
real_test.shape

In [None]:
real_test_X = preprocess(real_test)
real_test_X.shape