# Henry PI 2: Machine Learning

• Caros incluyen el promedio

• stacking & walking

• revisar descripciones repetidas: anuncios publicados múltiples veces

• robustscaler lidia mejor con outliers que standardscaler

• el registro más al sur parece estar en Nariño, pero también hay registros en el amazonas

## ------------- D A T A --- E X P L O R A T I O N --- 1 --------------

### -------------------------- GETTING TO KNOW THE DATASET --------------------------

We start by importing the libraries that we need

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from sklearn import preprocessing
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from helpers import *


In [2]:
import warnings

def disable_pandas_warnings():
    warnings.resetwarnings()  # Maybe somebody else is messing with the warnings system?
    warnings.filterwarnings('ignore')  # Ignore everything
    # ignore everything does not work: ignore specific messages, using regex
    warnings.filterwarnings('ignore', '.*A value is trying to be set on a copy of a slice from a DataFrame.*')
    warnings.filterwarnings('ignore', '.*indexing past lexsort depth may impact performance*')
disable_pandas_warnings()

In [3]:
# Next we import the dataset with the training data into a Pandas DataFrame

original_df = pd.read_csv('datasets/properties_colombia_train.csv')
#original_df.sample(5)

In [4]:
# Now we obtain some basic information about the DataFrame, along with the mean value from the feature we will use to create the target column

original_price_mean = original_df.price.mean()

print(f'• Original shape: {original_df.shape}\n')
print(f'• Original columns: {original_df.columns}\n')
print(f"• Original price column's mean: {original_price_mean}")

• Original shape: (197549, 27)

• Original columns: Index(['Unnamed: 0', 'id', 'ad_type', 'start_date', 'end_date', 'created_on',
       'lat', 'lon', 'l1', 'l2', 'l3', 'l4', 'l5', 'l6', 'rooms', 'bedrooms',
       'bathrooms', 'surface_total', 'surface_covered', 'price', 'currency',
       'price_period', 'title', 'description', 'property_type',
       'operation_type', 'geometry'],
      dtype='object')

• Original price column's mean: 643605091.0064613


In [5]:
# We look for duplicated registers (spoiler: there are none)

original_df.duplicated().value_counts()

False    197549
dtype: int64

In [6]:
# We look for missing values per feature (we find a lot of them, particularly in l4, l5, l6, rooms, bedrooms, surface_total, surface_covered and price_period)

original_df.isnull().sum()

Unnamed: 0              0
id                      0
ad_type                 0
start_date              0
end_date                0
created_on              0
lat                 49498
lon                 49498
l1                      0
l2                      0
l3                  11032
l4                 152182
l5                 170140
l6                 190682
rooms              170012
bedrooms           157024
bathrooms           41082
surface_total      190575
surface_covered    187747
price                  63
currency               67
price_period       161578
title                   1
description           121
property_type           0
operation_type          0
geometry                0
dtype: int64

### ---------- G E N E R A T I N G --- T A R G E T --- C O L U M N ----------

In [7]:
# We start by creating a copy of the original dataset and looking for missing values in the 'price' column (which we can see above as well)
# Our targets will be obtained from the information contained in this column, so any training data without an associated target value will be pretty much useless.

df_Xy = original_df.copy()
df_Xy.price.isnull().sum()

63

We can see that we have 63 missing values in the 'price' column, the one we will be using to create our target classification based on it's mean value.

We procceed to drop those registers, this is because we need them to have a target value in order to train our models.

In [8]:
df_Xy.dropna(subset=['price'], inplace=True)

price_mean_after_dropna = df_Xy.price.mean()

print(f'• Original DataFrame Shape: {original_df.shape}')
print(f'• DataFrame Shape (after dropna-price): {df_Xy.shape}\n')
print(f'• DataFrame Price column mean (after dropna-price): {price_mean_after_dropna}')
print(f"• Is the price's mean still the same as the original: {price_mean_after_dropna==original_price_mean}")

• Original DataFrame Shape: (197549, 27)
• DataFrame Shape (after dropna-price): (197486, 27)

• DataFrame Price column mean (after dropna-price): 643605091.0064613
• Is the price's mean still the same as the original: True


In [9]:
# We check again for missing values

df_Xy.price.isnull().sum()

0

In [10]:
# We check for the extreme values in the column

df_Xy.price.min(), df_Xy.price.max()

(0.0, 345000000000.0)

In [11]:
# We check for the amount of appereances of these extreme values

df_Xy.price.value_counts()[0], df_Xy.price.value_counts()[345000000000.0]

(4, 1)

UPDATE:

In the next couple cells we obtain the mean price of the column after removing it's outliers and it ended up being the one we used to create the target column with which our models would be trained.

This strategy, implemented late in the process, radically improved our model results and was a response to the problem found in the 'Important Note' registered below.

In [12]:
Q1 = df_Xy.price.quantile(0.25)
Q3 = df_Xy.price.quantile(0.75)
IQR = Q3 - Q1
IQR, df_Xy.shape

(400000000.0, (197486, 27))

In [13]:
df_Xy_no = df_Xy[~((df_Xy['price'] < (Q1 - 1.5 * IQR)) | (df_Xy['price'] > (Q3 + 1.5 * IQR)))]
df_Xy_no.shape

(178169, 27)

In [14]:
price_mean_after_do = df_Xy_no.price.mean()
print(f'Price mean after dropping outliers: {price_mean_after_do}')

Price mean after dropping outliers: 373486750.12355685


#### • IMPORTANT NOTE:

Above we can see that there are extreme outliers in the column from which we are getting our training data targets. This is an important situation that must be adressed with the client, as these outliers (specially the big ones) will distort the column's mean value, affecting the division betweeen 'expensive' and 'cheap' house we are creating in our target column.

Now we will create the 'target' column using the values from 'price', separating them into two categories based on the mean of the column.

UPDATE: We used the mean of the price column WITHOUT it's outliers to improve the model.

In [15]:
# As we found an absurdly big value as the max value of price column we check for some information about the biggest values in this column.

#df_Xy.sort_values(by='price',ascending=False).head(100).price.mean() # Output: 54,246'115,351.52
#df_Xy.sort_values(by='price',ascending=False).head(1000).price.mean() # Output: 17,829'070,843.313

In [16]:
df_Xy['target'] = (df_Xy['price'] >= price_mean_after_do).astype(int)
print(df_Xy['target'].shape)
df_Xy['target'].value_counts()

(197486,)


0    111846
1     85640
Name: target, dtype: int64

In [17]:
# Now we look for the amount of different values per feature (in order to filter out redundant and non-informative features)

for x in df_Xy:
    print(f'\n• {x}:\t{len(df_Xy[x].value_counts())}')


• Unnamed: 0:	197486

• id:	197486

• ad_type:	1

• start_date:	145

• end_date:	446

• created_on:	145

• lat:	51075

• lon:	50107

• l1:	1

• l2:	31

• l3:	293

• l4:	58

• l5:	20

• l6:	146

• rooms:	29

• bedrooms:	37

• bathrooms:	20

• surface_total:	1030

• surface_covered:	781

• price:	6096

• currency:	2

• price_period:	1

• title:	94963

• description:	111312

• property_type:	8

• operation_type:	1

• geometry:	62785

• target:	2


From the output above we can see that:
1) There are several features with only one value throughout all of the 197486 registers (ad_type, l1, price_period, operation_type). This features give us no information.
2) We can see that the columns labeled 'Unnamed: 0' and 'id' have unique values (identifiers) for each one of the rows and thus are redundant and not helpful for our classification purposes.

We will procceed to create another dataframe from the original one ignoring these features, along with the 'price' column which was only useful for us in order to obtain our 'target' column. 

After this we will check for duplicates (once we have removed the identifiers that guaranteed every row was unique) and remove them. This will give us a somewhat clean dataset to begin preprocessing our data, i.e. applying to it the changes that we would apply to any input data given to our finished model in order to get predictions from it.

In [18]:
# We create a new DataFrame which we will use to train our model with, ignoring the unnecessary columns

df_good_cols = df_Xy.drop(['ad_type', 'l1', 'price_period', 'operation_type', 'Unnamed: 0', 'id', 'price'], axis=1)
print(f'• Training DataFrame Shape: {df_good_cols.shape}\n')
print(f'• Training DataFrame Columns: {df_good_cols.columns}\n')

• Training DataFrame Shape: (197486, 21)

• Training DataFrame Columns: Index(['start_date', 'end_date', 'created_on', 'lat', 'lon', 'l2', 'l3', 'l4',
       'l5', 'l6', 'rooms', 'bedrooms', 'bathrooms', 'surface_total',
       'surface_covered', 'currency', 'title', 'description', 'property_type',
       'geometry', 'target'],
      dtype='object')



In [19]:
df_good_cols.duplicated().value_counts()

False    193400
True       4086
dtype: int64

We can see that after dropping the redundant and identifier columns we got duplicated registers.

We procceed to eliminate them.

In [20]:
df_good_cols.drop_duplicates(inplace=True)
df_good_cols.duplicated().value_counts()

False    193400
dtype: int64

## ---------------- D A T A --- E X P L O R A T I O N --- 2 ----------------

### ---------------------- FINDING THE APPROPIATE TRANSFORMATIONS ----------------------

In this section we will analyze our dataset's features grouping them by the type of data portrayed in them (date, location, property description and advertisement information).

This way, we will be able to determine the best transformations to perform on each of them in order to feed our models with the best quality data we can get.

In [21]:
df_good_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193400 entries, 0 to 197548
Data columns (total 21 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   start_date       193400 non-null  object 
 1   end_date         193400 non-null  object 
 2   created_on       193400 non-null  object 
 3   lat              144882 non-null  float64
 4   lon              144882 non-null  float64
 5   l2               193400 non-null  object 
 6   l3               182570 non-null  object 
 7   l4               44348 non-null   object 
 8   l5               26776 non-null   object 
 9   l6               6793 non-null    object 
 10  rooms            27452 non-null   float64
 11  bedrooms         40376 non-null   float64
 12  bathrooms        153023 non-null  float64
 13  surface_total    6942 non-null    float64
 14  surface_covered  9755 non-null    float64
 15  currency         193396 non-null  object 
 16  title            193399 non-null  obje

In [22]:
print(f'• Total registers: {len(df_good_cols)}')
print('• Null values per feature:')
df_good_cols.isnull().sum()

• Total registers: 193400
• Null values per feature:


start_date              0
end_date                0
created_on              0
lat                 48518
lon                 48518
l2                      0
l3                  10830
l4                 149052
l5                 166624
l6                 186607
rooms              165948
bedrooms           153024
bathrooms           40377
surface_total      186458
surface_covered    183645
currency                4
title                   1
description           121
property_type           0
geometry                0
target                  0
dtype: int64

In [23]:
# Here we create a function to gather information about a specific set of features from a dataset

def get_info(feature_list, dataset=df_good_cols, maxmin=False, stats=False):
    for x in feature_list:
        types = set()
        for y in dataset[x]:
            types.add(type(y))
        print(f'\n----- {x} -----\n •Data types: {types}\n •Missing values:')
        print(dataset[x].isnull().value_counts(),'\n')
        if maxmin:
            print(f' •Min: {dataset[x].min()}\n •Max: {dataset[x].max()}\n')
        if stats:
            print(f' •Mean: {dataset[x].mean()}\n •Median: {dataset[x].median()}\n •Mode: {dataset[x].mode()}\n')

### 1) DATE FEATURES: start_date, end_date & created_on

In [24]:
date_features = ['start_date', 'end_date', 'created_on']

get_info(date_features, maxmin=True)


----- start_date -----
 •Data types: {<class 'str'>}
 •Missing values:
False    193400
Name: start_date, dtype: int64 

 •Min: 2020-07-26
 •Max: 2020-12-31


----- end_date -----
 •Data types: {<class 'str'>}
 •Missing values:
False    193400
Name: end_date, dtype: int64 

 •Min: 2020-07-26
 •Max: 9999-12-31


----- created_on -----
 •Data types: {<class 'str'>}
 •Missing values:
False    193400
Name: created_on, dtype: int64 

 •Min: 2020-07-26
 •Max: 2020-12-31



Here we can see that the maximum value for the 'end_date' feature has wrong data, as it is supposed to be the date when either the property was sold or the 'for sale' announcement stopped showing.

In [25]:
# Now we look for the amount of occurrences from this wrong value (spoiler: there are a lot of them)
df_good_cols.end_date.value_counts()

9999-12-31    11925
2020-08-27     3994
2020-11-13     3800
2020-07-27     2621
2020-11-30     2399
              ...  
2021-08-16        2
2021-09-26        2
2021-10-03        1
2021-07-04        1
2021-06-20        1
Name: end_date, Length: 446, dtype: int64

In [26]:
# Here we sort the date features by the end_date and see when the next value after '9999-12-31' appears, which is '2021-10-18'
df_good_cols[['start_date', 'end_date', 'created_on']].sort_values(by=['end_date', 'start_date'], ascending=False).head(11928)

Unnamed: 0,start_date,end_date,created_on
6193,2020-12-31,9999-12-31,2020-12-31
9072,2020-12-31,9999-12-31,2020-12-31
20600,2020-12-31,9999-12-31,2020-12-31
25347,2020-12-31,9999-12-31,2020-12-31
44356,2020-12-31,9999-12-31,2020-12-31
...,...,...,...
190288,2020-07-26,9999-12-31,2020-07-26
193889,2020-07-26,9999-12-31,2020-07-26
147834,2020-12-26,2021-10-18,2020-12-26
7068,2020-12-11,2021-10-18,2020-12-11


In [27]:
11925/len(df_good_cols)

0.06165977249224405

We have 11925 wrong values in the 'end_date' feature (0.06%). 

After exploring this set of features, I have the intuiton that this data will not be relevant to our models. But in case we decide to use it in the future, here are some transformations that we may perform on them:

1) Try to replace the wrong 'end_date' values with a the average date difference (between start_date and end_date, not including the wrong values, off course).

2) Convert this colunms to 'datetime' data type and then change their format into timestamp.

### 2) LOCATION FEATURES: l2, l3, l4, l5, l6, geometry, lat & lon

In [28]:
location_features = ['l2', 'l3', 'l4', 'l5', 'l6', 'geometry', 'lat', 'lon']

get_info(location_features[:-2])
get_info(location_features[-2:], maxmin=True)


----- l2 -----
 •Data types: {<class 'str'>}
 •Missing values:
False    193400
Name: l2, dtype: int64 


----- l3 -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
False    182570
True      10830
Name: l3, dtype: int64 


----- l4 -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
True     149052
False     44348
Name: l4, dtype: int64 


----- l5 -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
True     166624
False     26776
Name: l5, dtype: int64 


----- l6 -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
True     186607
False      6793
Name: l6, dtype: int64 


----- geometry -----
 •Data types: {<class 'str'>}
 •Missing values:
False    193400
Name: geometry, dtype: int64 


----- lat -----
 •Data types: {<class 'float'>}
 •Missing values:
False    144882
True      48518
Name: lat, dtype: int64 

 •Min: -32.787342
 •Max: 34.420334


----- lon -----
 •Data types: {<class 'float'>}
 •Missing values

From the output above we can see that the 'l4', 'l5' and 'l6' features have more than half of their values missing, so this columns must be dropped.

#### In the 'l2' feature, corresponding to Colombia's departments (their equivalent to states or provinces), we have no values missing.

#### In the case of 'l3', there are 10828 values missing (about 5.5% of the registers). We will replace the missing values with the capital of the corresponding departments obtained from 'l2'.

In [29]:
# Here we have a dictionary containing each of the 32 colombian departments as keys followed by their corresponding capitals as values.
# This list will be stored in the helpers.py file

'''
capitals = {'Amazonas': 'Leticia', 'Antioquia': 'Medellín', 'Arauca': 'Arauca', 'Atlántico': 'Barranquilla', 'Bolívar' : 'Cartagena', 'Boyacá': 'Tunja',
            'Caldas': 'Manizales', 'Caquetá': 'Florencia', 'Casanare': 'Yopal', 'Cauca': 'Popayán', 'Cesar': 'Valledupar', 'Chocó': 'Quibdó', 'Córdoba': 'Montería',
            'Cundinamarca': 'Bogotá D.C', 'Guainía': 'Puerto Inírida', 'Guaviare': 'San José del Guaviare', 'Huila': 'Neiva', 'La Guajira': 'Riohacha', 
            'Magdalena': 'Santa Marta', 'Meta': 'Villavicencio', 'Nariño': 'Pasto', 'Norte de Santander': 'Cúcuta', 'Putumayo': 'Mocoa', 'Quindío': 'Armenia',
            'Risaralda': 'Pereira', 'San Andrés Providencia y Santa Catalina': 'San Andrés', 'Santander': 'Bucaramanga', 'Sucre': 'Sincelejo', 'Tolima': 'Ibagué',
            'Valle del Cauca': 'Cali', 'Vaupés': 'Mitú', 'Vichada': 'Puerto Carreño'}
'''
len(capitals)


32

In [30]:
# This is the amount of different cities in the 'l3' feature
len(df_good_cols.l3.unique())

294

#### Regarding the latitude and longitude values from the dataset, we can see from the output from the function at the beginning of this section that there are 48519 missing values on each of these features.


In [31]:
# Here we check whether the missing values correspond to the same registers in the dataset:

print(f"Rows missing 'lat' values: {len(df_good_cols[df_good_cols['lat'].isnull()])}")
print(f"Rows missing 'lon' values: {len(df_good_cols[df_good_cols['lon'].isnull()])}")
#print(f"Rows missing both 'lat' and 'lon' values (1): {len(df_good_cols[df_good_cols['lat'].isnull()][df_good_cols['lon'].isnull()])}")
print(f"Rows missing both 'lat' and 'lon' values (2): {len(df_good_cols[df_good_cols['lat'].isnull()][df_good_cols[df_good_cols['lat'].isnull()]['lon'].isnull()])}")


Rows missing 'lat' values: 48518
Rows missing 'lon' values: 48518
Rows missing both 'lat' and 'lon' values (2): 48518


In [32]:
# Here we take a sample to further proof that every row missing a 'lat' value is missing it's 'lon' value as well
df_good_cols[df_good_cols['lat'].isnull()].sample(10)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
190185,2020-10-09,2020-10-10,2020-10-09,,,Antioquia,Medellín,,,,...,,2.0,,,COP,Apartamento en Venta Ubicado en SABANETA,Codigo Inmueble 642 se Vende apartamento de lu...,Apartamento,POINT EMPTY,0
108099,2020-09-07,2020-09-07,2020-09-07,,,Cundinamarca,Bogotá D.C,Zona Chapinero,Barrios Unidos,,...,,5.0,,,COP,"97944 _ AMPLIO Y ACOGEDOR APARTAMENTO DUPLEX ,...",apto ubicado en la 104 15-75 chico navarra sex...,Apartamento,POINT EMPTY,1
71602,2020-10-14,2020-12-22,2020-10-14,,,Antioquia,Medellín,,,,...,,4.0,,,COP,SE VENDE APARTAMENTO EN SECTOR DE SAN LUCAS - ...,PR 10891. Apartamento en unidad cerrada sector...,Apartamento,POINT EMPTY,1
117112,2020-10-20,2021-02-25,2020-10-20,,,Cundinamarca,Bogotá D.C,Zona Chapinero,Teusaquillo,,...,3.0,3.0,,128.0,COP,Apartamento En Venta En Bogota Ciudad Salitre ...,Apartamento para estrenar en Proyecto Salitre ...,Apartamento,POINT EMPTY,1
39143,2020-08-04,2020-08-05,2020-08-04,,,Antioquia,Medellín,,,,...,,2.0,,,COP,Apartamento en venta en Aviva loma de los bern...,Hermoso apartamento con excelentes acabados co...,Apartamento,POINT EMPTY,1
155317,2020-09-12,2020-09-13,2020-09-12,,,Antioquia,Medellín,,,,...,,2.0,,,COP,Apartamento en Venta Ubicado en MEDELLIN,Codigo Inmueble 6127 COD INTERNO 6127 Lo conoc...,Apartamento,POINT EMPTY,0
183818,2020-08-04,2020-08-05,2020-08-04,,,Antioquia,Medellín,,,,...,,5.0,,,COP,Casa en Venta Ubicado en MEDELLIN,"Codigo Inmueble 52 Casa en Urbanizacion, cerca...",Casa,POINT EMPTY,1
25966,2020-08-04,2020-08-05,2020-08-04,,,Antioquia,Medellín,,,,...,,3.0,,,COP,Apartamento en Venta Ubicado en MEDELLIN,"Codigo Inmueble 554 Apartamento de 180 mtrs2, ...",Apartamento,POINT EMPTY,1
189913,2020-10-09,2020-10-10,2020-10-09,,,Antioquia,,,,,...,,,,,COP,LOTE EN VENTA EN EL RETIRO EL RETIRO SimiCRM6...,622-14599 Lote en Venta Sector Macedonia - El ...,Lote,POINT EMPTY,0
75046,2020-11-27,2020-11-28,2020-11-27,,,Antioquia,Medellín,,,,...,,1.0,,,COP,Apartamento en Venta Ubicado en RIONEGRO,Codigo Inmueble 6368 Y si de vivir relajado se...,Apartamento,POINT EMPTY,0


#### In order to analize whether the non-missing values for latitudes and longitudes are in fact located within colombian territory, we need to define certain limits for it, beyond which we shouldn't expect to find any lat or lon values. 

##### The map below can helps us estimate the following limits:

Latitudes (south to north) that encompass colombian territory: -4.5, 15

Longitudes (west to east) that encompass colombian territory: -82, -67


![Colombia Latitudes and Longitudes](https://i.imgur.com/ZdKWfRG.png)

In [33]:
# We define the corresponding limits as two lists, one for the latitudes and one for the longitudes
# This limits encompass the colombian insular territories, which extend further west an north than it's continental territory

lat_col = [-4.5, 15]    # Southernmost and northernmost latitudes respectively
lon_col = [-82, -67]    # Westernmost and easternmost longitudes respectively

In [34]:
count_lat_smaller = 0   # Registers with a latitude to the south of Colombia
count_lat_greater = 0   # Registers with a latitude to the north of Colombia
for x in df_good_cols.lat:
    if x<lat_col[0]:
        count_lat_smaller += 1
    elif x>lat_col[1]:
        count_lat_greater += 1
    
print(f'• Latitudes south from Colombia: {count_lat_smaller}\n• Latitudes north from Colombia: {count_lat_greater}')

• Latitudes south from Colombia: 1
• Latitudes north from Colombia: 1


As we can see, there's only 1 value exceeding Colombia's latitudes on each direction in our dataset. We can visualize them:

In [35]:
df_good_cols.sort_values(by='lat').head(2)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
138682,2020-09-29,2021-07-26,2020-09-29,-32.787342,-71.20732,Cundinamarca,La Calera,,,,...,,6.0,,,COP,51548 LA CALERA MIRADO DEL LAGO,"Casa hermosa,amplia, vigilancia sector&nbsp; t...",Casa,POINT (-71.20732 -32.787342),1
177722,2020-11-19,9999-12-31,2020-11-19,0.823972,-77.62271,Nariño,,,,,...,4.0,,,,COP,Se vende casa en Ipiales,Venta de casa en colina verde\nInfo: 320302104...,Casa,POINT (-77.6227098 0.823972),0


In [36]:
df_good_cols.sort_values(by='lat', ascending=False).head(2)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
148562,2020-11-06,2021-07-26,2020-11-06,34.420334,-119.69819,Cundinamarca,Bogotá D.C,,,,...,,4.0,,,COP,51599 SANTA BARBARA APARTAMENTO 506,"Apartamento&nbsp; para&nbsp; remodelar , duple...",Apartamento,POINT (-119.69819 34.420334),1
18218,2020-09-02,2020-09-21,2020-09-02,13.351917,-81.35745,San Andrés Providencia y Santa Catalina,Providencia,,,,...,,,,,COP,Lote Terreno en Venta en Providencia _ wasi150...,De la Isla de San Andrés en Avión son 20 minut...,Lote,POINT (-81.35745049 13.35191746),1


In [37]:
count_lon_smaller = 0   # Registers with a longitude to the west of Colombia
count_lon_greater = 0   # Registers with a longitude to the east of Colombia
for x in df_good_cols.lon:
    if x<lon_col[0]:
        count_lon_smaller += 1
    elif x>lon_col[1]:
        count_lon_greater += 1
    
print(f'• Longitudes to the west from Colombia: {count_lon_smaller}\n• Longitudes to the east from Colombia: {count_lon_greater}')

• Longitudes to the west from Colombia: 1
• Longitudes to the east from Colombia: 0


We found only one missplaced longitude, to the west of Colombia. Now we visualize it:

In [38]:
df_good_cols.sort_values(by='lon').head(2)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
148562,2020-11-06,2021-07-26,2020-11-06,34.420334,-119.69819,Cundinamarca,Bogotá D.C,,,,...,,4.0,,,COP,51599 SANTA BARBARA APARTAMENTO 506,"Apartamento&nbsp; para&nbsp; remodelar , duple...",Apartamento,POINT (-119.69819 34.420334),1
114121,2020-09-16,9999-12-31,2020-09-16,12.524494,-81.72839,San Andrés Providencia y Santa Catalina,,,,,...,8.0,,,,COP,Casa de Lujo en la Isla de Andrés,"Para vivir o para invertir, ésta casa llena de...",Finca,POINT (-81.7283900951 12.5244938806),1


In [39]:
df_good_cols.sort_values(by='lon', ascending=False).head(2)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
59361,2020-08-01,2020-09-16,2020-08-01,6.189912,-67.48257,Vichada,Puerto Carreño,,,,...,,1.0,,,COP,FINCA AGROINDUSTRIAL PUERTO CARREÑO,FINCA AGROINDUSTRIAL Y GANADERA A 10 MINUTOS...,Otro,POINT (-67.4825696 6.1899117),1
2232,2020-10-03,2020-11-30,2020-10-03,3.870204,-67.924336,Guainía,Inírida,,,,...,,,,,COP,SE VENDE FINCA EN INIRIDA GUAINIA,"Finca de 14,56 hectáreas en venta en la zona r...",Lote,POINT (-67.9243361 3.8702044),0


#### There where only 3 misplaced latitude and longitude values in total. Those can be replaced by the coordinates from the city in the register ('l3' value).

In [40]:
# Here we get a pd.Series with the possible combinatorics of the values in 'l2' and 'l3' for each row.

combinations = []
for x in range(len(df_good_cols)):
    if str(df_good_cols.iloc[x].l3) != 'nan':
        combinations.append(f'{df_good_cols.iloc[x].l3}, {df_good_cols.iloc[x].l2}')
comb_series = pd.Series(combinations)
unique_l2_l3 = comb_series.unique()

In [41]:
print(len(unique_l2_l3))

299


In [42]:
# Now we check whether the number of combinatorics is the same as the number of cities alone
df_cities = df_good_cols.l3.unique()
print(f'Amount of different cities in df_train.l3: {len(df_cities)}')
print(f'Amount of different cities in the combination df_train.l3-df_train.l2: {len(unique_l2_l3)}')

Amount of different cities in df_train.l3: 294
Amount of different cities in the combination df_train.l3-df_train.l2: 299


The output above suggests that there are cities with the same name but in different deppartments ('l2') among our list.

In [43]:
# We will check if there are repeated city names
l2_l3_cities = []
repeated_cities = []
for x in unique_l2_l3:
    y = x.split(',')
    if y[0] not in l2_l3_cities:
        l2_l3_cities.append(y[0])
    else:
        repeated_cities.append(y[0])
        print(y)


['Granada', ' Meta']
['San Martín', ' Cesar']
['Restrepo', ' Meta']
['Guamal', ' Meta']
['Armenia', ' Antioquia']
['Barbosa', ' Santander']


The output above represents different cities with the same name but in different departments from Colombia. This explains the difference between the amount of unique values in 'l3' and the amount of unique values in the combinatory of 'l2' and 'l3'.

Below you can see the complete list of the aforementioned combinatorics and corroborate that the cities listed above have two entries in the list.

In [44]:
for x in unique_l2_l3:
    y = x.split(',')
    if y[0] in repeated_cities:
        print(x)

Armenia, Quindío
Barbosa, Antioquia
Guamal, Magdalena
Restrepo, Valle del Cauca
Granada, Cundinamarca
San Martín, Meta
Granada, Meta
San Martín, Cesar
Restrepo, Meta
Guamal, Meta
Armenia, Antioquia
Barbosa, Santander


In [45]:
# Now we create a dictionary with the coordinates for each of the unique combinatorics
# This code takes too long to run, so it will be commented out and it's output saved in a dictionary in the helpers.py file.

'''
geolocator = Nominatim(user_agent='acidminded')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

dep_ciud_lat_lon = {}

for x in capitals.keys():
    dep_ciud_lat_lon[x] = {}

for x in unique_l2_l3:
    coor = geocode(x)
    y = x.split(',')
    ciud = y[0]
    dep = y[1][1:]
    dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}

for x in capitals:
    dep = x
    if capitals[x] not in dep_ciud_lat_lon[x]:
        ciud = capitals[x]
        coor = geocode(f'{ciud}, {dep}')
        dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}

print(dep_ciud_lat_lon)
'''

"\ngeolocator = Nominatim(user_agent='acidminded')\ngeocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)\n\ndep_ciud_lat_lon = {}\n\nfor x in capitals.keys():\n    dep_ciud_lat_lon[x] = {}\n\nfor x in unique_l2_l3:\n    coor = geocode(x)\n    y = x.split(',')\n    ciud = y[0]\n    dep = y[1][1:]\n    dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}\n\nfor x in capitals:\n    dep = x\n    if capitals[x] not in dep_ciud_lat_lon[x]:\n        ciud = capitals[x]\n        coor = geocode(f'{ciud}, {dep}')\n        dep_ciud_lat_lon[dep][ciud] = {'lat':coor.latitude, 'lon':coor.longitude}\n\nprint(dep_ciud_lat_lon)\n"

In [46]:
# Here we check for any problem within our dictionary and found one which is corrected manually.

'''
count = 0
problems = []
for dep in dep_ciud_lat_lon:
    for city in  dep_ciud_lat_lon[dep]:
        if (dep_ciud_lat_lon[dep][city]['lat'] < lat_col[0]) or (dep_ciud_lat_lon[dep][city]['lat'] > lat_col[1]):
            count += 1
            problems.append((dep, city, 'lat problem'))
        if (dep_ciud_lat_lon[dep][city]['lon'] < lon_col[0]) or (dep_ciud_lat_lon[dep][city]['lon'] > lon_col[1]):
            count += 1
            problems.append((dep, city, 'lon problem'))
print(count)
print(problems)
'''

'''
OUTPUT:
1
[('Bolívar', 'Santa Rosa', 'lon problem')]

# We solved this value by correcting the coordinates for this town manually. It seems that the 'geocode' library gave the wrong coordinates for it.
'''


"\nOUTPUT:\n1\n[('Bolívar', 'Santa Rosa', 'lon problem')]\n\n# We solved this value by correcting the coordinates for this town manually. It seems that the 'geocode' library gave the wrong coordinates for it.\n"

In [47]:
print(dep_ciud_lat_lon.keys())

dict_keys(['Amazonas', 'Antioquia', 'Arauca', 'Atlántico', 'Bolívar', 'Boyacá', 'Caldas', 'Caquetá', 'Casanare', 'Cauca', 'Cesar', 'Chocó', 'Córdoba', 'Cundinamarca', 'Guainía', 'Guaviare', 'Huila', 'La Guajira', 'Magdalena', 'Meta', 'Nariño', 'Norte de Santander', 'Putumayo', 'Quindío', 'Risaralda', 'San Andrés Providencia y Santa Catalina', 'Santander', 'Sucre', 'Tolima', 'Valle del Cauca', 'Vaupés', 'Vichada'])


In [48]:
# Let's see the missing values from the 'geometry' feature

df_good_cols.geometry.value_counts()

POINT EMPTY                        48518
POINT (-73.112 7.119)                264
POINT (-75.572 6.203)                259
POINT (-76.554 3.258)                199
POINT (-74.1376942 4.6303361)        137
                                   ...  
POINT (-74.061 4.715)                  1
POINT (-75.43534316 6.02491343)        1
POINT (-74.0868554 4.6703724)          1
POINT (-76.471 3.434)                  1
POINT (-73.106 7.064)                  1
Name: geometry, Length: 62785, dtype: int64

There are 48519 values missing from 'geometry', but once we have obtained the lat and lon for all our missing 'lat' and 'lon' values we can fill in this feature as well.

### 4) PROPERTY FEATURES: rooms, bedrooms, bathrooms, surface_total, surface_covered & property_type

In [49]:
property_features = ['rooms', 'bedrooms', 'bathrooms', 'surface_total', 'surface_covered', 'property_type']

get_info(property_features[:-1], maxmin=True, stats=True)
get_info(property_features[-1:])


----- rooms -----
 •Data types: {<class 'float'>}
 •Missing values:
True     165948
False     27452
Name: rooms, dtype: int64 

 •Min: 1.0
 •Max: 40.0

 •Mean: 3.2904706396619554
 •Median: 3.0
 •Mode: 0    3.0
Name: rooms, dtype: float64


----- bedrooms -----
 •Data types: {<class 'float'>}
 •Missing values:
True     153024
False     40376
Name: bedrooms, dtype: int64 

 •Min: 0.0
 •Max: 96.0

 •Mean: 3.241182880919358
 •Median: 3.0
 •Mode: 0    3.0
Name: bedrooms, dtype: float64


----- bathrooms -----
 •Data types: {<class 'float'>}
 •Missing values:
False    153023
True      40377
Name: bathrooms, dtype: int64 

 •Min: 1.0
 •Max: 20.0

 •Mean: 2.6387536514118795
 •Median: 2.0
 •Mode: 0    2.0
Name: bathrooms, dtype: float64


----- surface_total -----
 •Data types: {<class 'float'>}
 •Missing values:
True     186458
False      6942
Name: surface_total, dtype: int64 

 •Min: 10.0
 •Max: 180000.0

 •Mean: 1329.7345145491213
 •Median: 120.0
 •Mode: 0    60.0
Name: surface_total, dtyp

From the output above we can see that the only ones of these features that have less than half of it's values missing are 'bathrooms' and 'property_type'.

Because of this, 'bathrooms' and 'property_type' will be the only columns from this subset of features that we will be using for training our models by the moment.

We are awere that it exists the possibility for us to extract meaningful information from each sale's description in order to fill the missing data from these columns and that may be a path we will explore when improving our first models. But for the moment these two features will suffice.

In [50]:
# We try to see if the situation is different after dropping the date columns (which leaves us a lot of duolicated registries) and dropping duplicates

df_good_cols_no_dates = df_good_cols.drop(date_features, axis=1).copy()
print(f'\n • GOOD COLS NO DATES •\n\n SHAPE: \n{df_good_cols_no_dates.shape}\n Columns:\n{df_good_cols_no_dates.columns}')

print(f'DUPLICATED INFO:\n{df_good_cols_no_dates.duplicated().value_counts()}\n')
df_good_cols_no_dates.drop_duplicates(inplace=True)
print(f'DUPLICATED INFO AFTER DROP:\n{df_good_cols_no_dates.duplicated().value_counts()}\n')

get_info(property_features[:-1], dataset=df_good_cols_no_dates, maxmin=True, stats=True)
get_info(property_features[-1:], dataset=df_good_cols_no_dates)



 • GOOD COLS NO DATES •

 SHAPE: 
(193400, 18)
 Columns:
Index(['lat', 'lon', 'l2', 'l3', 'l4', 'l5', 'l6', 'rooms', 'bedrooms',
       'bathrooms', 'surface_total', 'surface_covered', 'currency', 'title',
       'description', 'property_type', 'geometry', 'target'],
      dtype='object')
DUPLICATED INFO:
False    120421
True      72979
dtype: int64

DUPLICATED INFO AFTER DROP:
False    120421
dtype: int64


----- rooms -----
 •Data types: {<class 'float'>}
 •Missing values:
True     93507
False    26914
Name: rooms, dtype: int64 

 •Min: 1.0
 •Max: 40.0

 •Mean: 3.2879170691833246
 •Median: 3.0
 •Mode: 0    3.0
Name: rooms, dtype: float64


----- bedrooms -----
 •Data types: {<class 'float'>}
 •Missing values:
True     80702
False    39719
Name: bedrooms, dtype: int64 

 •Min: 0.0
 •Max: 96.0

 •Mean: 3.239129887459402
 •Median: 3.0
 •Mode: 0    3.0
Name: bedrooms, dtype: float64


----- bathrooms -----
 •Data types: {<class 'float'>}
 •Missing values:
False    87751
True     32670
N

#### The missing values from the 'bathrooms' column will be imputed with the floor rounded mean value according to it's property_type.

In [51]:
df_good_cols.bathrooms.value_counts()

2.0     69156
3.0     30954
1.0     22231
4.0     17019
5.0      7557
6.0      2937
7.0      1158
10.0      928
8.0       681
9.0       329
20.0       14
12.0       14
13.0       11
11.0        8
15.0        7
19.0        6
14.0        5
18.0        5
16.0        2
17.0        1
Name: bathrooms, dtype: int64

In [52]:
df_good_cols.sort_values(by=['bathrooms'], ascending=False).head(15)

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
185592,2020-09-12,2020-11-13,2020-09-12,3.421,-76.545,Valle del Cauca,Cali,,,,...,20.0,20.0,210.0,692.0,COP,Edificio En Arriendo/venta En Cali Urbanizacin...,"Se Vende o Se alquila, edificio en la zona sur...",Otro,POINT (-76.545 3.421),1
53449,2020-11-30,2021-07-01,2020-11-30,4.644464,-74.060546,Cundinamarca,Bogotá D.C,Zona Chapinero,Chapinero,,...,22.0,20.0,,,COP,"CASA EN VENTA, BOGOTA-CHAPINERO",Casa esquinera remodelada. ambiente acogedor. ...,Casa,POINT (-74.060546 4.6444642),1
175440,2020-10-15,2021-04-07,2020-10-15,,,Atlántico,Barranquilla,,,,...,23.0,20.0,,,COP,"HOTEL EN VENTA, BARRANQUILLA-EL POBLADO",Medellín se ha convertido en un destino turíst...,Otro,POINT EMPTY,1
51712,2020-11-23,2021-08-20,2020-11-23,4.6328,-74.072722,Cundinamarca,Bogotá D.C,Zona Chapinero,Teusaquillo,,...,21.0,20.0,,,COP,"EDIFICIO EN VENTA, BOGOTA-SANTA TERESITA",UBICADÍSIMO EDIFICIO ESQUINERO EN VENTA 21 HA...,Otro,POINT (-74.0727222 4.6328001),1
102407,2020-11-11,2020-11-18,2020-11-11,4.204624,-74.631732,Tolima,Melgar,,,,...,21.0,20.0,,,COP,"HOTEL EN VENTA, MELGAR-MELGAR","Hotel en privilegiada ubicación, colindante co...",Otro,POINT (-74.6317323 4.2046243),1
138694,2020-08-27,9999-12-31,2020-08-27,,,Cundinamarca,Bogotá D.C,Zona Chapinero,Chapinero,Chapinero Central,...,0.0,20.0,,,COP,"EDIFICIO EN VENTA, BOGOTA-CHAPINERO CENTRAL",Edificio para venta en el sector de Chapinero....,Otro,POINT EMPTY,1
96152,2020-09-30,2021-01-18,2020-09-30,4.680884,-74.130453,Cundinamarca,Bogotá D.C,Zona Occidental,Fontibón,,...,0.0,20.0,,,COP,"EDIFICIO EN VENTA, BOGOTA-EL DORADO",Excelente ubicación muy cerca de estaciones de...,Otro,POINT (-74.1304534018 4.6808835909),1
58649,2020-08-21,2020-10-23,2020-08-21,4.652547,-74.058393,Cundinamarca,Bogotá D.C,Zona Chapinero,Chapinero,Quinta Camacho,...,24.0,20.0,,,COP,"CASA EN VENTA, BOGOTA-QUINTA CAMACHO",Arriendo Espectacular hotel ubicado en el cor...,Casa,POINT (-74.0583932 4.6525474),1
69334,2020-09-10,2021-05-24,2020-09-10,4.63738,-74.064302,Cundinamarca,Bogotá D.C,Zona Chapinero,Chapinero,Marly,...,45.0,20.0,,,COP,"EDIFICIO EN ARRIENDO/VENTA, BOGOTA-MARLY",espectacular edificio de 4000 mtrs2 construido...,Otro,POINT (-74.0643016 4.6373802),0
8442,2020-08-05,9999-12-31,2020-08-05,7.114464,-73.119884,Santander,Bucaramanga,,,,...,3.0,20.0,,,COP,"APARTAMENTO EN VENTA, BUCARAMANGA-CONCORDIA",Inmogestion presenta este bonito apartamento e...,Apartamento,POINT (-73.1198842 7.1144635),0


In [53]:
df_good_cols.property_type.value_counts()

Apartamento        98691
Casa               59116
Otro               16150
Lote               15982
Local comercial     1249
Finca               1131
Oficina             1071
Parqueadero           10
Name: property_type, dtype: int64

In [54]:
property_types = df_good_cols.property_type.unique()
print(type(property_types))

<class 'numpy.ndarray'>


In [55]:
print(f'\nHouse or apartment registries by amount of bathrooms:\n')
for x in range(5,21):
    print(f'• {x} bathrooms:')
    print('\t',len(df_good_cols.loc[((df_good_cols.property_type == 'Casa') | (df_good_cols.property_type == 'Apartamento'))&((df_good_cols.bathrooms >= x))]))


House or apartment registries by amount of bathrooms:

• 5 bathrooms:
	 11061
• 6 bathrooms:
	 4390
• 7 bathrooms:
	 2045
• 8 bathrooms:
	 1134
• 9 bathrooms:
	 661
• 10 bathrooms:
	 412
• 11 bathrooms:
	 30
• 12 bathrooms:
	 26
• 13 bathrooms:
	 20
• 14 bathrooms:
	 13
• 15 bathrooms:
	 10
• 16 bathrooms:
	 7
• 17 bathrooms:
	 7
• 18 bathrooms:
	 7
• 19 bathrooms:
	 6
• 20 bathrooms:
	 4


It is very unlikely that a house or an apartment will have 6 or more bathrooms, for this reason, those values will be replaced by the floor rounded mean of the column (2).

### 5) ADVERTISING FEATURES: currency, title & description 

In [56]:
advertising_features = ['currency', 'title', 'description']

get_info(advertising_features)


----- currency -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
False    193396
True          4
Name: currency, dtype: int64 


----- title -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
False    193399
True          1
Name: title, dtype: int64 


----- description -----
 •Data types: {<class 'str'>, <class 'float'>}
 •Missing values:
False    193279
True        121
Name: description, dtype: int64 



In [57]:
df_good_cols[advertising_features].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193400 entries, 0 to 197548
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   currency     193396 non-null  object
 1   title        193399 non-null  object
 2   description  193279 non-null  object
dtypes: object(3)
memory usage: 5.9+ MB


In [58]:
df_good_cols[df_good_cols['currency'].isnull()]

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
16240,2020-11-27,2020-11-27,2020-11-27,,,Valle del Cauca,Cali,,,,...,6.0,2.0,,,,Venta casa barrio cuidad Córdoba cali vall,<b>Venta casa barrio cuidad Córdoba cali vall...,Casa,POINT EMPTY,0
53528,2020-12-15,2020-12-15,2020-12-15,,,Tolima,Ibagué,,,,...,4.0,,,,,VENDO APARTAMENTO BALCONES DEL VERGEL,<b>VENDO APARTAMENTO BALCONES DEL VERGEL</b><...,Apartamento,POINT EMPTY,0
90818,2020-10-21,2020-10-21,2020-10-21,,,Antioquia,Sabaneta,,,,...,3.0,,,,,Cod12 Aparatamento en Venta Poblado,<br />\n <br />\n Ref#582518.,Apartamento,POINT EMPTY,0
119682,2020-10-21,2020-10-21,2020-10-21,,,Santander,Bucaramanga,,,,...,,,,,,Vendo Casa Campestre,<br />\n <br />\n Ref#582835.,Finca,POINT EMPTY,0


In [59]:
# Here we can see that 8 of the registers have a price in usd
df_good_cols.currency.value_counts()

COP    193388
USD         8
Name: currency, dtype: int64

In [60]:
df_good_cols.loc[df_good_cols.currency=='USD']

Unnamed: 0,start_date,end_date,created_on,lat,lon,l2,l3,l4,l5,l6,...,bedrooms,bathrooms,surface_total,surface_covered,currency,title,description,property_type,geometry,target
5902,2020-08-25,9999-12-31,2020-08-25,4.91431,-73.993189,Cundinamarca,Sopó,,,,...,3.0,,,,USD,Vendo espectacular casa entre Bogota y Briceño,Espectacular casa entre Bogota- Briceño km 12 ...,Casa,POINT (-73.9931885 4.9143096),0
49196,2020-10-07,2021-01-22,2020-10-07,10.513831,-75.498685,Bolívar,Cartagena,,,,...,5.0,,,,USD,Exclusive beach house for sale - Manzanillo de...,¡EXCLUSIVE BEACH HOUSE FOR SALE - MANZANILLO D...,Casa,POINT (-75.4986852407 10.5138313669),0
56159,2020-10-26,2020-10-26,2020-10-26,4.739003,-74.098302,Cundinamarca,Bogotá D.C,Zona Noroccidental,,,...,5.0,,,,USD,Casa en Venta Costa del Este RAH PA: 20-11172,Viva en una casa espaciosa con terrazas y pati...,Casa,POINT (-74.098302 4.7390028),0
56522,2020-10-26,2020-10-26,2020-10-26,4.695757,-74.043894,Cundinamarca,Bogotá D.C,Zona Norte,Usaquén,,...,4.0,,,,USD,Apartamento en Venta Santa Maria RAH PA: 20-10683,Majestuoso apartamento a estrenar con la mejor...,Apartamento,POINT (-74.0438943 4.6957568),0
116979,2020-09-13,2021-01-12,2020-09-13,,,Santander,Bucaramanga,,,,...,,,,,USD,Villa for sale Bali,Villa for sale Bali<br />\n<br />\nLocation: J...,Finca,POINT EMPTY,0
136236,2020-09-16,2020-10-29,2020-09-16,4.622794,-74.09096,Cundinamarca,Bogotá D.C,Zona Centro,Puente Aranda,,...,3.0,2.0,,,USD,Venta Casa Excelente,<br />\n - Calefacción\n- Parrilla\n <br />\n ...,Casa,POINT (-74.0909602 4.622794),0
137025,2020-08-29,9999-12-31,2020-08-29,12.585979,-81.714549,San Andrés Providencia y Santa Catalina,,,,,...,0.0,,,,USD,HOTEL EN VENTA EN LA ISLA DE SAN ANDRÉS,\nUn Hotel Boutique TOTALMENTE frente al mar.\...,Otro,POINT (-81.7145490646 12.5859785199),0
167143,2020-08-08,9999-12-31,2020-08-08,10.829302,-75.16026,Atlántico,,,,,...,,,,,USD,Lote en venta Vía Barranquilla Cartagena,OPORTUNIDAD DE INVERSIÒN EN EL CARIBE COLOMBIA...,Lote,POINT (-75.1602602005 10.8293016581),0


In [61]:
df_good_cols.description.duplicated().value_counts()


False    111313
True      82087
Name: description, dtype: int64

In [62]:
# In this cell we analized the properties' descriptions in order to fin key word that may help us identify a positive target, in case we use NPL
# The lists are saved in the next cell
'''
x = df_good_cols[['description','property_type','target']].sample(15)
print(x)
for y in x.description:
    print('----------------------------------------')
    print(y)
'''

"\nx = df_good_cols[['description','property_type','target']].sample(15)\nprint(x)\nfor y in x.description:\n    print('----------------------------------------')\n    print(y)\n"

In [63]:
description_expensive = ['excelente', 'exclusivo', 'exclusiva', 'club', 'mejor', 'lujo', 'playa', 'piscina', 'jacuzzi', 'terraza', 'campestre', 'condominio']
description_cheap = ['negociable', 'economico', 'economica', 'sencillo', 'sencilla']

In [64]:
df_good_cols.title.duplicated().value_counts()

True     98436
False    94964
Name: title, dtype: int64

## -------------- D A T A --- P R E P R O C E S S I N G --- 1 --------------

### ------------------------ DEFINING THE SUBSET(S) OF TRAINING DATA ------------------------

Now that we have filtered out some columns (resulting in teh datasets 'df_good_cols' and 'df_good_cols_no_dates'), we can explore whether it is convenient to further filter our training data or if it is okay to go.

In [65]:
df_good_cols.shape, df_good_cols.columns, f'DUPLICATED REGISTRIES: {df_good_cols.duplicated().value_counts()}'

((193400, 21),
 Index(['start_date', 'end_date', 'created_on', 'lat', 'lon', 'l2', 'l3', 'l4',
        'l5', 'l6', 'rooms', 'bedrooms', 'bathrooms', 'surface_total',
        'surface_covered', 'currency', 'title', 'description', 'property_type',
        'geometry', 'target'],
       dtype='object'),
 'DUPLICATED REGISTRIES: False    193400\ndtype: int64')

In [66]:
df_good_cols_no_dates.shape, df_good_cols_no_dates.columns,f'DUPLICATED REGISTRIES: {df_good_cols_no_dates.duplicated().value_counts()}'

((120421, 18),
 Index(['lat', 'lon', 'l2', 'l3', 'l4', 'l5', 'l6', 'rooms', 'bedrooms',
        'bathrooms', 'surface_total', 'surface_covered', 'currency', 'title',
        'description', 'property_type', 'geometry', 'target'],
       dtype='object'),
 'DUPLICATED REGISTRIES: False    120421\ndtype: int64')

In [67]:
df_good_cols_no_dates.description.duplicated().value_counts()

False    111313
True       9108
Name: description, dtype: int64

In [68]:
df_minimal = df_good_cols_no_dates.drop_duplicates(subset='description').copy()
df_minimal.shape

(111313, 18)

In [69]:
df_minimal.title.duplicated().value_counts()

False    89757
True     21556
Name: title, dtype: int64

In [70]:
X1 = df_good_cols.drop('target',axis=1)
y1 = df_good_cols.target
X2 = df_good_cols_no_dates.drop('target',axis=1)
y2 = df_good_cols_no_dates.target
X3 = df_minimal.drop('target',axis=1)
y3 = df_minimal.target

### ------------------------------------ CREATING THE PIPELINE ------------------------------------

We will design a pipeline the recieves a dataset with (at least) the same features as 'df_minimal' (minus the target column). 

This pipeline will perform the necessary changes to the dataset, feed it to a model of our selection, perform a cross validation and give us the results.

As we concluded on the previous section, we will select a few features that will be considered relevant to continue with the data preprocessing and model training: l2, l3, lat, lon, bathrooms and property_type.

l2--- categorical (needs encoding). MissVal (0, ok!)

l3--- categorical (needs encoding). MissVal (needs imputation using 'capitals')

lat--- numerical (ok). MissVal (needs imputation using 'dep_ciud_lat_lon') -standard scaler

lon--- numerical (ok). MissVal (needs imputation using 'dep_ciud_lat_lon')

bathrooms-- numerical (ok). MissVal (needs imputation using mean 2). Replace values greater than 5 (and with property type 'casa' or 'apartamento) with 2 by 2

property_type--- categorical (needs encoding). MissVal (0, ok!)

description-- generate numerical column depending on whether text appears on expensive_list or cheap_list (pending implementation)

In [71]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV

#### Here we define some helper functions that will be used to fill missing values during the preprocessing

In [72]:
def fill_l3(df):
    l3_ok = []
    for x in range(len(df)):
        if type(df.loc[x,'l3']) == float:
            dep = df.loc[x,'l2']
            l3_ok.append(str(capitals[dep]))
        else:
            l3_ok.append(str(df.loc[x,'l3']))
    return pd.Series(l3_ok)

In [73]:
def fill_coor(df):
    lat_ok = []
    lon_ok = []
    for x in range(len(df)):
        if str(df.loc[x,'lat']) == 'nan':
            dep, city = df.loc[x,'l2':'l3']
            #print('NAN FOUND: ',dep, city, df.loc[x,'lat'], df.loc[x,'lon'])
            try:
                lat_ok.append(float(dep_ciud_lat_lon[dep][city]['lat']))
                lon_ok.append(float(dep_ciud_lat_lon[dep][city]['lon']))
            except KeyError:
                geolocator = Nominatim(user_agent='acidminded')
                geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
                txt = f'{city}, {dep}'
                coor = geocode(txt)
                lat_ok.append(coor.latitude)
                lon_ok.append(coor.longitude)
                dep_ciud_lat_lon[dep][city] = {'lat':coor.latitude, 'lon':coor.longitude}
                print(f'NEW DATA: {dep}{city} lat: {coor.latitude} lon: {coor.longitude}')
        else:
            lat_ok.append(df.loc[x,'lat'])
            lon_ok.append(df.loc[x,'lon'])
    return pd.Series(lat_ok), pd.Series(lon_ok)

In [74]:
prtypes = ['Casa', 'Apartamento', 'Otro', 'Oficina', 'Finca', 'Lote', 'Local comercial', 'Parqueadero']
prtypes_avgbtrms = {}

for x in prtypes:
    prtypes_avgbtrms[x] = X1.loc[X1.property_type==x].bathrooms.mean()
    if str(prtypes_avgbtrms[x]) == 'nan':
        prtypes_avgbtrms[x] = 1
for x in prtypes_avgbtrms:
    prtypes_avgbtrms[x] = round(prtypes_avgbtrms[x])

In [75]:
def fill_bathrooms(df):  #hehe
    btrms_ok = []
    for x in range(len(df)):
        btrms = df.loc[x,'bathrooms'] 
        prtype = df.loc[x,'property_type']

        if str(btrms) == 'nan':
            btrms_ok.append(prtypes_avgbtrms[prtype])
        elif btrms >= 6:
            if prtype in ['Casa', 'Apartamento']:
                btrms_ok.append(float(2))
            else:
                btrms_ok.append(btrms)
        else:
            btrms_ok.append(btrms)
    return pd.Series(btrms_ok)

In [76]:
def fill_description(df):
    dsc_ok = []
    for x in range(len(df)):
        dsc = df.loc[x,'description'] 
        if str(dsc) == 'nan':
            dsc_ok.append('no description')
        else:
            dsc_ok.append(dsc)
    return pd.Series(dsc_ok)

In [77]:
def fill_nan(dataset, l3=True, coor=True, bathrooms=True, description=False):
    if description:
        X = dataset[['bathrooms', 'lat', 'lon', 'l2', 'l3', 'property_type', 'description']]
        X.reset_index(inplace=True, drop=True)
        X['description'] = fill_description(X)
        X['description'] = X['description'].astype('category')
    else:
        X = dataset[['bathrooms', 'lat', 'lon', 'l2', 'l3', 'property_type']]
        X.reset_index(inplace=True, drop=True)
    # Fill missing values
    if l3:
        X['l3'] = fill_l3(X)
    if coor:
        X['lat'], X['lon'] = fill_coor(X)
    if bathrooms:
        X['bathrooms'] = fill_bathrooms(X)
        
    # Prepare to transform to numerical
    X['l2'] = X['l2'].astype('category')
    X['l3'] = X['l3'].astype('category')
    X['property_type'] = X['property_type'].astype('category')
    
    
    return X

In [78]:
X1.isnull().sum()

start_date              0
end_date                0
created_on              0
lat                 48518
lon                 48518
l2                      0
l3                  10830
l4                 149052
l5                 166624
l6                 186607
rooms              165948
bedrooms           153024
bathrooms           40377
surface_total      186458
surface_covered    183645
currency                4
title                   1
description           121
property_type           0
geometry                0
dtype: int64

In [79]:
X_ok = fill_nan(X1, description=False)
X_ok.isnull().sum()


bathrooms        0
lat              0
lon              0
l2               0
l3               0
property_type    0
dtype: int64

In [80]:
X_ok.head()

Unnamed: 0,bathrooms,lat,lon,l2,l3,property_type
0,4.0,6.203,-75.572,Antioquia,Medellín,Casa
1,2.0,4.722748,-74.073115,Cundinamarca,Bogotá D.C,Apartamento
2,2.0,4.709,-74.03,Cundinamarca,Bogotá D.C,Casa
3,1.0,7.117263,-73.115667,Santander,Bucaramanga,Otro
4,2.0,6.244338,-75.573553,Antioquia,Medellín,Apartamento


In [81]:
X_min_ok = fill_nan(X3, description=False)
X_min_ok.isnull().sum()

bathrooms        0
lat              0
lon              0
l2               0
l3               0
property_type    0
dtype: int64

In [82]:
X_min_ok.head()

Unnamed: 0,bathrooms,lat,lon,l2,l3,property_type
0,4.0,6.203,-75.572,Antioquia,Medellín,Casa
1,2.0,4.722748,-74.073115,Cundinamarca,Bogotá D.C,Apartamento
2,2.0,4.709,-74.03,Cundinamarca,Bogotá D.C,Casa
3,1.0,7.117263,-73.115667,Santander,Bucaramanga,Otro
4,2.0,6.244338,-75.573553,Antioquia,Medellín,Apartamento


In [83]:
l2_coder = OneHotEncoder()
l3_coder = OneHotEncoder()
pt_coder = OneHotEncoder()

depts = capitals.keys()
cities = set()
for dept in dep_ciud_lat_lon:
    for city in dep_ciud_lat_lon[dept]:
        cities.add(city)

l2_cod = l2_coder.fit(pd.Series(depts).values.reshape(-1, 1))
l3_cod = l3_coder.fit(pd.Series(list(cities)).values.reshape(-1, 1))
pt_cod = pt_coder.fit(property_types.reshape(-1,1))

In [84]:
def preprocess_to_num(dataset, l2=True, l3=True, pt=True, description=True):
    X = dataset.copy()
    
    if l2:
        l2_cod = l2_coder.transform(X[['l2']])
        new_l2 = pd.DataFrame(l2_cod.toarray())
        X = pd.concat([X, new_l2], axis=1)
        X.drop('l2', axis=1, inplace=True)

    if l3:    
        l3_cod = l3_coder.transform(X[['l3']])
        new_l3 = pd.DataFrame(l3_cod.toarray())
        X = pd.concat([X, new_l3], axis=1)
        X.drop('l3', axis=1, inplace=True)
    
    if pt:
        pt_cod = pt_coder.transform(X[['property_type']])
        new_pt = pd.DataFrame(pt_cod.toarray(), columns=pt_coder.categories_)
        X = pd.concat([X, new_pt], axis=1)
        X.drop('property_type', axis=1, inplace=True)

    if description:
        pass

    X_num = X.to_numpy()

    return X_num

In [85]:
X_ok = preprocess_to_num(X_ok)
X_min_ok = X_min_ok[['bathrooms', 'lat', 'lon', 'property_type']]
X_min_ok = preprocess_to_num(X_min_ok, l2=False, l3=False)

In [86]:
X_ok.shape

(193400, 346)

In [87]:
X_min_ok.shape

(111313, 11)

In [88]:
#X_ok_corr = pd.concat([X_min_ok,y3], axis=1)


In [89]:
#sns.pairplot(X_ok_corr.sample(frac = 0.1), hue = '')
#plt.show()

In [90]:
min_max_scaler = MinMaxScaler().fit(X_ok[:,:1])
std_scaler = StandardScaler().fit(X_ok[:,1:])
pca = PCA(n_components=30, whiten=False).fit(X_ok)


def preprocess_std_dimred(matrix):
    '''
    min_max_scaler = MinMaxScaler().fit(matrix[:,:1])
    std_scaler = StandardScaler().fit(matrix[:,1:])
    '''
    #print(f' STANDARDIZATION INITIAL SHAPE: {matrix.shape}')
    X =  np.copy(matrix)
    X[:,:1] = min_max_scaler.transform(matrix[:,:1])
    X[:,1:] = std_scaler.transform(matrix[:,1:])
    #print(f' STANDARDIZATION INTERMEDIATE SHAPE: {X.shape}')
    
    X = pca.transform(X)
    #print(f' STANDARDIZATION FINAL SHAPE: {X.shape}')
    return X
    

In [91]:
min_max_scaler2 = MinMaxScaler().fit(X_min_ok[:,:1])
std_scaler2 = StandardScaler().fit(X_min_ok[:,1:])

def preprocess_std_dimred2(matrix):
    '''
    min_max_scaler = MinMaxScaler().fit(matrix[:,:1])
    std_scaler = StandardScaler().fit(matrix[:,1:])
    '''
    #print(f' STANDARDIZATION INITIAL SHAPE: {matrix.shape}')
    X =  np.copy(matrix)
    X[:,:1] = min_max_scaler2.transform(matrix[:,:1])
    X[:,1:] = std_scaler2.transform(matrix[:,1:])
    #print(f' STANDARDIZATION INTERMEDIATE SHAPE: {X.shape}')
    
    return X

In [92]:
X_ok = preprocess_std_dimred(X_ok)
X_ok.shape

(193400, 30)

In [93]:
X_min_ok = preprocess_std_dimred2(X_min_ok)
X_min_ok.shape

(111313, 11)

In [94]:
fill_df = FunctionTransformer(fill_nan)
df_to_num = FunctionTransformer(preprocess_to_num)
mat_to_X = FunctionTransformer(preprocess_std_dimred)
mat_to_X2 = FunctionTransformer(preprocess_std_dimred2)

### Here we start defining our preprocesses

We will define two processes that differ only because the second one doesn't take most of the onehot encoded categorical data and uses mostly numerical features (number of bathrooms, lat, lon, and the property type onehot codification).

In [95]:
def preprocess1(df):
    df_ok = df.copy()
    df_ok2 = fill_nan(df_ok)
    df_ok3 = preprocess_to_num(df_ok2)
    df_ok4 = preprocess_std_dimred(df_ok3)

    return df_ok4

In [96]:
def preprocess2(df):
    df_ok = df.copy()
    df_ok2 = fill_nan(df_ok)
    df_ok3 = df_ok2[['bathrooms', 'lat','lon', 'property_type']]
    df_ok4 = preprocess_to_num(df_ok3, l2=False, l3=False, description=False)
    df_ok5 = preprocess_std_dimred2(df_ok4)

    return df_ok5

In [97]:
preprocess2(X1)

array([[ 0.15789474,  0.21552   , -0.48713078, ..., -0.0919878 ,
        -0.29511709, -0.00947865],
       [ 0.05263158, -0.45233172,  0.919328  , ..., -0.0919878 ,
        -0.29511709, -0.00947865],
       [ 0.05263158, -0.45853469,  0.959784  , ..., -0.0919878 ,
        -0.29511709, -0.00947865],
       ...,
       [ 0.        , -1.04325684, -1.39637879, ..., -0.0919878 ,
         3.38848558, -0.00947865],
       [ 0.05263158, -0.47393134,  0.84992647, ..., -0.0919878 ,
        -0.29511709, -0.00947865],
       [ 0.        ,  2.35092603,  0.23454461, ..., -0.0919878 ,
        -0.29511709, -0.00947865]])

In [98]:
func_pp1 = FunctionTransformer(preprocess1)
func_pp2 = FunctionTransformer(preprocess2)

In [99]:
KNclf = KNeighborsClassifier()
DTclf = DecisionTreeClassifier()
RFclf = RandomForestClassifier()
LSVCclf = LinearSVC()
SVCclf = SVC()

In [100]:
pipe_kn1 = make_pipeline(func_pp1, KNclf)
pipe_dt1 = make_pipeline(func_pp1, DTclf)
pipe_rf1 = make_pipeline(func_pp1, RFclf)
pipe_LSVC1 = make_pipeline(func_pp1, LSVCclf)
pipe_SVC1 = make_pipeline(func_pp1, SVCclf)

In [101]:
pipe_kn2 = make_pipeline(func_pp2, KNclf)
pipe_dt2 = make_pipeline(func_pp2, DTclf)
pipe_rf2 = make_pipeline(func_pp2, RFclf)
pipe_LSVC2 = make_pipeline(func_pp2, LSVCclf)
pipe_SVC2 = make_pipeline(func_pp2, SVCclf)

## ------------------ T E S T I N G --- T H E --- P I P E L I N E S --------------------

In [102]:
scoring= {'acc': 'accuracy', 'recall': 'recall_macro'}

### CROSS VALIDATION

#### ------------- KNeighbors

In [100]:
cross_validate(pipe_kn1, X1, y1, cv=5, scoring=scoring)

{'fit_time': array([15.76625872, 15.60884166, 16.19298482, 16.75369215, 16.58904529]),
 'score_time': array([ 8.95588946,  9.10872984,  9.51508236,  9.08830619, 10.10334182]),
 'test_acc': array([0.77832933, 0.86804209, 0.86920551, 0.86992942, 0.87223041]),
 'test_recall': array([0.76358807, 0.80184349, 0.79952283, 0.80478933, 0.80945119])}

In [101]:
cross_validate(pipe_kn1, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([7.82627797, 7.75463176, 7.63043523, 7.55328798, 7.63892317]),
 'score_time': array([4.17761612, 3.95524812, 3.87610173, 3.99858785, 3.86021352]),
 'test_acc': array([0.84748297, 0.84727529, 0.85071651, 0.84926272, 0.84893043]),
 'test_recall': array([0.77789892, 0.77822889, 0.78416586, 0.78501521, 0.7866129 ])}

In [102]:
cross_validate(pipe_kn1, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([6.96152401, 6.92558503, 6.99998975, 6.99631071, 7.0344696 ]),
 'score_time': array([3.71052885, 3.57775402, 3.52755499, 3.51782274, 3.54580307]),
 'test_acc': array([0.84445043, 0.84449535, 0.84575304, 0.84538676, 0.84677927]),
 'test_recall': array([0.77437633, 0.77464498, 0.779482  , 0.77828491, 0.78416648])}

In [105]:
cross_validate(pipe_kn2, X1, y1, cv=5, scoring=scoring) ################################# MOST INTERESTING

{'fit_time': array([14.18290782, 13.50989294, 13.6500113 , 14.99079919, 14.05098796]),
 'score_time': array([56.51609349, 57.96026516, 60.32239842, 71.88497567, 57.31298041]),
 'test_acc': array([0.85630445, 0.8677577 , 0.8691538 , 0.86954161, 0.87158406]),
 'test_recall': array([0.81076846, 0.80374981, 0.79903253, 0.80426948, 0.8085721 ])}

In [106]:
cross_validate(pipe_kn2, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([6.6536355 , 6.34678006, 6.58386087, 6.65801263, 6.37996626]),
 'score_time': array([15.46485734, 14.75310254, 15.19360232, 16.44882417, 17.31507015]),
 'test_acc': array([0.84723376, 0.84715069, 0.85017653, 0.8494704 , 0.84884735]),
 'test_recall': array([0.77668534, 0.77847656, 0.78386068, 0.7858154 , 0.78666773])}

In [107]:
cross_validate(pipe_kn2, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([6.05738521, 6.17148852, 6.20670199, 6.05932212, 6.76290774]),
 'score_time': array([14.26884985, 14.0147624 , 14.46614575, 15.65272021, 14.20572662]),
 'test_acc': array([0.84319274, 0.84471994, 0.84579796, 0.84556644, 0.84601563]),
 'test_recall': array([0.77401461, 0.77539152, 0.77951196, 0.77798711, 0.78300065])}

#### ------------- Decision Tree

In [108]:
cross_validate(pipe_dt1, X1, y1, cv=5, scoring=scoring)

{'fit_time': array([22.89309716, 23.00607824, 23.05350041, 23.22311926, 23.33981466]),
 'score_time': array([3.92868447, 4.08871651, 3.95342374, 4.11821651, 4.0390625 ]),
 'test_acc': array([0.89224127, 0.8897593 , 0.8914398 , 0.89115541, 0.89291347]),
 'test_recall': array([0.8419106 , 0.838449  , 0.84076507, 0.84004655, 0.84469581])}

In [109]:
cross_validate(pipe_dt1, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([13.07382512, 13.40874958, 13.25738335, 14.00550056, 13.46884871]),
 'score_time': array([2.20357609, 2.1289506 , 1.94766235, 2.12201667, 1.91945577]),
 'test_acc': array([0.8465692 , 0.85088885, 0.85063344, 0.85088266, 0.84884735]),
 'test_recall': array([0.79040776, 0.78902065, 0.79111243, 0.79017605, 0.79124384])}

In [110]:
cross_validate(pipe_dt1, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([11.99421215, 11.65551209, 11.80455661, 12.20170116, 12.02365971]),
 'score_time': array([1.93236208, 1.90856266, 1.75685501, 1.80348611, 1.78372407]),
 'test_acc': array([0.84040785, 0.843597  , 0.84287832, 0.83968197, 0.84372473]),
 'test_recall': array([0.77937683, 0.78102712, 0.78347152, 0.77537391, 0.78737947])}

In [111]:
cross_validate(pipe_dt2, X1, y1, cv=5, scoring=scoring)  ##################################### MOST INTERESTING

{'fit_time': array([14.19791508, 13.54227114, 13.5511961 , 14.00640225, 14.02162242]),
 'score_time': array([3.2897625 , 3.32374144, 3.31043124, 3.4192965 , 3.34851623]),
 'test_acc': array([0.89306859, 0.892112  , 0.89200858, 0.89107785, 0.89335298]),
 'test_recall': array([0.84286992, 0.84253602, 0.84041418, 0.84003388, 0.84372781])}

In [112]:
cross_validate(pipe_dt2, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([6.79627919, 6.73659801, 6.77482581, 6.77934933, 6.87619209]),
 'score_time': array([1.71386385, 1.64151239, 1.55327344, 1.56251502, 1.62599087]),
 'test_acc': array([0.84993354, 0.85259179, 0.85478712, 0.84934579, 0.85142264]),
 'test_recall': array([0.78940068, 0.79363048, 0.7960894 , 0.78898515, 0.79544327])}

In [113]:
cross_validate(pipe_dt2, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([6.30907106, 6.3779254 , 6.17683268, 6.31683564, 6.95823431]),
 'score_time': array([1.58472133, 1.58539414, 1.4455018 , 1.69716525, 1.67410851]),
 'test_acc': array([0.84373175, 0.84480977, 0.84741499, 0.84152367, 0.84579103]),
 'test_recall': array([0.78099768, 0.78529714, 0.78876568, 0.77821369, 0.78702764])}

#### ------------- Random Forest

In [134]:
cross_validate(pipe_rf1, X1, y1, cv=5, scoring=scoring)

{'fit_time': array([79.13804507, 79.38540411, 77.21740842, 75.04187083, 74.8404572 ]),
 'score_time': array([4.46452928, 5.16125345, 4.55946589, 5.21439505, 4.64522147]),
 'test_acc': array([0.90054034, 0.89751545, 0.89826521, 0.89898912, 0.90061791]),
 'test_recall': array([0.8517109 , 0.84922634, 0.84868945, 0.84912471, 0.85315687])}

In [135]:
cross_validate(pipe_rf1, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([51.4400506 , 51.87565994, 50.79639697, 49.52957582, 49.37738419]),
 'score_time': array([2.53509378, 2.34621549, 2.36966825, 2.25388694, 2.33751154]),
 'test_acc': array([0.86397242, 0.86708756, 0.86807892, 0.86654206, 0.86641745]),
 'test_recall': array([0.80780845, 0.80865102, 0.81118871, 0.80878486, 0.81162381])}

In [136]:
cross_validate(pipe_rf1, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([48.04076672, 51.57431793, 48.3294239 , 46.45552158, 44.95006466]),
 'score_time': array([2.55691004, 2.2811904 , 2.21504307, 2.10499907, 2.29137826]),
 'test_acc': array([0.85729686, 0.86062076, 0.86210304, 0.85881772, 0.86205193]),
 'test_recall': array([0.79756625, 0.80139492, 0.80333857, 0.79637496, 0.80605101])}

In [103]:
cross_validate(pipe_rf2, X1, y1, cv=5, scoring=scoring) ############################## MOST INTERESTING

{'fit_time': array([29.96370602, 27.45436096, 29.10842204, 27.98951554, 27.97447658]),
 'score_time': array([4.22649121, 4.15078521, 4.79818797, 4.2227025 , 4.32479763]),
 'test_acc': array([0.87838676, 0.87939504, 0.8766546 , 0.87546536, 0.87784385]),
 'test_recall': array([0.87503584, 0.87615494, 0.87184818, 0.87112812, 0.87426602])}

In [138]:
cross_validate(pipe_rf2, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([15.57574773, 16.20597696, 17.50141644, 15.57983351, 15.89280057]),
 'score_time': array([2.16879845, 2.08867407, 2.0377593 , 1.97411275, 2.18174982]),
 'test_acc': array([0.86222794, 0.86683835, 0.86915888, 0.86641745, 0.86571132]),
 'test_recall': array([0.80631365, 0.80887068, 0.81179906, 0.80864658, 0.81076669])}

In [139]:
cross_validate(pipe_rf2, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([15.75585413, 14.49987459, 14.35626936, 14.41298056, 14.89055347]),
 'score_time': array([2.14486027, 1.94603801, 1.88956213, 1.89289641, 2.02546144]),
 'test_acc': array([0.85734178, 0.86026142, 0.86416925, 0.85980595, 0.86236636]),
 'test_recall': array([0.79753655, 0.8013342 , 0.80740224, 0.79673593, 0.80614146])}

#### ------------- Linear SVC

In [144]:
cross_validate(pipe_LSVC1, X1, y1, cv=5, scoring=scoring)

{'fit_time': array([73.65826368, 74.66255355, 72.77736354, 73.25964069, 73.58268905]),
 'score_time': array([3.9484601 , 3.78908587, 3.79566717, 3.81826687, 4.00475216]),
 'test_acc': array([0.785284  , 0.26127873, 0.41448848, 0.77920836, 0.77377905]),
 'test_recall': array([0.55552217, 0.5168151 , 0.61242258, 0.53978499, 0.52204582])}

In [146]:
cross_validate(pipe_LSVC1, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([39.52835083, 40.79839277, 39.7820642 , 38.3153944 , 38.46641827]),
 'score_time': array([1.92426848, 1.9485023 , 2.41100883, 1.83251691, 1.78645182]),
 'test_acc': array([0.38349394, 0.7575594 , 0.78380062, 0.79086189, 0.29088266]),
 'test_recall': array([0.58497072, 0.51921621, 0.59842755, 0.68997512, 0.52627793])}

In [147]:
cross_validate(pipe_LSVC1, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([33.52827311, 35.14373541, 36.3456254 , 34.42434764, 35.11392808]),
 'score_time': array([1.76718855, 1.68607497, 1.78291011, 1.64808798, 1.75025916]),
 'test_acc': array([0.27071823, 0.77464852, 0.7750977 , 0.76911329, 0.35720061]),
 'test_recall': array([0.51327619, 0.73353595, 0.5647285 , 0.54892777, 0.56863991])}

In [148]:
cross_validate(pipe_LSVC2, X1, y1, cv=5, scoring=scoring) 

{'fit_time': array([45.07757521, 47.71512008, 43.80670214, 41.08510399, 45.84256291]),
 'score_time': array([3.36925936, 3.46272993, 3.37830639, 3.27445316, 3.29322362]),
 'test_acc': array([0.79272991, 0.79466894, 0.79531529, 0.79394503, 0.79601334]),
 'test_recall': array([0.61565487, 0.61977514, 0.61860001, 0.61949204, 0.61985528])}

In [149]:
cross_validate(pipe_LSVC2, X2, y2, cv=5, scoring=scoring)

{'fit_time': array([25.00814223, 27.02097201, 24.15813971, 25.513304  , 24.24579239]),
 'score_time': array([1.78398895, 1.68497729, 1.82414079, 1.90233016, 1.62385321]),
 'test_acc': array([0.77857618, 0.78135903, 0.78350987, 0.77844237, 0.78292835]),
 'test_recall': array([0.61577966, 0.61978467, 0.61753038, 0.61392847, 0.61664615])}

In [150]:
cross_validate(pipe_LSVC2, X3, y3, cv=5, scoring=scoring) ############################## MOST INTERESTING... NOT SO INTERESTING...

{'fit_time': array([20.8830502 , 22.56210065, 18.5906508 , 23.35831285, 24.53805327]),
 'score_time': array([1.5736599 , 1.56822276, 1.7554636 , 1.48428798, 1.43557072]),
 'test_acc': array([0.7817455 , 0.78331761, 0.78524907, 0.78052286, 0.78501482]),
 'test_recall': array([0.61707929, 0.62063431, 0.61374799, 0.61513484, 0.61992179])}

#### ------------- SVC

In [155]:
cross_validate(pipe_SVC1, X1, y1, cv=5, scoring=scoring)

In [None]:
cross_validate(pipe_SVC1, X2, y2, cv=5, scoring=scoring)

In [None]:
cross_validate(pipe_SVC1, X3, y3, cv=5, scoring=scoring)

In [None]:
cross_validate(pipe_SVC2, X1, y1, cv=5, scoring=scoring)

In [None]:
cross_validate(pipe_SVC2, X2, y2, cv=5, scoring=scoring)

In [101]:
cross_validate(pipe_SVC2, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([224.28532982, 206.27716494, 184.21603179, 189.40349984,
        185.39456367]),
 'score_time': array([87.62521362, 80.33638406, 80.88860798, 80.21557689, 80.72905803]),
 'test_acc': array([0.78340745, 0.78462022, 0.7868661 , 0.78366724, 0.78398167]),
 'test_recall': array([0.59879505, 0.59906713, 0.60677124, 0.60362812, 0.59805   ])}

### Grid Searching the most interesting pipelines per model

In [189]:
kn_params = {}
kn_params['kneighborsclassifier__n_neighbors'] = [50, 70]
kn_params['kneighborsclassifier__weights'] = ['distance']
kn_params['kneighborsclassifier__metric'] = ['precomputed', 'minkowski']


In [190]:
grid_kn2 = GridSearchCV(pipe_kn2, kn_params, cv=5, scoring='recall')

In [191]:
grid_kn2.fit(X1, y1)

In [192]:
grid_kn2.best_score_, grid_kn2.best_params_

(0.7478738938240093,
 {'kneighborsclassifier__metric': 'minkowski',
  'kneighborsclassifier__n_neighbors': 50,
  'kneighborsclassifier__weights': 'distance'})

In [123]:
dt_params = {}
dt_params['decisiontreeclassifier__criterion'] = ['gini', 'entropy']
dt_params['decisiontreeclassifier__max_depth'] = [20, None]

In [119]:
grid_dt2 = GridSearchCV(pipe_dt2, dt_params, cv=5, scoring='recall')

In [124]:
grid_dt2.fit(X1, y1)

In [129]:
grid_dt2.best_score_, grid_dt2.best_params_

(0.7253686190745855,
 {'decisiontreeclassifier__criterion': 'gini',
  'decisiontreeclassifier__max_depth': 20})

In [104]:
rf_params = {}
rf_params['randomforestclassifier__n_estimators'] = [50, 100, 150]
rf_params['randomforestclassifier__max_depth'] = [20, None]

In [105]:
grid_rf2 = GridSearchCV(pipe_rf2, rf_params, cv=5, scoring='recall')

In [106]:
grid_rf2.fit(X1, y1)

In [107]:
grid_rf2.best_score_, grid_rf2.best_params_

(0.8040529011230844,
 {'randomforestclassifier__max_depth': None,
  'randomforestclassifier__n_estimators': 150})

### - Let's further explore the most promising grid search: RandomForestClassifier

In [102]:
rf_params2 = {}
rf_params2['randomforestclassifier__n_estimators'] = [80, 100, 120]

In [104]:
grid_rf22 = GridSearchCV(pipe_rf2, rf_params2, cv=5, scoring='recall')

In [105]:
grid_rf22.fit(X1, y1)

In [108]:
grid_rf22.best_score_, grid_rf22.best_params_

(0.7581591516932689, {'randomforestclassifier__n_estimators': 120})

In [104]:
rf_params3 = {}
rf_params3['randomforestclassifier__n_estimators'] = [120,150]
rf_params3['randomforestclassifier__min_samples_split'] = [2]
rf_params3['randomforestclassifier__min_samples_leaf'] = [2]
rf_params3['randomforestclassifier__bootstrap'] = [True, False]

In [105]:
grid_rf23 = GridSearchCV(pipe_rf2, rf_params3, cv=5, scoring='recall')

In [106]:
grid_rf23.fit(X1, y1)

In [107]:
grid_rf23.best_score_, grid_rf23.best_params_

(0.8421829435219902,
 {'randomforestclassifier__bootstrap': False,
  'randomforestclassifier__min_samples_leaf': 2,
  'randomforestclassifier__min_samples_split': 2,
  'randomforestclassifier__n_estimators': 120})

## ---------------------- G E N E R A T I N G --- T H E --- M O D E L ----------------------

In [108]:
real_test = pd.read_csv('datasets/properties_colombia_test.csv')
real_test1 = preprocess1(real_test)
real_test2 = preprocess2(real_test)
#z = pipe_kn1.fit(X3,y3).predict(real_test)

In [109]:
# Trying final predictions with Decision tree classifier max_depth=15
#final_model = RandomForestClassifier(n_estimators=120,min_samples_leaf=2,min_samples_split=2,bootstrap=False)
final_model = RandomForestClassifier(n_estimators=120,min_samples_leaf=2,min_samples_split=2,bootstrap=False)
X_train = preprocess2(X1)
final_model.fit(X_train, y1)
final_preds = final_model.predict(real_test2)
print(final_preds)

[1 1 0 ... 0 0 0]


In [110]:
final_preds.shape

(65850,)

In [111]:
preds = pd.DataFrame({'pred':final_preds})
preds.head(10)

Unnamed: 0,pred
0,1
1,1
2,0
3,1
4,1
5,1
6,0
7,0
8,0
9,0


In [112]:
preds.to_csv(path_or_buf='acidminded95.csv', index=False)

## -------------------------- One last attempt D:

In [None]:
'''
The following was our first attempt at choosing some words that reflected the most used words in the description of 
an expensive or cheap house. After some trials we quickly concluded that the appearance of these words in the description
had less than 1% correlation with the target variable. So we searched for a more comprehensive list.
'''
'''
description_expensive = ['excelente', 'exclusivo', 'exclusiva', 'club', 'mejor', 'lujo', 'playa', 'piscina', 'jacuzzi', 'terraza', 'campestre', 'condominio']
description_cheap = ['negociable', 'economico', 'economica', 'sencillo', 'sencilla']
'''

In [109]:
chars_to_ignore = ['!','?','@','#','$','%','^','&','*','(',')','-','_','=','+',',','.','1','2','3','4','5','6','7','8','9','0',':',';','/','|','\n','<b>','<br>','<br >', '<br  >','<br />', 'para','como', 'cuenta', 'baño', 'piso']
def get_words(str):
    sent = str.lower().strip()
    for char in chars_to_ignore:
        sent = sent.replace(char,' ')
    sent_list = sent.split()
    sent_list_big = []
    for word in sent_list:
        if len(word) >= 4:
            sent_list_big.append(word)
    return sent_list_big

In [110]:
# flc: feels like cheating... But we're only extracting the most used words in the descriptions of the properties per target category.

flc = pd.DataFrame({'description':X1['description'], 'target':y1})
flc.reset_index(inplace=True, drop=True)
flc['description'] = fill_description(flc)

In [111]:
expensive_words = {}
cheap_words = {}

for x in range(len(flc)):
    dsc = flc.loc[x,'description']
    tar = flc.loc[x,'target']
    
    dsc_list_big = get_words(dsc)
    
    for word in dsc_list_big:
        if tar == 1:
            if word in expensive_words.keys():
                expensive_words[word] += 1
            else:
                expensive_words[word] = 1
        elif tar == 0:
            if word in cheap_words.keys():
                cheap_words[word] += 1
            else:
                cheap_words[word] = 1


expensive_sorted = sorted(expensive_words.items(), key=lambda x:x[1])
cheap_sorted = sorted(cheap_words.items(), key=lambda x:x[1])
print('Expensive top len: ', len(expensive_sorted))
print('Cheap top len: ', len(cheap_sorted))
amount_ew = -800 #round(0-(len(expensive_sorted)/10))
amount_cw = -800 #round(0-(len(cheap_sorted)/10))

ex_top_sorted = expensive_sorted[amount_ew:]
exp = []
for x in ex_top_sorted:
    exp.append(x[0])

ch_top_sorted = cheap_sorted[amount_cw:]
chp = []
for x in ch_top_sorted:
    chp.append(x[0])

for x in exp:
    if x in chp:
        exp.remove(x)
        chp.remove(x)

print(f'\n\nMOST EXPENSIVE WORDS ({len(exp)})')
print(exp[-100:])
print(f'\n\nMOST CHEAP WORDS ({len(chp)})')
print(chp[-100:])


Expensive top len:  27499
Cheap top len:  38993


MOST EXPENSIVE WORDS (458)
['buen', 'rutas', 'circuito', 'salon', 'buenas', 'todos', 'alta', 'todo', 'construcción', 'vivir', 'biblioteca', 'vehículos', 'oacute', 'lavandería', 'frente', 'deben', 'manos', 'poblado', 'expertos', 'ampliarte', 'todas', 'iluminación', 'terreno', 'portería', 'altura', 'puerta', 'cubiertos', 'ascensor', 'aire', 'ntilde', 'millones', 'útil', 'exterior', 'encuentra', 'infantil', 'garajes', 'vías', 'interno', 'vendo', 'este', 'solo', 'supermercados', 'calle', 'valor', 'gran', 'excelentes', 'local', 'restaurantes', 'residencial', 'turco', 'hall', 'espacio', 'total', 'natural', 'información', 'construida', 'abierta', 'cerrada', 'estar', 'juegos', 'cancha', 'primer', 'madera', 'segundo', 'area', 'acabados', 'horas', 'vende', 'habitación', 'amplios', 'finca', 'consta', 'garaje', 'parque', 'unidad', 'patio', 'edificio', 'amplio', 'comercial', 'ubicado', 'nivel', 'ropas', 'closet', 'espacios', 'tres', 'vestier', 'parq

In [234]:
def description_interpreter(df):
    dsc_ints = []
    for x in range(len(df)):
        dsc = df.loc[x,'description']
        
        dsc_list = get_words(dsc)
        dsc_in = 0
        
        for word in dsc_list:
            if word in exp:
                dsc_in += 3
            elif word in chp:
                dsc_in -= 15
            else:
                pass
        dsc_ints.append(dsc_in)
    return pd.Series(dsc_ints)

In [174]:
def description_interpreter2(df):
    dsc_ints = []
    for x in range(len(df)):
        dsc = df.loc[x,'description']
        
        dsc_list = get_words(dsc)
        dsc_ew = 0
        dsc_cw = 0
        
        for word in dsc_list:
            if word in exp:
                dsc_ew += 1
            elif word in chp:
                dsc_cw += 16
            else:
                pass
        if dsc_ew > dsc_cw:
            dsc_ints.append(1)
        elif dsc_ew < dsc_cw:
            dsc_ints.append(0)
        else:
            dsc_ints.append(1)
    return pd.Series(dsc_ints)

In [175]:
trials = pd.DataFrame(X1['description'][:1000])
trials.reset_index(inplace=True, drop=True)
trials['description'] = fill_description(trials)
#trials.head()
trials['int'] = description_interpreter2(trials)
#trials.head(20)

In [176]:
trials.int.value_counts()

1    612
0    388
Name: int, dtype: int64

In [177]:
trials['target'] = y1[:1000]

In [178]:
trials['int_mm'] = min_max_scaler.transform(trials[['int']])
trials['int_std'] = StandardScaler().fit_transform(trials[['int']])
#trials['int_std'] = std_scaler.transform(trials[['int']])

In [179]:
trials.corr()

Unnamed: 0,int,target,int_mm,int_std
int,1.0,0.229601,1.0,1.0
target,0.229601,1.0,0.229601,0.229601
int_mm,1.0,0.229601,1.0,1.0
int_std,1.0,0.229601,1.0,1.0


In [180]:
def preprocess_std_dimred3(matrix):
    
    X =  np.copy(matrix)
    X[:,:1] = MinMaxScaler().fit_transform(matrix[:,:1])
    X[:,1:] = StandardScaler().fit_transform(matrix[:,1:])
    
    return X

In [181]:
def preprocess3(df):
    df_ok = df.copy()
    df_ok2 = fill_nan(df_ok, description=True)
    df_ok3 = df_ok2[['bathrooms', 'lat','lon', 'property_type', 'description']]
    df_ok3['description'] = description_interpreter2(df_ok3)
    df_ok4 = preprocess_to_num(df_ok3, l2=False, l3=False, description=True)
    df_ok5 = preprocess_std_dimred3(df_ok4)

    return df_ok5

In [182]:
X_fa = preprocess3(X1)

In [183]:
X_fa.shape

(193395, 12)

In [184]:
func_pp3 = FunctionTransformer(preprocess3)

In [185]:
pipe_rf3 = make_pipeline(func_pp3, RFclf)

In [186]:
cross_validate(pipe_rf3, X1, y1, cv=5, scoring=scoring)

{'fit_time': array([88.59920835, 83.58234382, 87.65987062, 81.07367492, 80.16410136]),
 'score_time': array([17.17124271, 17.97142291, 17.16933393, 18.09267282, 17.2541666 ]),
 'test_acc': array([0.819773  , 0.83076088, 0.81483492, 0.81920422, 0.81599835]),
 'test_recall': array([0.72317457, 0.75692586, 0.74251874, 0.70760062, 0.74016004])}

In [187]:
cross_validate(pipe_rf3, X3, y3, cv=5, scoring=scoring)

{'fit_time': array([44.2829988 , 43.11424708, 46.21974206, 43.49551773, 48.17798829]),
 'score_time': array([9.1252501 , 8.89193702, 9.7481122 , 9.46472931, 9.66995454]),
 'test_acc': array([0.78857297, 0.81035799, 0.79751157, 0.80585752, 0.80154523]),
 'test_recall': array([0.74825578, 0.72877742, 0.73261838, 0.73031215, 0.69700394])}

None of the two versions of the 'description_interpreter' function yielded the desired results..