# **Diplomatura en Ciencia de Datos, Aprendizaje Automático y sus Aplicaciones**

## **Edición 2023**


----

# Trabajo práctico entregable - parte 2


En el ejercicio 1 de la parte 1 del entregable seleccionaron las filas y columnas relevantes al problema de predicción de precios de una propiedad. Además de ello, tuvieron que reducir el número de valores posibles para las variables categóricas utilizando información de dominio.

En el ejercicio 2 de la parte 1 del entregable imputaron los valores faltantes de las columnas `Suburb` y las columnas obtenidas a partir del conjunto de datos `airbnb`.

En esta notebook, **se utilizará resultado de dichas operaciones.**


In [71]:
import matplotlib.pyplot as plt
import numpy
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import OneHotEncoder
import seaborn
seaborn.set_context('talk')


In [72]:
# Acá deberían leer el conjunto de datos que ya tienen.
melb_df = pd.read_csv(
    'https://raw.githubusercontent.com/bonafepedro/exploratory_analisis_datacuration/master/data/melbourne_and_airbnb.csv')
melb_df[:]

Unnamed: 0.1,Unnamed: 0,Suburb,Rooms,Type,Price,Distance,Postcode,Bedroom2,Bathroom,Car,YearBuilt,CouncilArea,Regionname,Lattitude,Longtitude,BuildingArea,zipcode_int,airbnb_price_mean,airbnb_record_count
0,0,Abbotsford,2,h,1480000.0,2.5,3067.0,2.0,1.0,1.0,,Yarra,Northern Metropolitan,-37.79960,144.99840,,3067.0,130.624031,258.0
1,1,Abbotsford,2,h,1035000.0,2.5,3067.0,2.0,1.0,0.0,1900.0,Yarra,Northern Metropolitan,-37.80790,144.99340,79.0,3067.0,130.624031,258.0
2,2,Abbotsford,3,h,1465000.0,2.5,3067.0,3.0,2.0,0.0,1900.0,Yarra,Northern Metropolitan,-37.80930,144.99440,150.0,3067.0,130.624031,258.0
3,3,Abbotsford,3,h,850000.0,2.5,3067.0,3.0,2.0,1.0,,Yarra,Northern Metropolitan,-37.79690,144.99690,,3067.0,130.624031,258.0
4,4,Abbotsford,4,h,1600000.0,2.5,3067.0,3.0,1.0,2.0,2014.0,Yarra,Northern Metropolitan,-37.80720,144.99410,142.0,3067.0,130.624031,258.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13367,13367,Wheelers Hill,4,h,1245000.0,16.7,3150.0,4.0,2.0,2.0,1981.0,,South-Eastern Metropolitan,-37.90562,145.16761,,3150.0,124.026455,189.0
13368,13368,Williamstown,3,h,1031000.0,6.8,3016.0,3.0,2.0,2.0,1995.0,,Western Metropolitan,-37.85927,144.87904,133.0,3016.0,191.094595,74.0
13369,13369,Williamstown,3,h,1170000.0,6.8,3016.0,3.0,2.0,4.0,1997.0,,Western Metropolitan,-37.85274,144.88738,,3016.0,191.094595,74.0
13370,13370,Williamstown,4,h,2500000.0,6.8,3016.0,4.0,1.0,5.0,1920.0,,Western Metropolitan,-37.85908,144.89299,157.0,3016.0,191.094595,74.0


In [73]:
melb_df.columns

Index(['Unnamed: 0', 'Suburb', 'Rooms', 'Type', 'Price', 'Distance',
       'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'YearBuilt', 'CouncilArea',
       'Regionname', 'Lattitude', 'Longtitude', 'BuildingArea', 'zipcode_int',
       'airbnb_price_mean', 'airbnb_record_count'],
      dtype='object')

## Ejercicio 1: Encoding

1. Seleccionar todas las filas y columnas del conjunto de datos obtenido en la parte 1 del entregable, **excepto** `BuildingArea` y `YearBuilt`, que volveremos a imputar más adelante.

2. Aplicar una codificación One-hot encoding a cada fila, tanto para variables numéricas como categóricas. Si lo consideran necesario, pueden volver a reducir el número de categorías únicas.

Algunas opciones:
  1. Utilizar `OneHotEncoder` junto con el parámetro `categories` para las variables categóricas y luego usar `numpy.hstack` para concatenar el resultado con las variables numéricas. 
  2. `DictVectorizer` con algunos pasos de pre-proceso previo.

Recordar también que el atributo `pandas.DataFrame.values` permite acceder a la matriz de numpy subyacente a un DataFrame.


In [74]:
columnas_object = melb_df1.select_dtypes(include='object').columns
numerical_cols = melb_df1.select_dtypes(include=['int64', 'float64']).columns

In [75]:
melb_df[columnas_object].nunique()

Type            3
CouncilArea    33
Regionname      8
dtype: int64

In [76]:
# Check for nulls
melb_df[columnas_object].isna().sum()

Type              0
CouncilArea    1348
Regionname        0
dtype: int64

Selecciono las columnas, excluyendo BuildingArea y YearBuilt como indica el ejercicio. 
¿CouncilArea tiene 1348 valores nulos y 33 opciones diferentes, que les parece excluirla también?

In [77]:
col_excluir = ['BuildingArea','YearBuilt']
melb_df1 = melb_df.loc[:, ~melb_df.columns.isin(col_excluir)].copy()
melb_df1

Unnamed: 0.1,Unnamed: 0,Suburb,Rooms,Type,Price,Distance,Postcode,Bedroom2,Bathroom,Car,CouncilArea,Regionname,Lattitude,Longtitude,zipcode_int,airbnb_price_mean,airbnb_record_count
0,0,Abbotsford,2,h,1480000.0,2.5,3067.0,2.0,1.0,1.0,Yarra,Northern Metropolitan,-37.79960,144.99840,3067.0,130.624031,258.0
1,1,Abbotsford,2,h,1035000.0,2.5,3067.0,2.0,1.0,0.0,Yarra,Northern Metropolitan,-37.80790,144.99340,3067.0,130.624031,258.0
2,2,Abbotsford,3,h,1465000.0,2.5,3067.0,3.0,2.0,0.0,Yarra,Northern Metropolitan,-37.80930,144.99440,3067.0,130.624031,258.0
3,3,Abbotsford,3,h,850000.0,2.5,3067.0,3.0,2.0,1.0,Yarra,Northern Metropolitan,-37.79690,144.99690,3067.0,130.624031,258.0
4,4,Abbotsford,4,h,1600000.0,2.5,3067.0,3.0,1.0,2.0,Yarra,Northern Metropolitan,-37.80720,144.99410,3067.0,130.624031,258.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13367,13367,Wheelers Hill,4,h,1245000.0,16.7,3150.0,4.0,2.0,2.0,,South-Eastern Metropolitan,-37.90562,145.16761,3150.0,124.026455,189.0
13368,13368,Williamstown,3,h,1031000.0,6.8,3016.0,3.0,2.0,2.0,,Western Metropolitan,-37.85927,144.87904,3016.0,191.094595,74.0
13369,13369,Williamstown,3,h,1170000.0,6.8,3016.0,3.0,2.0,4.0,,Western Metropolitan,-37.85274,144.88738,3016.0,191.094595,74.0
13370,13370,Williamstown,4,h,2500000.0,6.8,3016.0,4.0,1.0,5.0,,Western Metropolitan,-37.85908,144.89299,3016.0,191.094595,74.0


Aplico el encoding

In [78]:
for columna in columnas_object:
    valores_distintos = melb_df1[columna].value_counts()
    print(f"Valores distintos en la columna {columna}:")
    print(valores_distintos)
    print()

Valores distintos en la columna Type:
Type
h    9244
u    3016
t    1112
Name: count, dtype: int64

Valores distintos en la columna CouncilArea:
CouncilArea
Moreland             1163
Boroondara           1079
Moonee Valley         996
Darebin               932
Glen Eira             848
Maribyrnong           692
Stonnington           679
Yarra                 640
Port Phillip          614
Banyule               592
Melbourne             466
Bayside               459
Hobsons Bay           432
Brimbank              424
Monash                331
Manningham            311
Whitehorse            302
Kingston              207
Whittlesea            167
Hume                  164
Wyndham                86
Knox                   80
Maroondah              80
Melton                 66
Frankston              53
Greater Dandenong      52
Casey                  38
Nillumbik              36
Yarra Ranges           18
Cardinia                8
Macedon Ranges          7
Unavailable             1
Moorabool  

Vemos que la columna suburb tiene 314 valores distintos por lo que procederemos para este análisis a no considerarla ya que incrementaría exponencialmente la cantidad de memoria utilizada. Si bien como vimos en clases se guardaría como una matriz esparsa luego para el uso de los modelos de imputación generaría complicaciones por la cantidad de memoria usada.

In [79]:
col_excluir = ['Suburb']
melb_df1 = melb_df1.loc[:, ~melb_df1.columns.isin(col_excluir)].copy()

In [80]:
melb_df1.columns

Index(['Unnamed: 0', 'Rooms', 'Type', 'Price', 'Distance', 'Postcode',
       'Bedroom2', 'Bathroom', 'Car', 'CouncilArea', 'Regionname', 'Lattitude',
       'Longtitude', 'zipcode_int', 'airbnb_price_mean',
       'airbnb_record_count'],
      dtype='object')

In [81]:
feature_cols = ['Type', 'CouncilArea', 'Regionname']
feature_dict = list(melb_df1[feature_cols].T.to_dict().values())
#feature_dict[:100]

In [82]:
vec = DictVectorizer()
feature_matrix = vec.fit_transform(feature_dict)


In [83]:
#feature_matrix
vec.get_feature_names()[:10] 

['CouncilArea',
 'CouncilArea=Banyule',
 'CouncilArea=Bayside',
 'CouncilArea=Boroondara',
 'CouncilArea=Brimbank',
 'CouncilArea=Cardinia',
 'CouncilArea=Casey',
 'CouncilArea=Darebin',
 'CouncilArea=Frankston',
 'CouncilArea=Glen Eira']

In [84]:
# Before doing this type of conversion, it's mandatory to calculate the
# size of the resulting matrix!
matrix_size_mb = feature_matrix.shape[0] * feature_matrix.shape[1] * 4 / 1024 / 1024
print("The dense matrix will weight approximately {:.2f} MB".format(matrix_size_mb))

limit_size_mb = 10
precision_type = numpy.float32
if matrix_size_mb < limit_size_mb:  # Matrix is less than 10MB
  dense_feature_matrix = feature_matrix.astype(precision_type).todense()
else:
  # We calculate how many rows would fit given the number of columns
  n_rows = int(limit_size_mb *1024 * 1024 / 4 / feature_matrix.shape[1])
  print("Matrix too big! Using only first {} of {} rows".format(
      n_rows, feature_matrix.shape[0]))
  dense_feature_matrix = feature_matrix[:n_rows].astype(precision_type).todense()

print("Final size: {:.2f}".format(dense_feature_matrix.nbytes / 1024 / 1024))

The dense matrix will weight approximately 2.30 MB
Final size: 2.30


Por haber eliminado la columna suburb el tamaño de la matriz densa tiene solo 2.3 MB, obviamente que estamos perdiendo información seguramente relevante, pero a fines prácticos entendemos que para este ejercicio la selección es correcta

In [85]:
dense_feature_matrix

matrix([[ 0.,  0.,  0., ...,  1.,  0.,  0.],
        [ 0.,  0.,  0., ...,  1.,  0.,  0.],
        [ 0.,  0.,  0., ...,  1.,  0.,  0.],
        ...,
        [nan,  0.,  0., ...,  1.,  0.,  0.],
        [nan,  0.,  0., ...,  1.,  0.,  0.],
        [nan,  0.,  0., ...,  1.,  0.,  0.]], dtype=float32)

Ahora codifico y me quedo con el df codificado

In [86]:
# Crea un subconjunto de datos solo con las columnas categóricas
df_categorico = melb_df1[feature_cols]

# Crea una instancia de OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Realiza el encoding de las columnas categóricas
datos_codificados = encoder.fit_transform(df_categorico.values)

# Obtén los nombres de las columnas codificadas
nombres_columnas_codificadas = encoder.get_feature_names(feature_cols)

# Crea un nuevo DataFrame con las características codificadas
df_codificado = pd.DataFrame(datos_codificados, columns=nombres_columnas_codificadas)

#print(df_codificado)

In [87]:
df_final = pd.concat([melb_df1, df_codificado], axis=1)
df_final

Unnamed: 0.1,Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bedroom2,Bathroom,Car,CouncilArea,...,CouncilArea_Yarra Ranges,CouncilArea_nan,Regionname_Eastern Metropolitan,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
0,0,2,h,1480000.0,2.5,3067.0,2.0,1.0,1.0,Yarra,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1,2,h,1035000.0,2.5,3067.0,2.0,1.0,0.0,Yarra,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,2,3,h,1465000.0,2.5,3067.0,3.0,2.0,0.0,Yarra,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,3,3,h,850000.0,2.5,3067.0,3.0,2.0,1.0,Yarra,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,4,4,h,1600000.0,2.5,3067.0,3.0,1.0,2.0,Yarra,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13367,13367,4,h,1245000.0,16.7,3150.0,4.0,2.0,2.0,,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
13368,13368,3,h,1031000.0,6.8,3016.0,3.0,2.0,2.0,,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
13369,13369,3,h,1170000.0,6.8,3016.0,3.0,2.0,4.0,,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
13370,13370,4,h,2500000.0,6.8,3016.0,4.0,1.0,5.0,,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## Ejercicio 2: Imputación por KNN

En el teórico se presentó el método `IterativeImputer` para imputar valores faltantes en variables numéricas. Sin embargo, los ejemplos presentados sólo utilizaban algunas variables numéricas presentes en el conjunto de datos. En este ejercicio, utilizaremos la matriz de datos codificada para imputar datos faltantes de manera más precisa.

1. Agregue a la matriz obtenida en el punto anterior las columnas `YearBuilt` y `BuildingArea`.
2. Aplique una instancia de `IterativeImputer` con un estimador `KNeighborsRegressor` para imputar los valores de las variables. ¿Es necesario estandarizar o escalar los datos previamente?
3. Realice un gráfico mostrando la distribución de cada variable antes de ser imputada, y con ambos métodos de imputación.

In [91]:
col_imputar = ['BuildingArea','YearBuilt']
df_imputar = melb_df.loc[:, melb_df.columns.isin(col_imputar)].copy()

df_final = pd.concat([df_final, df_imputar], axis=1)
df_final.columns

Index(['Unnamed: 0', 'Rooms', 'Type', 'Price', 'Distance', 'Postcode',
       'Bedroom2', 'Bathroom', 'Car', 'CouncilArea', 'Regionname', 'Lattitude',
       'Longtitude', 'zipcode_int', 'airbnb_price_mean', 'airbnb_record_count',
       'Type_h', 'Type_t', 'Type_u', 'CouncilArea_Banyule',
       'CouncilArea_Bayside', 'CouncilArea_Boroondara', 'CouncilArea_Brimbank',
       'CouncilArea_Cardinia', 'CouncilArea_Casey', 'CouncilArea_Darebin',
       'CouncilArea_Frankston', 'CouncilArea_Glen Eira',
       'CouncilArea_Greater Dandenong', 'CouncilArea_Hobsons Bay',
       'CouncilArea_Hume', 'CouncilArea_Kingston', 'CouncilArea_Knox',
       'CouncilArea_Macedon Ranges', 'CouncilArea_Manningham',
       'CouncilArea_Maribyrnong', 'CouncilArea_Maroondah',
       'CouncilArea_Melbourne', 'CouncilArea_Melton', 'CouncilArea_Monash',
       'CouncilArea_Moonee Valley', 'CouncilArea_Moorabool',
       'CouncilArea_Moreland', 'CouncilArea_Nillumbik',
       'CouncilArea_Port Phillip', 'CouncilA

In [92]:
df_final[col_imputar].isna().sum()

BuildingArea    6374
YearBuilt       5308
dtype: int64

In [96]:
df_final

Unnamed: 0.1,Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bedroom2,Bathroom,Car,CouncilArea,...,Regionname_Eastern Metropolitan,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria,YearBuilt,BuildingArea
0,0,2,h,1480000.0,2.5,3067.0,2.0,1.0,1.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
1,1,2,h,1035000.0,2.5,3067.0,2.0,1.0,0.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1900.0,79.0
2,2,3,h,1465000.0,2.5,3067.0,3.0,2.0,0.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1900.0,150.0
3,3,3,h,850000.0,2.5,3067.0,3.0,2.0,1.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,
4,4,4,h,1600000.0,2.5,3067.0,3.0,1.0,2.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2014.0,142.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13367,13367,4,h,1245000.0,16.7,3150.0,4.0,2.0,2.0,,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1981.0,
13368,13368,3,h,1031000.0,6.8,3016.0,3.0,2.0,2.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1995.0,133.0
13369,13369,3,h,1170000.0,6.8,3016.0,3.0,2.0,4.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1997.0,
13370,13370,4,h,2500000.0,6.8,3016.0,4.0,1.0,5.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1920.0,157.0


In [94]:
#¿hay que hacer maxminscaler antes?? Creería que si

from sklearn.experimental import enable_iterative_imputer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import IterativeImputer

melb_data_mice = df_final.copy(deep=True)

mice_imputer = IterativeImputer(random_state=0, estimator=KNeighborsRegressor())
melb_data_mice[['YearBuilt','BuildingArea']] = mice_imputer.fit_transform(
    melb_data_mice[['YearBuilt', 'BuildingArea']])



In [97]:
melb_data_mice

Unnamed: 0.1,Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bedroom2,Bathroom,Car,CouncilArea,...,Regionname_Eastern Metropolitan,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria,YearBuilt,BuildingArea
0,0,2,h,1480000.0,2.5,3067.0,2.0,1.0,1.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2006.8,142.800
1,1,2,h,1035000.0,2.5,3067.0,2.0,1.0,0.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1900.0,79.000
2,2,3,h,1465000.0,2.5,3067.0,3.0,2.0,0.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1900.0,150.000
3,3,3,h,850000.0,2.5,3067.0,3.0,2.0,1.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2006.8,142.800
4,4,4,h,1600000.0,2.5,3067.0,3.0,1.0,2.0,Yarra,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2014.0,142.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13367,13367,4,h,1245000.0,16.7,3150.0,4.0,2.0,2.0,,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1981.0,68.196
13368,13368,3,h,1031000.0,6.8,3016.0,3.0,2.0,2.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1995.0,133.000
13369,13369,3,h,1170000.0,6.8,3016.0,3.0,2.0,4.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1997.0,179.400
13370,13370,4,h,2500000.0,6.8,3016.0,4.0,1.0,5.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1920.0,157.000


Ejemplo de gráfico comparando las distribuciones de datos obtenidas con cada método de imputación.

## Ejercicio 3: Reducción de dimensionalidad.

Utilizando la matriz obtenida en el ejercicio anterior:
1. Aplique `PCA` para obtener $n$ componentes principales de la matriz, donde `n = min(20, X.shape[0])`. ¿Es necesario estandarizar o escalar los datos?
2. Grafique la varianza capturada por los primeros $n$ componentes principales, para cada $n$.
3. En base al gráfico, seleccione las primeras $m$ columnas de la matriz transformada para agregar como nuevas características al conjunto de datos.

## Ejercicio 4: Composición del resultado

Transformar nuevamente el conjunto de datos procesado en un `pandas.DataFrame` y guardarlo en un archivo.

Para eso, será necesario recordar el nombre original de cada columna de la matriz, en el orden correcto. Tener en cuenta:
1. El método `OneHotEncoder.get_feature_names` o el atributo `OneHotEncoder.categories_` permiten obtener una lista con los valores de la categoría que le corresponde a cada índice de la matriz.
2. Ninguno de los métodos aplicados intercambia de lugar las columnas o las filas de la matriz.

In [None]:
## Small example
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

## If we process our data with the following steps:
categorical_cols = ['Type', 'Regionname']
numerical_cols = ['Rooms', 'Distance']
new_columns = []

# Step 1: encode categorical columns
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat = encoder.fit_transform(melb_df[categorical_cols])
for col, col_values in zip(categorical_cols, encoder.categories_):
  for col_value in col_values:
    new_columns.append('{}={}'.format(col, col_value))
print("Matrix has shape {}, with columns: {}".format(X_cat.shape, new_columns))

# Step 2: Append the numerical columns
X = numpy.hstack([X_cat, melb_df[numerical_cols].values])
new_columns.extend(numerical_cols)
print("Matrix has shape {}, with columns: {}".format(X_cat.shape, new_columns))

# Step 3: Append some new features, like PCA
pca = PCA(n_components=2)
pca_dummy_features = pca.fit_transform(X)
X_pca = numpy.hstack([X, pca_dummy_features])
new_columns.extend(['pca1', 'pca2'])

## Re-build dataframe
processed_melb_df = pandas.DataFrame(data=X_pca, columns=new_columns)
processed_melb_df.head()

Matrix has shape (13580, 11), with columns: ['Type=h', 'Type=t', 'Type=u', 'Regionname=Eastern Metropolitan', 'Regionname=Eastern Victoria', 'Regionname=Northern Metropolitan', 'Regionname=Northern Victoria', 'Regionname=South-Eastern Metropolitan', 'Regionname=Southern Metropolitan', 'Regionname=Western Metropolitan', 'Regionname=Western Victoria']
Matrix has shape (13580, 11), with columns: ['Type=h', 'Type=t', 'Type=u', 'Regionname=Eastern Metropolitan', 'Regionname=Eastern Victoria', 'Regionname=Northern Metropolitan', 'Regionname=Northern Victoria', 'Regionname=South-Eastern Metropolitan', 'Regionname=Southern Metropolitan', 'Regionname=Western Metropolitan', 'Regionname=Western Victoria', 'Rooms', 'Distance']


Unnamed: 0,Type=h,Type=t,Type=u,Regionname=Eastern Metropolitan,Regionname=Eastern Victoria,Regionname=Northern Metropolitan,Regionname=Northern Victoria,Regionname=South-Eastern Metropolitan,Regionname=Southern Metropolitan,Regionname=Western Metropolitan,Regionname=Western Victoria,Rooms,Distance,pca1,pca2
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,2.5,-7.669418,-0.292703
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,2.5,-7.669418,-0.292703
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.5,-7.620201,0.619633
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.5,-7.620201,0.619633
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,2.5,-7.570984,1.531969


## Ejercicio 5: Documentación

En un documento `.pdf` o `.md` realizar un reporte de las operaciones que realizaron para obtener el conjunto de datos final. Se debe incluir:
  1. Criterios de exclusión (o inclusión) de filas
  2. Interpretación de las columnas presentes
  2. Todas las transofrmaciones realizadas

Este documento es de uso técnico exclusivamente, y su objetivo es permitir que otres desarrolladores puedan reproducir los mismos pasos y obtener el mismo resultado. Debe ser detallado pero consiso. Por ejemplo:

```
  ## Criterios de exclusión de ejemplos
  1. Se eliminan ejemplos donde el año de construcción es previo a 1900

  ## Características seleccionadas
  ### Características categóricas
  1. Type: tipo de propiedad. 3 valores posibles
  2. ...
  Todas las características categóricas fueron codificadas con un
  método OneHotEncoding utilizando como máximo sus 30 valores más 
  frecuentes.
  
  ### Características numéricas
  1. Rooms: Cantidad de habitaciones
  2. Distance: Distancia al centro de la ciudad.
  3. airbnb_mean_price: Se agrega el precio promedio diario de 
     publicaciones de la plataforma AirBnB en el mismo código 
     postal. [Link al repositorio con datos externos].

  ### Transformaciones:
  1. Todas las características numéricas fueron estandarizadas.
  2. La columna `Suburb` fue imputada utilizando el método ...
  3. Las columnas `YearBuilt` y ... fueron imputadas utilizando el 
     algoritmo ...
  4. ...

  ### Datos aumentados
  1. Se agregan las 5 primeras columnas obtenidas a través del
     método de PCA, aplicado sobre el conjunto de datos
     totalmente procesado.
```
