In [None]:
import pandas as pd
import numpy as np
import duckdb as duck
import pyarrow as pa
import polars as pl
import seaborn as sns

con = duck.connect(database='/home/garcia-ln/Documentos/real-state-prices/data/processed/real_state.duckdb')

## JSON > Parquet
Before starting the process of cleaning and transforming the data for  our analysis, we're gonna make sure to convert the files into `.parquet` format so that we're always dealing with optimazed performance datasets, no matter the situation.  

For that, we're gonna start by loading our data into [Pola.rs](pola.rs 'Most Efficient DataFrame Lib for Python') dataframes and try to get some information from our dataset.

In [None]:
sp_dt = pl.read_json('/home/garcia-ln/Documentos/real-state-prices/data/raw/sp_properties.json')
rj_dt = pl.read_json('/home/garcia-ln/Documentos/real-state-prices/data/raw/rj_properties.json')
pa_dt = pl.read_json('/home/garcia-ln/Documentos/real-state-prices/data/raw/pa_properties.json')
bh_dt = pl.read_json('/home/garcia-ln/Documentos/real-state-prices/data/raw/bh_properties.json')

#sp_dt.write_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/sp_properties.parquet')
#rj_dt.write_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/rj_properties.parquet')
#pa_dt.write_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/pa_properties.parquet')
#bh_dt.write_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/bh_properties.parquet')

In [None]:
sql_sp = '''
    CREATE TABLE sp_tbl as 
        SELECT * FROM '~/Documentos/real-state-prices/data/processed/sp_properties.parquet';
    ALTER TABLE sp_tbl
        ADD COLUMN city VARCHAR DEFAULT 'Sao_Paulo'
'''

sql_rj = '''
    CREATE TABLE rj_tbl as 
         SELECT * FROM '~/Documentos/real-state-prices/data/processed/rj_properties.parquet';
    ALTER TABLE rj_tbl
        ADD COLUMN city VARCHAR DEFAULT 'Rio_de_Janeiro'
'''

sql_pa = '''
    CREATE TABLE pa_tbl as 
        SELECT * FROM '~/Documentos/real-state-prices/data/processed/pa_properties.parquet';
    ALTER TABLE pa_tbl
        ADD COLUMN city VARCHAR DEFAULT 'Porto_Alegre'
'''

sql_bh = '''
    CREATE TABLE bh_tbl as 
        SELECT * FROM '~/Documentos/real-state-prices/data/processed/bh_properties.parquet';
    ALTER TABLE bh_tbl
        ADD COLUMN city VARCHAR DEFAULT 'Belo_Horizonte'
'''

con.execute(sql_sp).fetchall()
sp_df = con.table('sp_tbl').df()
display(sp_df)


con.execute(sql_rj).fetchall()
rj_df = con.table('rj_tbl').df()
display(rj_df)


con.execute(sql_pa).fetchall()
pa_df = con.table('pa_tbl').df()
display(pa_df)


con.execute(sql_bh).fetchall()
bh_df = con.table('bh_tbl').df()
display(bh_df)

## Dtypes

Now that we altered the file from `.json` to `.parque` and added the feature to our dataset we're gonna **add all the tables together and define the dtypes of our data**.  


After that we're gonna make sure to **change all dtypes of our dataset**, to keep a tidy dataset for our cleaning, analysis and modeling.

In [None]:
sql = '''
    CREATE TABLE properties as
        SELECT * FROM sp_tbl 
    UNION ALL 
        SELECT * FROM rj_tbl 
    UNION ALL 
        SELECT * FROM pa_tbl 
    UNION ALL 
        SELECT * FROM bh_tbl
'''

#con.execute(sql).fetchall()

properties = pl.from_pandas(con.table('properties').df())
properties

In [None]:
# The code for change dtypes on Polars
properties = properties.with_columns(
    [
        (pl.col('type').cast(pl.Categorical)),
        (pl.col('city').cast(pl.Categorical)),
        (pl.col('address').cast(pl.Categorical)),
        (pl.col('neighborhood').cast(pl.Categorical)),
        (pl.col('footage').cast(pl.Int16)),
        (pl.col('doorms').cast(pl.Int8)),
        (pl.col('garages').cast(pl.Int8)),
        (pl.col('price').cast(pl.Int32))
    ]
)

#properties.write_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/properties.parquet')
properties = pl.read_parquet('/home/garcia-ln/Documentos/real-state-prices/data/processed/properties.parquet')

In [None]:
## here's the SQL query for changing the dtypes of the properties table

con.execute('''
    ALTER TABLE properties
        ALTER type SET DATA TYPE VARCHAR;
    ALTER TABLE properties
        ALTER city SET DATA TYPE VARCHAR;
    ALTER TABLE properties
        ALTER address SET DATA TYPE VARCHAR;
    ALTER TABLE properties
        ALTER neighborhood SET DATA TYPE VARCHAR;
    ALTER TABLE properties    
        ALTER footage SET DATA TYPE SMALLINT;
    ALTER TABLE properties    
        ALTER doorms SET DATA TYPE INT2;
    ALTER TABLE properties
        ALTER garages SET DATA TYPE INT2;
    ALTER TABLE properties    
        ALTER price SET DATA TYPE INT4
'''
)

## EDA
lets go for the EDA step of our project. Here we're gonna go through our dataset before start applying the changes for **data cleaning, feature selection and engineering and decoding our categorical features.**  

We're gonna start by understanding the basic information on the qualy and quant features, followed by some visualizations to help on the insights for our analysis.  

Before we go on, lets make some changes on our dataset to make sure we'll be able to work on it. In this case, i'm gonna use `seaborn` for our dataviz (wich requires the DF on `pandas` format, and not `polars`) insted of using `plotly express` (wich we can use the `polars` DF and generates interactive plots). The reason for that, is for prettier dataviz made simple and easy, seaborn is the way to go and given that we don't have such a big df, there's no problem transforming the pl.df to pd.df just for plotting.  

Said that, we're gonna by transforming the properties df.

In [None]:
display(properties.null_count())
null = properties.filter(pl.col('type')==None).to_pandas()
df = properties.to_pandas()
df.dtypes

From this, we can observe that we have a very small volume of missing values. But before we do anything with it, lets check if we have those null values concentrated on a group, or if it's well distributed through all citys and prices. After that, we decide whether to **drop those null values, or make some statistical interpolation**.

In [None]:
display(null)

In [None]:
sns.set_theme(
    context='notebook', 
    style='darkgrid', 
    font_scale=2,
)

sns.countplot(data=null, 
    x='city',
    hue='city', 
    saturation=2, 
)

In [None]:
sns.set_theme(
    context='notebook', 
    style='darkgrid', 
    font_scale=2,
)

sns.displot(
    data=df, 
    x='price', 
    hue='city',
    fill=True, 
    common_norm=False, 
    col='type',
    alpha=.2,
    kind='kde',
    height=4, 
    aspect=1.4,
    log_scale=True,
    linewidth=2.5,
    col_wrap=3
).set_axis_labels('Prices', 'Density').set_titles('{col_name}')#.set(xticks='plain')

In [None]:
properties.null_count()

In [None]:
properties['type'].value_counts().unique()