# Data Science Hands-on

**First day**     

The goal of this notebook is:
- Explore houses dataset
- Make some plots to check different variables
- Select interesting features to apply ML algorithms
- Fill missing values
- Transform some features to more informative variables


[Pandas cheatsheet](https://github.com/creyesp/houses-project/blob/add-binder-configs/pandas_cheatsheet.md)


## What are some questions that I can answer with this dataset?
Understand your dataset is the first step of any data science project. You need to know the limitations and make a list of possible questions that could be answered with this dataset. These questions can reduce, expand or modify the scope of our project.

examples: 
- We could have great ideas but poor data
- We could have incorrect question for our dataset

**Data**: 
- We have a set of features of houses for sale in a specific time windows.  

**Business question/objective**:
- **New infocasas functionality**: Is it possible to offer an estimated price for selling given house characteristics (uploaded by owner in the webpage) without asking an appraiser? 


# Exploratory data analysis
- How many rows are in our dataset?
- How many columns are in our dataset?
- What data types are in the columns?
- Are there missing values in the dataset? Do we infer missing values? how?
- Are there outlier values? 

Data types:
- **Numeric**:
    - *Discrete*: variables that have finite possible values.
    - *Continuous*:  variables that can have an infinite number of possible values
- **Categorical, variables that have 2 or more possible values**:
    - *Ordinal*: these values have a meaningful order or rank. Ex. marks, A, B, C
    - *Nominal*: the order of those values have no meaning. Ex, names
- **Unstructured**:
    - *text*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Custom module
import handson

%matplotlib notebook

pd.options.display.float_format = '{:.2f}'.format
pd.options.display.max_columns = None

## Load dataset


In [None]:
path_file = '../data/dataset_preprocessed.csv'

# Read csv file and assign to df variable
df = pd.read_csv(path_file)


## General information about our dataset

In [None]:
# Check the name of columns (features)


In [None]:
# Look at the first 5 rows of our dataset


In [None]:
# Get number of rows and columns


In [None]:
# Get data types of columns


In [None]:
# Check if there are missing values in each column


In [None]:
# Get percentage of missing values from each column


In [None]:
# Get number of unique values for each feature


In [None]:
# Get 10 largest values of some feature


## Statistic resume
### Numeric variables
Look at statistic info for each columns and check which columns has unusual behavior. 
- Are all positive values?
- is standard deviation different to zero?
- How long is percentil 75 from max?


In [None]:
# Get a resume of numerical columns from our dataset


In [None]:
# Get percentile 5 and 95


### Categeries resume

In [None]:
# Get a resume of no numerical columns from our dataset. 
# Hint: use include='O' as argument in resume function




## Visualization
[Seaborn](https://seaborn.pydata.org/) is a very useful package to make EDA (built on [Matplotlib](https://matplotlib.org/)), it's a statistical data visualization package and it's easy to create univarible and bivarible plots.
<img src="img/seaborn.png" />

### Univarible plots
- [Distribution](https://seaborn.pydata.org/generated/seaborn.distplot.html#seaborn.distplot)
- [Histograms](https://seaborn.pydata.org/generated/seaborn.distplot.html#seaborn.distplot)
- [Boxplots](https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot)


In [None]:
# Plot price distribution
f, ax = plt.subplots()



In [None]:
# Plot boxplots of price group by some categorical feature 
# ex. estado, barrio, banos, dormitorios, tipo_propiedad

f, ax = plt.subplots(figsize=(5, 5))


In [None]:
f, ax = plt.subplots(figsize=(10, 5))


In [None]:
# Plot histogram of barrio feature
f, ax = plt.subplots()


In [None]:
# Make a histogram of "ano_de_construccion" between 1880 and 2019, in bins of 10 years


### bivarible plots
- [Scatter](https://seaborn.pydata.org/generated/seaborn.scatterplot.html#seaborn.scatterplot)
- [Pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot)
- [Relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html#seaborn.relplot)

In [None]:
# make scatter plot, ex ano_de_construccion and precio, or gastos_comunes vs precio

f, ax = plt.subplots()


In [None]:
# Make a relplot using for example m2_edificados, precio, banos, dormitorios


In [None]:
# Select continuous feature and make a pairplot

features = [
]



In [None]:
# Select continuous feature, compute correlations between them and make a heatmap

features = [

]

f, ax = plt.subplots(figsize=(10, 8))


# Data preparation

**"The quality and quantity of data that you gather will directly determine how good your predictive model can be."**

- Select relevant features
- Clean and Missing values imputation




<table>
  <tr>
    <th>Feature selection</th>
    <th>Filling missing values</th>
  </tr>
  <tr>
    <th><img src="http://dkopczyk.quantee.co.uk/wp-content/uploads/2018/10/feat_sel-600x265.png" /></th>
    <th><img src="https://i.stack.imgur.com/E4mhD.png" /></th>
</table>


## Select feature for analysis
Check dataset [documentation](https://github.com/creyesp/houses-project/blob/add-binder-configs/data/dataset_description.md) to choose the most interesting feature to answer our questions.

In [None]:
columns_to_analyze = [
    'ano_de_construccion', 
    'banos',
    'banos_extra',
    'descripcion',
    'disposicion',
    'distancia_al_mar',
    'dormitorios',
    'dormitorios_extra',
    'estado',
    'extra',
    'garajes',
    'garajes_extra',
    'gastos_comunes',
    'tipo_de_publicacion',
    'm2_de_la_terraza',
    'm2_del_terreno',
    'm2_edificados',
    'oficina',
    'penthouse',
    'plantas',
    'plantas_extra',
    'precio',
    'sobre',
    'tipo_propiedad',
    'vista_al_mar',
    'vivienda_social',
    'barrio', 
]

## Split dataset in numerical and string variables
Pandas has a method to split dataset group by dtypes:
- **'object'**: To select strings you must use the object dtype
- **'number'**: To select all numeric types
- **'category'**: To select Pandas categorical dtypes
- **'datetime'**: To select datetimes
- **'timedelta'**: To select timedeltas

In [None]:
df_num = df[columns_to_analyze].select_dtypes(include='number')
df_obj = df[columns_to_analyze].select_dtypes(include='object')

print('Numerical columns: {}\n'.format(df_num.columns.tolist()))
print('Caterorial columns: {}'.format(df_obj.columns.tolist()))

## Missing values imputation
There are more sophisticated method to make missing imputation like [Iterative Imputer](https://towardsdatascience.com/4-tips-for-advanced-feature-engineering-and-preprocessing-ec11575c09ea).

Some features have only 1 valid value and the rest of the values are Nan (Not a number), ex. "oficina" column. In this case we can infer that missing value is 0. 
- **Look at what features we can replace Nan values with 0**.

There are other features that nan values should be replacing with a specific value, ex. "plantas", if a house or apartment doesn't have a valid value then default value should be 1.
- **Look at what feature we can replace Nan values with specific values**.

In [None]:
# Fill missing values with zero
fill_zero_col = [
]
# df_num.loc[:, fill_zero_col] = 

# Fill missing values with 1


We can infer some values of a column from other column, for example we can fill nan values in "m2_del_terreno" from "m2_edificados".
- **Select nan values from  "m2_del_terreno" and fill it with "m2_edificados".**

In [None]:
# Fill missing value usings other columns


Also we can use some statistical metrics to fill missing values, like mean, median, mode, etc.
- **Compute the median of "m2_edificados" and fill nan values with this result.**

For categorical feature we can add a new category to fill missing values
- **Replace nan values with a defaul category for following feature:**
    - "barrio"
    - "disposicion"
    - "tipo_propiedad"

In [None]:
# Fill missing categories


 ## Feature transformation
 

We can create new features applying some functions or filters to transform them and get a more informative features. Apply the following transformation:
- **Create a binary feature called "cerca_rambla" which is 1 when "distancia_al_mar" < 1000 or "vista_al_mar" is 1, in other case set it to 0.**
- **Create a feature called "m2_index" which is the ratio between "m2_edificados" and "m2_del_terreno"**
- **Create a binary feature called "es_casa" which is 1 if "tipo_propiedad" == 'casas' and 0 is "tipo_propiedad" == 'apartamentos'.**
- **Create a binary feature called "parrillero" if "extra" feature contain 'parrillero'**

In [None]:
df_num['cerca_rambla'] = 

df_num['m2_index'] =

df_num['es_casa'] = 

df_num['parrilero'] = df_obj['extra'].str.contains('parrillero').fillna(False)


### Binning
Some variables like years or ages is an example of a feature type that might benefit from transformation into a binning variable.

- **Create a new variable called decada that transform "ano_de_cosntruccion" to "decada". Use pd.cut()**

In [None]:
range_decade = np.arange(1880, 2021, 10)
range_label = np.arange(1880, 2020, 10)
year = df['ano_de_construccion'].copy()
year[year < 1880] = 1880
year[year > 2019] = 2019
year.fillna(1951, inplace=True)

df_num['decada'] = 


Some categorical features are ordinal, then we can map them to a numerical values in a specific order
- **Create a dictionary with all possible values of "estado" feature and assign a numerical value, where min value is the worse status and the max value is the best status of properties. Then map these values to a "estado" feature.**


In [None]:
# Categorical transformation
map_status = {
}
df_num['estado'] = 


One useful transformation is [80/20 rule or Pareto Rule](https://en.wikipedia.org/wiki/Pareto_principle), it's say that  for many events, roughly 80% of the effects come from 20% of the causes. In our case "barrio" feature has a similar behaviour.  
<img src="https://www.dansilvestre.com/wp-content/uploads/2017/12/DanSilvestre.com_-1.png" width="50%"/>

In [None]:
f, ax = plt.subplots(figsize=(10,5))
(df_obj['barrio'].value_counts().cumsum()/df_obj['barrio'].count()).plot(kind='bar', ax=ax)

for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize('x-small') 
ax.grid(axis='y')

Nominal features like "barrio" can be transformed into a numerical variable applying **ONE-HOT encoding**.
<img src="img/one-hot-encoding.png" />

- **Apply one-hot encoding on Pareto's transformation of "bario" feature and add prefix='ZN_', then assign to zona variable.**
- **Apply one-hot encoding on "disposicion" feature and add prefix='DISP_', then assign to zona disp.**


Finally concatenate all new features and drop redundant 

In [None]:
df_num_final = pd.concat([df_num, zona, disp], axis=1)
drop_col = ['distancia_al_mar', 'vista_al_mar', 'm2_del_terreno', 'ano_de_construccion']
df_num_final.drop(columns=drop_col, inplace=True)

## Apply customs filters
- **Get percentile 5 and 95 or 1 and 99 to get a hint of posibles filter to get a clean dataset.**

In [None]:
df_num_final.quantile([0.05, 0.95])

- Create a filter for following features to get :
  - tipo_propiedad
  - decada
  - oficina
  - penthouse
  - banos
  - dormitorios
  - garajes
  - m2_de_la_terraza
  - m2_edificados
  - gastos_comunes
  - precio
  - m2_index

In [None]:
mask = (
    
)
mask.sum()

## Drop no informative columns and drop missing row

In [None]:
zero_std_col = df_num_final.columns[df_num_final[mask].std() == 0]
df_final = df_num_final[mask].drop(columns=zero_std_col).astype(float).dropna()


In [None]:
handson.info(df_final)

In [None]:
df_final.describe()

## Save ready dataset 

In [None]:
df_final.to_csv('../data/dataset_ready.csv', index=False)