# Handling Missing Values

### Table of Contents
1. [Introduction](#introduction)
2. [Data Information](#data-info)
3. [Load Libraries](#load-libraries)
4. [Load Datasets](#load-datasets)
5. [Case Study: Non Time Series Problem](#non-time-series-problem)
6. [Case Study: The Time Series Problem](#time-series-problem)

## Introduction <a class="anchor" id="introduction"></a>

Real-world data often contains many missing values and can significantly affect the conclusions drawn from the data. Missing values (or missing data) are the data value that not stored for a variable in the observation of interest. There are several reasons for missing values, such as:
* Data doesn't exist
* Data not collected due to human error.
* Data deleted accidentally

Either way, we need to address this issue before we proceed with the modeling or analysis stuff.

![a](https://i.imgur.com/68u0dD2.png)

## Data Information <a class="anchor" id="data-info"></a>

There are two publically available datasets which will be used to explain the concepts:

* California Housing Prices for Non Time Series problem
* Water Consumption in a Median Size City (2000 - 2016) for Time Series problem

## Load Libraries <a class="anchor" id="load-libraries"></a>

Loading necessary libraries

In [1]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

## Load Datasets <a class="anchor" id="load-datasets"></a>

Loading necessary datasets

In [2]:
df_titanic_train = pd.read_csv('../00_Dataset/titanic/train.csv')
df_titanic_test = pd.read_csv('../00_Dataset/titanic/test.csv')

print('Titanic training data shape: ', df_titanic_train.shape)
print('Titanic testing data shape: ', df_titanic_test.shape)

# Show first five rows of the titanic training dataset
display(df_titanic_train.head())

Titanic training data shape:  (891, 11)
Titanic testing data shape:  (418, 10)


Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df_AguaH = pd.read_csv('../00_Dataset/AguaH/AguaH.csv')

print('AguaH data shape: ', df_AguaH.shape)

# Show first five rows of the AguaH dataset
display(df_AguaH.head())

AguaH data shape:  (178597, 89)


Unnamed: 0,USO2013,TU,DC,M,UL,f.1_ENE_09,f.1_FEB_09,f.1_MAR_09,f.1_ABR_09,f.1_MAY_09,...,f.1_MAR_15,f.1_ABR_15,f.1_MAY_15,f.1_JUN_15,f.1_JUL_15,f.1_AGO_15,f.1_SEP_15,f.1_OCT_15,f.1_NOV_15,f.1_DIC_15
0,H3,DOMESTICO MEDIO,0.5,MSDELAUNET,197.0,20.0,20.0,20.0,20.0,20.0,...,2.0,1.0,2.0,1.0,7.0,7.0,42.0,27.0,22.0,21.0
1,H3,DOMESTICO MEDIO,0.5,MSDELAUNET,307.0,,30.0,30.0,30.0,30.0,...,11.0,13.0,16.0,14.0,15.0,16.0,13.0,17.0,17.0,11.0
2,H3,DOMESTICO RESIDENCIAL,0.5,MSDELAUNET,179.0,,,,,,...,6.0,9.0,7.0,8.0,8.0,10.0,12.0,9.0,6.0,3.0
3,H3,DOMESTICO MEDIO,0.5,CICASA MMD-15 S,852.0,,,,,,...,17.0,20.0,16.0,16.0,18.0,18.0,17.0,17.0,18.0,9.0
4,H3,DOMESTICO RESIDENCIAL,0.5,,,20.0,20.0,20.0,20.0,20.0,...,27.0,27.0,27.0,28.0,28.0,31.0,27.0,27.0,27.0,24.0


## Case Study : Non Time Series Problem <a class="anchor" id="non-time-series-problem"></a>

## Case Study : Time Series Problem <a class="anchor" id="time-series-problem"></a>