# How to deal with missing values? 

Missing values could be : NaN, ? or just a blank cell.

There are three common ways to deal with missing values. They are:

*   Drop 
*   Replace
*   Keep

But always remember: the decision is uniquely a question of business, depends on what are that data and how much important they are considering the total

## 1. Drop


*   Drop the variable (entire column) 
*   Drop the data entry (entire row)


## 2. Replace




*   With an average 
*   By frequency
*   Based on other functions


## 3. Keep

Just keep missing values

## Setting up the environment

In [45]:
import pandas as pd
import numpy as np
path = '/content/drive/My Drive/Colab Notebooks/Data/'

In [46]:
# DATASET USED : UFO.CSV

ufo = pd.read_csv(path+'ufo.csv')

## Initial analysis

In [47]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


In [48]:
ufo.shape

(18241, 5)

## Discover how many nulls

In [49]:
#Comparing the result with the total lines, wich is 18241, we see that there are too much nulls on Colors Reported

ufo.isna().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

In [50]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

## 1.1 Drop values

In [51]:
#The method dropna() could be used with some parameters
# 1. axis = 0 level of row, or axis= 1 level of columns , defaul 0
# 2. how : 'any' If any NA values are present, drop that row or column. 'all' If all values are NA, drop that row or column. Default 'any'
# 3. subset : what labels, if nothing is informed, the entire Dataframe will be searched
# 4. inplace : True to replace the original Dataframe 

ufo.dropna(axis=0, subset=['City'], how='any', inplace=True)

## 2.1 Replace

In [52]:
# Suposing that the client decided that nulls in Colors Reported had to be replaced to 'Undefined'

ufo['Colors Reported'] = ufo['Colors Reported'].replace(np.nan, 'Undefined')

In [53]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,Undefined,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,Undefined,DISK,IA,12/31/2000 23:00
18238,Eagle River,Undefined,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,Undefined,OVAL,FL,12/31/2000 23:59


In [54]:
# Suposing that the data engineer order to fill nulls in Shape Reported by frequency

ufo['Shape Reported'].value_counts()

LIGHT        2801
DISK         2119
TRIANGLE     1885
OTHER        1402
CIRCLE       1362
SPHERE       1052
FIREBALL     1037
OVAL          844
CIGAR         617
FORMATION     433
VARIOUS       332
RECTANGLE     302
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           196
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
DOME            1
HEXAGON         1
PYRAMID         1
FLARE           1
Name: Shape Reported, dtype: int64

In [55]:
ufo['Shape Reported'] = ufo['Shape Reported'].replace(np.nan, 'LIGHT')

In [56]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,Undefined,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,Undefined,DISK,IA,12/31/2000 23:00
18238,Eagle River,Undefined,LIGHT,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,Undefined,OVAL,FL,12/31/2000 23:59


In [57]:
# Notice that we do not have more nulls in Dataframe

ufo.isnull().sum()

City               0
Colors Reported    0
Shape Reported     0
State              0
Time               0
dtype: int64