# Dealing With Missing Values

## Why should we treat missing values?

*   Some models cannot work with missing data (Nulls/NaNs)
*   Missing data may be a sign of a wider data issue
*   Missing data can be a useful feature


In Python , we can use Numpy and Pandas library to effectively clean our data and handle the missing values before processing our data.







In [38]:
import numpy as np
import pandas as pd
#importing libraries

In [39]:
df= pd.read_csv('/content/california_cities.csv')
#importing dataset

In [40]:
df.head()
#viewing top 5 rows to understand dataset

Unnamed: 0.1,Unnamed: 0,city,latd,longd,elevation_m,elevation_ft,population_total,area_total_sq_mi,area_land_sq_mi,area_water_sq_mi,area_total_km2,area_land_km2,area_water_km2,area_water_percent
0,0,Adelanto,34.576111,-117.432778,875.0,2871.0,31765,56.027,56.009,0.018,145.107,145.062,0.046,0.03
1,1,AgouraHills,34.153333,-118.761667,281.0,922.0,20330,7.822,7.793,0.029,20.26,20.184,0.076,0.37
2,2,Alameda,37.756111,-122.274444,,33.0,75467,22.96,10.611,12.349,59.465,27.482,31.983,53.79
3,3,Albany,37.886944,-122.297778,,43.0,18969,5.465,1.788,3.677,14.155,4.632,9.524,67.28
4,4,Alhambra,34.081944,-118.135,150.0,492.0,83089,7.632,7.631,0.001,19.766,19.763,0.003,0.01


In [41]:
df.shape
#shape of datset

(482, 14)

In [42]:
df.isnull().sum()
#checking for null values

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       2
area_land_sq_mi        0
area_water_sq_mi       1
area_total_km2         5
area_land_km2          4
area_water_km2         4
area_water_percent     5
dtype: int64

### Using Mean to fill missing Values

In [43]:
df['area_water_km2'].fillna(df['area_water_km2'].mean(),inplace=True)
#filling cells with null value

In [44]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       2
area_land_sq_mi        0
area_water_sq_mi       1
area_total_km2         5
area_land_km2          4
area_water_km2         0
area_water_percent     5
dtype: int64

### Using Median to fill Missing Values

In [45]:
df['area_water_percent'].fillna(df['area_water_percent'].median(),inplace=True)

In [46]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       2
area_land_sq_mi        0
area_water_sq_mi       1
area_total_km2         5
area_land_km2          4
area_water_km2         0
area_water_percent     0
dtype: int64

### Droping a column to get rid of Missing Values

In [47]:
df.drop(columns=["area_total_km2"],inplace=True)
#Deleting column with missing values

In [48]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       2
area_land_sq_mi        0
area_water_sq_mi       1
area_land_km2          4
area_water_km2         0
area_water_percent     0
dtype: int64

### Replacing missing values in a specific column with a given value

In [49]:
df['area_land_km2'].fillna(value='121.12', inplace=True)

In [50]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       2
area_land_sq_mi        0
area_water_sq_mi       1
area_land_km2          0
area_water_km2         0
area_water_percent     0
dtype: int64

### Manually filling missing values within a given range of values

In [51]:
df['area_total_sq_mi'].interpolate(method='polynomial', order=2,inplace=True)

In [52]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       0
area_land_sq_mi        0
area_water_sq_mi       1
area_land_km2          0
area_water_km2         0
area_water_percent     0
dtype: int64

### Estimating the missing value is similar to the cell above and copying that value into the missing value cell

In [53]:
df['area_water_sq_mi'].fillna(method='ffill', inplace=True)

In [54]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft          12
population_total       0
area_total_sq_mi       0
area_land_sq_mi        0
area_water_sq_mi       0
area_land_km2          0
area_water_km2         0
area_water_percent     0
dtype: int64

### Estimating the missing value is similar to the cell below and copying that value into the missing value cell

In [55]:
df['elevation_ft'].fillna(method='bfill', inplace=True)

In [56]:
df.isnull().sum()

Unnamed: 0             0
city                   0
latd                   0
longd                  0
elevation_m           48
elevation_ft           0
population_total       0
area_total_sq_mi       0
area_land_sq_mi        0
area_water_sq_mi       0
area_land_km2          0
area_water_km2         0
area_water_percent     0
dtype: int64

###  Dropping specific columns after recording where the values are not missing

In [59]:
#record where the values are not missing
df['recording_non_null_values'] = df['elevation_m'].notnull()
# dropping the column after it is recorded
df.drop(columns=['elevation_m'],inplace=True) 

In [60]:
df.isnull().sum()

Unnamed: 0                   0
city                         0
latd                         0
longd                        0
elevation_ft                 0
population_total             0
area_total_sq_mi             0
area_land_sq_mi              0
area_water_sq_mi             0
area_land_km2                0
area_water_km2               0
area_water_percent           0
recording_non_null_values    0
dtype: int64

### We can see after applying various methods is finally clean and all the missing values are handled correctly.

## Thanks!