# Python Pandas Tutorial 9- Cleaning Data
Welcome to the ninth module of the Panda's Tutorial! In this module, we will learn how to Cast Datatypes and Handle Missing Values. 

In [2]:
import pandas as pd
import numpy as np
people = { "first": ["Ben", "Han", "Luke", "Anakin", "Leia", "Chew", "Master", np.nan, None, 'NA'], 
"last": ["Kenobi", "Solo", "Skywalker", "Skywalker", "Organa", "Bacca", "Yoda", np.nan, np.nan, 'Missing'], 
"email": ["Benkenobi@email.com", "Hansolo@email.com", "Lukeskywalker@email.com", "Anakinskywalker@vader.com", "Leiaorgana@email.com", "Chewbacca@email.com", "Masteryoda@force.com", None,'Anonymous@email.com', np.nan ], "Age":["57", "33","23","45","23","204","900","None","None","None"]}
 

df = pd.DataFrame(people)
#I added some names to our data frame so we have some missing values

One thing you might want to do with missing data is remove it. We are going to remove some rows with those values. We're going to use the drop NA method.

In [3]:
df.dropna()
#Now you will notice that there are only nine rows.

Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57
1,Han,Solo,Hansolo@email.com,33
2,Luke,Skywalker,Lukeskywalker@email.com,23
3,Anakin,Skywalker,Anakinskywalker@vader.com,45
4,Leia,Organa,Leiaorgana@email.com,23
5,Chew,Bacca,Chewbacca@email.com,204
6,Master,Yoda,Masteryoda@force.com,900


In [4]:
df.dropna(axis='index', how='any') #Index tells Pandas we want to drop missing values when our rows are missing values.
#If we put columns instead of index, it would drop columns with missing values


Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57
1,Han,Solo,Hansolo@email.com,33
2,Luke,Skywalker,Lukeskywalker@email.com,23
3,Anakin,Skywalker,Anakinskywalker@vader.com,45
4,Leia,Organa,Leiaorgana@email.com,23
5,Chew,Bacca,Chewbacca@email.com,204
6,Master,Yoda,Masteryoda@force.com,900


In [5]:
#If it's ok to have just one value and you just want to get rid of the rows with no values, you can change "how" to "all".
df.dropna(axis='index', how='all')

Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57.0
1,Han,Solo,Hansolo@email.com,33.0
2,Luke,Skywalker,Lukeskywalker@email.com,23.0
3,Anakin,Skywalker,Anakinskywalker@vader.com,45.0
4,Leia,Organa,Leiaorgana@email.com,23.0
5,Chew,Bacca,Chewbacca@email.com,204.0
6,Master,Yoda,Masteryoda@force.com,900.0
7,,,,
8,,,Anonymous@email.com,
9,,Missing,,


In [6]:
#Say you want to drop some missing values but only want to drop rows that are missing values in a specific column. To do this, you can pass in a sub set argument.
df.dropna(axis='index', how='any', subset=['last'])
#this only checks the email column
#The same as our previous tutorials, doing this does not permanently change our data frame. In order to do that, you'd need to add the inplace=True argument.

Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57.0
1,Han,Solo,Hansolo@email.com,33.0
2,Luke,Skywalker,Lukeskywalker@email.com,23.0
3,Anakin,Skywalker,Anakinskywalker@vader.com,45.0
4,Leia,Organa,Leiaorgana@email.com,23.0
5,Chew,Bacca,Chewbacca@email.com,204.0
6,Master,Yoda,Masteryoda@force.com,900.0
9,,Missing,,


Since we've created our data frame from scratch, we can replace the missing values with an NaN value. Later we will look at replacing the missing values when the data frame was loaded in by csv.

In [7]:
#Here i am replacing the missing/NA values with np.NaN
df = pd.DataFrame(people)
df.replace('NA', np.NaN, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df.replace('None', '0', inplace=True)
df

Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57
1,Han,Solo,Hansolo@email.com,33
2,Luke,Skywalker,Lukeskywalker@email.com,23
3,Anakin,Skywalker,Anakinskywalker@vader.com,45
4,Leia,Organa,Leiaorgana@email.com,23
5,Chew,Bacca,Chewbacca@email.com,204
6,Master,Yoda,Masteryoda@force.com,900
7,,,,0
8,,,Anonymous@email.com,0
9,,,,0


In [8]:
#You can us .isna() to see whether or not a value is an NaN value.
df.isna()

Unnamed: 0,first,last,email,Age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,True,True,True,False
8,True,True,False,False
9,True,True,True,False


In [9]:
#This method can be more useful if you're trying to analyze numerical data.
df.fillna('MISSING')

Unnamed: 0,first,last,email,Age
0,Ben,Kenobi,Benkenobi@email.com,57
1,Han,Solo,Hansolo@email.com,33
2,Luke,Skywalker,Lukeskywalker@email.com,23
3,Anakin,Skywalker,Anakinskywalker@vader.com,45
4,Leia,Organa,Leiaorgana@email.com,23
5,Chew,Bacca,Chewbacca@email.com,204
6,Master,Yoda,Masteryoda@force.com,900
7,MISSING,MISSING,MISSING,0
8,MISSING,MISSING,Anonymous@email.com,0
9,MISSING,MISSING,MISSING,0


In [10]:
#if you're trying to find the average of something, you'll have to check what type each column is.
df.dtypes

first    object
last     object
email    object
Age      object
dtype: object

In [11]:
#We need to convert these numbers to the float datatype.
df['Age'] = df['Age'].astype(float)


In [12]:
#We can now get a mean for all of the ages
df['Age'].mean()

128.5

In [13]:
#You can aslo do the standard deviation
df['Age'].std()

277.67176866060964