# Handle Missing Data (replace function)

In [7]:
import pandas as pd
df = pd.read_csv("7-6_weather_data.csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,-99999,7,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,0
4,1/5/2017,32,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,0


You can see here that there are a bunch of -99999 values. These are spatial values. This means the data is missing, but instead of the values being blank, these spatial values get inserted. This happens a lot when you download or get data from somewhere. You'll have to deal with these values.

In [8]:
new_df = df.replace(-99999,"NaN")
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/2/2017,,7.0,Sunny
2,1/3/2017,28.0,,Snow
3,1/4/2017,,7.0,0
4,1/5/2017,32.0,,Rain
5,1/6/2017,31.0,2.0,Sunny
6,1/6/2017,34.0,5.0,0


In [9]:
new_df = df.replace([-99999, -88888],"NaN")
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/2/2017,,7.0,Sunny
2,1/3/2017,28.0,,Snow
3,1/4/2017,,7.0,0
4,1/5/2017,32.0,,Rain
5,1/6/2017,31.0,2.0,Sunny
6,1/6/2017,34.0,5.0,0


The **replace()** function will replace the first parameter with the second. 

What if you have two spatial values? For instance, some -99999 and some -88888 values? The second example shows how you can replace more than one value.

In [10]:
new_df = df.replace({
    'temperature':-99999,
    'windspeed': -99999,
    'event': '0'
 },"NaN")
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/2/2017,,7.0,Sunny
2,1/3/2017,28.0,,Snow
3,1/4/2017,,7.0,
4,1/5/2017,32.0,,Rain
5,1/6/2017,31.0,2.0,Sunny
6,1/6/2017,34.0,5.0,


In [12]:
new_df = df.replace({
    -99999: 'NaN',
    'No Event': 'Sunny'
    })
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/2/2017,,7.0,Sunny
2,1/3/2017,28.0,,Snow
3,1/4/2017,,7.0,0
4,1/5/2017,32.0,,Rain
5,1/6/2017,31.0,2.0,Sunny
6,1/6/2017,34.0,5.0,0


Create a dictionary within **replace()** to specify what values you want to replace in which columns. This is useful if you want to replace all the -99999 in one column, and all the 0 in another. 

Or, you can map it as in the second example.

In [13]:
new_df = df.replace('[A-Za-z]','',regex=True)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,
1,1/2/2017,-99999,7,
2,1/3/2017,28,-99999,
3,1/4/2017,-99999,7,0.0
4,1/5/2017,32,-99999,
5,1/6/2017,31,2,
6,1/6/2017,34,5,0.0


In [14]:
new_df = df.replace({
    'temperature': '[A-Za-z]',
   'windspeed': '[A-za-z]',
    },'',regex=True)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,-99999,7,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,0
4,1/5/2017,32,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,0


Say you want to remove the unit of measurements that people have inserted into random cells (i.e. 32 F instead of just 32). Use the **regex** function to eliminate A-Z and a-z characters, and replace them with blank spaces. 

However, you don't want to eliminate the words from the "event" column (as the values are alphabet such as "Sunny"). The second example uses a dictionary to specify which columns to eliminate from.

**regex** = Regular expression. Used to detect patterns.

In [18]:
df = pd.DataFrame({
    'score': ['exceptional', 'average', 'good', 'poor', 'average', 'exceptional'],
    'student': ['rob', 'maya', 'parthiv', 'tom', 'julian', 'erica']
    })
df

Unnamed: 0,score,student
0,exceptional,rob
1,average,maya
2,good,parthiv
3,poor,tom
4,average,julian
5,exceptional,erica


In [20]:
new_df = df.replace(['poor', 'average', 'good', 'exceptional'], [1,2,3,4])
new_df

Unnamed: 0,score,student
0,4,rob
1,2,maya
2,3,parthiv
3,1,tom
4,2,julian
5,4,erica


You can also replace a list of values with another list of values.

Here, we construct a different dataframe of scores and students. You often want to have numbers in the "score" column. Let's say I have an internal map of the word scores to numbers. For instance, "poor" is 0.

The second code shows how you can use **replace()** to replace the word scores with the number scores.