<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Pandas/DataFrame/MissingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Missing Values

If you're running a survey and a respondant didn't answer one question, the mission value is actually an omission. This is called ***Missing at Random*** if there are other variables that might be used to predict the variable which is missing. If there's no relationship to other variables, then it's a ***Missing Completely at Random*** data.

For instance, data might be missing because it wasn't collected , either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected. The last one is extremely common when joining DataFrames from multiple sources together.

In [None]:
import pandas as pd

In [None]:
# The pandas read_csv() function has a parameter called na_values to let us
# specify the form of missing values. It allows scalar, string, list or dict to be used

grades = pd.read_csv('class_grades.csv')
grades.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89


In [None]:
# Boolean mask
mask = grades.isnull()
mask.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False


In [None]:
# Dropping every row that has any missing data
grades.dropna().head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0


In [None]:
# Filling function - fillna()
# It takes a number of parameters
# It could be a scalar value (a single value that
# changes all of the missing data to one value)

# Filling missing values with zeros (in place)
grades.fillna(0, inplace=True)
grades.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89


In [None]:
# It's also possible to use the na_filter option to turn off white space
# filtering, if the white space is an actual value of interest. In data without
# any NAs, passing na_filter=False can improve the performance of reading a large file.

## Log case

Let's imagine a scenario. Imagine that we have to deal with logs from online learning systems and we are interested in looking at video use in lecture capture systems.

In these systems, it's commom for the player for have a heartbeat functionality where playback statistics are sent to the server entry so often, maybe every 30 seconds.

These heartbeats can get big as they can carry the whole state of the playback system such as where the video play head is at, where the video size is, which video is being rendered to the screen, how loud the volume is, and so on.

In [None]:
# Loading the file
log = pd.read_csv('log.csv')
log.head(10)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [None]:
# In this data, the first column is a timestamp in the Unix epoch format.
# The next column is the user name followed by a web page they're visiting
# and the video that they're playing. Every row of the DataFrame has a playback position.
# And as it increases by one, the timestamp increases by about 30 seconds.

# Except for Bob, which has paused his playback so as time increases the playback position doesn't change.

It is very difficult for us to try and derive this knowledge from the data, because it's not sorted by timestamp as one might expect. This is actually not uncommon on systems which have a high degree of parallelism.

There are a lot of missing values in the paused and volume columns. It's not efficient to send this information across the network if it hasn't changed. So, this particular system just inserts null values into the database if there's no changes.

## Method parameter()

In [None]:
# The two commom fill values are ffill and bfill. Ffill is for forward filling and
# it updates an na value for cell with the value from the previous row. Bfill  is backward filling
# which is the opposite of ffill. It fills  the missing values with the next valid value.
# And it's to be noted that it's important the data needs to be sorted in order for this to have the expected effect.
# In Pandas, we can manage to sort either by index or by values.


# Promoting the timestamp to and index and then sort on the index
log = log.set_index('time')
log = log.sort_index()
log.head(10)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


It's still not perfect. Taking a closely look at the output, we notice that the index is not unique, because two user seem to be able to use the system at the same time. A very common case.

In [None]:
# Reseting the index and using multi-level indexing
log = log.reset_index()
log = log.set_index(['time', 'user'])
log.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


Now that the DataFrame is appropriately indexed and sorted, it's time to fill all missing data using ffill. It's good to remember when dealing with missing values so we can deal with individual columns or sets of columns by projecting them. No need to fix everything in just one command.

In [14]:
log = log.fillna(method='ffill')
log.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


Another approach we can have is doing customized fill-in to replace values with the replace() function.
It allows replacement from several approaches: value-to-value, list, dictionary, regex.

In [16]:
dataframe = pd.DataFrame({'A': [1, 7, 14, 21],'B': [1, 8, 16, 24], 'C': [1, 9, 18, 27]})

dataframe

Unnamed: 0,A,B,C
0,1,1,1
1,7,8,9
2,14,16,18
3,21,24,27


In [17]:
# Replacing 1's with 99
dataframe.replace(1, 99)

Unnamed: 0,A,B,C
0,99,99,99
1,7,8,9
2,14,16,18
3,21,24,27


In [18]:
# Replacing two values at the same time
dataframe.replace([1, 99], [7, 91])

Unnamed: 0,A,B,C
0,7,7,7
1,7,8,9
2,14,16,18
3,21,24,27


# Replacing with Regex

Back to the logs.

In [19]:
# Loading the file
log = pd.read_csv('log.csv')
log.head(10)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [20]:
# Replacing the '.html' for 'webpage' in the video column
log.replace(to_replace='.*.html$', value='webpage',regex=True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,
