#### Week 3 - Visual Data Analysis

# Lesson 2:  Data Cleansing and Missing Data

### Objectives

1. inspecting a dataframe
2. renaming columns
3. replacing values in columns
4. dropping duplicates
5. changing column types
6. dealing with missing values

> ## Warm-up 1
> 1. Read through the article [Data Prep Still Dominates Data Scientists’ Time, Survey Finds!](https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/)  
> 2. Collect your main takeaways!

## Read Data
1. read the file `messy_data.csv` into a dataframe
2. what seems wrong to you in this dataset ? What would you change ?

In [1]:
import pandas as pd


In [2]:
# read the `messy_data.csv` into a `data` variable

data = pd.read_csv('./data/messy_data.csv')

In [3]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


## 1. inspecting a dataframe

In [4]:
# show the first 5 rows 

data.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [5]:
# show the info of the dataframe - do you see sth suspicious here?

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  32 non-null     int64  
 1   Date      31 non-null     object 
 2   Pulse     32 non-null     int64  
 3   Maxpulse  32 non-null     int64  
 4   Calories  30 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB


In [6]:
# run describe method on your dataframe

data.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,32.0,32.0,32.0,30.0
mean,68.4375,103.5,128.5,304.68
std,70.039591,7.832933,12.998759,66.003779
min,30.0,90.0,101.0,195.1
25%,60.0,100.0,120.0,250.7
50%,60.0,102.5,127.5,291.2
75%,60.0,106.5,132.25,343.975
max,450.0,130.0,175.0,479.0


In [7]:
data.describe(include='O')

Unnamed: 0,Date
count,31
unique,30
top,'2020/12/12'
freq,2


In [8]:
# checking for missing values

data.isnull()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [9]:
# calculate how many missing values we have for different columns

data.isnull().sum()

Duration    0
Date        1
Pulse       0
Maxpulse    0
Calories    2
dtype: int64

**What problems can you identify after the initial data exploration?:**

1. Column names are capitalized  
2. There is a miswritten date value in row 26
3. There seem to be a typo in row 7 - 450 min, yet same calories burned  
4. row 11 and 12 seem to be duplicates  
5. `Date` is a string rather than pandas datetime format  
6. The string of `Date` has also additional single quotes
7. we have two columns containing missing values: `Date`, `Calories`


## 2. Renaming columns

### Option 1

In [10]:
# display the current column names

data.columns

Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

In [11]:
# rename all the column names using the column attribute so they only consist of lower case letters

data.columns = ['duration', 'date', 'pulse', 'maxpulse', 'calories']

In [12]:
data.head()

Unnamed: 0,duration,date,pulse,maxpulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


### Option 2

In [13]:
# replacing the name of a specific column using the rename method

data.rename(columns={'duration':'duration_min', 'pulse': 'avg_pulse'})

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [14]:
data.columns # renaming didn't stick

Index(['duration', 'date', 'pulse', 'maxpulse', 'calories'], dtype='object')

In [15]:
# remember to set inplace=True to make it persistent

data.rename(columns={'duration':'duration_min', 'pulse': 'avg_pulse'}, 
            inplace=True)

data.head()

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [16]:
# we can also manipulate column names as strings
# keep in mind any string manipulation done to a pandas Series or DataFrame requires `.str` before the method

data.columns = data.columns.str.capitalize()

In [17]:
data.head()

Unnamed: 0,Duration_min,Date,Avg_pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [18]:
# reverting the change using the lower method

data.columns = data.columns.str.lower()

data.head()

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


## 3. cleaning the date column - replacing values in columns

In [19]:
# what is the format of the values in the `date` column?

data.loc[4,'date']

"'2020/12/05'"

In [20]:
# let's remove single quotes from the values in the date column

data['date'].str.replace("'", "")

0     2020/12/01
1     2020/12/02
2     2020/12/03
3     2020/12/04
4     2020/12/05
5     2020/12/06
6     2020/12/07
7     2020/12/08
8     2020/12/09
9     2020/12/10
10    2020/12/11
11    2020/12/12
12    2020/12/12
13    2020/12/13
14    2020/12/14
15    2020/12/15
16    2020/12/16
17    2020/12/17
18    2020/12/18
19    2020/12/19
20    2020/12/20
21    2020/12/21
22           NaN
23    2020/12/23
24    2020/12/24
25    2020/12/25
26      20201226
27    2020/12/27
28    2020/12/28
29    2020/12/29
30    2020/12/30
31    2020/12/31
Name: date, dtype: object

In [21]:
# make sure the change is persistent

data['date'] = data['date'].str.replace("'", "")

In [22]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [23]:
# now let's replace the misformatted value in row 26
# first "locate" the cell in question
# assign a new value to it

data.loc[26,'date'] = '2020/12/26'

In [24]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [25]:
# same for the duration 450

data.loc[7,'duration_min'] = 45

In [26]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,45,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


## 4. dropping duplicate rows

In [27]:
# rows 11 and 12 are identical, we need to drop one of them

data.drop_duplicates(inplace=True)

In [28]:
data


Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,45,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [29]:
# notice now that the duplicate row is dropped but the index is no longer correct

# let's reset the index

data.reset_index(drop=True, inplace=True)


In [30]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,45,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


## 5. changing column types

In [31]:
# what is the date column data type?

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   duration_min  31 non-null     int64  
 1   date          30 non-null     object 
 2   avg_pulse     31 non-null     int64  
 3   maxpulse      31 non-null     int64  
 4   calories      29 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.3+ KB


In [40]:
# See BONUS: the step would be here instead of the following function

In [41]:
# let's change the format of the date column to datetime (we will also come back to it in week 4!)

data['date'] = pd.to_datetime(data['date'])

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   duration_min  31 non-null     int64         
 1   date          30 non-null     datetime64[ns]
 2   avg_pulse     31 non-null     int64         
 3   maxpulse      31 non-null     int64         
 4   calories      29 non-null     float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 1.3 KB


In [43]:
data.head()

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0


In [44]:
# let's also round the calories column and convert to integer

data['calories'].round()



0     409.0
1     479.0
2     340.0
3     282.0
4     406.0
5     300.0
6     374.0
7     253.0
8     195.0
9     269.0
10    329.0
11    251.0
12    345.0
13    379.0
14    275.0
15    215.0
16    300.0
17      NaN
18    323.0
19    243.0
20    364.0
21    282.0
22    300.0
23    246.0
24    334.0
25    250.0
26    241.0
27      NaN
28    280.0
29    380.0
30    243.0
Name: calories, dtype: float64

In [45]:
# changing the column to integer

data['calories'] = data['calories'].round().astype('Int64')

In [46]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


> # Missing Data 
>
>Missing data can invalidate the data analysis, reports, dashboards, etc that are being created. Thus it is esstential to deal with in some fashion before the actual analysis begins. 
>
>In `pandas` missing data is represented by the values `NA` or `NaN`. As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. 
>
>While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.
>
>Common missing values representations:
>
>- NaN (the primary one and the one used by pandas)
>- Empty string “”
>- Other strings (e.g. “unknown”, “uncategorized”, “?”, etc.)
>- Negative values (i.e. -1, huge negatives like -999)
>
>### Where does missing data come from?
>
>    -  Failure of measurement
>    - No information (e.g. lack of observation)
>    - Technical issue (e.g. battery in a smart watch died)
>    - Programming error
>
>**Note:** Some missing data can still be representaional of an event or data point:
    - Purchase data with NaNs in coupon discount column
    - Recipe data with NaN for amount of the ingredient if we don’t need it

In [47]:
# how to get NaN as a value

import numpy as np

a = np.nan

In [48]:
a

nan

In [49]:
pd.DataFrame({'not_a_number': [a,a,a]})

Unnamed: 0,not_a_number
0,
1,
2,


## 6. Dealing with missing values

If upon inspecting data, missing values are evident how to handle them becomes of the utmost importance. This will always depend upon the situation. Above all the introduction of **any bias into the dataset must be avoided**. Therefore one option would be to leave it as is, yet that may also hinder the use of some valuable data analysis tools. 

**Here are some options as to how to deal with missing data:**

-  drop the observations with missing values
-  insert mode/mean/median depending on data type
-  insert the next or last known value using `pandas.DataFrame.fillna()`
-  insert the mean/median dependent on another column 
-  for time series data: interpolate using `pandas.Series.interpolate`


In [50]:
# calculate the number of missing values

data.isnull().sum()

duration_min    0
date            1
avg_pulse       0
maxpulse        0
calories        2
dtype: int64

In [51]:
# fill any missing data with a specific value (date column)
data['date'].fillna('2020-12-12', inplace=True)

In [52]:
data

Unnamed: 0,duration_min,date,avg_pulse,maxpulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


In [53]:
# fill the missing data with the mean of the given values (calories column)

mean_calories = int(data['calories'].mean())

In [54]:
mean_calories = data['calories'].mean().round().astype(int)

In [55]:
mean_calories

306

In [56]:
data['calories'].fillna(mean_calories, inplace=True)

In [57]:
data['calories'] = data['calories'].astype(int)

In [58]:
data.info() # each column has 31 non-null values in a 31 rows dataframe - ergo: no NaNs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   duration_min  31 non-null     int64         
 1   date          31 non-null     datetime64[ns]
 2   avg_pulse     31 non-null     int64         
 3   maxpulse      31 non-null     int64         
 4   calories      31 non-null     int32         
dtypes: datetime64[ns](1), int32(1), int64(3)
memory usage: 1.2 KB


In [61]:
#  we can also choose to drop any rows with missing values
# let's import the original dataset again as 'data_raw'

data_raw = pd.read_csv("./data/messy_data.csv")

data_raw

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [62]:
data_raw.dropna(axis=0) # dropping rows with NaN values

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [63]:
data_raw.dropna(axis=1) # dropping columns with NaN values

Unnamed: 0,Duration,Pulse,Maxpulse
0,60,110,130
1,60,117,145
2,60,103,135
3,45,109,175
4,45,117,148
5,60,102,127
6,60,110,136
7,450,104,134
8,30,109,133
9,60,98,124


# BONUS: using interpolate to fill NaN in the `'date'` column

In [None]:
# but start when data['date'] is still an object (see above before pd.to_datetime() )

pd.to_datetime(data['date'].astype(str).str.replace("/", "").astype(float).interpolate().astype(int).astype(str))