# Data Cleansing and Missing Data

## Objectives

1. inspecting a dataframe
2. renaming columns
3. replacing values in columns
4. dropping duplicates
5. changing column types
6. dealing with missing values
7. creating new columns

Data is often very disorganized. Messy data can hinder data exploration and other steps in your analysis. Data cleansing is about identifying incorrect, incomplete, inaccurate, or irrelevant data, fixing the problems, and making sure that all such issues will be fixed automatically in the future.

### Missing Data


Missing data can invalidate the data analysis, reports, dashboards, etc that are being created. Thus it is esstential to deal with in some fashion before the actual analysis begins.

In pandas missing data is represented by the values NA or NaN. As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data.

While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

### Common missing values representations:

NaN (the primary one and the one used by pandas)
Empty string “”
Other strings (e.g. “unknown”, “uncategorized”, “?”, etc.)
Negative values (i.e. -1, huge negatives like -999)
Where does missing data come from?

### Failure of measurement

No information (e.g. lack of observation)
Technical issue (e.g. battery in a smart watch died)
Programming error

Note: Some missing data can still be representaional of an event or data point:

Purchase data with NaNs in coupon discount column - it will mean that a customer didn’t use a coupon code
Recipe data with NaN for amount of the ingredient - it will mean we don’t this ingredient at all

### Dealing with missing data
If upon inspecting data, missing values are evident how to handle them becomes of the utmost importance. This will always depend upon the situation. Above all the introduction of any bias into the dataset must be avoided. Therefore one option would be to leave it as is, yet that may also hinder the use of some valuable data analysis tools. Here are some options as to how to deal with missing data:

- drop the observations with missing values
- insert mode/mean/median depending on data type
- insert the next or last known value using pandas.DataFrame.fillna()
- insert the mean/median dependent on another column
- for time series data: interpolate using pandas.Series.interpolate

### Pandas topic related commands

| Command              | Description                                            |
|----------------------|--------------------------------------------------------|
| df.info()            | prints concise summary of dataframe                    |
| df.rename()          | alter row or column axes labels                        |
| df.dropna()          | removes rows or columns that contain missing values     |
| df.set_index()       | assign a column to be the row index of the table       |
| df.reset_index()     | replace the current row index with a default           |
| df["col_name"].isna()| identify missing values in a column                    |
| df["col_name"].replace()| replace values in a Series                           |
| df["col_name"].fillna()| fill NaN values using the specified method           |


In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("./data/messy_data.csv")

In [4]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


## 1. inspecting a dataframe

In [5]:
data.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  32 non-null     int64  
 1   Date      31 non-null     object 
 2   Pulse     32 non-null     int64  
 3   Maxpulse  32 non-null     int64  
 4   Calories  30 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB


In [7]:
data.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,32.0,32.0,32.0,30.0
mean,68.4375,103.5,128.5,304.68
std,70.039591,7.832933,12.998759,66.003779
min,30.0,90.0,101.0,195.1
25%,60.0,100.0,120.0,250.7
50%,60.0,102.5,127.5,291.2
75%,60.0,106.5,132.25,343.975
max,450.0,130.0,175.0,479.0


In [8]:
data.describe(include='O') # includes objects

Unnamed: 0,Date
count,31
unique,30
top,'2020/12/12'
freq,2


In [9]:
# checking for missing values

data.isnull()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [10]:
# add the missing values up

data.isnull().sum()

Duration    0
Date        1
Pulse       0
Maxpulse    0
Calories    2
dtype: int64

**we can see the following problems:**

1. we have two columns containing missing values: Date, Calories
2. Data is a string rather than pandas datetime format
3. Column names are capitalized
4. There is a miswritten date value in row 26

## 2. Renaming columns

In [11]:
# method 1

# renaming all columns

data.columns

Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

In [16]:
data.columns = ['duration', 'date', 'pulse', 'max_pulse', 'calories']

data

Unnamed: 0,duration,date,pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [18]:
# method 2

# replacing the name of a specific column

data.rename(columns={'duration':'duration_min', 'pulse': 'avg_pulse'})


Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [19]:

data.head()

Unnamed: 0,duration,date,pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [20]:
# remember to set inplace True

data.rename(columns={'duration':'duration_min', 'pulse': 'avg_pulse'}, inplace=True)

data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [21]:
# we can also manipulate column names as strings
# keep in mind any string manipulation done to a pandas Series or DataFrame requires `.str` before the method


data.columns = data.columns.str.capitalize()

In [22]:
data.head()

Unnamed: 0,Duration_min,Date,Avg_pulse,Max_pulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [23]:
# reverting the change using the lower method

data.columns = data.columns.str.lower()

data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


## 3. replacing values in columns

In [24]:
# let's remove those single quotes from the date column

data['date'].str.replace("'", "")

0     2020/12/01
1     2020/12/02
2     2020/12/03
3     2020/12/04
4     2020/12/05
5     2020/12/06
6     2020/12/07
7     2020/12/08
8     2020/12/09
9     2020/12/10
10    2020/12/11
11    2020/12/12
12    2020/12/12
13    2020/12/13
14    2020/12/14
15    2020/12/15
16    2020/12/16
17    2020/12/17
18    2020/12/18
19    2020/12/19
20    2020/12/20
21    2020/12/21
22           NaN
23    2020/12/23
24    2020/12/24
25    2020/12/25
26      20201226
27    2020/12/27
28    2020/12/28
29    2020/12/29
30    2020/12/30
31    2020/12/31
Name: date, dtype: object

In [25]:
data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [28]:
data['date'] = data['date'].str.replace("'", "")

data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0


In [29]:
# now let's replace the misformatted value in row 26

data.loc[26, 'date']

'20201226'

In [30]:
data.loc[26, 'date'] = '2020/12/26'

data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


## 4. dropping duplicate rows

In [31]:
# rows 11 and 12 are identical, we need to drop one of them

data.drop_duplicates(inplace=True)

data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [32]:
# notice now that the duplicate row is dropped but the index is no longer correct

# let's reset the index
data.reset_index(drop=True, inplace=True)

data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,450,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


## 5. changing column types

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   duration_min  31 non-null     int64  
 1   date          30 non-null     object 
 2   avg_pulse     31 non-null     int64  
 3   max_pulse     31 non-null     int64  
 4   calories      29 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.3+ KB


In [34]:
# let's change the format of the date column to datetime

pd.to_datetime(data['date'])

0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-13
13   2020-12-14
14   2020-12-15
15   2020-12-16
16   2020-12-17
17   2020-12-18
18   2020-12-19
19   2020-12-20
20   2020-12-21
21          NaT
22   2020-12-23
23   2020-12-24
24   2020-12-25
25   2020-12-26
26   2020-12-27
27   2020-12-28
28   2020-12-29
29   2020-12-30
30   2020-12-31
Name: date, dtype: datetime64[ns]

In [35]:
data['date'] = pd.to_datetime(data['date']) # we will dive deeper into the datetime data type in week 4!

data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   duration_min  31 non-null     int64         
 1   date          30 non-null     datetime64[ns]
 2   avg_pulse     31 non-null     int64         
 3   max_pulse     31 non-null     int64         
 4   calories      29 non-null     float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 1.3 KB


In [37]:
# let's also round the calories column and convert to integer

data['calories'] = data['calories'].round()

data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


In [38]:
# changing the column to integer

data['calories'] = data['calories'].astype('Int64')

In [39]:
data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


# Missing Data

Missing data can invalidate the data analysis, reports, dashboards, etc that are being created. Thus it is esstential to deal with in some fashion before the actual analysis begins. 

In `pandas` missing data is represented by the values `NA` or `NaN`. As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. 

While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

Common missing values representations:

- NaN (the primary one and the one used by pandas)
- Empty string “”
- Other strings (e.g. “unknown”, “uncategorized”, “?”, etc.)
- Negative values (i.e. -1, huge negatives like -999)

### Where does missing data come from?

-  Failure of measurement
    - No information (e.g. lack of observation)
    - Technical issue (e.g. battery in a smart watch died)

- Programming error

**Note:** Some missing data can still be representaional of an event or data point:
    - Purchase data with NaNs in coupon discount column
    - Recipe data with NaN for amount of the ingredient if we don’t need it

## 6. Dealing with missing values

If upon inspecting data, missing values are evident how to handle them becomes of the utmost importance. This will always depend upon the situation. Above all the introduction of any bias into the dataset must be avoided. Therefore one option would be to leave it as is, yet that may also hinder the use of some valuable data analysis tools. Here are some options as to how to deal with missing data:

-  drop the observations with missing values
-  insert mode/mean/median depending on data type
-  insert the next or last known value using `pandas.DataFrame.fillna()`
-  insert the mean/median dependent on another column 
-  for time series data: interpolate using `pandas.Series.interpolate`


In [None]:
data.isna().sum()

In [None]:
# fill any missing data with a specific value

data['date'].fillna('2020-12-22')

In [None]:
# fill the missing data with the mean of the given values

mean_calories = int(data['calories'].mean().round())

data['calories'].fillna(mean_calories)

In [None]:
#  we can also choose to drop any rows with missing values

data.dropna(axis=0) # or axis='rows'

## 6. Creating New Columns

Creating columns is done in one of two ways: 

1. Using bracket notation

2. Using the `eval()` method on the Pandas DataFrame. 

Calories Burned per Minute (Calories_per_Minute):

Calculate the rate of burning calories per minute by dividing the total calories burned (Calories) by the duration of the activity (Duration). This can give insights into the intensity of the exercise.

In [None]:
data['calories_per_minute'] = data['calories'] / data['duration_min']


with `.eval()` we can create new columns

Calculate the heart rate reserve, which is the difference between the maximum heart rate (Maxpulse) and the resting heart rate (Pulse). This can provide insights into the intensity of the workout relative to the individual's resting heart rate.

In [40]:
data.eval('hrr = max_pulse - avg_pulse', inplace = True)
data.columns

Index(['duration_min', 'date', 'avg_pulse', 'max_pulse', 'calories', 'hrr'], dtype='object')

Day of the Week (Day_of_Week):

Extract the day of the week from the Date column. This can help analyze trends based on the day of the week when the activity was performed.

In [None]:
data['day_of_week'] = data['date'].dt.day_name()

Weekend vs. Weekday (Weekend):

Create a binary indicator variable to identify whether the activity was performed on a weekend (Saturday or Sunday).

In [None]:
data['weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
