#### Week 3 - Visual Data Analysis

# Lesson 2:  Data Cleansing and Missing Data

### Objectives

1. inspecting a dataframe
2. renaming columns
3. replacing values in columns
4. dropping duplicates
5. changing column types
6. dealing with missing values

> ## Warm-up 1
> 1. Read through the article [Data Prep Still Dominates Data Scientists’ Time, Survey Finds!](https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/)  
> 2. Collect your main takeaways!

## Read Data
1. read the file `messy_data.csv` into a dataframe
2. what seems wrong to you in this dataset ? What would you change ?

In [1]:
import pandas as pd


In [2]:
# read the `messy_data.csv` into a `data` variable

data = pd.read_csv('../data/messy_data.csv')

In [3]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


## 1. inspecting a dataframe

In [4]:
# show the first 5 rows 

data.head(5)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [5]:
# show the info of the dataframe - do you see sth suspicious here?

data.info()

# Date is an object
# 1 missing value in date / 2 in Calories
# Mismatch date formatat on row 26
# Date is a string, not date & time
# Date has additional quotations
# Name column as lower case
# Should Pulse column be average pulse / calories burned or consumed?
# Duration typo on row 7 (450, not 45)
# Duplicate rows 11 & 12

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  32 non-null     int64  
 1   Date      31 non-null     object 
 2   Pulse     32 non-null     int64  
 3   Maxpulse  32 non-null     int64  
 4   Calories  30 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB


In [6]:
# run describe method on your dataframe

data.describe()

# count is number of entries
# mean is mean of values
# std is Standard deviation
# min is min
# following % is percentiles
# max is maximmum

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,32.0,32.0,32.0,30.0
mean,68.4375,103.5,128.5,304.68
std,70.039591,7.832933,12.998759,66.003779
min,30.0,90.0,101.0,195.1
25%,60.0,100.0,120.0,250.7
50%,60.0,102.5,127.5,291.2
75%,60.0,106.5,132.25,343.975
max,450.0,130.0,175.0,479.0


In [7]:
data.describe(include='O')

# this includes and onl outputs objects

Unnamed: 0,Date
count,31
unique,30
top,'2020/12/12'
freq,2


In [8]:
data.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
dtype: bool

In [9]:
# checking for missing values

data.isnull()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [10]:
# calculate how many missing values we have for different columns

data.isnull().sum()

Duration    0
Date        1
Pulse       0
Maxpulse    0
Calories    2
dtype: int64

**What problems can you identify after the initial data exploration?:**

- list the problems here ...
- ...


+ Date is an object
+ 1 missing value in date / 2 in Calories
+ Mismatch date formatat on row 26
+ Date is a string, not date & time
+ Date has additional quotations
+ Name column as lower case
+ Should Pulse column be average pulse / calories burned or consumed?
+ Duration typo on row 7 (450, not 45)
+ Duplicate rows 11 & 12

## 2. Renaming columns

### Option 1

In [11]:
# display the current column names

data.columns # attribute

Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

In [12]:
# rename all the column names using the column attribute so they only consist of lower case letters

data.columns = ['duration', 'date', 'pulse', 'max_pulse', 'calories']
data.columns

Index(['duration', 'date', 'pulse', 'max_pulse', 'calories'], dtype='object')

### Option 2

In [13]:
# replacing the name of a specific column using the rename method

data.rename(columns={'duration':'duration_min',
                     'pulse':'avg_pulse'
                    })

# only a one-tume execution. To make it permanent... 

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [14]:
#... remember to set inplace=True to make it persistent

data.rename(columns={'duration':'duration_min',
                     'pulse':'avg_pulse'
                    },inplace=True)

data.columns

Index(['duration_min', 'date', 'avg_pulse', 'max_pulse', 'calories'], dtype='object')

In [15]:
# we can also manipulate column names as strings
# keep in mind any string manipulation done to a pandas Series or DataFrame requires `.str` before the method

data.columns = data.columns.str.upper()

In [16]:
data.columns

Index(['DURATION_MIN', 'DATE', 'AVG_PULSE', 'MAX_PULSE', 'CALORIES'], dtype='object')

### Option 3

In [17]:
# reverting the change using the lower method

data.columns = data.columns.str.lower()
data.columns 

Index(['duration_min', 'date', 'avg_pulse', 'max_pulse', 'calories'], dtype='object')

## 3. cleaning the date column - replacing values in columns

In [18]:
# what is the format of the values in the `date` column?

data.dtypes

duration_min      int64
date             object
avg_pulse         int64
max_pulse         int64
calories        float64
dtype: object

In [19]:
# let's remove single quotes from the date column

# view this first
data.loc[0,'date']


data['date'].str.replace("'","") #.replace( SOMETHING , WITH NOTHING or SOMETHING ELSE)


0     2020/12/01
1     2020/12/02
2     2020/12/03
3     2020/12/04
4     2020/12/05
5     2020/12/06
6     2020/12/07
7     2020/12/08
8     2020/12/09
9     2020/12/10
10    2020/12/11
11    2020/12/12
12    2020/12/12
13    2020/12/13
14    2020/12/14
15    2020/12/15
16    2020/12/16
17    2020/12/17
18    2020/12/18
19    2020/12/19
20    2020/12/20
21    2020/12/21
22           NaN
23    2020/12/23
24    2020/12/24
25    2020/12/25
26      20201226
27    2020/12/27
28    2020/12/28
29    2020/12/29
30    2020/12/30
31    2020/12/31
Name: date, dtype: object

In [20]:
# make sure the change is persistent

data['date'] = data['date'].str.replace("'", "")

In [21]:
# now let's replace the misformatted value in row 26
# first "locate" the cell in question

data.loc[26,'date'] = '2020/12/26'

In [22]:
data.loc[26,'date']

'2020/12/26'

In [23]:
# same for the duration 450

data.loc[7,'duration_min'] = 45

In [24]:
data.loc[7,'duration_min']

45

## 4. dropping duplicate rows

In [25]:
# rows 11 and 12 are identical, we need to drop one of them

data.drop_duplicates(inplace=True)

data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020/12/01,110,130,409.1
1,60,2020/12/02,117,145,479.0
2,60,2020/12/03,103,135,340.0
3,45,2020/12/04,109,175,282.4
4,45,2020/12/05,117,148,406.0
5,60,2020/12/06,102,127,300.0
6,60,2020/12/07,110,136,374.0
7,45,2020/12/08,104,134,253.3
8,30,2020/12/09,109,133,195.1
9,60,2020/12/10,98,124,269.0


In [26]:
# notice now that the duplicate row is dropped but the index is no longer correct (12 has disappeared!)

# let's reset the index

data.index

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
           dtype='int64')

In [27]:
data.reset_index()

Unnamed: 0,index,duration_min,date,avg_pulse,max_pulse,calories
0,0,60,2020/12/01,110,130,409.1
1,1,60,2020/12/02,117,145,479.0
2,2,60,2020/12/03,103,135,340.0
3,3,45,2020/12/04,109,175,282.4
4,4,45,2020/12/05,117,148,406.0
5,5,60,2020/12/06,102,127,300.0
6,6,60,2020/12/07,110,136,374.0
7,7,45,2020/12/08,104,134,253.3
8,8,30,2020/12/09,109,133,195.1
9,9,60,2020/12/10,98,124,269.0


In [28]:
# only inputting .reset_index() will create a new index and make the previous index a column of it's own.
# (drop=True) drops the old index & (inplace=True) makes it permenant.

data.reset_index(drop=True, inplace=True)
data.index

RangeIndex(start=0, stop=31, step=1)

## 5. changing column types

In [29]:
# what is the date column data type?

data.dtypes

duration_min      int64
date             object
avg_pulse         int64
max_pulse         int64
calories        float64
dtype: object

In [30]:
# let's change the format of the date column to datetime (we will also come back to it in week 4!)

pd.to_datetime(data['date'])

0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-13
13   2020-12-14
14   2020-12-15
15   2020-12-16
16   2020-12-17
17   2020-12-18
18   2020-12-19
19   2020-12-20
20   2020-12-21
21          NaT
22   2020-12-23
23   2020-12-24
24   2020-12-25
25   2020-12-26
26   2020-12-27
27   2020-12-28
28   2020-12-29
29   2020-12-30
30   2020-12-31
Name: date, dtype: datetime64[ns]

In [31]:
# but this isn't made permenant
data.dtypes

duration_min      int64
date             object
avg_pulse         int64
max_pulse         int64
calories        float64
dtype: object

In [32]:
# overwrite...

data['date'] = pd.to_datetime(data['date'])

In [33]:
data.dtypes

duration_min             int64
date            datetime64[ns]
avg_pulse                int64
max_pulse                int64
calories               float64
dtype: object

In [40]:
data.head()

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0


In [42]:
# let's also round the calories column and convert to integer

data['calories']=data['calories'].round()
data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


In [44]:
# changing the column to integer

# using .astype(int) returns an error b'cus of the NaN values. Therefore, ('Int64'), with a capital 'I', is needed:

data['calories'] = data['calories'].astype('Int64')

In [46]:
data.dtypes

duration_min             int64
date            datetime64[ns]
avg_pulse                int64
max_pulse                int64
calories                 Int64
dtype: object

> # Missing Data 
>
>Missing data can invalidate the data analysis, reports, dashboards, etc that are being created. Thus it is esstential to deal with in some fashion before the actual analysis begins. 
>
>In `pandas` missing data is represented by the values `NA` or `NaN`. As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. 
>
>While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.
>
>Common missing values representations:
>
>- NaN (the primary one and the one used by pandas)
>- Empty string “”
>- Other strings (e.g. “unknown”, “uncategorized”, “?”, etc.)
>- Negative values (i.e. -1, huge negatives like -999)
>
>### Where does missing data come from?
>
>    -  Failure of measurement
>    - No information (e.g. lack of observation)
>    - Technical issue (e.g. battery in a smart watch died)
>    - Programming error
>
>**Note:** Some missing data can still be representaional of an event or data point:
    - Purchase data with NaNs in coupon discount column
    - Recipe data with NaN for amount of the ingredient if we don’t need it

## 6. Dealing with missing values

If upon inspecting data, missing values are evident how to handle them becomes of the utmost importance. This will always depend upon the situation. Above all the introduction of **any bias into the dataset must be avoided**. Therefore one option would be to leave it as is, yet that may also hinder the use of some valuable data analysis tools. 

**Here are some options as to how to deal with missing data:**

-  drop the observations with missing values
-  insert mode/mean/median depending on data type
-  insert the next or last known value using `pandas.DataFrame.fillna()`
-  insert the mean/median dependent on another column 
-  for time series data: interpolate using `pandas.Series.interpolate`


In [48]:
data

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409.0
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.0
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.0
8,30,2020-12-09,109,133,195.0
9,60,2020-12-10,98,124,269.0


In [49]:
# calculate the number of missing values
data.isnull().sum()

duration_min    0
date            1
avg_pulse       0
max_pulse       0
calories        2
dtype: int64

In [50]:
# fill any missing data with a specific value (date column)

data.loc[21,'date']

NaT

In [51]:
#fillna fills all na values with a given value

data['date'].fillna('2020-12-22', inplace=True)

data.loc[21,'date']

Timestamp('2020-12-22 00:00:00')

In [54]:
# fill the missing data with the mean of the given values (calories column). Row 17 and 27 are missing.
# What is the mean for calories?

mean_cal = data['calories'].mean()
mean_cal = mean_cal.round().astype(int)
mean_cal

306

In [56]:
# using .fillna() again

data['calories'].fillna(mean_cal)

#but this isn't resistant...

0     409
1     479
2     340
3     282
4     406
5     300
6     374
7     253
8     195
9     269
10    329
11    251
12    345
13    379
14    275
15    215
16    300
17    306
18    323
19    243
20    364
21    282
22    300
23    246
24    334
25    250
26    241
27    306
28    280
29    380
30    243
Name: calories, dtype: Int64

In [57]:
#  we can also choose to drop any rows with missing values

data.isnull().sum()

duration_min    0
date            0
avg_pulse       0
max_pulse       0
calories        2
dtype: int64

In [58]:
data.head()


Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409
1,60,2020-12-02,117,145,479
2,60,2020-12-03,103,135,340
3,45,2020-12-04,109,175,282
4,45,2020-12-05,117,148,406


In [59]:
# remove a whole column with .dropna and the arguement (axis=1)

data.dropna(axis=1)

# column 'calories' has disappeared

Unnamed: 0,duration_min,date,avg_pulse,max_pulse
0,60,2020-12-01,110,130
1,60,2020-12-02,117,145
2,60,2020-12-03,103,135
3,45,2020-12-04,109,175
4,45,2020-12-05,117,148
5,60,2020-12-06,102,127
6,60,2020-12-07,110,136
7,45,2020-12-08,104,134
8,30,2020-12-09,109,133
9,60,2020-12-10,98,124


In [60]:
# and we can remove a whole rows with .dropna and the arguement (axis=0)

data.dropna(axis=0)

# rows '17' and '27' have disappeared

Unnamed: 0,duration_min,date,avg_pulse,max_pulse,calories
0,60,2020-12-01,110,130,409
1,60,2020-12-02,117,145,479
2,60,2020-12-03,103,135,340
3,45,2020-12-04,109,175,282
4,45,2020-12-05,117,148,406
5,60,2020-12-06,102,127,300
6,60,2020-12-07,110,136,374
7,45,2020-12-08,104,134,253
8,30,2020-12-09,109,133,195
9,60,2020-12-10,98,124,269
