# Section D2 - Data Cleaning with Pandas

Topics:
* Setting column data types
* Dropping rows with NaN values
* Removoing rows with invalid values. 

Sometimes you'll need to pre-process your data before you can analyze it or present it for analysis. A few scenarios:
* Excluding rows or columns with missing or invalid data:
  * You have a dataset os several measurments made on many samples.  Some of the samples don't have all measurements done, so you need to exclude them.  You can use dropna to remove rows or columns that are missing the needed measurements. 
  * Some samples have negative values reported, but this is impossible and would have been the result of a transcription error. We can exclude these rows. 

* Interpolation:
  * You have time series data collected at irregular intervals and need to interpolate it to a regular interval. 
  * You have spectoscopy data for many samples that are measured at regular but not precise wavelengths for each sample, and you need to interpolate each sample so the wavelengths all align.
* Filling Gaps in data
  * You have time series data with small gaps - you dan forward fill, backward fill, 
* Smoothing out noise in data
  * It is common to use a rolling median to smooth out an analog signal - it might be noisy from second to second, but a rolling median over 20 seconds will smooth it. Often the noise is from the mesurement and not the sample, so the noise should be removed.
  * If you have low frequency, e.g. tidal, data with regular high frequency noise in it, you can use a butterworth filter to exclude the high freqency signal (low pass filter) and preserve the signal you want to analyse.  

In [38]:
import pandas as pd
# df = pd.read_csv('https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/pulse_calories_modified.csv')
df = pd.read_csv('./SAMPLE_DATA/pulse_calories_modified.csv')
print(df.head(2))
print(df.tail(2))
print(df.info())

  Duration Pulse Maxpulse Calories
0       60   110      130    409.1
1       60   117      145    479.0
    Duration Pulse Maxpulse Calories
169       75   125      150    330.4
170        a     b        c        d
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Duration  170 non-null    object
 1   Pulse     170 non-null    object
 2   Maxpulse  170 non-null    object
 3   Calories  166 non-null    object
dtypes: object(4)
memory usage: 5.5+ KB
None


## Setting correct data type for columns
Right now, df.info is reporting a data type of "object" for all columns.  Also note the invalid string characters in df.tail output.  We can use `.astype(...)` to convert the data to numeric.  astype will raise an exception if any of the values in the column(s) cannot be converted to the given data type(s).  When this happens, we can either identify and fix those values first or we can include `errors='ignore'` as an argument. 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

#### *Exercise*
Run the folowing cell to convert the columns to numerc data types and modify it as needed resolve the error from astype.  Add a print statement for df_numeric.info to verify the new data types of each of the columns.
What happenend to the string values in the last row of the dataframe?

Also take note of the different usage examples commented out. 

In [37]:
# Single column method:
# df['Duration'] = df['Duration'].astype('int')

# Multiple columns method:
# df = df.astype('float')

# Multiplue columns method:
df_numeric = df.astype({'Duration': 'int', 
                'Pulse': 'int', 
                'Maxpulse': 'int', 
                'Calories': 'float'}, 
                errors='ignore')

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
166,60,110,145,300.0
167,60,-115,145,310.2
168,75,120,150,320.4
169,75,125,150,330.4
170,a,b,c,d


Since the string values we want to drop are still inculded, we can try using to_numeric rather than astype.  To_numeric accepts an option, errors='coerce', that will tell it to convert incompatible values to NaN rather than preserving the invalid value as astype does. 

But to_numeric works only on series data (a single column), so we need to do it once per column or use .apply to run it against all columns. 

https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html#pandas.to_numeric

In [41]:
# If I didn't know about .apply, I would use this to convert each column:
# df_numeric = df.copy()
# for col in df_numeric.columns:
#     df_numeric[col] = pd.to_numeric(df_numeric[col], errors='coerce')

# But this is much more efficient, and in the spirit of how pandas is intended to be used:
df_numeric = df.apply(pd.to_numeric, errors='coerce')


A helpful way to see what was changed is to make a mask indicating which rows in df_numeric have NaN values now, and use the mask to see the original rows still preserved in df. Below, .isnull() returns a boolean dataframe of the same shape as the df it's called against with true/false indicating if each cell is NaN.  And .any looks along the given axis and reports true for any row that has a true in it, returning a series object that we use as a mask.  axis=1 means check each row, and axis=0 would mean check each column.

In [42]:
# print rows which will cause errors in astype conversoin to numeric types:
numeric_na_rows = df_numeric.isnull().any(axis=1)
print(df[numeric_na_rows])

    Duration  Pulse Maxpulse Calories
8        NaN    109      133    195.1
17        45     90      112      NaN
27        60    103      132      NaN
91        45    107      137      NaN
118       60    105      125      NaN
135       20    NaN      156    189.0
141       60     97      127      NaN
146       60    107      NaN    400.0
153    'foo'  'bar'    'bla'    'asd'
170        a      b        c        d


## Removing rows with NaN values
Now that we have only numeric values and NaN values, we can pretty simply call df.dropna() to drop rows (or cols if we specify the axis argument) that have a NaN in them.  

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

#### *Exescise*:
Run df_numeric.dropna() in the following cell and use .info() to see how many rows we are left with.  How many were removed from the original df dataframe?

## Identifying bad data analytically
We have a bunch of numeric data in our dataframe now, but we might want to sanity check that it looks correct.  After all, our interns who transcribed this data are overworked, underpaid, and distracted, so there could be mistakes. 

#### *Exercise*:
Use masks to identify data meeting each of the following conditions.  Print and remove the identified rows from our dataset:
* Rows with Pulse or Maxpulse <30 or >220, as these would be impossible for normal humans
* Rows with Pulse greater than Maxpulse, as the average can't possibly be greater than the max observed
* Duration or Calories < 0 as this is not possible

Remember that `~mask` is the inverse of mask, so if your mask matches the condition you want to remove, then you want to set your dataframe to `df[~mask]` to remove the rows that met the criteria.

You can check each condition on by one, or you could check them in a loop and in each iteration of the loop, "or" the previous mask and new mask together with "|" (the pipe symbol).  Then after the loop, your mask will include all of the conditions and you can remove all matching rows at once from df. 

In [None]:
# Your code here
foo = ...
df_numeric = df_numeric[foo]
...

Ahother thing we can do is check that the calories burned per unit time are reasonable.  We'll do this by calculating normalized calories per duration and identifying outliers which are more than two standard deviations from the average:

#### *Exercise*
* Create a "NormalizedCalories" column equal to Calories divided by Duration
* Calculate the average and standard deviation of this new column.  
* Create a mask for abs(NormalizedCalories - Average) > (2 * Stdev)
* Print the rows identified and remove them from the dataset.  Does this seem like a reasonable filter for outliers?

A couple helpful functions you can use for this are:
* np.abs(...) to calculate the absolute value
* df['col_name'].describe() will return a **dictionary** of statistics describing the column.  You can use 'std' and 'mean' from this dictionary for the standard deviation and the average.  Try printing it to see what all is included. 


In [None]:
import numpy as np
# your code here:
df['NormalizedCalories'] = ...
stats = df['NormalizedCalories'].describe()
...

## Setting data type of columns

## Inerpolatoin of time series data

In [None]:
# drop columns with missing values
d

#### *Exercise*: