# Section D2 - Data Cleaning with Pandas

Sometimes you'll need to pre-process your data before you can analyze it or present it for analysis. A few scenarios:
* Excluding rows or columns with missing or invalid data:
  * You have a dataset os several measurments made on many samples.  Some of the samples don't have all measurements done, so you need to exclude them.  You can use dropna to remove rows or columns that are missing the needed measurements. 
  * Some samples have negative values reported, but this is impossible and would have been the result of a transcription error. We can exclude these rows. 

* Interpolation:
  * You have time series data collected at irregular intervals and need to interpolate it to a regular interval. 
  * You have spectoscopy data for many samples that are measured at regular but not precise wavelengths for each sample, and you need to interpolate each sample so the wavelengths all align.
* Filling Gaps in data
  * You have time series data with small gaps - you dan forward fill, backward fill, 
* Smoothing out noise in data
  * It is common to use a rolling median to smooth out an analog signal - it might be noisy from second to second, but a rolling median over 20 seconds will smooth it. Often the noise is from the mesurement and not the sample, so the noise should be removed.
  * If you have low frequency, e.g. tidal, data with regular high frequency noise in it, you can use a butterworth filter to exclude the high freqency signal (low pass filter) and preserve the signal you want to analyse.  

## Removing rows with missing data
nWe can check to see if any rows in our data have NaN values using the .isnull() check.  NaN is sort of like *None* in regular python.  And we can follow with We can use df.dropna() to drop rows or columns with NaN values. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

dropna takes an optional 'axis' argument.  axis=0 is implied and means to drop rows with NaN values.  axis=1 will tell dropna to drop all columns with NaN values.

Let's import some data tracking workout length, average pulse, max pulse, and calories burned.  This data was entered manually and might have some missing values:

In [10]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/pulse_calories_modified.csv')
print(df.head(2))
print(df.info())

   Duration  Pulse  Maxpulse  Calories
0      60.0  110.0     130.0     409.1
1      60.0  117.0     145.0     479.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  168 non-null    float64
 1   Pulse     168 non-null    float64
 2   Maxpulse  168 non-null    float64
 3   Calories  164 non-null    float64
dtypes: float64(4)
memory usage: 5.4 KB
None


We can check for NaN values with df.isnull(). This function returns a dataframe of the same shape as df, but entirely boolean values with true/false indicating whether or not each cell was NaN.  The ".any(axis=1)" checks each row in the resulting dataframe and reports true if any cell in that row is true.  So na_rows is a boolean series reporting any NaN values that we can use as a mask to view them in the original dataframe:

In [6]:
na_rows = df.isnull().any(axis=1)
print(df[na_rows])

     Duration  Pulse  Maxpulse  Calories
8         NaN  109.0     133.0     195.1
17       45.0   90.0     112.0       NaN
27       60.0  103.0     132.0       NaN
91       45.0  107.0     137.0       NaN
118      60.0  105.0     125.0       NaN
135      20.0    NaN     156.0     189.0
141      60.0   97.0     127.0       NaN
146      60.0  107.0       NaN     400.0


#### *Exercise*:
Use df.dropna() to remove rows wit mising vlaues for pulse, calories, or duration.  Use df.info() before and after, and verify with the above na_rows code cell to verify changes to the numbers of rows.

## Setting data type of columns

## Inerpolatoin of time series data

In [None]:
# drop columns with missing values
d