# ER131: Data Cleaning and Exploratory Data Analysis

Duncan Callaway

In this notebook we'll work with PurpleAir data to explore the concepts of Structure, Granularity, Scope, Temporality and Faithfulness.  Along the way we'll talk about data cleaning as well.  

[Here's PurpleAir's website](https://www.purpleair.com/map#1/25/-30) -- They have really cool maps!

The way I developed this lecture was by pulling the data down and exploring it.  You'll see my (edited) process of examining the data.

This began by me visiting [this website](https://www.purpleair.com/sensorlist) to look for data.  I used the Chrome browser to pull data (other browsers didn't work).

The folks are PurpleAir also sent me a pdf describing their data, which is available from the instructors.  

In [1]:
import numpy as np
import pandas as pd
import os

## Structure: how are the data stored?  

In [9]:
!ls

CAISO_2017to2018_stack.csv
Icon?
Lecture 05.1 Groupby and Pivot.pptx
Lecture 05.2 Groupby.html
Lecture 05.2 Groupby.ipynb
Lecture 05.3 Pivot.html
Lecture 05.3 Pivot.ipynb
Lecture 05.4-06.1 Data Cleaning, EDA.pptx
Lecture 05.5 EDA.ipynb
US-EPA-PM2.5-AQI-Monitoring.png
[34mdata[m[m
~$Lecture 05.4-06.1 Data Cleaning, EDA.pptx


Let's look in the data directory:

In [14]:
!ls 'data'

Alameda Gold Coast (outside) (37.767347 -122.267255) Primary Real Time 09_08_2021 09_07_2022.csv
B59-Mech (outside) (37.875921 -122.253082) Primary Real Time 09_08_2021 09_07_2022.csv
Backyard (outside) (37.826875 -122.245254) Primary Real Time 09_08_2021 09_07_2022.csv
Bower House (outside) (37.803884 -122.297151) Primary Real Time 09_08_2021 09_07_2022.csv
Icon?
Moraga Ave (outside) (37.83023 -122.239963) Primary Real Time 09_08_2021 09_07_2022.csv
manzanita at villanova (outside) (37.84099 -122.196456) Primary Real Time 09_08_2021 09_07_2022.csv


### Q: What can we learn from these file names?
* the sensor location appears to be provided in lat / lon coordinates in parens
* the date range is listed
* they are probably csv files.

If you type the lat-lon values into google maps, you'll find they correspond to the locations of purple air sensors with the same name. [Here](https://www.google.com/maps/dir/37.803884,-122.297151/37.826875+-122.245254/37.83023+-122.239963/37.84099+-122.196456/@37.8242299,-122.2991142,12.42z/data=!4m15!4m14!1m0!1m3!2m2!1d-122.245254!2d37.826875!1m3!2m2!1d-122.239963!2d37.83023!1m3!2m2!1d-122.196456!2d37.84099!3e1) is a route through these sites. 

Before proceeding let's find the size of some of these files:

In [29]:
!ls -l 'data/Bower House (outside) (37.803884 -122.297151) Primary Real Time 09_08_2021 09_07_2022.csv'

-rw-------  1 duncancallaway  staff  28354380 Sep  7  2022 data/Bower House (outside) (37.803884 -122.297151) Primary Real Time 09_08_2021 09_07_2022.csv


The number in the middle is the size of the file in bytes -- so 28MB. Pretty big.

Let's read in one of the .csv files:

In [30]:
Bower = pd.read_csv('data/Bower House (outside) (37.803884 -122.297151) Primary Real Time 09_08_2021 09_07_2022.csv')

In [31]:
Bower.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
0,2021-09-08 00:00:30 UTC,271065,14.34,29.12,38.14,40112.0,-71.0,84.0,38.0,28.73,
1,2021-09-08 00:02:30 UTC,271066,15.74,29.59,37.28,40114.0,-70.0,84.0,38.0,29.15,
2,2021-09-08 00:04:30 UTC,271067,13.86,28.89,41.7,40116.0,-75.0,84.0,37.0,28.64,
3,2021-09-08 00:06:30 UTC,271068,14.43,28.59,45.93,40118.0,-76.0,84.0,37.0,28.41,
4,2021-09-08 00:08:30 UTC,271069,13.0,27.02,41.47,40120.0,-70.0,84.0,36.0,27.02,


### Q: What do you notice about the file contents?

Several things to ask from this: 
1. Dates are UTC.
2. Each entry has a unique ID -- could be used to check for time stamp errors or gaps in data
3. Headers have 'CF1' or 'ATM' at the top -- what does that mean?
    1. From the PurpleAir documentation, in this directory, *"ATM is "atmospheric", meant to be used for outdoor applications. CF=1 is meant to be used for indoor or controlled environment applications. However, PurpleAir uses CF=1 values on the map. This value is lower than the ATM value in higher measured concentrations."*  
    2. The explanation is a little vague and suggests further exploration required!
    3. [This](https://amt.copernicus.org/articles/14/4617/2021/) cool paper suggests that the ATM data are 'raw' measurements and that CF_1 data have a 3/2 multiplication at concentrations over 25 $\mu$ g / m$^3$
4. The columns "UptimeMinutes" and "RSSI_dbm" are not immediately obvious
    1. again from documentation: "uptimeminutes" is time since last restart, and "RSSI_dbm" is wifi signal strength for the device.  
5. The "unnamed: 10" column seems useless, why is it there?
    1. Looking at the data we see a comma before the `\n` (newline character) at the end of the first (header) line, it appears this is generating the extra row.

### The `.describe` method is one of the most important you can use for EDA

In [32]:
Bower.describe()

Unnamed: 0,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
count,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,0.0
mean,390592.6365,7.703851,15.989965,24.196788,26747.629834,-70.180916,65.679018,50.249541,14.203361,
std,68985.096555,9.894243,19.466634,24.491579,18879.817383,5.026611,11.876395,14.772064,14.610303,
min,271065.0,0.0,0.0,0.0,1.0,-93.0,36.0,5.0,0.0,
25%,336116.0,1.21,3.59,6.75,10005.0,-71.0,58.0,41.0,3.59,
50%,380108.5,3.76,8.75,16.41,23995.5,-69.0,63.0,55.0,8.75,
75%,444398.75,10.22,20.28,33.27,41299.0,-67.0,71.0,62.0,20.27,
max,532384.0,336.36,1297.91,1582.45,71582.0,-53.0,126.0,77.0,865.12,


As you're learning about your data, `.describe` gives you the chance to do an initial "sniff test" to see whether you think the data are in good condition. 
* Are there anomolously high maxima or low minima?
* Are the averages much higher or lower than what you might expect?
* Are there other characteristics of the distributions that look suspicious, or curious?


As an aside, before we do more EDA, let's check the other location.

In [33]:
backyard = pd.read_csv('data/Backyard (outside) (37.826875 -122.245254) Primary Real Time 09_08_2021 09_07_2022.csv')
np.mean(backyard['PM2.5_CF1_ug/m3'])

11.776786947061435

In [34]:
moraga = pd.read_csv('data/Moraga Ave (outside) (37.83023 -122.239963) Primary Real Time 09_08_2021 09_07_2022.csv')
np.mean(moraga['PM2.5_CF1_ug/m3'])

10.18030246986375

In [35]:
manzanita = pd.read_csv('data/manzanita at villanova (outside) (37.84099 -122.196456) Primary Real Time 09_08_2021 09_07_2022.csv')
np.mean(manzanita['PM2.5_CF1_ug/m3'])

8.031333700168869

In [36]:
alameda = pd.read_csv('data/Alameda Gold Coast (outside) (37.767347 -122.267255) Primary Real Time 09_08_2021 09_07_2022.csv')
np.mean(alameda['PM2.5_CF1_ug/m3'])

13.041692402454741

Now you can see that the mean PM2.5 numbers vary significantly by location.  

If you inspect the data, you'll see a general trend: the further away from the Bay the sensor is, the lower its mean. 

Let's dig in to one sensor a little more

## Granularity: how are the data aggregated?

We'll talk a little more about Temporality in a moment, but time also matters for thinking about granularity.

First we need to pay attention to the fact that this is UTC.  Let's put it in datetime format to prevent mistakes.

In [37]:
Bowertime = pd.to_datetime(Bower['created_at'], utc=True)

In [38]:
Bower['created_at']=Bowertime

In [39]:
Bower['created_at'].dtype

datetime64[ns, UTC]

Yes, that response really means the time are recorded down to the nanosecond. However if you use `.second` on one of the time entries you'll see that the resolution is never more than 1 second.

Note: The data are instantaneous measurements, not averaged over time.  
* In practice, this just means there is *no* aggregation in the primary data.

In [40]:
Bower.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
0,2021-09-08 00:00:30+00:00,271065,14.34,29.12,38.14,40112.0,-71.0,84.0,38.0,28.73,
1,2021-09-08 00:02:30+00:00,271066,15.74,29.59,37.28,40114.0,-70.0,84.0,38.0,29.15,
2,2021-09-08 00:04:30+00:00,271067,13.86,28.89,41.7,40116.0,-75.0,84.0,37.0,28.64,
3,2021-09-08 00:06:30+00:00,271068,14.43,28.59,45.93,40118.0,-76.0,84.0,37.0,28.41,
4,2021-09-08 00:08:30+00:00,271069,13.0,27.02,41.47,40120.0,-70.0,84.0,36.0,27.02,


Nice thing about the datetime formate is that you can easily get time information out of it.  For example let's look at the 1,000th entry:

In [41]:
Bower.iloc[1000,0].hour

9

Note, we could rename the cols to make things easier if we wished.  I'm not going to because we're not going to be working with this data set for long, but in other cases you might decide to.

What are some examples where there might be more aggregation of the data?
* suppose you manipulated the data to provide hourly, daily, or yearly averages. Then you'd have a granularity aggregated at that particular time scale.

## Scope: how much time, how many people, what spatial area?
So far we have focused on data from one location -- A sensor in West Oakland. 

From the file name it looks like the time is from the last 12 months, let's confirm:

In [45]:
Bower['created_at'].describe()

count                                 351942
mean     2022-02-22 12:10:56.014408704+00:00
min                2021-09-08 00:00:30+00:00
25%                2021-12-08 20:49:46+00:00
50%                2022-02-08 00:57:35+00:00
75%                2022-05-08 12:27:22+00:00
max                2022-09-07 23:58:17+00:00
Name: created_at, dtype: object

So it's about one year of data.  

Does the data cover the topic of interest?

In this case, we need to answer the question:  For the PurpleAir data, what topic of interest might the data cover?

#### --> class discussion on this.

Possible answers why the data might be of interest
* near highways and port of oakland
* near communities that are historically underserved

Possible reasons *not* of interest:
* more important to look at many recent wildfire seasons
* it might be valuable to compare across sites rather than evaluate just one.

## Temporality: How is time represented in the data?
We've already figured out that we're working with UTC dates.  UTC is "universal time coordinated" and is essentially greenwich mean time, the time on the prime meridian.

Can we figure out how frequent measurements are?

Unfortunately I found it difficult to take differences with datetime objects, so I had to write a for loop:

In [46]:
diffs = np.zeros(len(Bower['created_at']))

for i in range(0, len(diffs)-1):
    diffs[i] = ((Bower['created_at'][i+1]
                      - Bower['created_at'][i]).total_seconds())  # we apply total_seconds in order to store the data as a float in the list

diffs = np.sort((diffs))

print('max diffs:', diffs[:-30:-1])
print('median:', np.median(diffs))

max diffs: [79585. 43629.  3842.  1782.  1782.  1561.  1443.  1441.  1437.   976.
   976.   727.   724.   720.   718.   718.   495.   495.   480.   480.
   388.   363.   360.   359.   358.   358.   346.   292.   276.]
median: 120.0


Looks like for the most part we're sampling every 2 minutes, with a few gaps in the data.  

## Faithfulness: are the data trustworthy?
This one's much harder to assess.  Let's have a look at some basic things we might care about

In [47]:
sum(Bower['PM2.5_ATM_ug/m3'].isna())

0

That tells us there are no NaN values in the PM2.5 data.  Impressive!

In [48]:
Bower.describe()

Unnamed: 0,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
count,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,0.0
mean,390592.6365,7.703851,15.989965,24.196788,26747.629834,-70.180916,65.679018,50.249541,14.203361,
std,68985.096555,9.894243,19.466634,24.491579,18879.817383,5.026611,11.876395,14.772064,14.610303,
min,271065.0,0.0,0.0,0.0,1.0,-93.0,36.0,5.0,0.0,
25%,336116.0,1.21,3.59,6.75,10005.0,-71.0,58.0,41.0,3.59,
50%,380108.5,3.76,8.75,16.41,23995.5,-69.0,63.0,55.0,8.75,
75%,444398.75,10.22,20.28,33.27,41299.0,-67.0,71.0,62.0,20.27,
max,532384.0,336.36,1297.91,1582.45,71582.0,-53.0,126.0,77.0,865.12,


That's a pretty high PM2.5 average.  And the max is very suspiciously high.  What's going on?

Options: 
1. Wildfire smoke really pumped up the 2.5 values
2. We have a lot of missing data and only values during the wild fires
3. There are some erroneously high values.

Let's start by looking at how many values are big.  

In [49]:
log_ind = Bower.loc[:,'PM2.5_CF1_ug/m3'] > 500 # gives a list for logical indexing
Bower.loc[log_ind,'PM2.5_CF1_ug/m3']

150102     608.21
150822     608.21
305431    1297.91
305432     878.52
Name: PM2.5_CF1_ug/m3, dtype: float64

Let's look in the vicinity of the high values to see if we believe the trend:

In [50]:
Bower.loc[305420:305435,:]

Unnamed: 0,created_at,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
305420,2022-07-05 05:46:09+00:00,485863,5.17,10.33,12.72,70909.0,-67.0,73.0,59.0,10.33,
305421,2022-07-05 05:48:09+00:00,485864,6.66,10.88,12.31,70911.0,-67.0,72.0,59.0,10.88,
305422,2022-07-05 05:50:09+00:00,485865,6.81,12.11,14.12,70913.0,-72.0,72.0,60.0,12.11,
305423,2022-07-05 05:52:09+00:00,485866,6.98,11.89,14.58,70915.0,-69.0,72.0,60.0,11.89,
305424,2022-07-05 05:54:09+00:00,485867,5.82,11.07,14.45,70917.0,-68.0,73.0,60.0,11.07,
305425,2022-07-05 05:56:09+00:00,485868,5.65,10.33,13.14,70919.0,-67.0,73.0,60.0,10.33,
305426,2022-07-05 05:58:09+00:00,485869,7.17,13.38,16.48,70921.0,-72.0,72.0,60.0,13.38,
305427,2022-07-05 06:00:09+00:00,485870,5.09,8.67,10.78,70923.0,-65.0,72.0,60.0,8.67,
305428,2022-07-05 06:02:09+00:00,485871,4.52,8.33,11.19,70925.0,-69.0,72.0,59.0,8.33,
305429,2022-07-05 06:04:09+00:00,485872,5.62,10.02,11.29,70927.0,-67.0,72.0,60.0,10.02,


Looks like there was a stretch of time with really high values, somewhat suspciously clustered around 5000.  If I were doing more work here I would look into the sensor more carefully to see if there is any significance to that number.

But for now -- let's just go ahead and drop them and see what happens:

In [83]:
Bower.loc[log_ind,'PM2.5_CF1_ug/m3'] = np.nan
Bower.describe()

Unnamed: 0,entry_id,PM1.0_CF1_ug/m3,PM2.5_CF1_ug/m3,PM10.0_CF1_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_ATM_ug/m3,Unnamed: 10
count,351942.0,351942.0,351938.0,351942.0,351942.0,351942.0,351942.0,351942.0,351942.0,0.0
mean,390592.6365,7.703851,15.980506,24.196788,26747.629834,-70.180916,65.679018,50.249541,14.203361,
std,68985.096555,9.894243,19.240002,24.491579,18879.817383,5.026611,11.876395,14.772064,14.610303,
min,271065.0,0.0,0.0,0.0,1.0,-93.0,36.0,5.0,0.0,
25%,336116.0,1.21,3.59,6.75,10005.0,-71.0,58.0,41.0,3.59,
50%,380108.5,3.76,8.75,16.41,23995.5,-69.0,63.0,55.0,8.75,
75%,444398.75,10.22,20.28,33.27,41299.0,-67.0,71.0,62.0,20.27,
max,532384.0,336.36,346.05,1582.45,71582.0,-53.0,126.0,77.0,865.12,


You can see the average came down a little, and the standard deviation came *really* far down.  And as we'd hope the max is now below 500.  