# Exploring AoT Data

One of the most important steps in doing science with data is to familiarize yourself with the data, including
* Getting a first look at the data.
* How is it structured?
* Looking more closely at the data.
* Filtering the data that you are interested in.

## Getting a First Look

#### Import Pandas

We will use Pandas to read in the data file and work with it.

In [15]:
import pandas as pd

#### Read in the .csv file 

Use the `read_csv` function in Pandas to read in the comma seperated variable dataset (.csv file).

In [16]:
# Path to file and filename need to be updated to reflect where data is currently stored 
data = pd.read_csv("data/AoT_Chicago.complete.recent.csv")

We put the data into a variable called `data`. To see it, we simply "call" that variable.

In [17]:
data

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455
...,...,...,...,...,...,...,...
4677,2021/03/20 01:01:23,001e0610ee36,metsense,pr103j2,temperature,686,6.9
4678,2021/03/20 01:01:23,001e0610ee36,metsense,spv1840lr5h_b,intensity,,56.72
4679,2021/03/20 01:01:23,001e0610ee36,metsense,tmp112,temperature,873,6.81
4680,2021/03/20 01:01:23,001e0610ee36,metsense,tsl250rd,intensity,0,0.0


#### Just view the first few entries

In [4]:
data.head()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0


#### Just view the last few entries

In [5]:
data.tail()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
36,2019/07/18 14:00:22,001e06118182,metsense,pr103j2,temperature,845,25.2
37,2019/07/18 14:00:23,001e06113ace,metsense,pr103j2,temperature,831,23.1
38,2019/07/18 14:00:23,001e0611537d,metsense,pr103j2,temperature,845,25.2
39,2019/07/18 14:00:24,001e0610ee43,metsense,pr103j2,temperature,846,25.35
40,2019/07/18 14:00:30,001e0611441e,metsense,pr103j2,temperature,835,23.7


## How is it structured?

In [24]:
data.shape

(4682, 7)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4682 entries, 0 to 4681
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  4682 non-null   object
 1   node_id    4682 non-null   object
 2   subsystem  4682 non-null   object
 3   sensor     4682 non-null   object
 4   parameter  4682 non-null   object
 5   value_raw  4538 non-null   object
 6   value_hrf  4636 non-null   object
dtypes: object(7)
memory usage: 256.2+ KB


## Looking more closely at the data

#### How many nodes are in this data set?

We can see the unique entries by eliminating duplicates.

In [18]:
data.drop_duplicates("node_id")

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
25,2021/03/20 00:31:21,001e0611804d,lightsense,apds_9006_020,intensity,65535,5267.409
62,2021/03/20 00:31:27,001e0610fb4c,metsense,hih4030,humidity,518,78.56


Your probably wondering if `drop_duplicates()` deleted a bunch of data.

In [41]:
data

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455
...,...,...,...,...,...,...,...
4677,2021/03/20 01:01:23,001e0610ee36,metsense,pr103j2,temperature,686,6.9
4678,2021/03/20 01:01:23,001e0610ee36,metsense,spv1840lr5h_b,intensity,,56.72
4679,2021/03/20 01:01:23,001e0610ee36,metsense,tmp112,temperature,873,6.81
4680,2021/03/20 01:01:23,001e0610ee36,metsense,tsl250rd,intensity,0,0.0


#### How many different sensors do we have? (Student)

If we want to create a new set of data, we can assign our operation to a variable.

In [47]:
sensors = data.drop_duplicates("sensor")
sensors

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
6,2021/03/20 00:31:19,001e0610ee36,lightsense,ml8511,intensity,9688,45.473
7,2021/03/20 00:31:19,001e0610ee36,lightsense,mlx75305,intensity,1126,11.943
8,2021/03/20 00:31:19,001e0610ee36,lightsense,tmp421,temperature,3392,13.25
9,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl250rd,intensity,9689,23.564
10,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl260rd,intensity,1126,2.926
11,2021/03/20 00:31:19,001e0610ee36,metsense,bmp180,pressure,11444704,1051.96
13,2021/03/20 00:31:19,001e0610ee36,metsense,hih4030,humidity,469,71.22


#### Obviously, there is an easier way to do this.

In [13]:
data.describe()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
count,4682,4682,4682,4682,4682,4538,4636
unique,184,3,3,18,23,646,538
top,2021/03/20 00:50:31,001e0611804d,metsense,pms7003,intensity,65535,0
freq,62,2627,2246,852,1100,1065,852


# Filtering Data

You can think of the data frame from two perspectives: columns and rows. Columns are much easier because they are arrays, meaning all the elements are the same datatype.

## Working with Columns


In [27]:
# List the column headings.
data.columns

Index(['timestamp', 'node_id', 'subsystem', 'sensor', 'parameter', 'value_raw',
       'value_hrf'],
      dtype='object')

We can extract columns by using brackets around the column names, but we need to pass a list of column names, which needs to be in its own set of brackets. This is how we reference columns: `dataset_name[['column_name1', 'column_name2, ...]]`

In [36]:
node_id = data[['node_id', 'sensor']]

In [40]:
node_id

Unnamed: 0,node_id,sensor
0,001e0610ee36,apds_9006_020
1,001e0610ee36,hih6130
2,001e0610ee36,hih6130
3,001e0610ee36,hmc5883l
4,001e0610ee36,hmc5883l
...,...,...
4677,001e0610ee36,pr103j2
4678,001e0610ee36,spv1840lr5h_b
4679,001e0610ee36,tmp112
4680,001e0610ee36,tsl250rd


In [38]:
node_id.describe()

Unnamed: 0,node_id,sensor
count,4682,4682
unique,3,18
top,001e0611804d,pms7003
freq,2627,852


## Renaming Columns

Sometimes the original column names are confusing. You can change them by setting a list of new column names to `data.columns`

In [21]:
data.columns = ['Timestamp','Node', 'Subsystem', 'Sensor_Name', 'Measurement', 'Raw_Value', 'Human_Readable']

In [22]:
data.head()

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0


### What about rows?

Rows are a little trickier to reference because they are not series. In columns, we can reference the name of the columns that we are interested in. Also, the elements in a column are the same data type. Since rows contain the elements of several series, they can contain multiple data types and require a different way of thinking. Let's not worry about that now.