# Exploring AoT Data

One of the most important steps in doing science with data is to familiarize yourself with the data, including
* Getting a first look at the data.
* How is it structured?
* Looking more closely at the data.
* Filtering the data that you are interested in.

## Getting a First Look

#### Import Pandas

We will use Pandas to read in the data file and work with it.

In [2]:
import pandas as pd

#### Read in the .csv file 

Use the `read_csv` function in Pandas to read in the comma seperated variable dataset (.csv file).

In [4]:
data = pd.read_csv("../data/AoT_Chicago.complete.recent.csv")

We put the data into a variable called `data`. In Pandas, it is stored as a "dataframe". To see it, we simply "call" that variable.

In [5]:
# print the dataframe
data

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455
...,...,...,...,...,...,...,...
4677,2021/03/20 01:01:23,001e0610ee36,metsense,pr103j2,temperature,686,6.9
4678,2021/03/20 01:01:23,001e0610ee36,metsense,spv1840lr5h_b,intensity,,56.72
4679,2021/03/20 01:01:23,001e0610ee36,metsense,tmp112,temperature,873,6.81
4680,2021/03/20 01:01:23,001e0610ee36,metsense,tsl250rd,intensity,0,0.0


#### Just view the first few entries

In [6]:
# print the first few entries
data.head()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455


#### Just view the last few entries

In [7]:
# print the last few entries
data.tail()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
4677,2021/03/20 01:01:23,001e0610ee36,metsense,pr103j2,temperature,686.0,6.9
4678,2021/03/20 01:01:23,001e0610ee36,metsense,spv1840lr5h_b,intensity,,56.72
4679,2021/03/20 01:01:23,001e0610ee36,metsense,tmp112,temperature,873.0,6.81
4680,2021/03/20 01:01:23,001e0610ee36,metsense,tsl250rd,intensity,0.0,0.0
4681,2021/03/20 01:01:23,001e0610ee36,metsense,tsys01,temperature,9280204.0,6.97


## How is it structured?

In [8]:
data.shape

(4682, 7)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4682 entries, 0 to 4681
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  4682 non-null   object
 1   node_id    4682 non-null   object
 2   subsystem  4682 non-null   object
 3   sensor     4682 non-null   object
 4   parameter  4682 non-null   object
 5   value_raw  4538 non-null   object
 6   value_hrf  4636 non-null   object
dtypes: object(7)
memory usage: 256.2+ KB


## Looking more closely at the data

We can get a little more detailed summary of the data using the `.describe()` method.

In [10]:
data.describe()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
count,4682,4682,4682,4682,4682,4538,4636
unique,184,3,3,18,23,646,538
top,2021/03/20 00:51:47,001e0611804d,metsense,pms7003,intensity,65535,0
freq,62,2627,2246,852,1100,1065,852


In order to drill a little deaper, we can see the unique entries using the `.drop_duplicates()` method.

In [11]:
data.drop_duplicates("node_id")

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
25,2021/03/20 00:31:21,001e0611804d,lightsense,apds_9006_020,intensity,65535,5267.409
62,2021/03/20 00:31:27,001e0610fb4c,metsense,hih4030,humidity,518,78.56


Your probably wondering if `drop_duplicates()` deleted a bunch of data.

In [12]:
data

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455
...,...,...,...,...,...,...,...
4677,2021/03/20 01:01:23,001e0610ee36,metsense,pr103j2,temperature,686,6.9
4678,2021/03/20 01:01:23,001e0610ee36,metsense,spv1840lr5h_b,intensity,,56.72
4679,2021/03/20 01:01:23,001e0610ee36,metsense,tmp112,temperature,873,6.81
4680,2021/03/20 01:01:23,001e0610ee36,metsense,tsl250rd,intensity,0,0.0


#### How many different sensors do we have? (Student)

In [13]:
data.drop_duplicates("sensor")

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
6,2021/03/20 00:31:19,001e0610ee36,lightsense,ml8511,intensity,9688,45.473
7,2021/03/20 00:31:19,001e0610ee36,lightsense,mlx75305,intensity,1126,11.943
8,2021/03/20 00:31:19,001e0610ee36,lightsense,tmp421,temperature,3392,13.25
9,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl250rd,intensity,9689,23.564
10,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl260rd,intensity,1126,2.926
11,2021/03/20 00:31:19,001e0610ee36,metsense,bmp180,pressure,11444704,1051.96
13,2021/03/20 00:31:19,001e0610ee36,metsense,hih4030,humidity,469,71.22


If we want to create a new set of data, we can assign our operation to a variable.

In [14]:
sensor = data.drop_duplicates("sensor")
sensor

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
6,2021/03/20 00:31:19,001e0610ee36,lightsense,ml8511,intensity,9688,45.473
7,2021/03/20 00:31:19,001e0610ee36,lightsense,mlx75305,intensity,1126,11.943
8,2021/03/20 00:31:19,001e0610ee36,lightsense,tmp421,temperature,3392,13.25
9,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl250rd,intensity,9689,23.564
10,2021/03/20 00:31:19,001e0610ee36,lightsense,tsl260rd,intensity,1126,2.926
11,2021/03/20 00:31:19,001e0610ee36,metsense,bmp180,pressure,11444704,1051.96
13,2021/03/20 00:31:19,001e0610ee36,metsense,hih4030,humidity,469,71.22


## Filtering Data

You can think of the data frame from two perspectives: columns and rows. Columns are much easier to work with because they are arrays, meaning all the elements in a column are the same datatype (or they should be).

#### Working with Columns

We can filter certain columns by using brackets around the column names, but we need to pass it as a list of column names, which needs to be in its own set of brackets. This is how we reference columns: `dataset_name[['column_name1', 'column_name2, ...]]`

In [15]:
node_id = data[['node_id', 'sensor']]

In [16]:
node_id

Unnamed: 0,node_id,sensor
0,001e0610ee36,apds_9006_020
1,001e0610ee36,hih6130
2,001e0610ee36,hih6130
3,001e0610ee36,hmc5883l
4,001e0610ee36,hmc5883l
...,...,...
4677,001e0610ee36,pr103j2
4678,001e0610ee36,spv1840lr5h_b
4679,001e0610ee36,tmp112
4680,001e0610ee36,tsl250rd


#### Renaming Columns

Sometimes the original column names are confusing. You can change them by setting a list of new column names to `data.columns`

In [17]:
data.columns = ['Timestamp','Node', 'Subsystem', 'Sensor_Name', 'Measurement', 'Raw_Value', 'Human_Readable']

In [18]:
data.head()

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
0,2021/03/20 00:31:19,001e0610ee36,lightsense,apds_9006_020,intensity,8,0.643
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
2,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,temperature,22704,17.17
3,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_x,-177,-160.909
4,2021/03/20 00:31:19,001e0610ee36,lightsense,hmc5883l,magnetic_field_y,-1700,-1545.455


### What about rows?

Rows are a little trickier to reference because they are not arrays. In columns, we can reference the name of the columns that we are interested in. Also, the elements in a column are the same data type. Since rows contain the elements of several different arrays, they can contain multiple data types and require a different way of thinking. We can filter rows two ways: by index or by label.

For the `iloc` method, you can pass a number `[5]`, range of numbers `[5:9]`, list `[23, 45, 87]`. These values correspond to the bold index values in the first column.

`iloc`

In [19]:
# pass a value
data.iloc[45]

Timestamp         2021/03/20 00:31:21
Node                     001e0611804d
Subsystem                    metsense
Sensor_Name                   pr103j2
Measurement               temperature
Raw_Value                         664
Human_Readable                   4.95
Name: 45, dtype: object

In [20]:
# pass a range
data.iloc[16:19]

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
16,2021/03/20 00:31:19,001e0610ee36,metsense,metsense,id,011cd1141800,011cd1141800
17,2021/03/20 00:31:19,001e0610ee36,metsense,mma8452q,acceleration_x,65328,-12.695
18,2021/03/20 00:31:19,001e0610ee36,metsense,mma8452q,acceleration_y,49392,-985.352


In [21]:
# pass a list
data.iloc[[23, 45, 87]]

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
23,2021/03/20 00:31:19,001e0610ee36,metsense,tsl250rd,intensity,0,0.0
45,2021/03/20 00:31:21,001e0611804d,metsense,pr103j2,temperature,664,4.95
87,2021/03/20 00:31:44,001e0610ee36,metsense,pr103j2,temperature,686,6.9


`loc`

The `loc` method allows you to search for labels within the columns. You can use values, booleans, and conditionals.

In [22]:
# pass a value
data.loc[34]

Timestamp         2021/03/20 00:31:21
Node                     001e0611804d
Subsystem                  lightsense
Sensor_Name                  tsl250rd
Measurement                 intensity
Raw_Value                       65535
Human_Readable                159.907
Name: 34, dtype: object

In [23]:
# pass a conditional
data.loc[data['Measurement'] == 'humidity']

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
1,2021/03/20 00:31:19,001e0610ee36,lightsense,hih6130,humidity,9272,56.6
13,2021/03/20 00:31:19,001e0610ee36,metsense,hih4030,humidity,469,71.22
14,2021/03/20 00:31:19,001e0610ee36,metsense,htu21d,humidity,47738,85.05
26,2021/03/20 00:31:21,001e0611804d,lightsense,hih6130,humidity,65535,100.0
38,2021/03/20 00:31:21,001e0611804d,metsense,hih4030,humidity,420,63.88
...,...,...,...,...,...,...,...
4629,2021/03/20 01:01:09,001e0611804d,metsense,htu21d,humidity,65535,118.99
4652,2021/03/20 01:01:21,001e0610fb4c,metsense,hih4030,humidity,518,78.56
4658,2021/03/20 01:01:23,001e0610ee36,lightsense,hih6130,humidity,9304,56.79
4670,2021/03/20 01:01:23,001e0610ee36,metsense,hih4030,humidity,466,70.77


In [24]:
# pass multiple conditionals
data.loc[(data['Measurement'] == 'humidity') & (data['Sensor_Name'] == 'htu21d')]

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
14,2021/03/20 00:31:19,001e0610ee36,metsense,htu21d,humidity,47738,85.05
39,2021/03/20 00:31:21,001e0611804d,metsense,htu21d,humidity,65535,118.99
81,2021/03/20 00:31:44,001e0610ee36,metsense,htu21d,humidity,48202,85.93
106,2021/03/20 00:31:47,001e0611804d,metsense,htu21d,humidity,65535,118.99
148,2021/03/20 00:32:09,001e0610ee36,metsense,htu21d,humidity,48518,86.54
...,...,...,...,...,...,...,...
4537,2021/03/20 01:00:33,001e0610ee36,metsense,htu21d,humidity,52250,93.66
4567,2021/03/20 01:00:44,001e0611804d,metsense,htu21d,humidity,65535,118.99
4604,2021/03/20 01:00:58,001e0610ee36,metsense,htu21d,humidity,52330,93.81
4629,2021/03/20 01:01:09,001e0611804d,metsense,htu21d,humidity,65535,118.99
