# Exlploring AoT Data (Lecture)

### Import Pandas

We will use Pandas to read in the data file and manipulate it. Numpy also has a lot of methods for working with data.

In [1]:
import pandas as pd
import numpy as np

### Read in the .csv file 

Use the read_csv function in Pandas to read in comma seperated variable datasets (.csv files).

In [2]:
# Path to file and filename need to be updated to reflect where data is currently stored 
data = pd.read_csv("../../Datasets/AoT/July18_2019.csv")

### Print the data

In [3]:
data

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0
5,2019/07/18 14:00:02,001e0611462f,metsense,pr103j2,temperature,810,20.2
6,2019/07/18 14:00:03,001e0610bbf9,metsense,pr103j2,temperature,845,25.2
7,2019/07/18 14:00:03,001e061146cb,metsense,pr103j2,temperature,835,23.7
8,2019/07/18 14:00:04,001e0610ba15,metsense,pr103j2,temperature,24,
9,2019/07/18 14:00:04,001e0610e537,metsense,pr103j2,temperature,833,23.4


### Just view the first few entries

In [4]:
data.head()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0


### Just view the last few entries

In [5]:
data.tail()

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
36,2019/07/18 14:00:22,001e06118182,metsense,pr103j2,temperature,845,25.2
37,2019/07/18 14:00:23,001e06113ace,metsense,pr103j2,temperature,831,23.1
38,2019/07/18 14:00:23,001e0611537d,metsense,pr103j2,temperature,845,25.2
39,2019/07/18 14:00:24,001e0610ee43,metsense,pr103j2,temperature,846,25.35
40,2019/07/18 14:00:30,001e0611441e,metsense,pr103j2,temperature,835,23.7


## Get a Summary of the Data

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  41 non-null     object 
 1   node_id    41 non-null     object 
 2   subsystem  41 non-null     object 
 3   sensor     41 non-null     object 
 4   parameter  41 non-null     object 
 5   value_raw  41 non-null     int64  
 6   value_hrf  40 non-null     float64
dtypes: float64(1), int64(1), object(5)
memory usage: 2.4+ KB


In [7]:
data.shape

(41, 7)

### Are we working with a complete data set?

A good way to check is to count the number of elements in each column.

In [8]:
data.count()

timestamp    41
node_id      41
subsystem    41
sensor       41
parameter    41
value_raw    41
value_hrf    40
dtype: int64

### How many nodes are in this data set?

We can see the unique entries by eliminating duplicates.

In [9]:
data.drop_duplicates("node_id")

Unnamed: 0,timestamp,node_id,subsystem,sensor,parameter,value_raw,value_hrf
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0
5,2019/07/18 14:00:02,001e0611462f,metsense,pr103j2,temperature,810,20.2
6,2019/07/18 14:00:03,001e0610bbf9,metsense,pr103j2,temperature,845,25.2
7,2019/07/18 14:00:03,001e061146cb,metsense,pr103j2,temperature,835,23.7
8,2019/07/18 14:00:04,001e0610ba15,metsense,pr103j2,temperature,24,
9,2019/07/18 14:00:04,001e0610e537,metsense,pr103j2,temperature,833,23.4


Your probably wondering if `drop_duplicates()` deleted a bunch of data.

In [10]:
data.shape

(41, 7)

### How many different sensors do we have? (Student)

If we want to create a new set of data, we can assign our operation to a variable.

In [11]:
sensor_count = data.drop_duplicates("sensor")

In [12]:
sensor_count.shape

(1, 7)

## Obviously, there is an easier way to do this.

In [13]:
data.describe()

Unnamed: 0,value_raw,value_hrf
count,41.0,40.0
mean,818.780488,24.30375
std,127.924687,2.003279
min,24.0,19.55
25%,833.0,23.4
50%,840.0,24.45
75%,845.0,25.2375
max,866.0,28.65


# Getting the right data

## Working with Columns


In [14]:
data.columns

Index(['timestamp', 'node_id', 'subsystem', 'sensor', 'parameter', 'value_raw',
       'value_hrf'],
      dtype='object')

We can extract columns data by using brackets around the column name. This is how we reference columns: `dataset_name['column name']`

In [15]:
node_id = data['node_id']      #This is just referencing a single column, which is a series.

In [16]:
node_id

0     001e0610ee36
1     001e06114fd4
2     001e0610ba13
3     001e0610ee61
4     001e06113ad8
5     001e0611462f
6     001e0610bbf9
7     001e061146cb
8     001e0610ba15
9     001e0610e537
10    001e061130f4
11    001e0610f732
12    001e06113cf1
13    001e061146bc
14    001e0611536c
15    001e06118501
16    001e0610e835
17    001e06113a24
18    001e06115365
19    001e0611856d
20    001e061183f3
21    001e061183f5
22    001e0610f02f
23    001e0610ee5d
24    001e0610f05c
25    001e06117b44
26    001e061182a7
27    001e0610bc10
28    001e061183eb
29    001e0610eef2
30    001e06113d20
31    001e061146ba
32    001e0610ba46
33    001e0610f703
34    001e061144be
35    001e06113d22
36    001e06118182
37    001e06113ace
38    001e0611537d
39    001e0610ee43
40    001e0611441e
Name: node_id, dtype: object

What kind of data is node_id?

In [17]:
type(node_id)

pandas.core.series.Series

Even though it is a series, it still keeps its index.

We can extract multiple columns, but we need to pass a list of column names.

In [18]:
node_id = data[['node_id', 'sensor']]      #This is referencing 2 columns which is an array (dataframe)

In [19]:
node_id.head()

Unnamed: 0,node_id,sensor
0,001e0610ee36,pr103j2
1,001e06114fd4,pr103j2
2,001e0610ba13,pr103j2
3,001e0610ee61,pr103j2
4,001e06113ad8,pr103j2


This looks different than when we extracted a single column. Let's check the type

In [20]:
type(node_id)

pandas.core.frame.DataFrame

### Why the difference?

## Renaming Columns

In [21]:
data.columns = ['Timestamp','Node', 'Subsystem', 'Sensor_Name', 'Measurement', 'Raw_Value', 'Human_Readable']

In [22]:
data.head()

Unnamed: 0,Timestamp,Node,Subsystem,Sensor_Name,Measurement,Raw_Value,Human_Readable
0,2019/07/18 14:00:00,001e0610ee36,metsense,pr103j2,temperature,840,24.45
1,2019/07/18 14:00:00,001e06114fd4,metsense,pr103j2,temperature,839,24.3
2,2019/07/18 14:00:01,001e0610ba13,metsense,pr103j2,temperature,853,26.45
3,2019/07/18 14:00:02,001e0610ee61,metsense,pr103j2,temperature,850,26.0
4,2019/07/18 14:00:02,001e06113ad8,metsense,pr103j2,temperature,837,24.0


### What about rows?

Rows are a little trickiier to reference because they are not series. In columns, we can reference the name of the columns that we are interested in. Also, the elements in a column are the same data type. Since rows contain the elements of several series, they can contain multiple data types and require a different way of thinking.