# Topic: Pandas Data Frames
In this class you will get practice at summarising data using Pandas.
Make sure to complete all of example 1. The remaining examples are provided for those that would like further practice.

### Online Documentation:
* Pandas User Guide: https://pandas.pydata.org/docs/user_guide/index.html
* Pandas User Guide - Intro to data structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro
* Reading from CSV file: https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table

# About Data Attribution

<i>
    "<b>Open data</b> is data that is openly accessible, exploitable, editable and shared by anyone for any purpose, even commercially"</i>.
<div style="text-align:right">[<a href="https://en.wikipedia.org/wiki/Open_data">Wikipedia</a>]</div>

<p>Open data is typically distributed under some form of <a href="">open license<a> agreement.</p>
    
<p>That license may impose <i>obligations</i> on the users of that data, including the need to:
<ul>
<li>include a reference to the license</li>
<li>acknowledge the person or organization that provided the data</li>
<li>note clearly when changes have been made to the original data</li>
</ul>

# Example 1: House Hold Energy Data

<h3>Data Attribution</h3>
<table style="border-style:solid; margin-left:0">
        <tr><td>Contributor:</td><td>Jaganadh Gopinadhan</td></tr>
        <tr><td>License:</td><td><a href="https://cdla.io/permissive-1-0/">Community Data License Agreement - Permissive - Version 1.0</a></td></tr>
        <tr><td>Data source:</td><td><a href="https://www.kaggle.com/jaganadhg/house-hold-energy-data">https://www.kaggle.com/jaganadhg/house-hold-energy-data</a></td></tr>
        <tr><td>Local data file:</td><td><a href="D202.csv">D202.csv</a>  </td></tr>
        <tr><td colspan=2>If you share this data, you must preserve this attribution.</td></tr>
</table>      
<p>Make sure you have uploaded file <code>D202.csv</code> to the same folder as this Notebook.</p>

## Importing and inspecting data

In [1]:
# Read file D202.csv into a Pandas DataFrame
import pandas as pd
D202 = pd.read_csv("D202.csv")
D202

Unnamed: 0,TYPE,DATE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
0,Electric usage,10/22/2016,0:00,0:14,0.01,kWh,$0.00,
1,Electric usage,10/22/2016,0:15,0:29,0.01,kWh,$0.00,
2,Electric usage,10/22/2016,0:30,0:44,0.01,kWh,$0.00,
3,Electric usage,10/22/2016,0:45,0:59,0.01,kWh,$0.00,
4,Electric usage,10/22/2016,1:00,1:14,0.01,kWh,$0.00,
...,...,...,...,...,...,...,...,...
70363,Electric usage,10/24/2018,22:45,22:59,0.02,kWh,$0.00,
70364,Electric usage,10/24/2018,23:00,23:14,0.03,kWh,$0.01,
70365,Electric usage,10/24/2018,23:15,23:29,0.03,kWh,$0.01,
70366,Electric usage,10/24/2018,23:30,23:44,0.03,kWh,$0.01,


In [2]:
# Inspect the results above and answer the following questions:
# 1) What is the column name of the first column? TYPE
# 2) What is the row index of the first row? 0



In [3]:
# What is the type of the Date Column?
# It's an object - means it's treating it like a string
D202.dtypes

TYPE           object
DATE           object
START TIME     object
END TIME       object
USAGE         float64
UNITS          object
COST           object
NOTES         float64
dtype: object

In [4]:
# What is the kind of the index? 
# Range index
D202.index

RangeIndex(start=0, stop=70368, step=1)

In [5]:
D202.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70368 entries, 0 to 70367
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TYPE        70368 non-null  object 
 1   DATE        70368 non-null  object 
 2   START TIME  70368 non-null  object 
 3   END TIME    70368 non-null  object 
 4   USAGE       70368 non-null  float64
 5   UNITS       70368 non-null  object 
 6   COST        70368 non-null  object 
 7   NOTES       0 non-null      float64
dtypes: float64(2), object(6)
memory usage: 4.3+ MB


In [6]:
# Now we'll add extra parameters to the read_csv function so that the Date column becomes the index and Date strings should be parsed as Dates.
D202 = pd.read_csv("D202.csv", parse_dates = ['DATE'], index_col = ['DATE'])
D202

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,$0.00,
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,$0.00,
...,...,...,...,...,...,...,...
2018-10-24,Electric usage,22:45,22:59,0.02,kWh,$0.00,
2018-10-24,Electric usage,23:00,23:14,0.03,kWh,$0.01,
2018-10-24,Electric usage,23:15,23:29,0.03,kWh,$0.01,
2018-10-24,Electric usage,23:30,23:44,0.03,kWh,$0.01,


In [58]:
D202.index
# What is the kind of the index? Now it's a datetime index
# What is the data type of the values in the index? Datetime

DatetimeIndex(['2016-10-22', '2016-10-22', '2016-10-22', '2016-10-22',
               '2016-10-22', '2016-10-22', '2016-10-22', '2016-10-22',
               '2016-10-22', '2016-10-22',
               ...
               '2018-10-24', '2018-10-24', '2018-10-24', '2018-10-24',
               '2018-10-24', '2018-10-24', '2018-10-24', '2018-10-24',
               '2018-10-24', '2018-10-24'],
              dtype='datetime64[ns]', name='DATE', length=70368, freq=None)

In [8]:
# Is the Date index unique? Why (examine the data)? 
D202.index.is_unique

False

In [9]:
D202.describe()
# Why are statistics only listed for the Usage and Notes columns? Only works for columns with numeric data types.

Unnamed: 0,USAGE,NOTES
count,70368.0,0.0
mean,0.121941,
std,0.210507,
min,0.0,
25%,0.03,
50%,0.05,
75%,0.12,
max,2.36,


## Selecting, filtering and aggregating data

In [10]:
# select a column
D202['COST']

DATE
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
               ...  
2018-10-24    $0.00 
2018-10-24    $0.01 
2018-10-24    $0.01 
2018-10-24    $0.01 
2018-10-24    $0.01 
Name: COST, Length: 70368, dtype: object

In [11]:
D202.COST

DATE
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
               ...  
2018-10-24    $0.00 
2018-10-24    $0.01 
2018-10-24    $0.01 
2018-10-24    $0.01 
2018-10-24    $0.01 
Name: COST, Length: 70368, dtype: object

In [12]:
# select a row
D202.loc['2016-10-22']

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,$0.00,
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,$0.00,
...,...,...,...,...,...,...,...
2016-10-22,Electric usage,22:45,22:59,0.05,kWh,$0.01,
2016-10-22,Electric usage,23:00,23:14,0.30,kWh,$0.05,
2016-10-22,Electric usage,23:15,23:29,0.30,kWh,$0.05,
2016-10-22,Electric usage,23:30,23:44,0.30,kWh,$0.05,


In [13]:
# select row and column
D202.loc['2016-10-22', 'COST']

DATE
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
2016-10-22    $0.00 
               ...  
2016-10-22    $0.01 
2016-10-22    $0.05 
2016-10-22    $0.05 
2016-10-22    $0.05 
2016-10-22    $0.05 
Name: COST, Length: 96, dtype: object

In [14]:
# select multiple rows / cols
D202.loc['2016-10-22', ['USAGE','COST']]

Unnamed: 0_level_0,USAGE,COST
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
...,...,...
2016-10-22,0.05,$0.01
2016-10-22,0.30,$0.05
2016-10-22,0.30,$0.05
2016-10-22,0.30,$0.05


In [15]:
# select multiple rows / cols
D202.loc['2016-10-22' : '2016-10-28', ['USAGE','COST']]

Unnamed: 0_level_0,USAGE,COST
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
2016-10-22,0.01,$0.00
...,...,...
2016-10-28,0.07,$0.01
2016-10-28,0.05,$0.01
2016-10-28,0.05,$0.01
2016-10-28,0.05,$0.01


In [16]:
midday = D202['START TIME'] == '12:00'
D202[midday] # select all rows of D202 where midday is True

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-10-22,Electric usage,12:00,12:14,0.13,kWh,$0.02,
2016-10-23,Electric usage,12:00,12:14,0.05,kWh,$0.01,
2016-10-24,Electric usage,12:00,12:14,0.03,kWh,$0.01,
2016-10-25,Electric usage,12:00,12:14,0.03,kWh,$0.01,
2016-10-26,Electric usage,12:00,12:14,0.05,kWh,$0.01,
...,...,...,...,...,...,...,...
2018-10-20,Electric usage,12:00,12:14,0.05,kWh,$0.01,
2018-10-21,Electric usage,12:00,12:14,0.00,kWh,$0.00,
2018-10-22,Electric usage,12:00,12:14,0.00,kWh,$0.00,
2018-10-23,Electric usage,12:00,12:14,0.02,kWh,$0.00,


In [17]:
# filtering - select all midday observations
D202[D202['START TIME'] == '12:00']


Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-10-22,Electric usage,12:00,12:14,0.13,kWh,$0.02,
2016-10-23,Electric usage,12:00,12:14,0.05,kWh,$0.01,
2016-10-24,Electric usage,12:00,12:14,0.03,kWh,$0.01,
2016-10-25,Electric usage,12:00,12:14,0.03,kWh,$0.01,
2016-10-26,Electric usage,12:00,12:14,0.05,kWh,$0.01,
...,...,...,...,...,...,...,...
2018-10-20,Electric usage,12:00,12:14,0.05,kWh,$0.01,
2018-10-21,Electric usage,12:00,12:14,0.00,kWh,$0.00,
2018-10-22,Electric usage,12:00,12:14,0.00,kWh,$0.00,
2018-10-23,Electric usage,12:00,12:14,0.02,kWh,$0.00,


In [18]:
# aggregates
D202['USAGE'].mean()
D202['USAGE'].median()
D202['USAGE'].std()

0.21050692734054596

In [19]:
# Extract just the Usage column from this data frame
D202['USAGE']

DATE
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
              ... 
2018-10-24    0.02
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
Name: USAGE, Length: 70368, dtype: float64

In [20]:
D202.USAGE

DATE
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
              ... 
2018-10-24    0.02
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
Name: USAGE, Length: 70368, dtype: float64

In [21]:
type(D202)

pandas.core.frame.DataFrame

In [22]:
# What is the Python data type of this result (Data Frame or Data Series?)
# series
type(D202.USAGE)

pandas.core.series.Series

In [23]:
# Extract just the Usage column using a loc expression
D202.loc[:, 'USAGE']

DATE
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
              ... 
2018-10-24    0.02
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
Name: USAGE, Length: 70368, dtype: float64

In [24]:
# Extract just the Usage column using a iloc expression
D202.iloc[:, 3]

DATE
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
2016-10-22    0.01
              ... 
2018-10-24    0.02
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
2018-10-24    0.03
Name: USAGE, Length: 70368, dtype: float64

In [25]:
# What is the maximum Usage?
max_usage = D202['USAGE'].max()
max_usage

2.36

In [26]:
# Extract the rows from the house hold energy data frame where the usage is at it's maximum.
# How many rows are there?
max_usage_row = D202['USAGE'] == max_usage
D202[max_usage_row]

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-12-30,Electric usage,17:00,17:14,2.36,kWh,$0.57,
2017-12-30,Electric usage,17:15,17:29,2.36,kWh,$0.65,
2017-12-30,Electric usage,17:30,17:44,2.36,kWh,$0.65,
2017-12-30,Electric usage,17:45,17:59,2.36,kWh,$0.65,


In [27]:
# Retrieve all rows where the usage is greater than zero
positive_usage_row = D202['USAGE'] > 0
D202_positive = D202[positive_usage_row]

In [28]:
# What was the minimum usage from those rows?
min_usage = D202_positive['USAGE'].min()
min_usage

0.01

In [29]:
# Retrieve the rows where usage was at that minimum
D202_min_usage = D202_positive[D202_positive['USAGE'] == min_usage]
D202_min_usage

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,$0.00,
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,$0.00,
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,$0.00,
...,...,...,...,...,...,...,...
2018-10-24,Electric usage,12:45,12:59,0.01,kWh,$0.00,
2018-10-24,Electric usage,13:00,13:14,0.01,kWh,$0.00,
2018-10-24,Electric usage,13:15,13:29,0.01,kWh,$0.00,
2018-10-24,Electric usage,13:30,13:44,0.01,kWh,$0.00,


In [30]:
# Use the unique() method to determine the number of different dates where usage was at that minimum
len(D202_min_usage.index.unique())

298

## Aggregating using the groupby method

In [31]:
# Now we will add some new columns to capture some different aspects of the datetime data. 
# You will learn about this more next week
D202['DAY'] = D202.index.day_name()
D202['MONTH'] = D202.index.month_name()
D202


Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES,DAY,MONTH
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,$0.00,,Saturday,October
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,$0.00,,Saturday,October
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,$0.00,,Saturday,October
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,$0.00,,Saturday,October
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,$0.00,,Saturday,October
...,...,...,...,...,...,...,...,...,...
2018-10-24,Electric usage,22:45,22:59,0.02,kWh,$0.00,,Wednesday,October
2018-10-24,Electric usage,23:00,23:14,0.03,kWh,$0.01,,Wednesday,October
2018-10-24,Electric usage,23:15,23:29,0.03,kWh,$0.01,,Wednesday,October
2018-10-24,Electric usage,23:30,23:44,0.03,kWh,$0.01,,Wednesday,October


In [32]:
# Return a dataframe that displays the average usage per 15 min observation for each day of the week, sorted in descending order.
# Which days of the week have the highest usage?
# Hint: Use groupby to group by day. Use the sort_values() method to sort.
D202.groupby('DAY')['USAGE'].mean().sort_values(ascending = False)

# usage is highest on weekends when people are at home rather than work

DAY
Sunday       0.141234
Saturday     0.134044
Monday       0.120813
Friday       0.119884
Thursday     0.114175
Tuesday      0.112063
Wednesday    0.111278
Name: USAGE, dtype: float64

In [59]:
# Return a dataframe that displays the average usage per 15 min observation for each month, sorted in descending order.
# Which months have the highest usage?
D202.groupby('MONTH')['USAGE'].mean().sort_values(ascending = False)

# usage is highest December and January. For this location (Northern hemisphere) this is when it is coldest and heating is run the most.

MONTH
December     0.209395
January      0.203555
February     0.159554
November     0.136546
March        0.130498
July         0.104839
June         0.094736
April        0.093493
September    0.091715
August       0.085867
May          0.082151
October      0.074564
Name: USAGE, dtype: float64

In [60]:
# Return a dataframe that displays the total daily usage for each date. Which date has the highest usage?
# Hint: Use groupby to group by the date index. Use a sum aggregate to get the total for each day.
D202.groupby('DATE')['USAGE'].sum().sort_values().tail(1)

# The highest daily usage of 39.72 kWh was recorded on 1st January 2017.

DATE
2017-01-01    39.72
Name: USAGE, dtype: float64

## Computed columns and the apply method

In [61]:
# Use the apply method to clean the cost column so there is no leading dollar sign, and it returns a float value
# Hint: The values in the cost column are currently strings. Select all values of the string from the 2nd character onwards
#       and then convert to a float.

def clean_cost(cost):
    return float(cost[1: ])

D202['COST'] = D202['COST'].apply(clean_cost)
D202

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES,DAY,MONTH
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,0.00,,Saturday,October
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,0.00,,Saturday,October
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,0.00,,Saturday,October
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,0.00,,Saturday,October
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,0.00,,Saturday,October
...,...,...,...,...,...,...,...,...,...
2018-10-24,Electric usage,22:45,22:59,0.02,kWh,0.00,,Wednesday,October
2018-10-24,Electric usage,23:00,23:14,0.03,kWh,0.01,,Wednesday,October
2018-10-24,Electric usage,23:15,23:29,0.03,kWh,0.01,,Wednesday,October
2018-10-24,Electric usage,23:30,23:44,0.03,kWh,0.01,,Wednesday,October


In [62]:
# Now create a new column which represents the cost per kWh. What is the average cost per kWh?
D202['COST PER KWH'] = D202['COST'] / D202['USAGE']
D202['COST PER KWH'].mean()

0.18998251364238192

In [64]:
# Use the apply method to categorise whether an observation belongs to a weekday or a weekend. Add this as a new column to the data.
def is_weekend(day):
    return day in ['Saturday', 'Sunday']

D202['WEEKEND'] = D202['DAY'].apply(is_weekend)
D202
        

Unnamed: 0_level_0,TYPE,START TIME,END TIME,USAGE,UNITS,COST,NOTES,DAY,MONTH,COST PER KWH,WEEKEND
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-10-22,Electric usage,0:00,0:14,0.01,kWh,0.00,,Saturday,October,0.000000,True
2016-10-22,Electric usage,0:15,0:29,0.01,kWh,0.00,,Saturday,October,0.000000,True
2016-10-22,Electric usage,0:30,0:44,0.01,kWh,0.00,,Saturday,October,0.000000,True
2016-10-22,Electric usage,0:45,0:59,0.01,kWh,0.00,,Saturday,October,0.000000,True
2016-10-22,Electric usage,1:00,1:14,0.01,kWh,0.00,,Saturday,October,0.000000,True
...,...,...,...,...,...,...,...,...,...,...,...
2018-10-24,Electric usage,22:45,22:59,0.02,kWh,0.00,,Wednesday,October,0.000000,False
2018-10-24,Electric usage,23:00,23:14,0.03,kWh,0.01,,Wednesday,October,0.333333,False
2018-10-24,Electric usage,23:15,23:29,0.03,kWh,0.01,,Wednesday,October,0.333333,False
2018-10-24,Electric usage,23:30,23:44,0.03,kWh,0.01,,Wednesday,October,0.333333,False


# Example 2: Thermodynamic Properties of Water

<h3>Data Attribution</h3>
<table style="border-style:solid; margin-left:0">
        <tr><td>Contributor:</td><td>Israel Urieli</td></tr>
        <tr><td>License:</td><td><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States</a></td></tr>
        <tr><td>Data source:</td><td><a href="https://www.ohio.edu/mechanical/thermo/property_tables/H2O/">https://www.ohio.edu/mechanical/thermo/property_tables/H2O/</a></td></tr>
        <tr><td>Local data file:</td><td><a href="H2O_TempSat.csv">H2O_TempSat.csv</a>  </td></tr>
        <tr><td>Local changes:</td><td>Converted from Excel to CSV and top 2 header rows removed.</td></tr>
        <tr><td colspan=2>If you share this data, you must preserve this attribution.</td></tr>
</table>      
<p>Make sure you have uploaded file <code>H2O_TempSat.csv</code> to the same folder as this Notebook.</p>

In [66]:
# Read file Water Saturation Properties Temperature Table.csv into a pandas data frame 
# and use the temperature column as the index
water_properties = pd.read_csv('H2O_TempSat.csv', index_col = ['Temp'])
water_properties

Unnamed: 0_level_0,Pressure,vf,vg,uf,ug,hf,hfg,hg,sf,sfg,sg
Temp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.01,0.000612,0.001,205.99,0.0,2374.9,0.0,2500.9,2500.9,0.0,9.1555,9.1555
5.0,0.000873,0.001,147.01,21.0,2381.8,21.02,2489.1,2510.1,0.0763,8.9485,9.0248
10.0,0.001228,0.001,106.3,42.0,2388.6,42.02,2477.2,2519.2,0.1511,8.7487,8.8998
15.0,0.001706,0.001001,77.88,63.0,2395.5,62.98,2465.3,2528.3,0.2245,8.5558,8.7803
20.0,0.002339,0.001002,57.76,83.9,2402.3,83.91,2453.5,2537.4,0.2965,8.3695,8.666
25.0,0.00317,0.001003,43.34,104.8,2409.1,104.83,2441.7,2546.5,0.3672,8.1894,8.5566
30.0,0.004247,0.001004,32.88,125.7,2415.9,125.73,2429.8,2555.5,0.4368,8.0153,8.452
35.0,0.005629,0.001006,25.21,146.6,2422.7,146.63,2417.9,2564.5,0.5051,7.8466,8.3517
40.0,0.007385,0.001008,19.52,167.5,2429.4,167.53,2406.0,2573.5,0.5724,7.6831,8.2555
45.0,0.009595,0.00101,15.25,188.4,2436.1,188.43,2394.0,2582.4,0.6386,7.5247,8.1633


In [67]:
# Write Python code to find the data relating to water at 100 degrees celsius
# Note: this can be done by either using using a .loc expression or by filtering the row where the index column == 100
water_properties.loc[100]

Pressure       0.101400
vf             0.001044
vg             1.672000
uf           419.100000
ug          2506.000000
hf           419.170000
hfg         2256.400000
hg          2675.600000
sf             1.307200
sfg            6.046900
sg             7.354100
Name: 100.0, dtype: float64

In [68]:
# Find the hfg enthalpy of water at 65 degrees celsius (hint: use a .loc expression)
water_properties.loc[65, 'hfg']

2345.4

In [70]:
# Write Python code to find all data relating to temperatures of at most 38 degrees celsius
water_properties.loc[:35]

Unnamed: 0_level_0,Pressure,vf,vg,uf,ug,hf,hfg,hg,sf,sfg,sg
Temp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.01,0.000612,0.001,205.99,0.0,2374.9,0.0,2500.9,2500.9,0.0,9.1555,9.1555
5.0,0.000873,0.001,147.01,21.0,2381.8,21.02,2489.1,2510.1,0.0763,8.9485,9.0248
10.0,0.001228,0.001,106.3,42.0,2388.6,42.02,2477.2,2519.2,0.1511,8.7487,8.8998
15.0,0.001706,0.001001,77.88,63.0,2395.5,62.98,2465.3,2528.3,0.2245,8.5558,8.7803
20.0,0.002339,0.001002,57.76,83.9,2402.3,83.91,2453.5,2537.4,0.2965,8.3695,8.666
25.0,0.00317,0.001003,43.34,104.8,2409.1,104.83,2441.7,2546.5,0.3672,8.1894,8.5566
30.0,0.004247,0.001004,32.88,125.7,2415.9,125.73,2429.8,2555.5,0.4368,8.0153,8.452
35.0,0.005629,0.001006,25.21,146.6,2422.7,146.63,2417.9,2564.5,0.5051,7.8466,8.3517


In [71]:
# Write Python code to find entropy data (columns sf, sfg and sg) 
# relating to temperatures in the range of 100 - 150 degrees celsius
# Hint: use loc slicing for both the row and column labels
water_properties.loc[100:150, 'sf':'sg']

Unnamed: 0_level_0,sf,sfg,sg
Temp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100.0,1.3072,6.0469,7.3541
110.0,1.4188,5.8193,7.2381
120.0,1.5279,5.6012,7.1291
130.0,1.6346,5.3918,7.0264
140.0,1.7392,5.1901,6.9293
150.0,1.8418,4.9953,6.8371


# Example 3: Predictive Maintenance

<h3>Data Attribution</h3>
<table style="border-style:solid; margin-left:0">
        <tr><td>Contributor:</td><td>Stephan Matzka, School of Engineering - Technology and Life, Hochschule für Technik und Wirtschaft Berlin, 12459 Berlin, Germany</td></tr>
        <tr><td>Data source:</td><td><a href="https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset">https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset</a></td></tr>
        <tr><td>Local data file:</td><td><a href="ai4i2020.csv">ai4i2020.csv</a>  </td></tr>
        <tr><td colspan=2>If you share this data, you must preserve this attribution.</td></tr>
</table>      
<p>Make sure you have uploaded file <code>ai4i2020.csv</code> to the same folder as this Notebook.</p>

In [72]:
# Read a local csv file containing predictive maintenance data and store in a data frame.  
# Use the first column (UDI) as the row index.
maintenance_data = pd.read_csv('ai4i2020.csv', index_col = ['UDI'])
maintenance_data

Unnamed: 0_level_0,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
UDI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,M24855,M,298.8,308.4,1604,29.5,14,0,0,0,0,0,0
9997,H39410,H,298.9,308.4,1632,31.8,17,0,0,0,0,0,0
9998,M24857,M,299.0,308.6,1645,33.4,22,0,0,0,0,0,0
9999,H39412,H,299.0,308.7,1408,48.5,25,0,0,0,0,0,0


In [73]:
# Find the rows for products that have failed because of tool wear (column TWF)
TWF_failure = maintenance_data['TWF'] > 0
TWF_failure_data = maintenance_data[TWF_failure]
TWF_failure_data

Unnamed: 0_level_0,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
UDI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
78,L47257,L,298.8,308.9,1455,41.3,208,1,1,0,0,0,0
1088,H30501,H,296.9,307.8,1549,35.8,206,1,1,0,0,0,0
1510,L48689,L,298.0,308.5,1429,37.7,220,1,1,0,0,0,0
1683,H31096,H,297.9,307.4,1604,36.1,225,1,1,0,0,0,0
1764,L48943,L,298.2,307.6,1511,31.0,209,1,1,0,0,0,0
1997,M16856,M,298.4,308.0,1416,38.2,198,1,1,0,0,0,0
2167,M17026,M,299.6,309.2,1867,23.4,225,1,1,0,0,0,0
2245,M17104,M,299.3,308.4,1542,37.5,203,1,1,0,0,0,0
2672,M17531,M,299.7,309.3,1399,41.9,221,1,1,0,0,0,0
2865,H32278,H,300.6,309.4,1380,47.6,246,1,1,0,0,0,0


In [80]:
# How many rows are in those tool wear results
TWF_failure_data.shape # 46 rows

(46, 13)

In [82]:
# What is the average rotational of all products
maintenance_data['Rotational speed [rpm]'].mean()

1538.7761

In [83]:
# Find all the products that have had both heat dissipation failure (HDF) and overstrain failure (OSF)
HD_failure = maintenance_data['HDF'] > 0
OS_failure = maintenance_data['OSF'] > 0
maintenance_data[HD_failure & OS_failure]

Unnamed: 0_level_0,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
UDI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
4371,L51550,L,302.0,309.9,1308,57.6,197,1,0,1,0,1,0
4384,L51563,L,301.7,309.5,1298,65.5,229,1,0,1,0,1,0
4463,L51642,L,302.7,310.5,1263,67.8,197,1,0,1,0,1,0
4643,L51822,L,303.2,311.4,1238,54.6,226,1,0,1,0,1,0
4644,L51823,L,303.2,311.4,1324,54.2,228,1,0,1,0,1,0
4730,L51909,L,303.4,311.8,1306,61.0,215,1,0,1,0,1,0


<b>Change Data Frame and Write to File</b>

In [86]:
# A corrupted version of the Ai4i2020 Dataset has been supplied to you that needs to be modified.  
# Write Python code to create a dataframe from the supplied file: "ai4i2020_DODGEY data.csv"
corrupted_data = pd.read_csv('ai4i2020_DODGEY data.csv', index_col = 'UDI', header = 1) # header used to start from 2nd row
corrupted_data

Unnamed: 0_level_0,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
UDI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,M24855,M,298.8,308.4,1604,29.5,14,0,0,0,0,0,0
9997,H39410,H,298.9,308.4,1632,31.8,17,0,0,0,0,0,0
9998,M24857,M,299.0,308.6,1645,33.4,22,0,0,0,0,0,0
9999,H39412,H,299.0,308.7,1408,48.5,25,0,0,0,0,0,0


In [88]:
# Make the required change to the dataframe data:
#     The air temperature of UDI 7 should be 298.1 (not 2298.1).
# Hint: use a loc expression to refer to the cell that needs to be changed
corrupted_data.loc[7, 'Air temperature [K]'] = 298.1

In [89]:
# How do we check that it worked?
corrupted_data.loc[7, 'Air temperature [K]']

298.1

In [90]:
# Save the modified data to a file with the name: "ai4i2020_MODIFIED data.csv"
corrupted_data.to_csv('ai4i2020_MODIFIED data.csv')

# Example 4: Oscilloscope Data

<h3>Data Attribution</h3>
<table style="border-style:solid; margin-left:0">
        <tr><td>Contributor:</td><td>Xitong Gao</td></tr>
        <tr><td>Data source:</td><td><a href="https://github.com/admk/Tektronix-Waveform-Converter/blob/master/Test%20Samples/TEK0002.CSV">https://github.com/admk/Tektronix-Waveform-Converter/blob/master/Test%20Samples/TEK0002.CSV</a></td></tr>
        <tr><td>Local data file:</td><td><a href="TEK0002.csv">TEK0002.csv</a>  </td></tr>
        <tr><td colspan=2>If you share this data, you must preserve this attribution.</td></tr>
</table>      
<p>Make sure you have uploaded file <code>TEK0002.csv</code> to the same folder as this Notebook.</p>

In [92]:
# Write Python code to read the file: TEK0002.csv 
scope_data = pd.read_csv('TEK0002.csv')
scope_data

Unnamed: 0,Record Length,2.50E+03,Unnamed: 2,-0.000025,-0.206
0,Sample Interval,2.00E-08,,-0.000025,-0.204
1,Trigger Point,1.25E+03,,-0.000025,-0.204
2,Source,CH1,,-0.000025,-0.198
3,Vertical Units,V,,-0.000025,-0.198
4,Vertical Scale,5.00E-02,,-0.000025,-0.200
...,...,...,...,...,...
2494,,,,0.000025,0.198
2495,,,,0.000025,0.202
2496,,,,0.000025,0.202
2497,,,,0.000025,0.208


What's wrong with the data in this DataFrame?

Open the raw csv file to better understand the problem: <a href="TEK0002.csv">TEK0002.csv</a>

In [93]:
# The first row of the csv file was treated as the header row, but it actually contained data and there are no column names provided.
# Read the data again, but add parameter header=None to tell pandas there there is no header row.
scope_data = pd.read_csv('TEK0002.csv', header = None)
scope_data

Unnamed: 0,0,1,2,3,4
0,Record Length,2.50E+03,,-0.000025,-0.206
1,Sample Interval,2.00E-08,,-0.000025,-0.204
2,Trigger Point,1.25E+03,,-0.000025,-0.204
3,Source,CH1,,-0.000025,-0.198
4,Vertical Units,V,,-0.000025,-0.198
...,...,...,...,...,...
2495,,,,0.000025,0.198
2496,,,,0.000025,0.202
2497,,,,0.000025,0.202
2498,,,,0.000025,0.208


In [95]:
# The actual oscilloscope data is in columns 3 and 4, while columns 0 and 1 contain meta data.
# Start by retrieving just the Meta data from columns 0 and 1 and rows 0 to 14
metadata = scope_data.loc[:14, :1]
metadata

Unnamed: 0,0,1
0,Record Length,2.50E+03
1,Sample Interval,2.00E-08
2,Trigger Point,1.25E+03
3,Source,CH1
4,Vertical Units,V
5,Vertical Scale,5.00E-02
6,Vertical Offset,0.00E+00
7,Horizontal Units,s
8,Horizontal Scale,5.00E-06
9,Pt Fmt,Y


In [99]:
# Next we will extract the actual oscilloscope data in columns 3 and 4 and change the column names to time and amplitude
# Hint: use the rename method and pass a dictionary mapping old column names to new column names.
# e.g. dataframe.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
scope_data = scope_data.loc[:, 3:]
scope_data = scope_data.rename(columns = {3: 'time', 4: 'amplitude'})
scope_data

Unnamed: 0,time,amplitude
0,-0.000025,-0.206
1,-0.000025,-0.204
2,-0.000025,-0.204
3,-0.000025,-0.198
4,-0.000025,-0.198
...,...,...
2495,0.000025,0.198
2496,0.000025,0.202
2497,0.000025,0.202
2498,0.000025,0.208


In [106]:
# Write code to get a sample of the data by selecting every 100th reading.
# Hint: use slicing, specifying the 'step' argument
scope_data.iloc[range(0, 2500, 100)]

Unnamed: 0,time,amplitude
0,-2.5e-05,-0.206
100,-2.3e-05,-0.128
200,-2.1e-05,-0.122
300,-1.9e-05,-0.128
400,-1.7e-05,-0.124
500,-1.5e-05,-0.126
600,-1.3e-05,-0.126
700,-1.1e-05,-0.032
800,-9e-06,-0.02
900,-7e-06,-0.016
