# Lecture 8 - Pandas Fundamentals

This week you will be introduced to pandas, which is a library dedicated to processing large datasets simply and efficiently. Pandas can be imported by running the Python cell below.

In [1]:
import pandas as pd

## Importing Data
For this lecture we will be analysing the 'LSTM-Multivariate_pollution.csv' dataset. The <code>read_csv</code> function from pandas is called below to import the data.

In [4]:
# import the data using the read_csv function
pollution_data = pd.read_csv('Data/LSTM-Multivariate_pollution.csv', index_col='date')
pollution_data

Unnamed: 0_level_0,pollution,dew,temp,press,wnd_dir,wnd_spd,snow,rain
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2/01/2010 0:00,129,-16,-4.0,1020.0,SE,1.79,0,0
2/01/2010 1:00,148,-15,-4.0,1020.0,SE,2.68,0,0
2/01/2010 2:00,159,-11,-5.0,1021.0,SE,3.57,0,0
2/01/2010 3:00,181,-7,-5.0,1022.0,SE,5.36,1,0
2/01/2010 4:00,138,-7,-5.0,1022.0,SE,6.25,2,0
...,...,...,...,...,...,...,...,...
31/12/2014 19:00,8,-23,-2.0,1034.0,NW,231.97,0,0
31/12/2014 20:00,10,-22,-3.0,1034.0,NW,237.78,0,0
31/12/2014 21:00,10,-22,-3.0,1034.0,NW,242.70,0,0
31/12/2014 22:00,8,-22,-4.0,1034.0,NW,246.72,0,0


## Inspecting Data
Use data inspection tools to answer the following questions:
1. How many observations are in the data?
2. How many columns are in the data?
3. What are the column names?
4. What is the data type of each column?
5. What is the dataframe index? What data type does it have?

In [None]:
# insert data inspection code

## Selecting Data
Data can be selected by label using <code>loc</code> or by index using <code>iloc</code>.

General syntax for <code>loc</code> is </code>df_name.loc[row_label, col_label]</code>

General syntax for <code>iloc</code> is </code>df_name.iloc[row_index, col_index]</code>

In [29]:
# select the temp column

In [None]:
# select the row at 1/01/2012 0:00

In [None]:
# select the pollution at 5/03/2013 21:00

In [None]:
# select all observations from March 2011

In [None]:
# select the element in the 100th row and the 4th column

## Filtering Data
Data can be filtered using Boolean arrays just like you saw with NumPy arrays last week. This allows for data to be selected if a certain condition is True. 

The general syntax is <code>df_name[condition]</code> and reads select the dataframe for each row where the condition is True.


In [None]:
# select all rows where the temperature is below zero

In [None]:
# select all the rows where there is snow

In [None]:
# select all rows that are below zero and don't have snow

In [None]:
# select all rows that have rain or snow

## Computed Columns
A computed column is a column that is calculated based on other columns in the dataframe. These are generally calculated by using element-wise operators (you should remember these from when you learned about arrays). If you need to define you own custom function for computations, it can be applied element-wise by calling the <code>apply</code> method.

In [None]:
# create a new column that includes the pressure measured in atmospheres 
# (note: the current pressure column is measured in millibars).

In [None]:
# create a new column that specified whether the temperatures is above or below freezing point (0 degrees)

## Aggregates
Aggregates are used to summarise information in a dataset. They condense many observations down to a single number to describe a property of that variable.

In pandas, these aggregates are stored as methods, meaning the general syntax is <code>df_name.method_name(arguments)</code>.

In [None]:
# find the average pollution

In [None]:
# find the max pollution. at what time did this occur?

In [None]:
# find the percentage of observations that have a temperature below zero