<img align=left src="http://www.nus.edu.sg/templates/t3_nus2015/images/assets/logos/logo.png" width=125>
<br><br>
# RE2708 Lecture 2

## The PANDAS Module: Working with data

Dr. Cristian Badarinza

## Structure of this Lecture

- First part (1 hour): **Learning**

- Second part (30 minutes): **Reviewing** and **Debugging**

## Table of Contents

### Working with data

1. Loading the library
1. Understanding data frames
1. Reading data from a CSV file
1. Locating rows and columns
1. Cleaning
1. Grouping
1. Merging

# 1. Loading the library

First things first, we start by importing the Python Data Analysis Library (`PANDAS`):

In [None]:
import pandas as pd

Remember our discussion about **objects** in Python. With the command above, we have just loaded the PANDAS library in an object called `pd`. You may wonder: Why should we bother doing that? Answer: Simply to make life easier.

## 2. Understanding data frames

To facilitate the kinds of computations and analyses that we are after, PANDAS works with objects of a slightly modified dictionary type that it calls `DataFrames`.

What is a **DataFrame** object? Remember the **dictionary** type we discussed last week? PANDAS stores data sets in a dictionary-type variable, and formats them nicely:

In [None]:
df = pd.DataFrame(data={'Phone type':['iPhone 8','Galaxy S8','Redmi'],
                        'Year of release':[2018,2017,2019],
                        'Current price':['$700','$800','$500'],
                        'Expected battery time': ['3 days','4 days','2 days']})

In [None]:
df

** Accessing single columns **

In [None]:
df['Phone type']

## 3. Reading data from a CSV file

Most often, data sets are stored by their respective providers in CSV files, i.e. files containing *comma-separated values*. 

To read the content of a CSV file, PANDAS offers us the function `read_csv()`.

Let's use this function to load some data on HDB resale prices.

In [None]:
df = pd.read_csv('Data/hdb-transactions-2018.csv')

### Viewing data frames

The function `head()` shows the **first** 5 rows on the screen, nicely formatted:

In [None]:
df.head()

The function `tail()` shows the **last** 5 rows on the screen:

In [None]:
df.tail()

### Summary statistics

First of all, let's get some quick summary statistics for our data set.

The function `describe()` shows us the number of observations, the mean of each variable, the minimum, maximum and other statistics:

In [None]:
df.describe()

**Note**: The functions `round()` and `T` can be used to transpose the table and make it look nicer. Try it out in the cell above: `df.describe().round().T`

How about the other variables? Since they are not numerical, we cannot get a minimum and a maximum. Instead, we can ask Python to show us their unique values:

In [None]:
print(df['town'].unique())

## 4. Locating rows and columns

What if we want to locate certain rows and certain columns?

In [None]:
df.loc[[1,2,3],['town','resale_price']]

What if we want to select transactions that meet a certain condition?

In [None]:
df.loc[df['town']=='CLEMENTI',['town','resale_price']].describe()

What if we want to consider more than one town? Using the `isin()` method for filtering:

In [None]:
df.loc[df['town'].isin(['JURONG EAST', 'JURONG WEST']),['town','resale_price']].describe()

## 5. Cleaning

Publicly available data often contains outliers, missing observations, or it simply extends beyond what we need in our analysis.

If we simply want to drop missing observations, we have a simple function available:

In [None]:
df = df.dropna()

If we want to drop transactions where the price is obviously wrong, e.g. negative or zero, we use the locate (`loc`) function:

In [None]:
df = df.loc[df['resale_price']>0]
df.describe()

## 6. Grouping

Grouping is by far the  most frequent operation that we want to do with data.

The function that PANDAS offers to do this is called `groupby` and it is used together with functions such as `max` or `mean` or `count`:

In [None]:
df.groupby('town').mean()

What if we want to sort towns in increasing order of prices?

In [None]:
df.groupby('town').mean().sort_values(by='resale_price')

How about some more complicated overviews? Using the function `pivot_table`:

In [None]:
pd.pivot_table(df, values='resale_price', index=['town'], columns=['flat_type']).round()

## 7. Merging

Finally, we often need to merge data from different sources. For example, we may want to label each town as belonging to a certain region of Singapore. 

For this purpose, let's read in an additional data set:

In [None]:
dreg = pd.read_csv('Data/regions.csv')

In [None]:
dreg

Now, let's merge the two datasets:

In [None]:
df2 = pd.merge(df, dreg, on='town')

In [None]:
df2.head()

... and build a table that tells us how prices for different flat types vary by region:

In [None]:
pd.pivot_table(df2, values='resale_price', index=['region'], columns=['flat_type']).round()

### THE END