<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:black dashed 2px; 
            border-radius:5px;
            margin: 20px 0;">
            
            
# Pandas Data Analysis



**Staff:** Loren Verreyen <br/>
**Support Material:** [exercises](https://github.com/dtaantwerp/dtaantwerp.github.io/blob/DTA_Bootcamp_2021_students/exercises/Questions_2023/12_EX_Pandas.ipynb) <br/>
**Support Sessions:**  Thursday, October 12, 10:30AM

<h2 style="color:purple">Datasets</h2>

- <a style="font-size:120%;color:blue" href="https://raw.githubusercontent.com/dtaantwerp/dtaantwerp.github.io/master--/data/titanic.csv">titanic.csv</a>
- <a style="font-size:120%;color:blue" href="https://raw.githubusercontent.com/dtaantwerp/dtaantwerp.github.io/master/data/311-service-requests.csv">311-service-requests.csv</a>
- <a style="font-size:120%;color:blue" href="https://raw.githubusercontent.com/dtaantwerp/dtaantwerp.github.io/master/data/bikes.csv">bikes.csv</a>
</div>

<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:lightgrey solid 2px; 
            border-radius:5px;
            margin: 20px 0;
            text-align:center">
  
# PART 1 : Introduction to Pandas 

</div>

## An introduction to Pandas


#### Learning Objectives
- Understand what Pandas is used for
- Be able to implement the fundamental components of Pandas
- Be familiar with the Pandas approach


#### Programme
- What is Pandas?
- Why would I use it?
- How do I use Pandas?


In [None]:
%matplotlib inline 
# this is magic (a "magic expression" that makes plots appear in the notebook)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## CSV Files (Comma-Separated Values)

CSV's are a file type for storing tabular data. You may know them from Microsoft Excel. They store data like this:

```
column1,column2,column3
index1,0,1
index2,3,2
index3,6,3
```

The commas separate the values in the table (hence the name), and the returns separate the lines.

In the above example, the first row and first column are used as a **header** to specify column names and the index. This is a best practice but not a necessity. 

## Opening CSVs in Pandas

Pandas works with all major forms of tabular data. It can even import Excel spreadsheets. However, most of the time we work with CSV's and Jsons

In [None]:
df = pd.read_csv('../data/titanic.csv') 

In [None]:
df.head()

#### Some random examples:

In [None]:
df['Survived'].mean()

Example: mean survival rate per sex?

In [None]:
df.groupby('Sex')['Survived'].mean()

Plot that:

In [None]:
df.groupby('Sex')['Survived'].mean().plot(kind='bar');

Fare per class?

In [None]:
df.groupby('Pclass')['Fare'].mean().plot(kind='bar');

Survival rate per class?

In [None]:
df.groupby('Pclass')['Survived'].mean().plot(kind='bar');

Survival vs fare?

In [None]:
df.boxplot('Fare', 'Survived');

# The pandas data structures: `DataFrame` and `Series`

Pandas uses two main structures, a DataFrame and a Series. Understanding them will help you to use Pandas for data analysis.

In [None]:
df

## DataFrame

A `DataFrame` is a tabular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, like a spreadsheet.

The `DataFrame` is a kind of container object in a similar way as a `dictionary`, but which can be acessed along two axes. 

#### Characteristics  
- 2-dimensional data structure
- A table
- Similar to a spreadsheet
- An object type within Python

### Attributes of the DataFrame

Like dictionaries have `keys` and `values`, the contents of a DataFrame can be accessed using a set of attributes. The primary attributes for `DataFrames` are as follows:

In [None]:
df.index # row labels

Note that `index` in Pandas DataFrames refers to **rows**. The row labels above are continuous, 0-891, and thus summarized. That is different from the column labels as you'll see below.

In [None]:
df.columns # column labels

The `values` attribute returns an array (a `numpy` version of lists) or arrays that contains the whole dataset.

In [None]:
df.values

The `shape` attribute is very useful for getting a sense of the size of the dataset. The format is `(n_rows, n_columns)`.

In [None]:
df.shape

DataFrames can contain all kinds of different object types. Standard Python objects like `int` or `str` are put into object types specific to DataFrames. To check the data types of the different columns:

In [None]:
df.dtypes

An overview of that information can be given with the `info()` method:

In [None]:
df.info()

There is also the `describe()` method:

In [None]:
df.describe()

### Building DataFrames from Scratch.

#### With a list

Apart from importing your data from an external source (text files, Excel spreadsheets, databases, ..), it is also common to build dataframes from Python data structures like lists and dictionaries.

Note that with this method, each list represents a single observation or, in this case, a country. You could use other ordered objects as well, such as `tuples`: i.e. tuple of lists, list of tuples, tuple of tuples, list of lists.

In [None]:
data = [
    ['Belgium', 11.3, 30510, 'Brussels'],
    ['France', 64.3, 671308, 'Paris'],
    ['Germany', 81.3, 357050, 'Berlin'],
    ['Netherlands', 16.9, 41526, 'Amsterdam'],
    ['United Kingdom', 64.9, 244820, 'London']
]

headers = ['country', 'population', 'area', 'capital'] # column headers

df_countries = pd.DataFrame(data, columns=headers) # what happens if we don't say `columns=headers`?
df_countries

#### With a dictionary

Rather than feeding a list of lists as rows, we can provide a dictionary of columns. In this case, the keys of the dictionary are the column labels and the values are some kind of ordered iterable (e.g. `list`, `tuple`, `pandas.Series`). 

Note that the items for each iterable should be in the order of the other iterables. For example, $64.3$ in 'population' should correspond with 'France' in 'country'.

In [None]:
data = {
    'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
    'population': [11.3, 64.3, 81.3, 16.9, 64.9],
    'area': [30510, 671308, 357050, 41526, 244820],
    'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']
}
df_countries = pd.DataFrame(data)
df_countries

### One-dimensional data: `Series` (a column of a DataFrame)

A Series is a basic holder for **one-dimensional labeled data**, similar to a `list`, but possessing special methods and attributes for data analysis. A `DataFrame` consists of `Series` objects "glued" together. For instance, if we select the "Age" column below we'll see that the column *is* a `Series`.

#### Characteristics  
- 1 dimensional data structure
- Each **column** in a `DataFrame` is a `Series`
- Each **row** in a `DataFrame` is a `Series`

![series](https://pandas.pydata.org/docs/_images/01_table_series.svg)

In [None]:
age = df['Age']
print(type(df['Age']))

In [None]:
age

The `Series` has a lot of useful methods. Here are some examples:

In [None]:
print('mean', age.mean())
print('max', age.max())
print('min', age.min())
print('mean', age.mean())
print('sum', age.sum())

### Attributes of a Series: `index` and `values`

The Series has also an `index` and `values` attribute, but no `columns`

In [None]:
age.index

You can access the underlying numpy array representation with the `.values` attribute:

In [None]:
age.values[:10]

We can access series values via the index, just like for NumPy arrays:

In [None]:
age[0]

Unlike the NumPy array, though, this index can be something other than integers:

In [None]:
df = df.set_index('Name')
df

In [None]:
age = df['Age']
age

How is a `Series` different from a plain `numpy` array? `Series` can have a non-numeric index:

In [None]:
age['Dooley, Mr. Patrick']

Many things you can do with numpy arrays, can also be applied on DataFrames / Series; e.g. element-wise operations:

In [None]:
age * 1000

A range of methods exists:

In [None]:
age.mean()

Fancy indexing, like indexing with a list or boolean indexing:

In [None]:
age[age > 70]

But also a lot of pandas specific methods, e.g.

In [None]:
df['Embarked'].value_counts()

# *Class Exercise*
- How many women were on board? How many men?
- What were the names of the oldest and youngest passenger?
- What was the age distribution of the Titanic passengers?
- What was the maximum Fare that was paid? And the median?
- Calculate the average survival ratio for all passengers (note: the 'Survived' column indicates whether someone survived (1) or not (0)).
- **Can you think yourself of a question that we could answer on the basis of the dataset?**

# Selecting and filtering data

For a DataFrame, basic indexing selects the columns.

Selecting a single column:

In [None]:
df['Age']

or multiple columns:

In [None]:
df[['Age', 'Fare']]

But, slicing accesses the rows:

In [None]:
df[10:15]

### Systematic indexing with `loc` and `iloc`

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by "label"; or rather: the index
* `iloc`: selection by position

Note that we have changed the index to the `Name` column:

In [None]:
df.head()

In [None]:
df.loc['Bonnell, Miss. Elizabeth', 'Fare']

In [None]:
df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]

Selecting by position with `iloc` works similar as indexing numpy arrays:

In [None]:
df.iloc[0:2,1:3]

The different indexing methods can also be used to assign data:

In [None]:
df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100

In [None]:
df

### Boolean indexing (filtering)

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' and comparable to numpy. 

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

In [None]:
df['Fare'] > 50

In [None]:
df[df['Fare'] > 50]

<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:lightgrey solid 2px; 
            border-radius:5px;
            margin: 20px 0;
            text-align:center">
  
# PART 2 : Solving Real Problems with Pandas 

</div>

In this part of the class we will work on group problems using a series of datasets.

In [None]:
# we will import on a table on complaint call data from New York

complaints = pd.read_csv('../data/311-service-requests.csv', dtype='unicode')
complaints.head()

# *Class Exercise: What's the most common complaint type?*

There's a `.value_counts()` method that we can use:

If we just wanted the top 10 most common complaints, we can do this:

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

But it gets better! We can plot them!

In [None]:
complaint_counts[:10].plot(kind='bar')

This quick set of methods unwraps a large table and communicates it clearly based on a query.

## Selecting only noise complaints

I'd like to know which borough has the most noise complaints. First, we'll take a look at the data to see what it looks like:

In [None]:
complaints.head()

To get the noise complaints, we need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". I'll show you how to do that, and then explain what's going on.

In [None]:
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints.head()

If you look at `noise_complaints`, you'll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let's deconstruct it into two pieces

In [None]:
# we can use boolean indexing
complaints['Complaint Type'] == "Noise - Street/Sidewalk"

This is a big array of `True`s and `False`s, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to `True`.  It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.

You can also combine more than one condition with the `&` operator like this:

In [None]:
# these are our boolean indices
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk" 
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]

Or if we just wanted a few columns:

In [None]:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]

# *Class Exercise: So, which borough has the most noise complaints?* 


In [None]:
# answer here

(Hint: It's Manhattan!)

But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:

In [None]:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()

In [None]:
noise_complaint_counts / complaint_counts

In [None]:
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar');

# *Class Exercise: How many ride their bike per day?*
</div>



For this problem we'll need to group out dataframe into subsets. We will learn the `groupby` method.

First, we need to load up the data; in this case usage data about bike lanes in Montreal. We've done this before.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
bikes.head()

Next up, we're just going to look at the Berri bike path.

In [None]:
bikes['Berri 1'].plot()

We can also isolate this column:

In [None]:
berri_bikes = bikes[['Berri 1']].copy()

In [None]:
berri_bikes[:5]

Next, we need to add a 'weekday' column. Firstly, we can get the weekday from the index. We haven't talked about indexes yet, but the index is what's on the left on the above dataframe, under 'Date'. It's basically all the days of the year.

In [None]:
berri_bikes.index

You can see that actually some of the days are missing -- only 310 days of the year are actually there. Who knows why.

Pandas has a bunch of really great time series functionality, so if we wanted to get the day of the month for each row, we could do it like this:

In [None]:
berri_bikes.index.day

We actually want the weekday, though:

In [None]:
berri_bikes.index.weekday

These are the days of the week, where 0 is Monday. I found out that 0 was Monday by checking on a calendar.

Now that we know how to *get* the weekday, we can add it as a column in our dataframe like this:

In [None]:
berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday
berri_bikes[:5]

# Adding up the cyclists by weekday

This turns out to be really easy!

Dataframes have a `.groupby()` method that is similar to SQL groupby or Excel groupby, if you're familiar with those. I'm not going to explain more about it right now -- if you want to to know more, [the documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) is really good.

In this case, `berri_bikes.groupby('weekday').aggregate(sum)` means "Group the rows by weekday and then add up all the values with the same weekday".

In [None]:
weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
weekday_counts

It's hard to remember what 0, 1, 2, 3, 4, 5, 6 mean, so we can fix it up and graph it:

In [None]:
weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts

In [None]:
weekday_counts.plot(kind='bar')