<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pandas for Exploratory Data Analysis

---

# Agenda

#### Housekeeping
#### Pandas for Exploratory Data Analysis
#### BREAK
#### Coding Practice

# Housekeeping

- Unit Project 1 deadline has expired. No exceptions.
- Grades to be returned to you ASAP.
- Solutions will be posted tomorrow for UP1.
- Unit Project 2 will be given to you on Monday 9th July. It is due by end of class on Monday 16th July.

## Learning Objectives

- Discuss the different types of data
- Define what Pandas is and how it relates to data science.
- Manipulate Pandas `DataFrames` and `Series`.
- Filter and sort data using Pandas.
- Manipulate `DataFrame` columns.
- Know how to handle null and missing values.

# Types of Data

## Discrete vs. Continuous

### Nominal

Categorical data where ordering **does not matter**.

Examples:

- countries
- football teams
- colours

### Ordinal

Categorical data where ordering **does** matter, and **distances between items are unequal**.

Examples:

- t-shirt sizes (S, M, L)
- classes on trains (first, second)

### Interval

Data where **distances are meaningful**, but **zero doesn't mean an absence**.

Examples:

- temperature in Celsius
    - difference between 2-4C is the same as between 107-109C
    - 0 doesn't mean "no temperature"
- can also be discrete e.g. a Likert scale (1-5 ratings) **could** be argued as interval data

### Ratio

Data where **distances are meaningful**, and **zero means a true absence**.

Examples:

- temperature in Kelvin
    - there is nothing below 0 Kelvin, so it is an "absence of temperature"
- height of a building

## What is Pandas?

- `pandas` is a data analysis library for Python

- developed by Wes McKinney

- stands for "panel data"

### How does pandas help data analysis?

- fast

- lots of useful functionality

- large open source community

### How does it compare to "raw" Python?

Pandas introduces two important new data types: `Series` and `DataFrame`

`Series`

A `Series` is a sequence of items, where each item has a unique label (called an `index`)

`DataFrame`

A `DataFrame` is a table of data. Each row has a unique label (the `row index`), and each column has a unique label (the `column index`).

Think of `DataFrame` as a table and `Series` as a column or a row.

> Behind the scenes, these datatypes use the NumPy ("Numerical Python") library. NumPy primarily adds the `ndarray` (n-dimensional array) datatype to Pandas. An `ndarray` is similar to a Python list — it stores ordered data. However, it differs in three respects:
> - Each element has the same datatype (typically fixed-size, e.g., a 32-bit integer).
> - Elements are stored contiguously (immediately after each other) in memory for fast retrieval.
> - The total size of an `ndarray` is fixed.

> Storing `Series` and `DataFrame` data in `ndarray`s makes Pandas faster and uses less memory than standard Python datatypes. Many libraries (such as scikit-learn) accept `ndarray`s as input rather than Pandas datatypes, so we will frequently convert between them.

`pandas` encourages the use of **vectorised** operations

This means instead of changing items one by one, you do them column by column

Example:

Imagine you have a list of 100 numbers:

`a = list(range(100))`

How would you get the square of each number using just Python?

In [1]:
a = list(range(100))

new_list = []

for num in a:
    new_list.append(num**2)

print(new_list)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]


For large datasets this can become inefficient. Pandas can let us do it in one go:

In [2]:
import pandas as pd

numbers = pd.Series(range(100))

squares = numbers ** 2

print(squares.values)

[   0    1    4    9   16   25   36   49   64   81  100  121  144  169
  196  225  256  289  324  361  400  441  484  529  576  625  676  729
  784  841  900  961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681
 1764 1849 1936 2025 2116 2209 2304 2401 2500 2601 2704 2809 2916 3025
 3136 3249 3364 3481 3600 3721 3844 3969 4096 4225 4356 4489 4624 4761
 4900 5041 5184 5329 5476 5625 5776 5929 6084 6241 6400 6561 6724 6889
 7056 7225 7396 7569 7744 7921 8100 8281 8464 8649 8836 9025 9216 9409
 9604 9801]


Let's look at timings

In [3]:
%%timeit

a = list(range(10000))

new_list = []

for num in a:
    new_list.append(num**2)

4.65 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
%%timeit

numbers = pd.Series(range(10000))

squares = numbers ** 2

247 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
