# Introduction to Pandas

- prepared by [Rita Colaço](https://www.cpr.ku.dk/staff/?id=621366&vis=medarbejder) 

Pandas is a library for data analysis and its powertool is the **DataFrame**.

Pandas is well suited for many different kinds of data:

- Tabular data with heterogeneously-typed columns, as in an Excel spreadsheet
- Ordered and unordered time series data.
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

## Learning goals

- Concept of 
    - vectors (1d-arrays) as series
    - tables as data frames
- Organization of a table: index, columns
- Filtering and slicing
- Data types a Pandas dataframe can handle
- Applying statistics and grouping
- Modifying a table
 

## What is a DataFrame?

A DataFrame is basically, a **Table** of data (or a tabular data structure) with labeled rows and columns. The rows are labeled by a special data structure called an Index, taht permits fast look-up and powerful relational operations.
For example:

| (index) | Name | Age | Height | LikesIceCream |
| :---: | :--: | :--: | :--: | :--: |
| 0     | "Nick" | 22 | 3.4 | True |
| 1     | "Jenn" | 55 | 1.2 | True |
| 2     | "Joe"  | 25 | 2.2 | True |

## Importing Pandas library

In [2]:
import pandas as pd

## Create a DataFrame directly

### From a List of Lists

In [3]:
data = [
    [2.23, 1, "test"],
    [3.45, 2, "train"],
    [4.5, 3, "test"],
    [6.0, 4, "train"]
]

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

In [4]:
df

Unnamed: 0,A,B,C
0,2.23,1,test
1,3.45,2,train
2,4.5,3,test
3,6.0,4,train


### From a List of Dicts

In [5]:
data = [
    {'A':2.23, 'B':1, 'C':"test"},
    {'A':3.45, 'B':2, 'C':"train"},
    {'A':4.5, 'B':3, 'C':"test"},
    {'A':6.0, 'B':4, 'C':"train"}
]

df = pd.DataFrame(data)

In [6]:
df

Unnamed: 0,A,B,C
0,2.23,1,test
1,3.45,2,train
2,4.5,3,test
3,6.0,4,train


### From a Dict of Lists

In [7]:
df = pd.DataFrame({
    'A': [2.23, 3.45, 4.5, 6.0],
    'B': [1, 2, 3, 4],
    'C': ["test", "train", "test", "train"]
})

In [8]:
df

Unnamed: 0,A,B,C
0,2.23,1,test
1,3.45,2,train
2,4.5,3,test
3,6.0,4,train


### From an empty DataFrame

In [9]:
df = pd.DataFrame()
df['A'] = [2.23, 3.45, 4.5, 6.0]
df['B'] = [1, 2, 3, 4]
df['C'] = ["test", "train", "test", "train"]

In [10]:
df

Unnamed: 0,A,B,C
0,2.23,1,test
1,3.45,2,train
2,4.5,3,test
3,6.0,4,train


### Exercise 1
Please recreate the table below as a Dataframe using one of the approaches detailed above:

| Year | Product | Cost |
| :--: | :----:  | :--: |
| 2015 | Apples  | 0.35 |
| 2016 | Apples  | 0.45 |
| 2015 | Bananas | 0.75 |
| 2016 | Bananas | 1.10 |

Which approach did you prefer? Why?

## Making DataFrames from a Data File

Pandas has functions that can make DataFrames from a wide variety of file types.  To do this, use one of the functions in Pandas that start with "read_".  Here is a non-exclusive list of examples:

| File Type | Function Name |
| :----:    |  :---:  |
| Excel | `pd.read_excel()` |
| CSV, TSV | `pd.read_csv()` |
| H5, HDF, HDF5 | `pd.read_hdf()` |
| JSON  | `pd.read_json()` |
| SQL | `pd.read_sql_table()` |



### Loading the Data

In [11]:
# url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
# df = pd.read_csv(url)
url_ecdc_daily_cases = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/data.csv"
df = pd.read_csv(url_ecdc_daily_cases, parse_dates=True, infer_datetime_format=True)
df.head()

## Examining the Dataset

Sometimes, we might just want to quickly inspect the DataFrame:

### Attributes
```python
df.shape    # Shape of the object (2D)
df.dtypes   # Data types in each column
df.index    # Index range
df.columns  # Column names
```

### Methods

```python
df.describe()   # Descriptive statistics of columns
df.info()       # DataFrame information

```




### Shape

In [12]:
df.shape

(60409, 12)

### Data types

In [13]:
df.dtypes

dateRep                                                        object
day                                                             int64
month                                                           int64
year                                                            int64
cases                                                           int64
deaths                                                          int64
countriesAndTerritories                                        object
geoId                                                          object
countryterritoryCode                                           object
popData2019                                                   float64
continentExp                                                   object
Cumulative_number_for_14_days_of_COVID-19_cases_per_100000    float64
dtype: object

### Index and Columns

In [26]:
df.index

RangeIndex(start=0, stop=446, step=1)

In [27]:
df.columns

Index(['Entity', 'Code', 'Year', 'Electricity consumption'], dtype='object')

## Selecting Data

Pandas has a lot of flexibility in the number of syntaxes it supports.  For example, to select columns in a DataFrame:

```python
df['Column1']
df.Column1  # no whitespaces possible!
```

Multiple Columns can also be selected by providing a list:

```python
df[['Column1', 'Column2']]
```

Rows are selected with the **iloc** and **loc** attributes:

```python
df.iloc[5]  # Used to get the "integer" index of the row.
df.loc['Row5']  # Used if rows are named.
```

However, with large DataFrames, we often just want to see the first or last rows, or even just a sample of the rows.

| Method | Description |
| ---  | --- |
| `df.head(5)` | the first 5 rows |
| `df.tail(5)` | the last 5 rows |
| `df.sample(5)` | a random 5 rows |


### Exercise 2

Please open the file `titanic.csv` (using `pd.read_csv`) and use it to answer the following questions about the rdataset. If you reach the end of the exercises, explore the dataset and DataFrames more and see what you can find about it!

Display the first 5 lines of the dataset.

Show the last 3 lines of the "alive" column.

Check 3 random lines of the dataset

Make a new dataframe containing just the "survived", "sex", and "age" columns

Make a new dataframe containing just the 10th, 15th and 16th lines of the dataset

## Query/Filtering Data

To get rows based on their value, Pandas supports both Numpy's logical indexing:

```python
select_rows = df[df['Column1'] > 0]
```

and an SQL-like query string:
    
```python
df.query('Colummn1 > 0')
```

One can also filter based on multiple conditions, using the element-wise ("bit-wise") logical operators **&** data intersection, or **|** for the data union.

```python
select_rows = df[(df['Column1'] > 0) & (df['Column2'] > 2)]
```

```python
select_rows = df[(df['Column1'] > 0) | (df['Column2'] > 2)]
```

### Exercise 3
Using the Titanic dataset, let's do some data querying exercises.

What is ticket fare for the 1st class? The 2nd? The 3rd?

In [15]:
pd.Series([True, True]) & pd.Series([False, False])

0    False
1    False
dtype: bool

Did the oldest passenger on the Titanic survive?

Was the youngest passenger on the Titanic alone?

How many passengers on the Titanic embarked from Cherbourg?

How much money did the Titanic make from passengers from Southampton? From Cherbourg? From Queenstown?

Considering only those passengers older than 22, were there more Males travelling alone from Southampton or Females in Third class from Cherbourg?

## Summarizing/Statistics in DataFrames

Pandas' Series and DataFrames are iterables, and can be given to any function that expects a list or Numpy Array, which allows them to be useful to many different libraries' functions.  For example, to compute basic statistics:

```python
df['Column1'].count()
df['Column1'].max()
df['Column2'][df['Column1'] == 'string'].sum()

```

You can also use the "pipe" method to call a function on the rows or columns of a DataFrame:

```python
df['Column1'].pipe(np.mean)
```


### Exercise 4

What is the mean ticket fare that the passengers paid on the titanic? And the median?

How many passengers does this dataset contain?

What class ticket did the 10th (index = 9) passenger in this dataset buy?

What proportion of the passengers were alone on the titanic?

How many different classes were on the titanic?

How many men and women are in this dataset? (value_counts())

How many passengers are sitting in each class?

## Transforming/Modifying Data

Any transformation function can be performed on each element of a column or on the entire DataFrame. For example:


```python
df['Column1'] * 5

np.sqrt(df['Column1'])

df['Column1'].str.upper()

del df['B']

df['Column1'] = [3, 9. 27, 81]  # Replace the entire column with other values (length must match)
```

### Exercise 5

Get everyone's age if they were still alive today (hint: Titanic sunk in 1912)

Make the class name title-cased (the first letter capitalized)

Make a columns called "not_survived", the opposite of the "survived" column

## GroupBy Operations

In most of our tasks, getting single metrics from a dataset is not enough, and we often actually want to compare metrics between groups or conditions.

The **groupby()** method essentially splits the data into different groups depending on a variable of your choice, and allows you to apply summary functions on each group. For example, if you wanted to calculate the mean temperature by month from a given data frame:

```python
df.groupby('month').temperature.mean()
```
where "month" and "temperature" are column names from the data frame.
 
You can also group by multiple columns, by providing a list of column names:
 
```python
df.groupby(['year', 'month']).temperature.mean()
```

The **groupby()** function returns a GroupBy object, where the **.groups** variable is a dictionary whose keys are the computed unique groups.

Groupby objects are **lazy**, meaning they don't start calculating anything until they know the full pipeline.  This approach is called the **"Split-Apply-Combine"** workflow.  You can get more info on it [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

### Exercise 6

Let's try this out on the Titanic dataset.

What was the median ticket fare for each class?

What was the survival rate for each class?

What was the survival rate for each sex?

What was the survival rate, broken down by both sex and class?

Which class tended to travel alone more often? Did it matter where they were embarking from?

What was the ticket fare for each embarking city?

What was the median age of the survivors vs non-survivors, when sex is considered as a factor?

# For Track 2 (or not, what do you think?)

## GroupBy Operations: Multiple Statistics per Group

Another piece of syntax we are going to look at, is the **agg()** function for Pandas. The aggregation functionality provided by this function allows multiple statistics to be calculated per group in one calculation.

The instructions to the function **agg()** are provided in the form of a dictionary, where the keys specify the columns upon which to apply the operations, and the value specify the function to run:

```python
df.groupby(['year', 'month']).agg({'duration':sum,
                                   'network_type':'count',
                                   'date':'first'})
```

You can also apply multiple functions to one column in groups:

```python
df.groupby(['year', 'month']).agg({'duration':[min, max, sum],
                                   'network_type':'count',
                                   'date':[min, 'first', 'nunique']})
```

### Exercise 7

Now, let's try to apply it to our Titanic dataset, and answer the following questions.

How many man, women and childern survived, and what was their average age?

In [42]:
# df.groupby('who').agg({'survived':sum,
#                        'age':"mean"})

How many males and females, embarking on different towns, were alive? And how many of those were alone?

In [47]:
# df.groupby(['sex', 'class', 'embark_town']).agg({'alive':"count",
#                                                  "alone":sum})

## Handling Missing Values

Missing values are often a concern in data science, for example in proteomics, and can be indicated with a **`None`** or **`NaN`** (np.nan in Numpy). Pandas DataFrames have several methods for detecting, removing and replacing these values:

| method | description
| ---:  | :---- |
**`isna()`** | Returns True for each NaN |
**`notna()`** | Returns False for each NaN |
**`dropna()`** | Returns just the rows without any NaNs |

### Exercise 8

What proportion of the "deck" column is missing data?

How many rows don't contain any missing data at all?

Make a dataframe with only the rows containing no missing data.

## Imputation

Imputation means replacing the missing values with real values. 

| method | description |
| ----: |  :---- |
| **`fillna()`** | Replaces the NaNs with values (provides lots of options) |
| **`ffill()`** | Replaces the Nans with the previous non-NaN value (equivalent to df.fillna(method='ffill') |
| **`bfill()`** | Replaces the Nans with the following non-NaN value (equivalent to df.fillna(method='bfill') |
| **`interpolate()`** | interpolates nans with previous and following values |


### Exercise 9

Using the following DataFrame, solve the exercises below.

In [33]:
data = pd.DataFrame({'time': [0.5, 1., 1.5, None, 2.5, 3., 3.5, None], 'value': [6, 4, 5, 8, None, 10, 11, None]})
data

Unnamed: 0,time,value
0,0.5,6.0
1,1.0,4.0
2,1.5,5.0
3,,8.0
4,2.5,
5,3.0,10.0
6,3.5,11.0
7,,


Replace all the missing "value" rows with zeros.

Replace the missing "time" rows with the previous value.

Replace all of the missing values with the data from the next row. What do you notice when you do this with this dataset?

Linearly interpolate the missing data. What is the result for this dataset?