# Python Data Analysis Library: Pandas 

<div class="alert alert-block alert-success">
    <b>NOTE</b>
    <br>If you are using the Jupyter Hub provided for the IKON training, all the modules should be installed. In this case, please ignore the installation sections.
</div>

## Introduction

This document will give a short introduction to one of the Python Data Analysis Library: `Pandas`.  

`Pandas` is widely used in data science, machine learning, scientific computing, and many other data-intensive fields. Some of its advantages are:

- data representation: easy to read, suited for data analysis 
- easy handling of missing data
- easy to add/delete columns from `Pandas` data structures
- data alignment: intelligent automatic label-based alignment
- handling large datasets
- powerful grouping of data
- native to `Python`
  
`Pandas` provides rich data structures and indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. Its key data structures are called the _Series_ and _DataFrame_.

A _Series_ is a one dimensional array-like object containing an array of data and an associated array of data labels, called its index.

_DataFrames_ are two-dimensional tabular, column-oriented data structures with both row and column labels.

## Installation

You can install `pandas` using `pip` by typing the following command in a terminal:
```
python -m pip install pandas
```
or with `conda` 

```
conda install pandas
```

or directly from a jupyter notebook:

```
import sys
!{sys.executable} -m pip install pandas
```

To start using it, simply type:

```
import pandas
```

In [None]:
import numpy as np
import pandas as pd

## Solutions to the exercises
### How to combine many series to form a dataframe?

In [None]:
series1 = pd.Series(['a','b', 'c', 'd'])
series2 = pd.Series([1, 2, 3, 4])

In [None]:
# Solution 1
df = pd.concat([series1, series2], axis=1)
print(f'Solution1:\n{df}')

In [None]:
# Solution 2
df = pd.DataFrame({'col1': series1, 'col2': series2})
print(f'\nSolution2:\n{df}')

### How to stack two series vertically and horizontally ?
Stack `series1` and `series2` vertically and horizontally to form a dataframe.

In [None]:
# Input
series1 = pd.Series(range(5))
series2 = pd.Series(list('vwxyz'))

In [None]:
# Solution
# Vertical
series1.append(series2)

In [None]:
# Horizontal
df = pd.concat([series1, series2], axis=1)
print(df)

### How to get the positions of items of series A in another series B?

Get the positions of items of `series2` in `series1` as a list.

In [None]:
# Input
series1 = pd.Series([10, 3, 6, 5, 3, 1, 12, 8, 23])
series2 = pd.Series([1, 3, 5, 23])

In [None]:
# Solution 1
[np.where(i == series1)[0].tolist()[0] for i in series2]

In [None]:
# Solution 2
[pd.Index(series1).get_loc(i) for i in series2]

### How to compute difference of differences between consecutive numbers of a series?

Difference of differences between the consecutive numbers of `series`.

In [None]:
# Input
series = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

In [None]:
# Solution
print(series.diff().tolist())
print(series.diff().diff().tolist())

### How to check if a dataframe has any missing values?

Check if _df_ has any missing values.

In [None]:
# Input
df = pd.DataFrame(np.random.randn(6, 4), 
                         index=list('abcdef'), 
                         columns=list('ABCD'))
df['E'] = [0.5, np.nan, -0.33, np.nan, 3.14, 8]

# Solution
df.isnull().values.any()

### Playing with `groupby` and csv files

- load the csv file [`biostats.csv`](https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv) to a `users` DataFrame
- determine the average, minimum and maximum ages per gender
- determine the average weight of people over 35 years of age

In [None]:
users = pd.read_csv('biostats.csv', skipinitialspace=True) # if the file is in the same directory as this notebook
# users = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv', skipinitialspace=True) # to download and create the DataFrame at once
users

In [None]:
# Average age per gender
# option 1
users.groupby('Sex')['Age'].mean()

In [None]:
# option 2
users.groupby('Sex').Age.mean()

In [None]:
# One can also plot the result
users.groupby('Sex')['Age'].mean().plot(kind='bar', grid=True);

In [None]:
# min max age per gender
# option 1
print(f"Minimum age by gender:\n {users.groupby('Sex')['Age'].min()}\nMaximum age by gender:\n {users.groupby('Sex')['Age'].max()}")

In [None]:
# option 2
users.groupby('Sex').Age.min(), users.groupby('Sex').Age.max()

In [None]:
# Other solution: Calculate average, mean and max at once
users.groupby('Sex')['Age'].agg([np.mean, np.min, np.max])

In [None]:
# Average weight and height of people over 35 years of age
users[users.Age > 35].loc[:, 'Height (in)':'Weight (lbs)'].mean()

## References

https://pandas.pydata.org/

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

Exercises:   
https://www.w3resource.com/python-exercises/pandas/index.php  
https://github.com/guipsamora/pandas_exercises/tree/master/03_Grouping/Occupation

CSV files:  
https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html