# Pandas 101


## Data wrangling using Pandas
- In science we make a distinction between raw data and tidy data
- The former is usually not annotated, is closer to the acquisition source  (i.e. sensor or measurement device ) and requires some steps to perform analytics on 
- The latter represents a statistical table that is the foundation for at least one (but often many) questions to be asked on 
- The term Data wrangling is a commonly used catch-all to describe the early stages of the data analytics process. 
- It reflects the steps required to transition from raw to tidy datasets 
  - While the steps can change from project to project the most common ones are: 
    1. Collecting\Extracting data: 
       1. The first step is to identify the data you need, 
       2. Where to acquire\download it from, and then, of course, 
       3. How to collect it
    2. Structuring the data  
       1. Most raw data is unstructured and requires some attention to the way you structure it 
       2. In some cases you will use summary scores to simplify the data 
       3. Other cases may call for statistical data mining and transformation 
       4. But in the end we want to transform the data into a format where the following rules apply:
          1. Each row reflects an observation
          2. Each column reflects a variable/feature of the dataset 
          3. Each cell reflects a measurement
          4. Ideally, both rows and columns are labelled 
    3. Exploratory data analysis
       1. Describe the data components using both summary tables and data visualisation  
       2. Identify redundancy 
       3. Identify outliers
       4. Measure missingness 
       5. Identify association between features and categories 
    4. Data cleaning, enriching and fusion 
       1. Removing or clipping outliers 
       2. Removing errors and duplications 
       3. Standardising category names, dates and numeric formats 
       4. Fusing together complementary datasets to improve available information



## What is Pandas?
- While part of this list can be achieved using Numpy it not what it was designed for 
- Pandas, is a package that was designed from scratch to support all of the above list and more. 
- It extends all the elements in numpy by creating three abstract objects
  - `Index` - Immutable sequence used for indexing and alignment. The basic object storing axis labels for all pandas objects.
  - `Series` - One-dimensional `ndarray` with axis labels (including time series).
  - `DataFrame` - Two-dimensional, size-mutable, tabular data. Data structure also contains labelled axes (rows and columns). 
- These three tools provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.




## Pandas practicalities 
### Importing pandas
- Most of the times we will import both Numpy and Pandas 
- The syntax uses the `as` command to create shortcuts to speed up the code writing 

In [23]:
import numpy as np 
import pandas as pd
import datetime as dt


### Pandas basic structures 
#### The Index class 
- As stated above the major component missing from numpy is the ability to add some context to the arrays 
- The Index class provides this context 
- It is in fact the parent of a family of classes each designed to provide the means to efficiently perform various operations on the two dimensions of the tidy table format. 
- For example: 
  - RangeIndex : Index of some monotonic integer range (i.e. start stop and step)
  - CategoricalIndex : Index of categories 
  - DatetimeIndex : Index of datetime64 data.
  - MultiIndex : A multi-level, or hierarchical Index.

```{Important}
When you call pd.Index the class will automatically try to infer which of it's many sub-classes to use for the sake of efficiency  
```
- For example:
  

In [25]:
print(pd.Index(['a','b','c','d','e']))
print(pd.Index(range(2,11,2)))
print(pd.Index([2,4,6,8,10]))
print(pd.Index(pd.Categorical(['a','b','c','d','e'])))
print(pd.Index([dt.datetime(2020,10,1),dt.datetime(2020,10,5)]))
print(pd.Index([('a',1),('a',2),('b',1),('b',3)]))


Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
RangeIndex(start=2, stop=11, step=2)
Int64Index([2, 4, 6, 8, 10], dtype='int64')
CategoricalIndex(['a', 'b', 'c', 'd', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, dtype='category')
DatetimeIndex(['2020-10-01', '2020-10-05'], dtype='datetime64[ns]', freq=None)
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 3)],
           )


##### Index are very similar to an immutable numpy array
- For example, you can use standard indexing notation to retrieve values or slices
- And it has many of the attributes that NumPy arrays have
- However you cannot change indices, only replace the whole thing


In [41]:
A2Z = pd.Index([chr(n) for n in range(65,91)])
print(A2Z[2])
print(A2Z[2:5])
print(A2Z[1:6:3])
print(f'size = {A2Z.size}, shape = {A2Z.shape}, dtype = {A2Z.dtype}')

C
Index(['C', 'D', 'E'], dtype='object')
Index(['B', 'E'], dtype='object')
size = 26, shape = (26,), dtype = object


##### Index support many set operations 
- Python's built-in set methods such as Difference, Intersection or Union are supported either by set notation
- Or as builtin functions

In [45]:
ix_01 = pd.Index(range(1,11,3))
ix_02 = pd.Index(range(1,11,2))
print(f'Index 01 = {list(range(1,11,3))} and Index 02 = {list(range(1,11,2))}')
print(f'{"Symmetric Difference of ix_01 and ix_02":<50} = {ix_01 ^ ix_02}')
print(f'{"Intersection of ix_01 and ix_02":<50} = {ix_01 & ix_02}')
print(f'{"Union of ix_01 and ix_02":<50} = {ix_01 | ix_02}')

Index 01 = [1, 4, 7, 10] and Index 02 = [1, 3, 5, 7, 9]
Difference of ix_01 and ix_02                      = Int64Index([3, 4, 5, 9, 10], dtype='int64')
Intersection of ix_01 and ix_02                    = RangeIndex(start=1, stop=11, step=6)
Union of ix_01 and ix_02                           = Int64Index([1, 3, 4, 5, 7, 9, 10], dtype='int64')


In [48]:
print(f'{"Difference of ix_01 and ix_02":<50} = {ix_01.difference(ix_02)}')
print(f'{"Difference of ix_02 and ix_01":<50} = {ix_02.difference(ix_01)}')
print(f'{"Intersection of ix_01 and ix_02":<50} = {ix_01.intersection(ix_02)}')
print(f'{"Union of ix_01 and ix_02":<50} = {ix_01 | ix_02}')

Difference of ix_01 and ix_02                      = RangeIndex(start=4, stop=13, step=6)
Difference of ix_02 and ix_01                      = Int64Index([3, 5, 9], dtype='int64')
Intersection of ix_01 and ix_02                    = RangeIndex(start=1, stop=11, step=6)
Union of ix_01 and ix_02                           = Int64Index([1, 3, 4, 5, 7, 9, 10], dtype='int64')


#### The Pandas Series Object
- A Pandas `Series` is an object that contains at least three attributes 
  - one-dimensional array of values
  - a pandas index, 
  - and a dtype.
- It can be created from any sequence that can create a numpy array - because it is a numpy array (at least the value data is)
- If you wish you can explicitly define an index 
- You can also add a name 
  

In [56]:
print(pd.Series((1,2,3)))
print(pd.Series(np.ones(3),index = ('a','b','c')))
print(pd.Series(range(4,7,2),name='46'))

0    1
1    2
2    3
dtype: int64
a    1.0
b    1.0
c    1.0
dtype: float64
0    4
1    6
Name: 46, dtype: int64


##### Series provide the power of numpy arrays with the flexibility of dictionaries 
- For example, you can use standard indexing notation to retrieve values or slices
- And it has many of the attributes that NumPy arrays have
- It also has access to many analytical methods 
- A series cell can contain any type of data 
- And they are of course mutable 
- Also by default a Series Index will be implicitly sorted 

#### The Pandas DataFrame Object
- The Series object can be viewed as a coupling of a numpy array with a pandas Index object 
- The DataFrame object is just a set of Series objects that share a pandas Index object for their rows and have another index object that identifies each Series 
- Can be thought of as a dict-like container for Series objects.
- If for example we combine several Series that have overlapping indices Pandas will automatically combine these Series and create missing values (using `NaN`) in the parts that misalign. 
- Let's see this in action  

In [74]:
col_A = pd.Series((1,2,3))
col_B = pd.Series(list(range(3,7)))
col_C = pd.Series(['a','b','c'],index=[1,5,2])
pd.concat([col_A,col_B,col_C],axis=1)

Unnamed: 0,0,1,2
0,1.0,3.0,
1,2.0,4.0,a
2,3.0,5.0,c
3,,6.0,
5,,,b


```{admonition} Discuss 
What do you think adding the name argument change the DataFrame? 
````{code} python
col_A = pd.Series((1,2,3),name='col_A')
col_B = pd.Series(list(range(3,7)),name='col_B')
col_C = pd.Series(['a','b','c'],index=[1,5,2],name='col_C')
pd.concat([col_A,col_B,col_C],axis=1)
````
```

In [4]:
col_A = pd.Series((1,2,3),name='col_A')
col_B = pd.Series(list(range(3,7)),name='col_B')
col_C = pd.Series(['a','b','c'],index=[1,5,2],name='col_C')
pd.concat([col_A,col_B,col_C],axis=1)

Unnamed: 0,col_A,col_B,col_C
0,1.0,3.0,
1,2.0,4.0,a
2,3.0,5.0,c
3,,6.0,
5,,,b


#### Constructing DataFrame objects (selected methods)

- A Pandas `DataFrame` can be constructed in many ways.

 

##### The simplest approach - Dictionary  


In [1]:
d = {'col_A': [1, 2], 'col_B': [3, 4]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


##### Dictionary values can be Series


In [9]:
d = {'col_A': col_A, 'col_B': col_B}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col_A,col_B
0,1.0,3
1,2.0,4
2,3.0,5
3,,6


In [2]:
d = [[1,2],[3,4]]
df = pd.DataFrame(data=d)
df

Unnamed: 0,0,1
0,1,2
1,3,4
