<h1>Pandas Playbook</h1>

This Notebook lets you develop some intuition of the key features of *pandas* in a *query-like* way, with a minimum of python syntax:  
* Basic properties and functions  
* Frequent used operations: slicing, selection, insertion, deletion and aggregation

*pandas* is a library for data exploration and analysis and is built upon other high performant libraries (*numpy*, *scipy*, *matplotlib*)  
*pandas* provides easy-to-use data structures and data manipulation functions.

More resources on pandas:  
http://pandas.pydata.org/pandas-docs/stable/10min.html  
https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data/notebook

<h2>Import libraries</h2>

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Set pandas defaults
# Show max 10 rows: head(5) ... tail(5)
pd.set_option('max_rows', 10) 

## Terminal
Use !<bash command> to use terminal commands

In [None]:
!ls

In [None]:
!head -3 excel_test.xlsx

### Import Excel file

In [None]:
df = df = pd.read_excel('excel_test.xlsx', skiprows=2)
df

In [None]:
df.sample(5)

## Series
Series are like one column of the dataframe, vice versa a dataframe consists of multiple series. A Series can also be a row from a dataframe!

In [None]:
ser = df.age
ser.head(), ser.tail()

In [None]:
ser.dtype

In [None]:
ser.index

In [None]:
ser.describe()

In [None]:
ser.iloc[2]

In [None]:
ser.loc[[1, 100]]

In [None]:
ser[[4, 3, 1]]

In [None]:
81 in ser

In [None]:
# todo get index of 81
ser.index.get_loc(81)

In [None]:
ser.max()

In [None]:
ser.argmax()

In [None]:
ser[149]

### Mutable state
NOTE: Objects like Series and DataFrames are MUTABLE!

In [None]:
id(ser)

In [None]:
from copy import deepcopy
ser_copy = deepcopy(ser)

In [None]:
id(ser_copy)

In [None]:
ser_copy == ser

In [None]:
ser_copy is ser

### DataFrame row to Series

In [None]:
row_0 = df.iloc[0,:]
type(row_0)

In [None]:
row_0

In [None]:
row_0.name

In [None]:
row_0 = row_0.rename('first_row')
row_0.name

In [None]:
'weight' in row_0

In [None]:
row_0.index

In [None]:
row_0.isnull().any()

## DataFrames and Series

### Show dataframe and column properties

In [None]:
df.info()

In [None]:
df.select_dtypes(include=['object', 'category'])

In [None]:
df.select_dtypes(include=['int64', 'float64']).columns

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df = df.rename(columns = {
    'sex':'sex',
    'id':'ID',
})
df.head(3)

### List comprehension

In [None]:
[col_name for col_name in df.columns]

In [None]:
df.columns = [col.upper() for col in df.columns]
df.tail(5)

In [None]:
df.columns = [col.lower() for col in df.columns]
df.head(5)

### Type, dtype, category

In [None]:
type(df['age'])

In [None]:
df['age'].dtype, df['sexe'].dtype

In [None]:
df['sexe'].factorize()
df['sexe'].astype('category')

In [None]:
df.values

In [None]:
df[:5]

In [None]:
df.iloc[-5:,]

In [None]:
organs = df.pop('organs')
organs[:5]

In [None]:
type(organs)

In [None]:
df.head()

In [None]:
del df['alcoholic']
df.head()

In [None]:
df.insert(5, 'organs', organs)
df.head(10)

### Check for type and NaN

In [None]:
df['height'].dtype

In [None]:
df['height'].isnull().any().sum()

### Coerce string values to numeric values

In [None]:
df['height_in_meter'] = pd.to_numeric(df['height'], errors='coerce') / 100
df.head()

### Set type to float
Feature needs to be clean, no strings like 'ND'

In [None]:
# df.loc[:, 'height'].astype(np.float32, errors='ignore') # error

### Show summary statistics
<li>The "describe" function returns a dataframe containing summary stats for all numerical columns
<li>Columns containing non-numerical data are ignored

In [None]:
df.describe()

In [None]:
df['smoker'].mean()

### Transpose dataframe

In [None]:
df.describe().T

### Set precision format of DataFrame

In [None]:
pd.options.display.float_format = '{:.1f}'.format
df.describe().T

In [None]:
df.describe().T['75%']

### Get unique values

In [None]:
df['smoker'].unique()

### Get frequency

In [None]:
df.loc[:, 'smoker'].value_counts()

In [None]:
top = 3
df.loc[:, 'age'].value_counts().nlargest(top)

### Set index

In [None]:
df.set_index('id', inplace=True)
df.head()

### Reset index

In [None]:
df.reset_index()
df[:5]

## Accessing columns and rows
There are multiple ways to select columns and rows.  
The prefered way is to use .loc and .iloc

<h3>Getting column data</h3>

In [None]:
df.age[:5]

In [None]:
df['age'][:5]

In [None]:
df.loc[:5, 'age']

<h3>Getting row data</h3>

In [None]:
#df[1] # KeyError

In [None]:
df.loc[1]

<h3>Getting a row by row number</h3>

In [None]:
df.iloc[0]

### Getting a column by column number

In [None]:
df.iloc[:5, 0]

### Getting multiple columns

In [None]:
df[['age','smoker']][:5]

In [None]:
df.loc[:5, ['age', 'smoker']]

### Getting a specific cell as view

In [None]:
df.loc[1,'age']

### Getting cell as view != copy
Watch out for this caveat!

In [None]:
df.loc[1]['age']

In [None]:
a = df.loc[1]['age']
b = df.loc[1,'age']
a == b

In [None]:
a is b

In [None]:
id(a) == id(b)

## Slicing
By index and by row/column number 

In [None]:
df.index

In [None]:
df.loc[1] == df.iloc[0]

In [None]:
df.loc[5:10]

In [None]:
df.loc[5:10, ['age', 'weight']]

In [None]:
df.iloc[-5:, -3:]

### Conditional slicing

In [None]:
df[df.age < 18]

In [None]:
df[pd.isnull(df)].sample(5)

In [None]:
df.where((df['age'] > 53)).sample(5)

In [None]:
df.where((df['age'] > 52), None).sample(5)

In [None]:
df[(df['age'] > 50) & (df['weight'] < 50)].sort_values('weight')

### Slicing based on value list

In [None]:
age_range = range(0, 20)
age_range

In [None]:
valuelist = list(age_range)
valuelist

In [None]:
df[df.age.isin(valuelist)].sort_values(['sexe', 'age'])

### Create new (calcuculated) columns

In [None]:
df['is_smoker'] = df['smoker'] == 1
df.head()

In [None]:
df['bmi'] = df['weight'] / df['height_in_meter']**2
df.head()

### Conditionally replace values

In [None]:
df['adult'] = np.where(df.loc[:, 'age'] >= 21, 1, 0).astype(np.uint8)
df.sample(3)

In [None]:
df.loc[:, 'organs'] = np.where(df.loc[:, 'organs'] > 3, 3, df.loc[:, 'organs'])
df.organs.describe()

### Create dummy variables

In [None]:
df = pd.get_dummies(df, columns=['organs'], drop_first=True)
df

## Missing data

In [None]:
df.info()

### Get columns with NaN+ values

In [None]:
na_synonyms = ['ND', 'nan', 'NaN', 'NA', '']
contains_nan = [df.loc[:, feature].isin(na_synonyms).values.any() for feature in df.columns]
df.columns[contains_nan]

In [None]:
df['height'].unique()

In [None]:
df.loc[:, 'height'].value_counts()

In [None]:
df[df['height'] == 'ND'] = np.nan

In [None]:
missing_values_count = df.isnull().sum()
missing_values_count.sort_values()

In [None]:
total_cells = np.product(df.shape)
missing_values_count.sum() / total_cells

In [None]:
pd.isnull(df)

In [None]:
df.dropna(how='any', axis=0)

In [None]:
df.dropna(how='any', axis=1)

In [None]:
df.notnull().shape

In [None]:
df.fillna(value=df.mean())

In [None]:
df.notnull().shape

## Aggregation

In [None]:
df.groupby('sexe').count()

In [None]:
df[df.age.isin([16, 17, 18, 19, 20, 21])].groupby(['age', 'sexe']).count()['is_smoker']

### Get top-n by frequency

In [None]:
top = 3
df.groupby('age')['smoker'].count().sort_values(ascending=False).head(top).index.tolist()

## Stacking, pivotting

In [None]:
df.reset_index(inplace=True)
df.set_index(['sexe', 'is_smoker'], inplace=True)
df.sort_index().stack()[:18]

In [None]:
df.reset_index(inplace=True)
df.columns

In [None]:
pd.pivot_table(df, values='bmi', index=['sexe'], columns=['smoker'])

In [None]:
pd.pivot_table(df, values='bmi', index=['sexe'], columns=['smoker'], margins=True, aggfunc=np.std)

In [None]:
pd.crosstab(df['smoker'], df['age'])

In [None]:
df['age_group'] = pd.cut(df.age, range(0, 120, 20), right=False) #, labels=labels)
df['age_group'].describe()

In [None]:
df['age_group'].dtype

In [None]:
df['sexe'].astype('category'); # ';' kills output

In [None]:
pd.crosstab(df['smoker'], df['age_group'])

## Correlation matrix

In [None]:
df.corr()

In [None]:
df.apply(lambda x: x.factorize()[0]).corr()

In [None]:
df.loc[:, ['age', 'weight']].plot();

In [None]:
df.loc[df['age'] > 50, ['weight']].plot()
df.loc[df['age'] <= 50, ['weight']].plot();