In [None]:
# update to the latest version
# ! git pull

# Pandas


Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.<br>
* A fast and efficient DataFrame object for data manipulation with integrated indexing
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted from data structures for size mutability;
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets
* High performance merging and joining of data sets
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging<br>
<br>
The documentation can be found on
https://pandas.pydata.org/index.html

In [None]:
import pandas as pd #import pandas
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
%matplotlib inline
plt.style.use('ggplot')
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 30)

## 1. Series

* a one-dimensional object similar to an array, list, or column in a table. 
* It will assign a labeled index to each item in the Series. 
* By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [None]:
serie = pd.Series([3,3.14,'Seven',None])
serie

#### Index
we can also specify the index to be something different than increasing integer numbers


In [None]:
serie2 = pd.Series([3,3.14,'Seven',None], index=['a','b','c','d'])
serie2

or you can change it again

In [None]:
serie2.index = ['integer','float','string','null']
serie2

### Create a series from a dictionary
This is very useful, because a dictionary is a python representation of a `json` format

In [None]:
d_cities_population = {'Amsterdam':821752,
     'Istanbul':15030000,
     'London': 8200000, 
     'Paris': 2206000, 
     'Frankfurt':  732688 , 
     'Berlin': 3700000, 
     'Hamburg': 1800000, 
     'Manchester': 541000}
cities_population = pd.Series(d_cities_population)
cities_population

## Accessing the data

#### Index



In [None]:
cities_population.index

#### Values

In [None]:
cities_population.values

## Sorting and Filtering

### Sort the cities by population

In [None]:
# if you do not specify ascending = False, it will use the default value
# which is ascending = True
cities_population.sort_values(ascending = False) 

## Intermezzo 
It is always useful to read the documentation of the function.<br>
For this you can use the help function from the Jupyter Notebook

In [None]:
help(cities_population.sort_values)

# Exercise 1
What happens if we use <br>
`cities_population.sort_values(inplace=True)`

### Filtering

I am interested in knowing which cities have more than 1 million citizens

In [None]:
cities_population>1000000

Now it returns a mask (`true` or `false`) that can be used for filtering.<br>
Example

In [None]:
cities_population[cities_population>1000000]

Or the inverse by using the `~` sign in fron of the mask

In [None]:
mask = cities_population>1000000

In [None]:
cities_population[~mask]

#### Note!!!<br> 
you can save the masked filter into a variable (in the example `mask`) or use it explicity (like done above

### Create a dataframe from two series.
For this purpose, we need a new series with similar indexes<br>
Let's create one

In [None]:
d_cities_country = {'Amsterdam':'Netherlands',
     'Istanbul':'Turkey',
     'London': 'UK', 
     'Paris': 'France', 
     'Frankfurt': 'Germany', 
     'Berlin': 'Germany', 
     'Hamburg': 'Germany', 
     'Lyon':'France',
     'Manchester': 'UK'}
cities_country = pd.Series(d_cities_country)
cities_country

### Counting values
How many cities do I have per country in by dataset?

In [None]:
cities_country.value_counts()

### Note the output of value_counts is a Serie, where now the index is the unique value, and the value is the number of times it appears

What would happen if we apply `value_counts` twice, i.e.<br>

`cities_country.value_counts().value_counts()`?

# 2 Dataframes

* Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 
* Arithmetic operations align on both row and column labels. 
* Can be thought of as a dict-like container for Series objects. 
* The primary pandas data structure.

### pandas concat

Read the `help(pd.concat)` to see what this is doing

In [None]:
pd.concat([cities_country, cities_population])

### Not really what we wanted, right?
* Pandas uses a `'two-dimensional'` representation of the data (table), with  rows and columns.<br>
* When applying an operation, we need to specify the direction (i.e. along the rows of the columns)<br>
* This is defined by the `axis` parameter

In [None]:
### concat along the 2nd axix (python starts indexing from 0)
### We will save the output into a new object, called cities
cities=pd.concat([cities_country, cities_population], axis = 1)

In [None]:
cities

#### This is a DataFrame

In [None]:
type(cities)

### How about we give a nice name to the columns?

In [None]:
cities.columns=['Country','Population']
cities

### Accessing the column

In [None]:
cities['Country']

# Exercise 2:
What is the difference between :
* `cities['Country']`
* `cities[['Country']]`
* `cities.Country`

### Filters, sorting
It works the same way as in Series<br>
Let's filter out the missing value of `Lyon`
the function isnull() will tell if the value is null or not

cities.isnull()

and we can use this mapping for filtering values

In [None]:
cities[~cities.Population.isnull()] #remember, ~ is negation

## Operations on the DataFrame

Let's create a function that will categorise the cities in EU and non EU ones

In [None]:
def is_EU(x):
    return x in ['Netherlands','Germany','UK','France']

def is_EU_after2019(x):
    '''
    remove the UK after Brexit
    '''
    return x in ['Netherlands','Germany','France']

In [None]:
cities['Country'].apply(is_EU)

In [None]:
cities['Country'].apply(is_EU_after2019)

### But We would like something nicer, like a column saying if it is EU or non EU

In [None]:
cities

### Applying a function return a series. We can add it as a new column to the dataframe

In [None]:
cities['isEU']=cities['Country'].apply(is_EU_after2019)

In [None]:
cities

In [None]:
### Lets now convert it into Strings
cities['European_Union']=cities.isEU.apply(lambda x: 'EU' if x else 'no EU')

In [None]:
cities

### The drop function
 

In [None]:
cities.drop('isEU', axis=1)

### The same can be achieved by selecting the columns we want to keep

In [None]:
cities[['Country','Population','European_Union']]

In [None]:
#REMOVE IN PLACE
cities.drop('isEU',inplace=True, axis=1)

In [None]:
cities

## Missing Values

Missing values can be replaced by the fillna function.<br>
Let's see the help

In [None]:
help(cities.fillna)

For our use case, we want to replace the Population of Lyon with the value we know from Wikipedia, cca half a milion

In [None]:
cities.fillna(value=500000)

### It might be a bit safer to specify directly the column

In [None]:
cities.Population.fillna(value=500000,inplace=True) # note the inplace=True

In [None]:
cities

## Exercise 3:
The population is a float. Convert it to integers

## Group-By and Aggregations

In [None]:
grouped_cities = cities.groupby('Country')

Is it a Dataframe? Not really

In [None]:
type(grouped_cities)

#### Loop over the different groups

In [None]:
for group, df in grouped_cities:
    print(df)
    print('\n')

### Apply functions

Sum the population of the country

In [None]:
grouped_cities['Population'].sum()

### Multiple aggregations

In [None]:
countries = grouped_cities['Population'].aggregate({'Total population':'sum', 'Average_popoulation':'mean'})
countries

#### The group-by key (Country in our case) is by default is now an index.

But it is probably more useful to have it as a columt in a new DataFrame.<br>
If you check the groupby documentation, we can see that there is the parameters `as_index`, which is True by default.<br>
Setting it to False does the trick

In [None]:
countries = cities.groupby('Country', as_index=False)['Population'].aggregate(
    {'Total_population':'sum', 'Average_popoulation':'mean'}
)
countries

## Joining DataFrames
https://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
help(pd.merge)

In [None]:
pd.merge(left=cities,
        right = countries,
        on='Country')

### Where are the names of the cities?

In [None]:
cities['City'] = cities.index

In [None]:
cities.reset_index(drop=True, inplace=True)

### Now we can rejoin

In [None]:
extended_cities = pd.merge(left=cities,
        right = countries,
        on='Country')
extended_cities

## Now we have more information. We could use it to create more features/variables

For instance, I want to know which fraction of the country population is from a given city

In [None]:
extended_cities['Populatiion_fraction']=extended_cities['Population']/extended_cities['Total_population']
extended_cities

# 3 Reading data from files

So far we have seen academic examples with random generated data.<br>
Let's actually see how we can import data from files

In [None]:
help(pd.read_csv)

In [None]:
data = pd.read_csv('../data/UCI_Credit_Card.csv')

The dataset has been downloaded from kaggle

https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

### Let's get some general information about the dataset

How many rows and columns?

In [None]:
data.shape
# 30000 rows and 25 columns

Which columns do we have?

In [None]:
data.columns

Can we say something more about it? For instance, of which type is the data contained in the columns?

In [None]:
data.dtypes

# Important:

Before doing any data science project, it is important to understand the inputs?
On the kaggle website you can find the information about the dataset, which is reported below

Content

There are 25 variables:

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)

### Hmmm... Why in the Pay column we have a 0 for september, and in the rest is a 1?
Let's rename it

# Exercise: 
Convert the name of the column from PAY_0 to PAY_1

In [None]:
*** your solution here ***

Checking only the first 10 rows to inspect the values

In [None]:
data.head(10)

Or the last 7

In [None]:
data.tail(7)

## See some statistics about my dataset

In [None]:
data.describe() # shows the basic stats about the dataset

## Missing values
Before starting processing the data, we need to check if there are any missing values.

In [None]:
data.isnull().head() # returns True if any of the values in the column is missing, otherwise it is a false

In [None]:
# any() applieas the any() function on all the columns. 
# If there is a single value that is True, it will return True
data.isnull().any() 

In our case we are lucky (all the values are False, we do not have any missing values). <br>
However, in case of missing values, one needs to think of a strategy to deal with them

# Intermezzo: list comprehension in python and applications to a Dataframe
List comprehension is an elegant way to define and create lists in Python. <br>

Example: I want to create a list of the numbers from 1 to 10 that are divisible by 3



In [None]:
# Returns all the numbers from 0 to 9 included
[x for x in range(10)]

In [None]:
# Returns all the numbers from 0 to 9 included where the remainder of the division by 3 is 0 
# (hence numbers divisible by 3)
[x for x in range(10) if x%3==0]

##### Back to the DataFrames
Example: `data.columns` will return an iterable that will represent the columns of the DataSet. <br>

Let's say we are interested in showing the statistics of the columns related to the BILL of the month.<br>
All of this columns start with `BILL_*`, and we can use this knowledge to select them in one line of code

In [None]:
# List comprehension
# Return all the columns of the dataframe data where the first 4 characthers equal 'BILL'
[col for col in data.columns if col[:4]=='BILL']

In [None]:
# And using this to slice the columns
data[[col for col in data.columns if col[:4]=='BILL']].head() # remember, head() shows only the first 5 rows

Another example: <br>
    
Let's find the 'categorial' variables: let's assume that categorical variables are those with less than 10 unique values.<br>
`data['PAY_1'].unique()` will return all the unique values of PAY_1 category 

In [None]:
print('array:',data['PAY_1'].unique()) # returns the array
print('length:',data['PAY_1'].unique().shape[0]) # returns the length of the array

Using list comprehension, we can find the 'categorical' columns

In [None]:
[col for col in data.columns if data[col].unique().shape[0]<10]

## Categorical variables

The trick that we have shown above works in most of the cases, however, one needs to be careful with categorical variables.<br>
By taking the definition from wikipedia:<br>
* In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.<br>

Categorical variables if represented with numbers (like in our example) could induce our machine learning model in trouble, as the model will interpret them as ordered values.<br>
Take the categorical variables in  our dataset: 
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)

One can see that `FEMALE`>`MALE` (2>1), but this does not make any sense from the mathematical point of view.
Let's do the exercise, and then we will see how to deal with it

### Categorical variables: (one-hot) encodings
We would like to represent our categorical variables in a way that our model can process them without wrongly assuming an ordered dependence.<br>
one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)
Let's see an example that will make it more clear:<br>
in `pandas` you can create one-hot encodings by using pd.get_dummies()



In [None]:
data['SEX'].value_counts()

In [None]:
pd.get_dummies(data['SEX']).head(10)

In [None]:
ohe_data = pd.get_dummies(data,columns = ['SEX','MARRIAGE','EDUCATION'])
ohe_data.shape, data.shape

#### Note:
* we saved the one-hot encoded DataFrame into a new variable (`ohe_data`)
* if we look at the shape, we added 8 more columns. This what happens with OHE - dimensions can explode quickly!

## Correlation between variables





In [None]:
ohe_data.corr()

In [None]:
from plotting import plot_dataframe_correlations

In [None]:
plot_dataframe_correlations(ohe_data)

### Let's check the correlations with our target variable

In [None]:
ohe_data.corr()['default.payment.next.month'].sort_values(ascending=False)

# Feature engineering - an example

In view of builiding a model, we would like to enrich the information that we have, by creating features.<br>
This could allow our model to perform better.<br>
<br>
Feature engineering is driven by the creativity of the Data Scientist, common sense, and business logic.<br>
In our problem, we would like to be able to predict defaults (i.e. the value of `default.payment.next.month`). <br>
We see that we have a lot of information available per a single client: by common sense, we can immagine that the comparison of the BILL amount to the total limit on the credit card might be a good indicator of a probable default.<br>
We can create a feature that will be representative of this, by taking the ratio of the BILL_AMT to the LIMIT_BAL

In [None]:
ohe_data['BILL_AMT1']/ohe_data['LIMIT_BAL']

We could also do it for all the bills, in a loop

In [None]:
# as before
bill_columns = [col for col in ohe_data.columns if col[:4]=='BILL']
bill_columns

We also want to create new names of the columns. We can use the string.format from python.<br>
See this example

In [None]:
for col in bill_columns:
    print('ratio_{}_to_LIMIT'.format(col))

In [None]:
for col in bill_columns:
    
    # define the new column name
    new_column_name = 'ratio_{}_to_LIMIT'.format(col)
    
    # perform the ratio operation, and assign it to the new column
    ohe_data[new_column_name]=ohe_data[col]/ohe_data['LIMIT_BAL']

In [None]:
# as expected, we added 6 new columns (we had 33 before)
ohe_data.shape

# Functions on rows

Above we have seen how to apply a functoin on a column value.<br>
But what about applying a function on rows. <br>

For instance, if you want to compute the average bill 

In [None]:
def find_mean(df, columns):
    
    sum_ = 0
    for column in columns:
        sum_ += df[column]
        
    return sum_

In [None]:
data['average_bill'] = data.apply(lambda x: find_mean(x,bill_columns), axis=1)

# Exercise (20 minutes)

Earlier we looked into fitting 6 points with a linear regression.<br>
Now we want to use it to our advantage, by building a feature that tells us something about the change in trend, for instance:
* is the bill amount increasing or decreasing in the last 6 months?


In [None]:
import utils

In [None]:
def compute_trend(df,columns):
    coordinates = [(-ix,df[column]) for ix, column in enumerate(columns)]
    
    return utils.compute_slope(*coordinates)

In [None]:
data['bill_trend'] = data.apply(lambda x: compute_trend(x,bill_columns), axis =1)

In [None]:
data.head()

# Exporting the dataframe to a file 

The dataframe can be exported to different format, using different functions, with the most common being:<br>
* `to_csv()`: save it to a csv (comma separated values) file
* `to_pickle()`: save it to pickle. Pickle is a compression of python objects. It is useful to save any form of object (like your data, but also models, arrays etc) that can be read in another python session
* `to_json()`: save it to json
* `to_excel()`: well, we work in a bank, sometimes we need excel as well :(

Read the documentation of each function, to be sure you do not miss some important details

In [None]:
ohe_data.to_csv('data/processed_data_by_trainees.csv')


In [None]:
# lets see the content of the folder data (! executes a unix command, but this goes beyond the scope of this training)
! ls -lh data

# Plotting with pandas
Pandas has some integrated functions to do quick plots

#### Histograms

In [None]:
data['AGE'].hist(color='blue')

#### Scatter plots

In [None]:
data.plot(kind='scatter', x='BILL_AMT1', y='PAY_AMT1')

### Plot histograms by different groupby keys
Example, how doe the age distribution for male and female look like?


In [None]:
data[['AGE','SEX']].hist(by='SEX', figsize=(12, 8))

In [None]:
data[[col for col in data.columns if 'PAY_AMT' in col]].plot.box(figsize=(20, 8))

### Time series

`Definition`: a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

In pandas you can plot time series as well.
By selecting a column and then calling plot, it will plot the values ordered by the index

In [None]:
# This is not really a time series (as each index represent a different client, so they are not a dependent sequence), 
# but we show it here for the sake of teaching the pandas API 
data['PAY_AMT1'].plot()