# Activity 5: Calculate conditional probabilities

Let's practise producing frequency and probability tables using the *pandas* package. This is the first step in any analysis that involves categorical or qualitative data. It should complement visualisation. Charts give a quick overview, but tables provide more detail. We will then move to conditional probabilties. 

We will use the "German Credit Data" from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).

Variables used:
- VAR19: Telephone (known/unkwown)
- VAR21: Status (target) : (Good/Bad)
- VAR15: Housing (rent/own/for free).

In [1]:
import pandas as pd

# This is the demo for Telephone
# Read in the required two variables
credit_data1 = pd.read_csv("german_credit.csv", usecols=["Telephone", "Status"])
credit_data1.head(10)

Unnamed: 0,Telephone,Status
0,Known,Good
1,Unknown,Bad
2,Unknown,Good
3,Unknown,Good
4,Unknown,Bad
5,Known,Good
6,Unknown,Good
7,Known,Good
8,Unknown,Good
9,Unknown,Bad


In [2]:
# Produce a one-way table named the 'Count' column
telephone_freq = pd.crosstab(index=credit_data1["Telephone"],columns="Count")    
telephone_freq

col_0,Count
Telephone,Unnamed: 1_level_1
Known,404
Unknown,596


In [3]:
# To get proportions, i.e. probabilities, we can use the 'normalize' option
# Produce a one-way table named the proprotions/'Prob' column
telephone_prob = pd.crosstab(index=credit_data1["Telephone"], columns="Prob", normalize='columns')                        
telephone_prob

col_0,Prob
Telephone,Unnamed: 1_level_1
Known,0.404
Unknown,0.596


In [4]:
# To produce a two-way table add the second variable into the column
# Produce a two-way table include the column variable
telephone_cross = pd.crosstab(index=credit_data1["Telephone"], columns=credit_data1["Status"], margins=True)
telephone_cross

Status,Bad,Good,All
Telephone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Known,113,291,404
Unknown,187,409,596
All,300,700,1000


In [5]:
# To produce a two-way table with proportions, use normalize
# Produce a two-way table include the column variable
telephone_crossc = pd.crosstab(index=credit_data1["Telephone"], columns=credit_data1["Status"], margins=True, normalize='columns')
telephone_crossc

Status,Bad,Good,All
Telephone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Known,0.376667,0.415714,0.404
Unknown,0.623333,0.584286,0.596


In [6]:
# The same for Housing
credit_data2 = pd.read_csv("german_credit.csv", usecols=["Housing", "Status"])
credit_data2.head(10)

Unnamed: 0,Housing,Status
0,Own,Good
1,Own,Bad
2,Own,Good
3,Free,Good
4,Free,Bad
5,Free,Good
6,Own,Good
7,Rent,Good
8,Own,Good
9,Own,Bad


In [7]:
# we can skip one-way table, since this info is part of the two-way one
housing_cross = pd.crosstab(index=credit_data2["Housing"],  columns=credit_data2["Status"], margins=True)                  
housing_cross

Status,Bad,Good,All
Housing,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Free,44,64,108
Own,186,527,713
Rent,70,109,179
All,300,700,1000


In [8]:
# Probabilities
housing_crossc = pd.crosstab(index=credit_data2["Housing"],columns=credit_data2["Status"], margins=True, normalize='columns')
housing_crossc

Status,Bad,Good,All
Housing,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Free,0.146667,0.091429,0.108
Own,0.62,0.752857,0.713
Rent,0.233333,0.155714,0.179
