### Preprocessing (31 pts) ###

In [None]:
import pandas as pd
import numpy as np
gc = pd.read_csv('GermanCredit.csv')
gc.head()

1. [8 pts] Drop the 3 columns that contribute the least to the dataset. These would be the columns with the highest number of non-zero 'none' values. Break ties by going left to right in columns. (Your code should be generalizable to drop n columns, but for the rest of the analysis, you can call your code for n=3.)

In [None]:
n = 3
for col in gc.columns:
    gc[col] = gc[col].apply(lambda x: x if str(x).lower() != 'none' else np.nan)
dropping = []
for i, col in enumerate(gc.columns):
    dropping.append([col, i, gc[col].count()])
dropping.sort(key = lambda x:(x[-1], x[1]))
gc = gc.drop([x[0] for x in dropping[:n]], axis=1)
gc.shape

2. [4 pts] Certain values in some of the columns contain unnecessary apostrophes (‘). Remove the apostrophes.


In [None]:
gc.replace('\'','', regex=True, inplace=True) 
gc.head()

3. [5 pts] The checking_status column has values in 4 categories: 'no checking', '<0', '0<=X<200', and '>=200'. Change these to 'No Checking', 'Low', 'Medium', and 'High' respectively.

In [None]:
gc['checking_status'] = gc['checking_status'].str.replace('no checking', 'No Checking')
gc['checking_status'] = gc['checking_status'].str.replace('<0', 'Low')
gc['checking_status'] = gc['checking_status'].str.replace('0<=X<200', 'Medium')
gc['checking_status'] = gc['checking_status'].str.replace('>=200', 'High')
gc.head()

4. [5 pts] The savings_status column has values in 4 categories: 'no known savings', '<100', '100<=X<500', '500<=X<1000', and '>=1000'. Change these to 'No Savings', 'Low', 'Medium', 'High', and 'High' respectively. (Yes, the last two are both 'High').

In [None]:
gc['savings_status'] = gc['savings_status'].str.replace('500<=X<1000', 'High')
gc['savings_status'] = gc['savings_status'].str.replace('no known savings', 'No Savings')
gc['savings_status'] = gc['savings_status'].str.replace('<100', 'Low')
gc['savings_status'] = gc['savings_status'].str.replace('100<=X<500', 'Medium')
gc['savings_status'] = gc['savings_status'].str.replace('>=1000', 'High')
gc.head()

In [None]:
gc['savings_status'].value_counts()

5. [4 pts] Change class column values from 'good' to '1' and 'bad' to '0'

In [None]:
gc['class'] = gc['class'].replace('good', 1)
gc['class'] = gc['class'].replace('bad', 0)
gc.head()

6. [5 pts] Change the employment column value 'unemployed' to 'Unemployed', and for the others, change to 'Amateur', 'Professional', 'Experienced' and 'Expert', depending on year range.

In [None]:
gc['employment'] = gc['employment'].str.replace('unemployed', 'Unemployed')
gc['employment'] = gc['employment'].str.replace('<1', 'Amateur')
gc['employment'] = gc['employment'].str.replace('1<=X<4',  'Professional')
gc['employment'] = gc['employment'].str.replace('4<=X<7', 'Experienced')
gc['employment'] = gc['employment'].str.replace('>=7', 'Expert')
gc.head()

### Analysis (17 pts) ###

1. [5 pts] Often we need to find correlations between categorical attributes, i.e. attributes that have values that fall in one of several categories, such as "yes"/"no" for attr1, or "low","medium","high" for attr2.
One such correlation is to find counts in combinations of categorial values across attributes, as in how many instances are "yes" for attr1 and "low" for attr2. A good way to find such counts is to use the Pandas crosstab (Links to an external site.) function. Do this for the following two counts.

a. [3 pts] Get the count of each category of foreign workers (yes and no) for each class of credit (good and bad).

b. [2 pts] Similarly, get the count of each category of employment for each category of saving_status.

In [None]:
pd.crosstab(gc['foreign_worker'], gc['class'])

In [None]:
pd.crosstab(gc['employment'], gc['savings_status'])

2. [4 pts] Find the average credit_amount of single males that have 4<=X<7 years of employment. You can leave the raw result as is, no need for rounding.

In [None]:
gc[(gc['employment'] == 'Experienced') & (gc['personal_status'] == 'male single')]['credit_amount'].mean()

3. [4 pts] Find the average credit duration for each of the job types. You can leave the raw result as is, no need for rounding.

In [None]:
gc.groupby('job')['duration'].mean()

4. [4 pts] For the purpose 'education', what is the most common checking_status and savings_status? 

In [None]:
com_check = gc[(gc['purpose'] == 'education')][('checking_status')].value_counts().index[0]
com_sav = gc[(gc['purpose'] == 'education')][('savings_status')].value_counts().index[0]
print(com_check)
print(com_sav)

### Visualization (24 pts) ###

In [None]:
import matplotlib.pyplot as plt

1. [9 pts] Plot subplots of two histograms: one with savings_status on the x-axis and personal_status as different colors, and another with checking_status on the x-axis and personal_status as different colors.

In [None]:
plot = gc[['savings_status', 'personal_status', 'checking_status']]
plt.figure(figsize=(12, 8))
ax1 = plt.subplot(2,1,1)
ax2 = plt.subplot(2,1,2)
ax1.hist(plot[['savings_status', 'personal_status']])
ax2.hist(plot[['checking_status', 'personal_status']])
plt.show()

2. [9 pts] For people having credit_amount more than 4000, plot a bar graph which maps property_magnitude (x-axis) to the average customer age for that magnitude (y-axis).

In [None]:
cred_amount = gc[gc['credit_amount'] > 4000]
age_avg = cred_amount.groupby('property_magnitude')['age'].mean()
plt.bar(cred_amount['property_magnitude'].unique(), age_avg)

3. [6 pts] For people with a "High" savings_status and age above 40, use subplots to plot the following pie charts:
1)Personal status 
2)Credit history
3)Job

In [None]:
plt.figure(figsize=(15,5))

ax = plt.subplot(1,3,1)
per_stat = gc[(gc['savings_status'] == 'High') & (gc['age'] > 40)]['personal_status']
ax.pie(per_stat.value_counts(), labels=per_stat.value_counts().index)
ax.set_title('Personal status')

ax = plt.subplot(1,3,2)
cred_his = gc[(gc['savings_status'] == 'High') & (gc['age'] > 40)]['credit_history']
ax.pie(cred_his.value_counts(), labels=cred_his.value_counts().index)
ax.set_title('Credit history')

ax = plt.subplot(1,3,3)
job = gc[(gc['savings_status'] == 'High') & (gc['age'] > 40)]['job']
ax.pie(job.value_counts(), labels=job.value_counts().index)
ax.set_title('job')