# Business Analytics - Assignment 2  

**Assignment Points**: 100  
**Submission**: Provide your answers in this notebook and submit it via iLearn

- Where a question requires a written (text) solution provide your answer in Markdown in appropriate cells under each question.
- Comment out your print statements unless you are explicitly asked to use the print() function. 
- 5 marks will be deducted for printed outputs that are not asked for.

### About the Assignment

- Assignment 2 extends Assignment 1 on credit card applications. 


- For this assignment there are two files in the `data` folder `credit_record.csv` and `application_record.csv` where bank clients are related by the `ID` column.
- In `credit_record.csv` we have the following variables

| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number      |         |
| AMT_INCOME_TOTAL   | Annual income  |  |
| NAME_INCOME_TYPE   | Income Source |  |
| NAME_EDUCATION_TYPE   | Level of Education  |  |
| CODE_GENDER   | Applicant's Gender   |  |
| FLAG_OWN_CAR | Car Ownership |  | 
| CNT_CHILDREN | Number of Children | |
| FLAG_OWN_REALTY | Real Estate Ownership | | 
| NAME_FAMILY_STATUS | Relationship Status | | 
| NAME_HOUSING_TYPE | Housing Type | | 
| DAYS_BIRTH | No. of Days | Count backwards from current day (0), -1 means yesterday
| DAYS_EMPLOYED | No. of Days | Count backwards from current day(0). If positive, it means the person is currently unemployed.
| FLAG_MOBIL | Mobile Phone Ownership | | 
| FLAG_WORK_PHONE | Work Phone Ownership | | 
| FLAG_PHONE | Landline Phone Ownership | | 
| FLAG_EMAIL | Landline Phone Ownership | | 
| OCCUPATION_TYPE | Occupation | | 
| CNT_FAM_MEMBERS | Count of Family Members | |



- In `credit_record.csv` we have the following variables


| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number | |
| MONTHS_BALANCE | Number of months in the past from now when STATUS is measured | 0 = current month, -1 = last month, -2 = two months ago, etc.|
| STATUS | Number of days a payment is past due | 0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month |

---
---

### Task 1: Reading, Summarising and Cleaning Data (Total Marks: 30)



**Question 1.** 

1. Import the `application_record.csv` and `credit_record.csv` files from `data` folder into pandas DataFrames named `df_application` and `df_credit`, respectively. (1 mark)

2. How many rows are there in `df_application` and `df_credit`, respectively? Answer using both print() function and in Markdown text. (1 mark)

3. How many unique bank clients are there in `df_application` and `df_credit`? Answer using both print() function and in Markdown text. (1 mark)

4. Add the records from `df_credit` to `df_application` by merging the data from the two DataFrames on the `ID` column, and output the joint data into a new DataFrame named `df`. Hint: Use `merge` function from pandas by setting `how` parameter to `inner` (4 marks) 

5. How many rows and how many unique clients are there in `df`? (1 mark)

6. How are multiple rows for each `ID` in `df` different? Answer in Markdown text. (2 mark) 

(10 marks)


In [1]:
from io import StringIO # allows us to read from a string as if we are reading from a file
from sklearn.preprocessing import LabelEncoder
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [2]:
import pandas as pd
df_application = pd.read_csv('data/application_record.csv')
df_credit = pd.read_csv('data/credit_record.csv')
print(f'There are {len(df_application)} rows in df_application')
print(f'There are {len(df_credit)} rows in df_credit')
df_application = df_application.drop_duplicates(subset=['ID'])
y = df_credit['ID'].unique()
print(f'There are {len(df_application)} unique bank clients in df_application')
print(f'There are {len(y)} unique bank clients in df_credit')
df = pd.merge(df_application,df_credit, how='inner')
df
F = len(df['ID'].unique())
print(f'There are {F} rows and {F} unique clients in df')

There are 438557 rows in df_application
There are 1048575 rows in df_credit
There are 438510 unique bank clients in df_application
There are 45985 unique bank clients in df_credit
There are 36457 rows and 36457 unique clients in df


Question 1) (Markdown answers)
* There are 438557 rows in df_application
* There are 1048575 rows in df_credit
* There are 438510 unique bank clients in df_application
* There are 45985 unique bank clients in df_credit
* There are 36457 rows and 36457 unique clients in df

---

**Question 2.**

1. Change the values of `STATUS` in `df` according to the following mapping: {C, X, 0} -> 0 and {1, 2, 3, 4, 5} -> 1 making sure that the new values of 0 and 1 are encoded as integers. (2 marks)
2. Create a new numpy array named `list_of_defaults` containing *unique* `ID` numbers for the clients who have `STATUS` = 1 in any of the last 12 months in the dataset. (2 marks) 
3. Create a new DataFrame called `df_final` that contains the rows of `df` for which the `ID` are in `list_of_defaults`, keeping only one row for each `ID` (i.e. eliminate rows with duplicate `ID`s while keeping the first duplicate row). How many rows do you have in `df_final`? Answer using both print() function and in Markdown text. (Hint: find out about `isin()` function in pandas.) (2 marks)
4. Add a new column `y = 1` for all the rows in `df_final`. (1 marks)
5. Increase `df_final` to a total of 4,000 rows by adding rows from `df` with unique `ID`s (nonduplicated `ID`s) which are not in `list_of_defaults`. To do this start adding the rows from the beginning of `df`. (Hint: learn what `~`, i.e. tilde sign, does in pandas). (2 marks) 
6. Fill the missing values of `y` in `df_final` with zeros. Remove `STATUS` and `MONTHS_BALANCE` from `df_final`. How many clients with  overdue payments of more than 29 days and how many clients with less than 29 days overdue payments are there in `df_final`? Answer using both print() function and in Markdown text.(1 mark)

(10 marks)

In [3]:
import numpy as np
size_mapping = {'C':0, 'X':0, '0':0, '1':1, '2':1, '3':1, '4':1, '5':1}

df['STATUS'] = df['STATUS'].map(size_mapping)

MonthsOnly = df.loc[df['MONTHS_BALANCE'].isin(np.arange(-12,1,1))]# we locate the rows where MONTHS_BALANCE is between 0 and -12

MonthsOnly = MonthsOnly.query(f"STATUS =={1}")#Remove any status that isnt 1

list_of_defaults = MonthsOnly['ID'].unique()#Take only unique ID's

df_final = MonthsOnly.drop_duplicates('ID')

Length_change = 3999-len(df_final)

y = []
for i in range(len(df_final)):
    y.append(1)
df_final['y'] = y

df_final = pd.concat([df_final,((df.loc[~df['ID'].isin(list(list_of_defaults))].drop_duplicates("ID").reset_index()).drop('index',axis=1)).loc[0:Length_change,:]])


df_final['y'] = df_final['y'].fillna(0)#Okay so, this is the combination of df_final and the rows of the IDs not in the list of list_of_defaults. We then reset the index, drop the old index, take the first N rows and thats all folks!


df_final=df_final.drop(['STATUS','MONTHS_BALANCE'],axis=1)
print(f'THERE ARE {len(df_final.query(f"y =={float(1)}"))} THAT ARE 30-59 days AND LIKE ANOTHER {len(df_final.query(f"y =={float(0)}"))} THAT ARE 0-29 days')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['y'] = y


THERE ARE 1833 THAT ARE 30-59 days AND LIKE ANOTHER 2167 THAT ARE 0-29 days


Question 2) (As Markdown) 
* There are 1833 that are 30-59 days
* There are 2167 that are 0-29 days


<hr style="width:25%;margin-left:0;"> 

**Question 3**. 
- Delete `ID` column from `df_final` (1 marks)
- Of the remaining variables in `df_final` and assuming that `NAME_EDUCATION_TYPE` is the only ordinal variable, how many variable are of numeric and nominal types? Provide lists of all numeric and nominal variables. (6)
- Using an appropriate function find and comment on the missing values in `df_final`, i.e. how many variables and how many observations? (3 marks)   
(10 marks)

In [4]:
df_final.drop('ID', inplace=True, axis=1)
df_final.head()
#add to the box below 

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,y
28,M,Y,Y,0.0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2,1.0
374,F,N,Y,,157500.0,Working,Secondary / secondary special,Married,House / apartment,-10031,-1469,1,0,1,0,Laborers,2,1.0
962,M,Y,Y,,360000.0,Commercial associate,,Married,House / apartment,-16670,-5364,1,0,1,0,Security staff,2,1.0
1692,F,N,Y,0.0,297000.0,Commercial associate,,Single / not married,Rented apartment,-15519,-3234,1,0,0,0,Laborers,1,1.0
1729,F,N,Y,0.0,297000.0,Commercial associate,,Single / not married,Rented apartment,-15519,-3234,1,0,0,0,Laborers,1,1.0


In [5]:

Nominal = []
Numeric = []
for columns in df_final:
    if columns == 'NAME_EDUCATION_TYPE': 
        continue
    elif str(type(df_final[columns][0])) == "<class 'str'>":
        Nominal.append(columns)
    else:
        Numeric.append(columns)
#print(f'The numeric values are {Numeric} and there are {len(Numeric)} of them.')
#print(f'The nominal values are {Nominal} and there are {len(Nominal)} of them.')

#for columns in df_final:
    #print(f"{columns}, {len(df_final[df_final[f'{columns}'].isna()])}")# Finds NaN in columns

    #df_final[df_final.isna().any(axis=1)]

Question 3)
* There are 11 Numeric Values, as listed below:
    'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE','CNT_FAM_MEMBERS', 'y'

* There are 6 Nominal Values, as listed below
    'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE'
    
The values with Missing Data
* 'CNT_CHILDREN'- Numeric data, Missing 74
* 'NAME_EDUCATION_TYPE'- Nominal Data, Missing 1831
* 'OCCUPATION_TYPE' - Numeric data, Missing 648


---
---

### Task 2: Imputing missing values and dealing with categorical features (Total Marks: 30)



**Question 4.** 
- Use an appropriate `pandas` function to impute missing values in `df_final` (10 marks)
    - Be careful when deciding which method to use to replace missing observations 
    - Take into consideration the type of each variable and the best practices we discussed in class/lecture notes
- Briefly explain what you have done and why. (5 marks)

(Total: 15 marks)

In [6]:
 
df = (df_final.fillna(df_final.mean(axis = 0), inplace=True))  #fill NaN with column mean values


---- provide your text answer here ----


<hr style="width:25%;margin-left:0;"> 

**Question 5**. Convert the values in `NAME_EDUCATION_TYPE` as follows
- Lower secondary -> 1
- Secondary / secondary special -> 2
- Incomplete higher -> 3
- Higher education -> 4


(Total: 5 marks)  

In [7]:
new_size_mapping = {'Lower secondary':1, 'Secondary / secondary special':2, 'Incomplete higher':3, 'Higher education':4}

df_final['NAME_EDUCATION_TYPE'] = df_final['NAME_EDUCATION_TYPE'].map(new_size_mapping)

<hr style="width:25%;margin-left:0;"> 

**Question 6**. 

Add dummy variables to `df_final` for all of the nominal features which are currently stored as string (text). 
- Make sure to delete the original variables from the dataframe
- Drop the first column from each set of created dummy variable, i.e. for each feature



(Total: 10 marks)  

In [8]:
# df_final = pd.get_dummies(df_final,columns=Nominal)

In [9]:
one_hot = pd.get_dummies(df_final,columns=Nominal,drop_first=False)

df_final = df_final.join(one_hot, how='left', lsuffix='_left', rsuffix='_right')
df_final

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN_left,AMT_INCOME_TOTAL_left,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE_left,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH_left,...,NAME_FAMILY_STATUS_Married,NAME_FAMILY_STATUS_Separated,NAME_FAMILY_STATUS_Single / not married,NAME_FAMILY_STATUS_Widow,NAME_HOUSING_TYPE_Co-op apartment,NAME_HOUSING_TYPE_House / apartment,NAME_HOUSING_TYPE_Municipal apartment,NAME_HOUSING_TYPE_Office apartment,NAME_HOUSING_TYPE_Rented apartment,NAME_HOUSING_TYPE_With parents
0,M,Y,Y,0.0,427500.0,Working,4.0,Civil marriage,Rented apartment,-12005,...,0,0,0,0,0,0,0,0,1,0
1,M,Y,Y,0.0,112500.0,Working,2.0,Married,House / apartment,-21474,...,1,0,0,0,0,1,0,0,0,0
2,F,N,Y,0.0,270000.0,Commercial associate,2.0,Single / not married,House / apartment,-19110,...,0,0,1,0,0,1,0,0,0,0
3,F,N,Y,0.0,270000.0,Commercial associate,2.0,Single / not married,House / apartment,-19110,...,0,0,1,0,0,1,0,0,0,0
4,F,N,Y,0.0,270000.0,Commercial associate,2.0,Single / not married,House / apartment,-19110,...,0,0,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
777614,F,Y,Y,0.0,130500.0,Working,2.0,Married,House / apartment,-16137,...,1,0,0,0,0,1,0,0,0,0
777622,M,Y,Y,0.0,315000.0,Working,2.0,Married,House / apartment,-17348,...,1,0,0,0,0,1,0,0,0,0
777639,F,N,Y,0.0,157500.0,Commercial associate,4.0,Married,House / apartment,-12387,...,1,0,0,0,0,1,0,0,0,0
777691,F,N,Y,0.0,283500.0,Working,2.0,Married,House / apartment,-17958,...,1,0,0,0,0,1,0,0,0,0


---
---

### Task 3 Preparing X and y arrays (Total Marks: 10)

**Question 7**. 

- Create a numpy array named `y` from the `y` column of `df_final` making sure that the values of the array `y` are stored as integers (3 marks)   
- Create a numpy array named `X`  from all the remaining features in `df_final` (2 marks)   

(Total: 5 Marks)

In [10]:
y = df_final['y']
X = (df_final[df_final.columns[:-1]])

KeyError: 'y'

<hr style="width:25%;margin-left:0;"> 

**Question 8**. 

- Use an appropriate scikit-learn library we used in class to create `y_train`, `y_test`, `X_train` and `X_test` by splitting the data into 70% train and 30% test datasets (2.5 marks) 
    - Set random_state to 7 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target's class labels 
- Standardise the data using `StandardScaler` library (2.5 marks)   

(Total: 5 marks) 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7, stratify = y)
# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)
# print(y_train)
# print(y_test)

In [None]:
np.set_printoptions(precision=3, suppress = True) 
sc = StandardScaler()

sc.fit(X_train)

# print(dir(sc))
# print(sc.mean_, sc.scale_)


X_train_scaled = sc.transform(X_train)
# print('means:', X_train.mean(axis=0), X_train_scaled.mean(axis=0))
# print('sigmas', X_train.std(axis=0), X_train_scaled.std(axis=0))

X_test_scaled = sc.transform(X_test)
# print('means:', X_test.mean(axis=0), X_test_scaled.mean(axis=0))
# print('sigmas', X_test.std(axis=0), X_test_scaled.std(axis=0))

---
---

### Task 4. Support Vector Classifier and Accuracies (Total Marks: 30)


**Question 9**. 

- Train a Support Vector Classifier on standardised data (3 marks)
    - Use `linear` kernel and set `random_state` to 7 (don't change any other parameters)
    - Compute and print training and test dataset accuracies
- Train another Support Vector Classifier on standardised data (3 marks)
    - Use `rbf` kernel and set `random_state` to 7 (don't change any other parameters)
    - Compute and print training and test dataset accuracies
- What can you say about the presence of nonlinearities in the dataset? (4 marks)

(Total: 10 marks)  

In [None]:
#--------------  1.
lda = LDA(n_components = 2)
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
pd.DataFrame(X_train_lda)
#--------------  2.  
lr = LogisticRegression(multi_class='ovr', random_state=1)
lr.fit(X_train_lda, y_train)
#--------------  3.  
print('Accuracy - Training:', lr.score(X_train_lda, y_train))
pdr.plot_decision_regions(X_train_lda, y_train, classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show() 
#--------------  4.  
X_test_lda = lda.transform(X_test_scaled)
print('Accuracy - Training:', lr.score(X_test_lda, y_test))

pdr.plot_decision_regions(X_test_lda, y_test, classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower left')
plt.tight_layout()
# plt.savefig('images/05_10.png', dpi=300)
plt.show()


---- provide your text answer here ----

<hr style="width:25%;margin-left:0;"> 

**Question 10**

- Extract 2 linear principal components from the standardised features using an appropriate `sklearn` library (5 marks)
- Train a Support Vector Classifier on the computed principal components (5 marks) 
    - Use `rbf` kernel and set `random_state` to 7 (don't change any other parameters)
- Compute and print training and test dataset accuracies (5 marks)
- What can you say about the ability of the 2 principal components to compress the information contained in the features matrix `X`, and why? (5 marks)     


(Total: 20 marks)  

In [None]:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], color='red', marker='^', alpha=0.5)
plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1], color='blue', marker='o', alpha=0.5)
plt.xlabel('K-PC1')
plt.ylabel('K-PC2')
plt.show()

---- provide your text answer here ----


---
---