# Notebook 02 - Data Cleaning

## Objectives

* Check data for duplicate entries
* Determine whether there is any missing data
* Assess column names for inconsistencies and rename these as appropriate
* Assess whether any data needs to have its type changed or be aggregated or renamed
* Clean data
* Develop a data cleaning pipeline for use on future data provided by the client

## Inputs

* CSV file generated in previous notebook: outputs/datasets/collection/credit_card_data.csv

## Outputs

* Cleaned data, to be stored in new folder outputs/datasets/cleaned
* Data cleaning pipeline

## Additional Comments

[add any additional comments here]

## Conclusions

[add conclusions here]


---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, confirm that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default'

---

# Load data

The data is loaded from the outputs/datasets/collection folder:

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/credit_card_data.csv')
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


---

# Data Examination

## Check for duplicate entries

Firstly, we want to check the data for any duplicate entries.
* We know that the field `ID` corresponds to the customer ID
* Unless a customer holds multiple credit cards, it should not therefore be possible for a customer to appear in the dataset multiple times

In [5]:
df[df.duplicated(subset='ID')]

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month


We see that there are no duplicate values recorded for `ID` and therefore we can assume that there are no duplicate values recorded in the dataset.
* Since `ID` is a unique customer identifier and does not convey any meaning that could impact the target variable, it can be dropped from the dataset.

## Check for missing data

Next, we check whether there is any missing data:

In [6]:
vars_with_missing_data = df.columns[df.isna().any()].tolist()
vars_with_missing_data

[]

There is no missing data in the dataset and so data imputation will not be necessary.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
