# Data Cleaning Notebook

## Objectives
- Fill in missing data and handle encoding on categorical variables

## Inputs
- 'output/loan_data.csv'

## Outputs
- Create a cleaned database in the cleaned output folder.

## Conclusion
- 

## Change Working Directory

The notebooks are stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to its parent folder

We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Bartek\\Desktop\\Predictive-Analysis\\jupyter_notebooks'

- We use os.path.dirname() to get the parent directory
- Then we call the os.chir() function, which defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

# Load Data

In [3]:
import pandas as pd
loan_data = pd.read_csv("outputs/loan_data.csv")

## Handle Missing Data

In [4]:
loan_data.isnull().sum()

Loan_ID               0
Gender                5
Married               0
Dependents            8
Education             0
Self_Employed        21
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     11
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

This code fills missing values in a DataFrame `loan_data` for numeric columns  with their respective means, and for categorical columns with their modes. This ensures that the dataset is prepared for analysis or modeling by replacing missing values with central tendencies for numerical data and most common values for categorical data, aiding in maintaining data integrity and completeness.

In [5]:
# Fill missing values for numeric columns with their mean
loan_data['Credit_History'] = loan_data['Credit_History'].fillna(loan_data['Credit_History'].mean())
loan_data['Loan_Amount_Term'] = loan_data['Loan_Amount_Term'].fillna(loan_data['Loan_Amount_Term'].mean())

# Fill missing values for categorical variables with their mode
loan_data['Self_Employed'] = loan_data['Self_Employed'].fillna(loan_data['Self_Employed'].mode()[0])
loan_data['Gender'] = loan_data['Gender'].fillna(loan_data['Gender'].mode()[0])
loan_data['Dependents'] = loan_data['Dependents'].fillna(loan_data['Dependents'].mode()[0])
loan_data.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

This code snippet drops rows containing any NaN values from the `loan_data` DataFrame and then fills missing values in numeric columns with their means and missing values in categorical columns with their modes, resulting in a cleaned dataset `loan_data_drop` with imputed values.

In [6]:
# Drop rows with NaN values
loan_data_drop = loan_data.dropna()

# Fill missing values for numeric columns with their mean
loan_data_drop['Credit_History'] = loan_data_drop['Credit_History'].fillna(loan_data_drop['Credit_History'].mean())
loan_data_drop['Loan_Amount_Term'] = loan_data_drop['Loan_Amount_Term'].fillna(loan_data_drop['Loan_Amount_Term'].mean())

# Fill missing values for categorical variables with their mode
loan_data_drop['Self_Employed'] = loan_data_drop['Self_Employed'].fillna(loan_data_drop['Self_Employed'].mode()[0])
loan_data_drop['Gender'] = loan_data_drop['Gender'].fillna(loan_data_drop['Gender'].mode()[0])
loan_data_drop['Dependents'] = loan_data_drop['Dependents'].fillna(loan_data_drop['Dependents'].mode()[0])
loan_data_drop.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

## Push File to Repo

In [7]:
import os
try:
  os.makedirs(name='outputs')
except Exception as a:
  print(a)

loan_data.to_csv(f"outputs/loan_data_mean.csv",index=False)
loan_data_drop.to_csv(f"outputs/loan_data_drop.csv",index=False)

[WinError 183] Cannot create a file when that file already exists: 'outputs'
