# **Data Collection**

## Objectives

* Fetch data from Kaggle as zip file
* Then unzip the file and remove the zip file
* Finally, save the data as raw data in an output folder

## Inputs

* Kaggle JSON file - The authentication token. 

## Outputs

Generate Dataset: outputs/datasets/LoanDefaultDataset.csv


---

## Environment Setup


In [None]:
%pip install -r /workspace/loan-default-prediction/requirements.txt

## Change working directory

* Change the working directory from its current folder to its parent folder

In [4]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/loan-default-prediction/jupyter_notebooks'

* Make the parent of the current directory the new current directory*

In [5]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


* Confirm the new current directory

In [6]:
current_dir = os.getcwd()
current_dir

'/workspace/loan-default-prediction'

---

## Dataset Importation

* Install Kaggle

In [None]:
%pip install kaggle==1.5.12

## Configure Kaggle
* Change Kaggle directory to the current working directory
* Set permissions for Kaggle JSON authentication to the author only

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

## Dataset Import and Download

In [None]:
KaggleDatasetPath = "taweilo/loan-approval-classification-data"
DestinationFolder = "inputs/datasets"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

## Unzip Dataset

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip

---

## Load and Inspect Kaggle data

* Dataset Loading

In [7]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/loan_data.csv")
df.head()

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,0.49,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,0.08,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,0.44,3.0,635,No,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,0.44,2.0,675,No,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,0.53,4.0,586,No,1


* Dataset Inspection

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_age                      45000 non-null  float64
 1   person_gender                   45000 non-null  object 
 2   person_education                45000 non-null  object 
 3   person_income                   45000 non-null  float64
 4   person_emp_exp                  45000 non-null  int64  
 5   person_home_ownership           45000 non-null  object 
 6   loan_amnt                       45000 non-null  float64
 7   loan_intent                     45000 non-null  object 
 8   loan_int_rate                   45000 non-null  float64
 9   loan_percent_income             45000 non-null  float64
 10  cb_person_cred_hist_length      45000 non-null  float64
 11  credit_score                    45000 non-null  int64  
 12  previous_loan_defaults_on_file  

---

## Output Dataset

* Create an output directory to save the dataset

In [9]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/LoanDefaultDataset",index=False)


---