# **Data Collection**

## Objectives

* Fetch data from Kaggle as zip file.
* Unzip the file and remove the zip file.
* Save the data as raw data in an output folder.

## Inputs

* Kaggle JSON file - The authentication token. 

## Outputs

Generate Dataset: outputs/datasets/LoanDefaultDataset.csv.

---

## **Setup**

### Install packages

In [None]:
%pip install -r /workspace/loan_default/requirements.txt

### Imports

In [None]:
import os
import pandas as pd

### Change working directory

* Change the working directory from its current folder to its parent folder.

In [None]:
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Importation**

* Install Kaggle.

In [None]:
%pip install kaggle

* Configure Kaggle.
  * Change Kaggle directory to the current working directory.
  * Set permissions for Kaggle JSON authentication to the author only.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Download Dataset as a Zip file.

In [None]:
KaggleDatasetPath = "taweilo/loan-approval-classification-data"
DestinationFolder = "inputs/datasets"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip Dataset.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

* Dataset Loading.

In [None]:
df = pd.read_csv(f"inputs/datasets/loan_data.csv")
df.head()

* Dataset Inspection.

In [None]:
df.info()

> Result:
    
    - `dtypes: float64(6), int64(3), object(5)`

## **Output Dataset**

* Create an output directory.
* save the dataset on the output directory.

In [None]:
try:
  os.makedirs(name='outputs/datasets/collection/row') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/row/LoanDefaultDataset.csv",index=False)