# Notebook 01 - Data Collection

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token
* Kaggle dataset

## Outputs

* Dataset as a CSV file in the outputs/datasets directory 

## Additional Comments

* The dataset is publicly available since it is hosted on Kaggle, and is anonymised, so there are no privacy concerns to deal with.
* The dataset is located [here](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset).


---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, confirm that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default'

---

# Fetch data from Kaggle

The dataset can now be fetched from Kaggle, where it is stored.

Firstly, the Kaggle package is installed to allow fetching of the data:

In [5]:
! pip install kaggle==1.5.12



The `kaggle.json` file is then imported to the workspace to authenticate the request to access the data from Kaggle
* This file will not be seen in the public repository since it is linked to my personal Kaggle account and as such is listed in the `.gitignore` file
* The following cell sets the Kaggle API config directory, gets the path to the `kaggle.json` file and then sets the file permissiongs for the `kaggle.json` file

In [6]:
import stat

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.chmod(kaggle_json_path, stat.S_IREAD | stat.S_IWRITE)


The dataset can now be imported:

In [7]:
KaggleDatasetPath = "uciml/default-of-credit-card-clients-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading default-of-credit-card-clients-dataset.zip to inputs/datasets/raw




  0%|          | 0.00/0.98M [00:00<?, ?B/s]
100%|██████████| 0.98M/0.98M [00:00<00:00, 2.81MB/s]
100%|██████████| 0.98M/0.98M [00:00<00:00, 2.79MB/s]


Finally, the files are unzipped and the `kaggle.json` file is removed

In [8]:
import shutil

for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        file_path = os.path.join(DestinationFolder, file)
        shutil.unpack_archive(file_path, DestinationFolder)
        os.remove(file_path)

os.remove("kaggle.json")

---

# Load and inspect data

The `read_csv` method is used to assign the dataset to a Pandas dataframe, and the first five rows of the resulting dataframe are displayed using the `head` method:

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/UCI_Credit_Card.csv")
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


We see that the dataset is loading correctly. The `info` method can now be used to display more information:
*  Not all columns display on the above view, since there are too many to fit on the screen
* The `info` method will also display the data type, `Dtype` of the data recorded in each column, so that we can check that the data is of the correct type to be used in our analysis

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

The dataset has 30,000 rows and 25 columns.

Reviewing documentation on an appropriate size of dataset for machine learning, a number of sources suggest that:
* The number of examples should be at least ten times the number of trainable parameters
* All else being equal, the larger the dataset, the better the results can be expected to be

These considerations are both noted in, for example, [Google's notes on the size and quality of a dataset](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality).

The current dataset has several orders of magnitude more examples than trainable parameters, suggesting that it should be appropriate even if some of the examples are dropped due to unsuitability. A larger dataset is always better, and the current dataset size does not come close to that of some of the examples given in the Google documentation linked above, but given the availability of datasets and the processing power available to work with, this dataset should be sufficient for this project.

---

# Push files to Repo

The dataframe is now saved in a new `outputs/datasets/collection` folder, before starting work on it in the following notebook.

In [11]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/credit_card_data.csv", index=False)