# Notebook 01 - Data Collection

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token
* Kaggle dataset

## Outputs

* Dataset as a CSV file in the outputs/datasets directory 

## Additional Comments

* The dataset is publicly available since it is hosted on Kaggle, and is anonymised, so there are no privacy concerns to deal with.
* The dataset is located [here](https://www.kaggle.com/datasets/brandao/diabetes).


---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, it is confirmed that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis'

# Fetch data from Kaggle

The dataset can now be fetched from Kaggle, where it is stored.

Firstly, the Kaggle package is installed to allow fetching of the data:

In [4]:
! pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting tqdm (from kaggle==1.5.12)
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.1 text-unidecode-1.3 tqdm-4.65.0


The `kaggle.json` file is then imported to the workspace to authenticate the request to access the data from Kaggle
* This file will not be seen in the public repository since it is linked to my personal Kaggle account and as such is listed in the `.gitignore` file
* The following cell sets the Kaggle API config directory, gets the path to the `kaggle.json` file and then sets the file permissiongs for the `kaggle.json` file

In [7]:
import stat

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.chmod(kaggle_json_path, stat.S_IREAD | stat.S_IWRITE)


The dataset can now be imported:

In [8]:
KaggleDatasetPath = "brandao/diabetes"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading diabetes.zip to inputs/datasets/raw




  0%|          | 0.00/4.41M [00:00<?, ?B/s]
 68%|██████▊   | 3.00M/4.41M [00:00<00:00, 24.8MB/s]
100%|██████████| 4.41M/4.41M [00:00<00:00, 28.0MB/s]


Finally, the files are unzipped and the `kaggle.json` file is removed

In [9]:
import shutil

for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        file_path = os.path.join(DestinationFolder, file)
        shutil.unpack_archive(file_path, DestinationFolder)
        os.remove(file_path)

os.remove("kaggle.json")

---

# Load and inspect data

The `read_csv` method is used to assign the dataset to a Pandas dataframe, and the first five rows of the resulting dataframe are displayed using the `head` method:

In [11]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/diabetic_data.csv")
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


From this it is clear that the dataset is large; not all the columns display on this view. As such, the `info` method is also used to access all the columns within the dataframe:

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

The dataset has 101,766 rows and 50 columns.

Reviewing documentation on an appropriate size of dataset for machine learning, a number of sources suggest that:
* The number of examples should be at least ten times the number of trainable parameters
* All else being equal, the larger the dataset, the better the results can be expected to be

These considerations are both noted in, for example, [Google's notes on the size and quality of a dataset](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality).

The current dataset has several orders of magnitude more examples than trainable parameters, suggesting that it should be suitable even if some of the examples are dropped due to unsuitability. A larger dataset is always better, and the current dataset size does not come close to that of some of the examples given in the Google documentation linked above, but given the availability of datasets and the processing power available to work with, this dataset should be sufficient for this project.

It is also clear even from the initial data inspection above that some data cleaning will be necessary. The 'weight' column contains only missing values in the data that is displayed, and it may be the case that other columns also have missing data. This will be explored further in the following notebook.

---

# Push files to Repo

The dataframe is now saved in a new `outputs/datasets/collection` folder, before starting work on it in the following notebook.

In [19]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/diabetic_data.csv", index=False)