# Notebook 01 - Data Collection

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token
* Kaggle dataset

## Outputs

* Dataset as a CSV file in the outputs/datasets directory 

## Additional Comments

* The dataset is publicly available since it is hosted on Kaggle, and is anonymised, so there are no privacy concerns to deal with.
* The dataset is located [here](https://www.kaggle.com/datasets/brandao/diabetes).


---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, it is confirmed that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\diabetes-data-analysis'

# Fetch data from Kaggle

The dataset can now be fetched from Kaggle, where it is stored.

Firstly, the Kaggle package is installed to allow fetching of the data:

In [4]:
! pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting tqdm (from kaggle==1.5.12)
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.1 text-unidecode-1.3 tqdm-4.65.0


The `kaggle.json` file is then imported to the workspace to authenticate the request to access the data from Kaggle
* This file will not be seen in the public repository since it is linked to my personal Kaggle account and as such is listed in the `.gitignore` file
* The following cell sets the Kaggle API config directory, gets the path to the `kaggle.json` file and then sets the file permissiongs for the `kaggle.json` file

In [7]:
import stat

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.chmod(kaggle_json_path, stat.S_IREAD | stat.S_IWRITE)


The dataset can now be imported:

In [8]:
KaggleDatasetPath = "brandao/diabetes"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading diabetes.zip to inputs/datasets/raw




  0%|          | 0.00/4.41M [00:00<?, ?B/s]
 68%|██████▊   | 3.00M/4.41M [00:00<00:00, 24.8MB/s]
100%|██████████| 4.41M/4.41M [00:00<00:00, 28.0MB/s]


Finally, the files are unzipped and the `kaggle.json` file is removed

In [9]:
import shutil

for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        file_path = os.path.join(DestinationFolder, file)
        shutil.unpack_archive(file_path, DestinationFolder)
        os.remove(file_path)

os.remove("kaggle.json")

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
