# Data Acquisition

In this notebook, I will briefly summarise the project idea, explain why I chose that specific data set what I aim to do with it. I will download the data from Kaggle using its API. In order for any potential interested reader to be able to replicate this, I will include tutorial style instructions.

## Project Idea
### Scope
This is my first solo portfolio project. While I previously participated in team projects as well as basic Kaggle challenges, I never completed a project like this completely on my own. Besides the machine learning modeling part, there are many more aspects to pay attention to. Because of that, this I will keep this first project rather simple and focus on completing an MVP (minimum viable project) first, which I can incrementally improve or go on to a next, more complex and comprehensive project.

### Objective
In this project, I will use data from heart failure patients and use a statistical machine learning model (as opposed to deep learning) to predict survival. I will include an exploratory data analysis, feature engineering, model selection as well as evaluation.

### Dataset

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020)

https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

## Download Dataset using API

If you do not have Kaggle installed, please do so by running either:

`pip install kaggle`

or

`conda install -c conda-forge kaggle`.

In [1]:
# -q for quiet installation
!pip install -q kaggle

If you do have have the API key set up yet, please do so by navigating to your account settings in Kaggle and creating a new API key. Then run:

```
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 /Users/<YOUR_USER_NAME_HERE>/.kaggle/kaggle.json
```

If you are currently in the directory you downloaded `kaggle,json` to (for example your `Downloads`), you can run the commands unchanged. If you are in another directory, please either specify the path to `kaggle,json` or change to the directory you downloaded it to.

To test if the installation was successful, you can do one or multiple of the following:
- run `kaggle -v` in the terminal: you should get something comparable to `Kaggle API 1.5.16`
- run `which kaggle` in the terminal: you should get a path corresponding to the chosen way of installation
- run `import kaggle` in Python: it must run without an error

In [2]:
# import libraries
import os
import pandas as pd



In [3]:
# since this notebook is stored in the directory for notebooks, the current working directory is the directory for notebooks
os.getcwd()

'/Users/fabianjkrueger/Documents/portfolio/machine_learning/ml_portfolio/notebooks'

In [4]:
# I prefer working in the project's base directory because it simplifies the paths
os.chdir("..")
os.getcwd()

'/Users/fabianjkrueger/Documents/portfolio/machine_learning/ml_portfolio'

Now use the Kaggle API to download the data into the data directory. In this project, the download is the originally obtained data. Hence, it will be saved in `data/raw`, even though this data has been processed in some way before it was uploaded to Kaggle. In this project, original data is regarded as immutable. This means that it will not be changed at all. Derivations from this data will be saved to the `data/interim` and `data/processed` directories respectively. In this project, all data directories are included in the .gitignore. The data will not be in GitHub. (Disclaimer: If you still find it in GitHub, I might have decided to change this after writing this. You can check the .gitignore file if the data folders are still ignored.)

In [5]:
# download data set using api
# --unzip: unzip the downloaded file and delete the zip file right away
# --dataset: the user and data set or rather path inside Kaggle which to download the data set from 
# --path: the path which to download the data 
!kaggle datasets download --unzip -dataset andrewmvd/heart-failure-clinical-data --path data/raw

Downloading heart-failure-clinical-data.zip to data/raw
  0%|                                               | 0.00/3.97k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.97k/3.97k [00:00<00:00, 4.53MB/s]


In [8]:
# get file name and full path of newly downloaded and unzipped data file
# it was saved in data/raw -> list files in this folder to screen
!ls data/raw

README.md
heart_failure_clinical_records_dataset.csv


In [9]:
# have a look at the data set
# copy paste file name from cell's output above, read using pandas
pd.read_csv("data/raw/heart_failure_clinical_records_dataset.csv")

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


There we have it. The data is downloaded to the `data/raw` directory using Kaggle's API. The exploratory data analysis will be performed in the next logical notebook.