# **Data Collection**

## Objectives

* Install dependencies
* Fetch data from Kaggle
* Do an initial inspection of the data
* Save the raw data in the outputs folder

### Inputs

* Kaggle JSON file - used for authentification

### Outputs

* Raw dataset, saved in the output folder as a csv file


---

## Set up the Working Directory

Define and confirm the current working directory

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Install Python Packages

Install dependencies from requirements.txt

In [None]:
%pip install -r requirements.txt

___

## Fetch Dataset from Kaggle

Install Kaggle

In [None]:
%pip install kaggle==1.5.12

Add the Kaggle authentication token to the root directory.
Read more in the [Kaggle Docs](https://github.com/Kaggle/kaggle-api).

Recognise the JSON token in this session.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset and destination folder.
Download the the dataset.

In [None]:
KaggleDatasetPath = "yasserh/titanic-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

___

## Initial Inspection of Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Titanic-Dataset.csv")
df.head()

In [None]:
df.info()

___

## Save Files to Repo

In [None]:
import os
try:
  os.makedirs(name=f'outputs/datasets/collection')
except Exception as e:
  print(e)

In [None]:
df.to_csv(f"outputs/datasets/collection/titanic_passengers.csv",index=False)