## Extracting Titanic Disaster Data From Kaggle



**Key Changes**

- New Kaggle API to download the data from kaggle. Follow the notebook for the steps.


### Setup Kaggle API token


Steps:

- Signup and login to https://kaggle.com 
- Go to the Account Tab ( `https://www.kaggle.com/<username>/account` ) and under API section, click `Create API Token`.  A file named `kaggle.json` will be downloaded.
- place `kaggle.json` inside the root level `titanic` folder.
- if you are using any code versioning system like git. then make sure `kaggle.json` is an entry in the `.gitignore` file so that you don't accidently share the credentials
- install `kaggle` package as shown below

For more info visit kaggle Github page: https://github.com/Kaggle/kaggle-api 


In [1]:
!pip install --user --upgrade kaggle



In [None]:
# restart terminal
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

#### Read API key

In [2]:
import json
import os

api_file_path = os.path.join(os.path.pardir,'kaggle.json')
with open(api_file_path) as f:
    kaggle_token = json.load(f)
    # kaggle authentication
    os.environ["KAGGLE_USERNAME"] = kaggle_token['username']
    os.environ["KAGGLE_KEY"] = kaggle_token['key']

FileNotFoundError: [Errno 2] No such file or directory: '..\\kaggle.json'

In [3]:
print(os.environ["KAGGLE_USERNAME"])

KeyError: 'KAGGLE_USERNAME'

### Using Kaggle API 

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

#### Important Step :

You need to agree to terms and conditions of the competition for which you want to download the dataset. 
- This is a one time activity per account. 
- for Titanic competition : visit https://www.kaggle.com/c/titanic and click on "Join Competition" 
- and click on "I Understand and Accept"
- once done, proceed to next steps

In [1]:
# create kaggle API object
api = KaggleApi()
# authenticate
api.authenticate()
# file paths
raw_data_path = os.path.join(os.path.pardir,'data','raw')
# this will download 'titanic.zip' in the raw_data_path location
api.competition_download_file(competition='titanic', file_name='train.csv',path=raw_data_path, force=True)
api.competition_download_file(competition='titanic', file_name='test.csv' ,path=raw_data_path, force=True)


NameError: name 'KaggleApi' is not defined

In [None]:
train_data_path = os.path.join(raw_data_path, 'train.csv')
# printing top 5 rows
!head -5 $train_data_path

In [8]:
!ls -l ../data/raw

total 57
-rw-r--r-- 1 nt-user nt-user 28629 Apr 11 05:23 test.csv
-rw-r--r-- 1 nt-user nt-user 61194 Apr 11 05:23 train.csv


### Builiding the file script

In [9]:
get_raw_data_script_file = os.path.join(os.path.pardir,'src','data','get_raw_data.py')

In [10]:
%%writefile $get_raw_data_script_file
# -*- coding: utf-8 -*-
import json
import os
import logging

# getting root directory
project_dir = os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)
# read kaggle API token and create enviornment variables
api_file_path = os.path.join(project_dir,'kaggle.json')
with open(api_file_path) as f:
    kaggle_token = json.load(f)
    # environment variable for kaggle authentication
    os.environ["KAGGLE_USERNAME"] = kaggle_token['username']
    os.environ["KAGGLE_KEY"] = kaggle_token['key']


from kaggle.api.kaggle_api_extended import KaggleApi

def main(project_dir):
    '''
    main method
    '''
    # get logger
    logger = logging.getLogger(__name__)
    logger.info('getting raw data')
    
    # file name : check from the competition data page
    train_file_name = 'train.csv'
    test_file_name = 'test.csv'

    # file paths
    raw_data_path = os.path.join(project_dir,'data','raw')
    
    # extract data
    api = KaggleApi()
    api.authenticate()
    api.competition_download_file(competition='titanic', file_name=train_file_name, path=raw_data_path, force=True)
    api.competition_download_file(competition='titanic', file_name=test_file_name, path=raw_data_path, force=True)
    logger.info('downloaded raw training and test data')


if __name__ == '__main__':
    
    # setup logger
    log_fmt = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    logging.basicConfig(level=logging.INFO, format=log_fmt)

    # call the main
    main(project_dir)


Writing ../src/data/get_raw_data.py


In [12]:
!python3 $get_raw_data_script_file

2021-04-11 05:23:26,388 - __main__ - INFO - getting raw data
Downloading train.csv to ../src/data/../../data/raw
  0%|                                               | 0.00/59.8k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 59.8k/59.8k [00:00<00:00, 6.03MB/s]
Downloading test.csv to ../src/data/../../data/raw
  0%|                                               | 0.00/28.0k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 28.0k/28.0k [00:00<00:00, 5.45MB/s]
2021-04-11 05:23:27,274 - __main__ - INFO - downloaded raw training and test data
