In [1]:
# Basic utility functions
import logging
from src.log import logger
from src import paths
from src.utils import list_dir


## Set up Raw Data Source from Kaggle using their API

In [2]:
from src.data import DataSource

In [3]:
ds_name = 'beer_review'
dsrc = DataSource(ds_name)

**Note**: Kaggle requires an account, so you need to be signed in for this step to work. Check on error messages if NOT signed in later. 

I don't think this is handled nicely yet. So I'll download the kaggle data it "by hand" through the API and then have the location of the raw data source be local.

We'll use the Kaggle API for this https://github.com/Kaggle/kaggle-api as we want to download from "https://www.kaggle.com/rdoume/beerreviews/download".

We'll need to `pip install kaggle` via our `environment.yml`.

Nope. That failed.

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\<Windows-username>\.kaggle\kaggle.json - you can check the exact location, sans drive, with echo %HOMEPATH%). You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to $KAGGLE_CONFIG_DIR/kaggle.json (on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json).

For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:

chmod 600 ~/.kaggle/kaggle.json

You can also choose to export your Kaggle username and token to the environment:

export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx

In addition, you can export any other configuration value that normally would be in the $HOME/.kaggle/kaggle.json in the format 'KAGGLE_' (note uppercase).
For example, if the file had the variable "proxy" you would export KAGGLE_PROXY and it would be discovered by the client.

Giving up on the API. Downloading manually.

## Options for adding raw files
This will need to go in docs
* add_url
* add_file

In [4]:
logger.setLevel(logging.DEBUG)

In [5]:
import os
import pathlib

Modify your download location as necessary

In [6]:
home_location = pathlib.PosixPath(os.environ['HOME']) 
downloaded_location = home_location / "Downloads/beerreviews.zip"

In [None]:
message = """
Please download the beerreviews dataset from the Kaggle webpage at:
   https://www.kaggle.com/rdoume/beerreviews
This will require you creating a Kaggle account, and consenting to their terms of service.
"""

In [7]:
dsrc.add_manual_download(message, file_name="beerreviews.zip")

AttributeError: 'DataSource' object has no attribute 'add_message'

In [None]:
dsrc.add_file(source_file=downloaded_location)

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch()

In [None]:
dsrc.file_list

In [None]:
!ls -la $paths.raw_data_path

In [None]:
dsrc.unpack()

In [None]:
!ls -la $paths.interim_data_path

## Save progress!

In [None]:
from src import workflow

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.available_datasources()

Should be able to nuke this from orbit and recover it again at this point...

In [None]:
!cd .. && make clean_raw && make clean_interim

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch()
dsrc.unpack()

NOte that somehow the make fetch_sources and make unpack_sources don't seem to work yet at this stage?!

## How to get a DataSource you've already started working with

In [None]:
workflow.available_datasources()

In [None]:
dsrc = DataSource.from_name('beer_review')

In [None]:
dsrc.file_list

## Time to add metadata and license info

Uh oh. This data has no license! Says it's from this talk.

https://conferences.oreilly.com/strata/strata-ny-2017/public/schedule/detail/59542

## No need to expand a .zip file on disk if using pandas .read_csv

No license. Let's at least write down where it's from. Let's not be clever and just copy it from JH's repo.

In [None]:
metadata_txt = """
This can be downloaded from that Kaggle \
webpage at: https://www.kaggle.com/rdoume/beerreviews It requires \
Kaggle login."""

In [None]:
dsrc.add_metadata(contents=metadata_txt, kind='DESCR')

In [None]:
dsrc.file_list

In [None]:
dsrc.fetch(force=True)

In [None]:
dsrc.unpack()

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.available_datasources()