# Kaggle data
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [kaggle-data.ipynb](https://github.com/diegoinacio/machine-learning-notebooks/blob/master/Tips-and-Tricks/kaggle-data.ipynb)
---
Methods to obtain datasets from the *Kaggle* platform.

[Keggle](https://www.kaggle.com/) is basically an online community for *data scientists*. It is a platform that provide users ways to practice *machine learning*, *data analysis* or any kind of *data mining* projects. There is a lot of datasets available on Kaggle ready to be downloaded and explored. Here we will se how to get them.

In [None]:
import os
import sys
import zipfile
import pandas as pd

## Before getting the data
---
Before downloading any dataset, we have to set our credentials to explore Kaggle's API. The first step is to create an *API token* by doing the following steps below:

- Visit [Kaggle](https://www.kaggle.com/) and go to your profile and click on account;
- Roll down and click on **Create new API Token** button. 

```
This will download a file called `kaggle.json`, which will provide us our **username** and **key** like:
```

``` json
{"username": "diegoinacio", "key": "abfj3......2q9b"}
```

- Place this file in `$HOME/.kaggle/kaggle.json`

## Kaggle API
---
To access Kaggle API functionality, first install the kaggle tools:

``` shell
pip install kaggle --upgrade
```

It will provide us with the possibility to access both the **CLI** functionality and Python interface. To know more, you can reach the [project repository](https://github.com/Kaggle/kaggle-api).

In [None]:
!kaggle -h

In [None]:
import kaggle

### Datasets via CLI
---

In [None]:
!kaggle datasets list

In [None]:
!kaggle datasets download ahsan81/hotel-reservations-classification-dataset

In [None]:
zf = zipfile.ZipFile("./hotel-reservations-classification-dataset.zip") 
df = pd.read_csv(zf.open("Hotel Reservations.csv"))

In [None]:
df

### Datasets via Python interface
---

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

In [None]:
api = KaggleApi()
api.authenticate()

In [None]:
api.dataset_list()

In [None]:
api.dataset_download_files("senapatirajesh/netflix-tv-shows-and-movies")

In [None]:
zf = zipfile.ZipFile("./netflix-tv-shows-and-movies.zip") 
df = pd.read_csv(zf.open("NetFlix.csv"))

In [None]:
df

## opendatasets library
---
[opendatasets](https://github.com/JovianHQ/opendatasets/) is a Python library for downloading datasets not only from Kaggle but also from any online sources like Google Drive and others. To install that just run the following command:

``` shell
pip install opendatasets
```

In [None]:
import opendatasets as ods

To download from any source just provide an url. In the case of Kaggle, the command will interactively ask you for the **username** and **key** obtained from `kaggle.json` file. Copy and paste into the interactive widgets.

In [None]:
# This command will ask you for your username and key (from kaggle.json) interactively
dataset_url = "https://www.kaggle.com/rakkesharv/spotify-top-10000-streamed-songs"
ods.download(dataset_url)

In [None]:
df = pd.read_csv("./spotify-top-10000-streamed-songs/Spotify_final_dataset.csv")

In [None]:
df