# Downloading Kaggle dataset using Kaggle API

### What is Kaggle?

Kaggle is a worldwide 'data wherever' community, which means that you can use thousands of available datasets to get better and better in your data-skillset with wrangling some data, training Machine Learning models that will predict some target value given the features, create dashboard visualizations that will impress everyone, or just use them when creating some online tutorial.

### Trying to use kaggle API

In order to be able to download datasets directly from the Kaggle 'repo' (instead of manually entering the site and downloading the required data), we need to install a python module called kaggle.

Run the `!kaggle` command on terminal to run kaggle for the first time.

In [24]:
!kaggle

/bin/bash: line 1: kaggle: command not found


As we can see, we need to first install this package.

### Installing kaggle API

Run the `!pip install kaggle` command on Jupyter terminal to install kaggle on our environment through pip package manager.

In [25]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.13.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.13-py3-none-any.whl size=77717 sha256=26d7e483a94d586c0053472c53e0d6c10440647bfb92c228b4e0e30d5dde6620
  Stored in directory: /home/jovyan/.cache/pip/wheels/f3/16/ff/34e7d368370d4fd68bb749a59f1d2639ed66f3c14358e340a1
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.5.13


Now the kaggle api client was installed and we are 1 step away from getting what we need.

### Trying to access datasets list from the API

Running the `!kaggle datasets list` command:

In [27]:
!kaggle datasets list

Traceback (most recent call last):
  File "/opt/conda/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/opt/conda/lib/python3.10/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/opt/conda/lib/python3.10/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /home/jovyan/.kaggle. Or use the environment method.


### Generating kaggle API Token

As we can see from the error above, the API requires some **authentication** provided in any of the following ways:
  - A kaggle json file in .kaggle directory containing the user and his token;
  - KAGGLE_USERNAME and KAGGLE_KEY environment variables correctly setted;

Whether you choose the first or the second option, the authentication details come from the official kaggle site, and to get this json file you will need to generate a Token from https://www.kaggle.com/settings site under the API section, pressing the 'Create New Token' button. 

The generated Token will seem like this:
```json
{"username":"**your_user_name**","key":"**a_32_long_hashcode**"}
```
For example:
```json
{"username":"**aderbal**","key":"**7b10681fbc3f1c5026d7cf01eb3139d8**"}
```

### Trying to access datasets list from the API (again)

With the needed authentication provided, we will try running the `!kaggle datasets list` command again. 

In [29]:
!kaggle datasets list

ref                                                                   title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
arnabchaki/data-science-salaries-2023                                 Data Science Salaries 2023 💸                         25KB  2023-04-13 09:55:16          21824        604  1.0              
mauryansshivam/netflix-ott-revenue-and-subscribers-csv-file           Netflix OTT Revenue and Subscribers (CSV File)        2KB  2023-05-13 17:40:23            936         23  1.0              
fatihb/coffee-quality-data-cqi                                        Coffee Quality Data (CQI May-2023)                   22KB  2023-05-12 13:06:39           2369         56  1.0              
darshanprabhu09/stock-prices-f

As we can see, no error message was shown, so the API is correctly authenticated and we can start using it as we need.

### Downloading a specific Dataset

The `!kaggle datasets download -d some_dataset_on_kaggle_site` will provide us with the files of the chosen dataset.

```bash
!kaggle datasets download -d arnabchaki/data-science-salaries-2023
```

In [37]:
!kaggle datasets download -d arnabchaki/data-science-salaries-2023

Downloading data-science-salaries-2023.zip to /home/jovyan/work/kaggle
  0%|                                               | 0.00/25.4k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 25.4k/25.4k [00:00<00:00, 1.66MB/s]


### Unziping the downloaded Dataset

The chosen dataset *arnabchaki/data-science-salaries-2023* downloaded a zip file containing the desired data, so we need to unzip it.

Running he `unzip` command will expand the zip file to a folder with its files.

In [38]:
!unzip data-science-salaries-2023.zip -d data

Archive:  data-science-salaries-2023.zip
  inflating: data/ds_salaries.csv    


### Checking the downloaded dataset

Now the dataset file was created on data subfolder as specified and is ready to use.

```python
# importing pandas (and in case of error, just run `!pip install pandas`)
import pandas as pd

# loading the dataset on a dataframe
df = pd.read_csv('data/ds_salaries.csv')

#checking the first 10 rows in the df
df.head(10)
```

In [39]:
import pandas as pd

In [40]:
df = pd.read_csv('data/ds_salaries.csv')
df.head(10)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
5,2023,SE,FT,Applied Scientist,222200,USD,222200,US,0,US,L
6,2023,SE,FT,Applied Scientist,136000,USD,136000,US,0,US,L
7,2023,SE,FT,Data Scientist,219000,USD,219000,CA,0,CA,M
8,2023,SE,FT,Data Scientist,141000,USD,141000,CA,0,CA,M
9,2023,SE,FT,Data Scientist,147100,USD,147100,US,0,US,M


Loading this data on a Pandas DataFrame we were able to check that the corrected dataset was downloaded, and the next steps could be performing a *exploratory analysis*, with a couple of seaborn plots, some statistics metrics generated to better 'undestand' this data, and then proceed to a deep analysis as already cited. 