# Connect Kaggle to Colab

## Step 1: Get Kaggle API Token
- Nagivate to [Kaggle](https://www.kaggle.com)
- Go to ```My Account``` -> ```API``` -> ```Create New API Token```
- You get a JSON file downloaded to your local PC. We'll upload it to the CoLab server later.

## Step 2: Install Kaggle library and Import Google Collab File Library

In [0]:
# Colab library to upload files to notebook
from google.colab import files

# Install Kaggle library
!pip install -q kaggle

## Step 3: Upload Kaggle API json file to Google Colab

In [0]:
# Upload kaggle API key file
uploaded = files.upload()

Saving kaggle.json to kaggle.json


After having ```kaggle.json``` containing your username and password under directory ```.```, you need to move it to ```/root/.kaggle/```

In [0]:
!cp kaggle.json /root/.kaggle/

However, every user on your machine now has the reading authority to your 'kaggle.json', which contains your username and password. 

In [0]:
!ls -l /root/.kaggle/

total 4
-rw-r--r-- 1 root root 64 Apr  1 04:38 kaggle.json


We'd better set the permissions of this file. Command ```chmod``` will do the trick for us. Each digit in the option ```600``` represents the permission for user, group and others respectively. 
- ```6``` means reading and writing permission for yourself
- ```0``` means no permission for users in your group
- ```0``` means no permission for anyone else

In [0]:
!chmod 600 /root/.kaggle/kaggle.json

## Step 4: Download dataset from Kaggle
Now, go to the kaggle competition dataset you are interested in, navigate to the Data tab, and copy the API link and paste in Colab to download the dataset.


In [0]:
!kaggle datasets download -d ryanxjhan/cbc-news-coronavirus-articles-march-26

cbc-news-coronavirus-articles-march-26.zip: Skipping, found more recently modified local copy (use --force to force download)


Actually, you can list all the datasets available on Kaggle by Kaggle's API. You may notice that replacing the string after option ```-d``` will allow you download other datasets.

In [0]:
!kaggle datasets list

ref                                                            title                                                size  lastUpdated          downloadCount  
-------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  
allen-institute-for-ai/CORD-19-research-challenge              COVID-19 Open Research Dataset Challenge (CORD-19)  729MB  2020-03-27 23:46:53          38169  
ryanxjhan/cbc-news-coronavirus-articles-march-26               CBC News Coronavirus/COVID-19 Articles (NLP)          6MB  2020-03-27 23:23:07             37  
sobhanmoosavi/us-accidents                                     US Accidents (3.0 million records)                  199MB  2020-01-17 04:45:09          11801  
fireballbyedimyrnmom/us-counties-covid-19-dataset              US counties COVID 19 dataset                        150KB  2020-03-31 12:41:45             61  
monogenea/birdsongs-from-europe               

## Step 5: Unzip the data

In [0]:
!unzip \cbc-news-coronavirus-articles-march-26.zip
!ls

Archive:  cbc-news-coronavirus-articles-march-26.zip
replace news.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
cbc-news-coronavirus-articles-march-26.zip  kaggle.json  news.csv  sample_data


## Step 6: Load the data using Pandas
Pandas is a high-level data analysis tool for Python. When working with tabular data, such as data stored in spreadsheets or databases, Pandas is the right tool for you. Pandas will help you to explore, clean and process your data. In Pandas, a data table is called a DataFrame.

In this step, we'll load the CSV data file into Pandas' DataFrame. And you can use function ```.head()``` to get a brief look of the data.

In [57]:
import pandas as pd
df = pd.read_csv('news.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,authors,title,publish_date,description,text,url
0,0,['Cbc News'],Coronavirus a 'wake-up call' for Canada's pres...,2020-03-27 08:00:00,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,https://www.cbc.ca/news/health/covid-19-drug-s...
1,1,['Cbc News'],Yukon gov't names 2 possible sources of corona...,2020-03-27 01:45:00,The Yukon government has identified two places...,The Yukon government has identified two places...,https://www.cbc.ca/news/canada/north/yukon-cor...
2,2,['The Associated Press'],U.S. Senate passes $2T coronavirus relief package,2020-03-26 05:13:00,The Senate has passed an unparalleled $2.2 tri...,The Senate late Wednesday passed an unparallel...,https://www.cbc.ca/news/world/senate-coronavir...
3,3,['Cbc News'],Coronavirus: The latest in drug treatment and ...,2020-03-27 00:36:00,Scientists around the world are racing to find...,Scientists around the world are racing to find...,https://www.cbc.ca/news/health/coronavirus-tre...
4,4,['Cbc News'],The latest on the coronavirus outbreak for Mar...,2020-03-26 20:57:00,The latest on the coronavirus outbreak from CB...,Trudeau says rules of Quarantine Act will ...,https://www.cbc.ca/news/the-latest-on-the-coro...


In [0]:
print("The shape of our dataset: {}".format(df.shape))
print(df.groupby(df['publish_date']).first())

The shape of our dataset: (3566, 7)
                     Unnamed: 0  ...                                                url
publish_date                     ...                                                   
2004-01-16 17:09:00        4603  ...  https://www.cbc.ca/news/world/who-finds-link-b...
2004-01-31 00:09:00        1748  ...  https://www.cbc.ca/news/technology/sars-mutate...
2006-11-23 18:11:00        4588  ...  https://www.cbc.ca/news/technology/sars-origin...
2012-09-23 21:50:00        4579  ...  https://www.cbc.ca/news/canada/outbreak-of-vir...
2012-11-02 01:24:00        4606  ...  https://www.cbc.ca/news/canada/nova-scotia/ell...
...                         ...  ...                                                ...
2020-03-27 05:29:00          50  ...  https://www.cbc.ca/news/canada/british-columbi...
2020-03-27 06:28:00          32  ...  https://www.cbc.ca/news/service-canada-covid-1...
2020-03-27 08:00:00           0  ...  https://www.cbc.ca/news/health/covid-19-drug-s

## Reference
1. https://medium.com/@saedhussain/google-colaboratory-and-kaggle-datasets-b57a83eb6ef8
2. https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463
3. https://pandas.pydata.org/docs/


In [0]:
# Remove everything
!rm ~/kaggle.json
!rm /root/.kaggle/kaggle.json
!rm ~/*.zip
!rm ~/*csv