# Create new (restricted) set of notebooks by experienced Kaggle users

I downloaded ~60,000 notebooks by expert users. Then I decided to make selection rules more restrictive:

- we consider an "expert" a user that have reached the "expert" tier in one of the categories of expertise since (at least) 6 months;
- we select only notebooks that were created *after* their authors had become "expert" Kaggle users.

This notebook takes as input the `.csv` file outputted by the download script and filters its rows by taking into account this new more restrictive rationale.

In [1]:
import pandas as pd

In [2]:
reduced_set = pd.read_csv('/Users/luigiquaranta/Developer/se4ai/notebookQualityAndReproducibility/KaggleTorrent/notebook_output/Notebooks by experts - full download URLs.csv')
reduced_set.shape

(11592, 2)

In [3]:
reduced_set.head()

Unnamed: 0,slug,url
0,jeffmoser/render-test,https://www.kaggle.com/kernels/scriptcontent/2...
1,jeffmoser/asdfadsf,https://www.kaggle.com/kernels/scriptcontent/3...
2,jeffmoser/test-notebook-health,https://www.kaggle.com/kernels/scriptcontent/3...
3,jeffmoser/testing-notebooks,https://www.kaggle.com/kernels/scriptcontent/3...
4,jeffmoser/save-edit-test,https://www.kaggle.com/kernels/scriptcontent/3...


In [4]:
complete_set = pd.read_csv('/Users/luigiquaranta/Developer/se4ai/notebookQualityAndReproducibility/KaggleTorrent/output_files/notebook_download_paths_1595069757.1767712.csv', sep=';')
complete_set.shape

(61199, 3)

In [5]:
complete_set.head()

Unnamed: 0,slug,url,download_path
0,jeffmoser/i-3-ipython,https://www.kaggle.com/kernels/scriptcontent/7...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
1,jeffmoser/notebook-a2042497432d665,https://www.kaggle.com/kernels/scriptcontent/8...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
2,jeffmoser/notebook-213497279134f5b,https://www.kaggle.com/kernels/scriptcontent/8...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
3,jeffmoser/testing-with-jamie,https://www.kaggle.com/kernels/scriptcontent/1...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
4,jeffmoser/notebook-ee6ac917bce6da7,https://www.kaggle.com/kernels/scriptcontent/1...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...


In [6]:
complete_set[complete_set['download_path'].isnull()].shape[0]

11

In [7]:
merge = pd.merge(reduced_set, complete_set, how='left')
merge.shape

(11592, 3)

In [8]:
merge.head()

Unnamed: 0,slug,url,download_path
0,jeffmoser/render-test,https://www.kaggle.com/kernels/scriptcontent/2...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
1,jeffmoser/asdfadsf,https://www.kaggle.com/kernels/scriptcontent/3...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
2,jeffmoser/test-notebook-health,https://www.kaggle.com/kernels/scriptcontent/3...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
3,jeffmoser/testing-notebooks,https://www.kaggle.com/kernels/scriptcontent/3...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
4,jeffmoser/save-edit-test,https://www.kaggle.com/kernels/scriptcontent/3...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...


In [9]:
merge[merge['download_path'].isnull()].empty

True

Output restricted list

Output path list

In [10]:
merge.columns

Index(['slug', 'url', 'download_path'], dtype='object')

# Beginners notebooks that have been already downloaded

In [11]:
downloaded_beginners = complete_set[~complete_set['slug'].isin(merge['slug'])]
downloaded_beginners.shape

(49601, 3)

In [12]:
downloaded_beginners.head()

Unnamed: 0,slug,url,download_path
0,jeffmoser/i-3-ipython,https://www.kaggle.com/kernels/scriptcontent/7...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
1,jeffmoser/notebook-a2042497432d665,https://www.kaggle.com/kernels/scriptcontent/8...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
2,jeffmoser/notebook-213497279134f5b,https://www.kaggle.com/kernels/scriptcontent/8...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
3,jeffmoser/testing-with-jamie,https://www.kaggle.com/kernels/scriptcontent/1...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...
4,jeffmoser/notebook-ee6ac917bce6da7,https://www.kaggle.com/kernels/scriptcontent/1...,/mnt/ext_hdd/local_repositories/kaggle_noteboo...


In [13]:
downloaded_beginners[downloaded_beginners['download_path'].isnull()].shape

(11, 3)

In [14]:
downloaded_beginners_cleaned = downloaded_beginners.dropna()
downloaded_beginners_cleaned.shape

(49590, 3)

In [15]:
downloaded_beginners_cleaned[downloaded_beginners_cleaned['download_path'].isnull()].shape[0]

0

In [16]:
downloaded_beginners_cleaned[['download_path']].to_csv('./output_files/download_paths_beginners_restricted.csv', index=False, header=False)