<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/05_the_%F0%9F%A4%97_Datasets_library_continued.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating your own dataset

After you've framed the problem, the next step is to source your data

But what if that dataset doesn't exist?

Well, life is not a Kaggle competition - i.e., people aren't going to provide you with a neatly structured csv file to do your EDA and build your model with.

Therefore, let's explore one possible path for sourcing data: scraping websites with ```requests```. 

## [Getting the data](https://huggingface.co/course/chapter5/5?fw=pt#getting-the-data)

We could code along with the chpater and download download 🤗Datasets [Issues tab](https://github.com/huggingface/datasets/issues) by using GitHub's REST API to poll the Issues endpoint which will return a list of JSON objects. 

Instead, I'm going to kip to the final ✏️ [Try it out!](https://huggingface.co/course/chapter5/5?fw=pt#:~:text=issues%20and%20comments.-,%E2%9C%8F%EF%B8%8F%20Try%20it%20out!,-Go%20through%20the) which is to create a dataset of GitHub issues for my favorite open source library. 

To that end, let's play around with [spaCy](https://github.com/explosion/spaCy). 

As always, we'll start with a single page before requesting all of them. 

In [None]:
!pip install datasets
!pip install requests

In [7]:
import requests 

owner = 'explosion/spacy'

url = "https://api.github.com/repos/"+owner+"/issues?page=1&per_page=1"

response = requests.get(url)

response.status_code

200

In [8]:
response.json()

[{'active_lock_reason': None,
  'assignee': None,
  'assignees': [],
  'author_association': 'MEMBER',
  'body': "<!--- Provide a general summary of your changes in the title. -->\r\n\r\n## Description\r\n<!--- Use this section to describe your changes. If your changes required\r\ntesting, include information about the testing environment and the tests you\r\nran. If your test fixes a bug reported in an issue, don't forget to include the\r\nissue number. If your PR is still a work in progress, that's totally fine – just\r\ninclude a note to let us know. -->\r\n\r\nAdd edit tree lemmatizer, converted from [`spacy_experimental.edit_tree_lemmatizer`](https://github.com/explosion/spacy-experimental/tree/d01fd5b479db823772865c362b4e4e1e706cf554/spacy_experimental/edit_tree_lemmatizer)\r\n\r\n### Types of change\r\n<!-- What type of change does your PR cover? Is it a bug fix, an enhancement\r\nor new feature, or a change to the documentation? -->\r\n\r\nEnhancement\r\n\r\n## Checklist\r\n<!-

In [9]:
GITHUB_TOKEN = XXXXXX
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

In [10]:
import time
from math import ceil 
from pathlib import Path 
import pandas as pd 
from tqdm.notebook import tqdm 

In [15]:
defaults = {'owner':'explosion',
            'repo': 'spacy',
            'num_issues': 100,
            'rate_limit': 5_000}

In [17]:
def fetch_issues(   
    owner='explosion',
    repo='spacy',
    num_issues=100,
    rate_limit=5_000,
    issues_path=Path("."),
    ):
  if not issues_path.is_dir():
    issues_path.mkdir(exist_ok=True)

  batch = []
  all_issues = []
  per_page = 100
  num_pages = ceil(num_issues / per_page)
  base_url = "https://api.github.com/repos"

  for page in tqdm(range(num_pages)):
    query = f"issues?page={page}&per_page={per_page}&state=all"
    issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", 
                          headers=headers)
    batch.extend(issues.json())

    if len(batch) > rate_limit and len(all_issues) < num_issues:
      all_issues.extend(batch)
      batch = []  # Flush batch for next time period
      print(f"Reached GitHub rate limit. Sleeping for one hour ...")
      time.sleep(60 * 60 + 1)

    all_issues.extend(batch)

    df = pd.DataFrame.from_records(all_issues)
    
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

In [18]:
fetch_issues()

  0%|          | 0/1 [00:00<?, ?it/s]

Downloaded all the issues for spacy! Dataset stored at ./spacy-issues.jsonl


In [22]:
from datasets import load_dataset

issues_spacy = load_dataset('json',
                            data_files="spacy-issues.jsonl")

issues_spacy

Using custom data configuration default-2c8e049d9fe0e638
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-2c8e049d9fe0e638/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'],
        num_rows: 100
    })
})

Unfortunately, because of how [GitHub's REST API](https://docs.github.com/en/rest/reference/issues#list-issues-assigned-to-the-authenticated-user) works, when we download the issues, we get the pull requests as well. 

So, let's get rid of those 😀

## [Cleaning up the data](https://huggingface.co/course/chapter5/5?fw=pt#cleaning-up-the-data)