# CodeSearchNet Data Source Notice

In [None]:
!mkdir /content/CodeSearchNet

mkdir: cannot create directory ‘/content/CodeSearchNet’: File exists


In [None]:
%%capture
!pip install docopt

After the CodeSearchNet dataset was archieved, the S3 bucket was taken offline. As a result, following the installation on the github installation guide will not work. A short illustration can be seen down below.

In [None]:
import os
from subprocess import call, check_call, CalledProcessError

destination_dir = "/content/CodeSearchNet"

if not os.path.exists(destination_dir):
    os.makedirs(destination_dir)
os.chdir(destination_dir)

try:
    language = "python"
    check_call(['wget', f'https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{language}.zip', '-O', f'{language}.zip'])
    check_call(['unzip', f'{language}.zip'])
    check_call(['rm', f'{language}.zip'])
except CalledProcessError as e:
    print(f"Error: {e}")
    print(f"Error executing command {e.cmd}")
    print(f"Returned code {e.returncode}")

Error: Command '['wget', 'https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip', '-O', 'python.zip']' returned non-zero exit status 8.
Error executing command ['wget', 'https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip', '-O', 'python.zip']
Returned code 8


Instead, we download the dataset from Hugging Face. Updating `datasets` might not be necessary but might sometimes be helpful to avoid errors concering caching in the local file system

# Data Fetching

In [1]:
%%capture

!pip install datasets==3.6.0

In [2]:
!pip show datasets

Name: datasets
Version: 3.6.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.11/dist-packages
Requires: dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, tqdm, xxhash
Required-by: torchtune


Make sure to restart the runtime to allow the changes to take effect

In [1]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("code_search_net", "python")
# dataset = load_dataset("Nan-Do/code-search-net-python")

train_data = dataset["train"]
test_data = dataset["test"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

code_search_net.py: 0.00B [00:00, ?B/s]

The repository for code_search_net contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/code_search_net.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


python.zip:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

We can inspect the contents of the dataset object for the training, testing, and validation datasets.

In [2]:
print(train_data.features.keys())

dict_keys(['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'])


In [3]:
train_data.shape

(1880853, 11)

# Question Generation Pipeline

In [4]:
import itertools
import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer
)
from nltk import sent_tokenize

from typing import (
    Tuple,
    Dict,
    Literal,
    List,
    Any,
    Generator,
    overload,
    Union
)

class QGPipeline:
    def __init__(
        self,
        model: str,
        qg_format: Literal["highlight"] = "highlight",
        exclude_after: List[str] = [],
        use_cuda: bool = False
    ):

        self.model = AutoModelForSeq2SeqLM.from_pretrained(model)
        self.tokenizer = AutoTokenizer.from_pretrained(model)
        self.qg_format = qg_format

        assert self.model.__class__.__name__ == "T5ForConditionalGeneration"

        self.device = torch.device("cuda" if torch.cuda.is_available() and use_cuda else "cpu")
        self.model.to(self.device)
        self.model.eval()
        self.use_cuda = use_cuda
        self._exclude_after = exclude_after
        print(f"Using {self.device}")

    def __call__(self, input: Union[Tuple[str, str], List[Tuple[str, str]]]):
        if isinstance(input, tuple):
            # Handle single input
            func_name, docstring = input
            questions = self._generate_questions(func_name, docstring)
            output = [{'answer': func_name, 'question': que} for que in questions]
            if output:
                 return output[0]
            else:
                 return {}

        elif isinstance(input, list) and all(isinstance(item, tuple) for item in input):
            # Handle batch input with proper error handling
            return self._process_batch_generator(input)
        else:
            raise TypeError("Invalid input type. Expected a tuple (func_name, docstring) or a list of such tuples.")


    def _process_batch_generator(self, batch_input: List[Tuple[str, str]]) -> Generator[Dict[str, Any], None, None]:
        """
        Process batch input and yield results with error handling per item
        """
        for i, (func_name, docstring) in enumerate(batch_input):
            try:
                questions = self._generate_questions(func_name, docstring)
                output = [{'answer': func_name, 'question': que} for que in questions]

                if output:
                    yield {
                        'success': True,
                        'index': i,
                        'function_name': func_name,
                        'docstring': docstring,
                        'model_output': output[0],
                        'error': None
                    }
                else:
                    yield {
                        'success': False,
                        'index': i,
                        'function_name': func_name,
                        'docstring': docstring,
                        'model_output': {},
                        'error': 'No questions generated'
                    }

            except Exception as e:
                yield {
                    'success': False,
                    'index': i,
                    'function_name': func_name,
                    'docstring': docstring,
                    'model_output': {},
                    'error': str(e)
                }

    def _generate_questions(self, func_name, docstring):
        #TODO: This can be re-written in a more forceful way for the llm
        inputs = self._prepare_inputs_for_question_extraction(func_name, docstring)

        inputs = self._tokenize(inputs, padding=True, truncation=True)

        with torch.no_grad():
            outs = self.model.generate(
                input_ids=inputs['input_ids'].to(self.device),
                attention_mask=inputs['attention_mask'].to(self.device),
                num_beams=4,

                max_length=32
            )

        questions = [self.tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]

        return questions

    def _tokenize(self, inputs, padding=True, truncation=True, add_special_tokens=True, max_length=512):
        tokenized_inputs = self.tokenizer(
            inputs,
            max_length=max_length,
            add_special_tokens=add_special_tokens,
            truncation=truncation,
            padding=padding,
            return_tensors="pt"
        )
        return tokenized_inputs

    def _prepare_inputs_for_question_extraction(self, func_name, docstring):
        #NOTE: experimental, consider removing :params and :return values
        #manual observation suggests the model struggles to understand the pupose of the function in their presense
        for string in self._exclude_after:
            param_idx = docstring.find(string)
            if param_idx != -1:
                docstring = docstring[:param_idx]
            docstring = docstring.strip()
        input = f"answer: <hl>The function is {func_name}<hl>. Context: {docstring} </s>"

        return [input]

    @property
    def exclude_after(self):
        return self._exclude_after

    @exclude_after.setter
    def exclude_after(self, value):
        self._exclude_after = value

In [5]:
finetuned_t5 = QGPipeline(model="valhalla/t5-base-qg-hl")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/15.0 [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Using cpu


In [6]:
func_name = train_data[4]['func_name']
docstring = train_data[4]['func_documentation_string']
print(docstring)
finetuned_t5((func_name, docstring))

:return: ids of test reports


{'answer': 'ReportsTable.get_report_list',
 'question': 'What is the name of the function that returns the ids of test reports?'}

The result looks promising. Let's run the model for the first 20 doc strings in our dataset

In [None]:
for i in range(10):
    func_name = train_data[i]['func_name']
    p = docstring = train_data[i]['func_documentation_string']
    print(f"========Sample{i+1}==========")
    idx = docstring.find(":param")
    if idx != -1:
        p = docstring[:idx]
    p = docstring[:].strip()
    print(f"Docstring: {docstring}")
    print(finetuned_t5((func_name, docstring)))


Docstring: update db entry

        :param field_dict: dictionary of fields and values
        :param where_clause: where clause for the update
{'answer': 'Table.update', 'question': 'What is the function for Table.update?'}
Docstring: insert new db entry

        :param fields: list of fields to insert
        :param values: list of values to insert
        :return: row id of the new row
{'answer': 'Table.insert', 'question': 'What is the name of the function that inserts a new db entry?'}
Docstring: :param report: the report to store
        :param test_id: the id of the test reported
        :return: report id
{'answer': 'ReportsTable.store', 'question': 'What is the name of the function that stores the report?'}
Docstring: get report by the test id

        :param test_id: test id
        :return: Report object
{'answer': 'ReportsTable.get', 'question': 'What is the name of the function that returns a report?'}
Docstring: :return: ids of test reports
{'answer': 'ReportsTable.get_re

In [None]:
zipped5 = zip(train_data[:5]['func_name'], train_data[:5]['func_documentation_string'])

generator = finetuned_t5(list(zipped5))
print(generator)

for item in generator:
    ans, ques = item['function_name'], item['model_output']['question']
    print(f"========Sample==========")
    print(f"Question: {ques}")
    print(f"Answer: {ans}")
    print()

<generator object QGPipeline._process_batch_generator at 0x7833286165c0>
Question: What is the function that ensures that the WebSocket connection is open?
Answer: WebSocketCommonProtocol.ensure_open

Question: What is the function that reads incoming messages and puts them in a queue?
Answer: WebSocketCommonProtocol.transfer_data

Question: What is the function that reads a single message from the connection?
Answer: WebSocketCommonProtocol.read_message

Question: What is the function that reads a single data frame from a connection?
Answer: WebSocketCommonProtocol.read_data_frame

Question: What is the function that reads a single frame from a connection?
Answer: WebSocketCommonProtocol.read_frame



# Dataset Processor

In [None]:
import json
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import time
from typing import List, Tuple, Dict, Any
from datasets import Dataset, DatasetDict
from huggingface_hub import repo_exists, auth_check
from huggingface_hub.utils import GatedRepoError, RepositoryNotFoundError


def dataset_exists(dataset_name, token=None):
    try:
        load_dataset(dataset_name, token=token, streaming=True)
        return True
    except (RepositoryNotFoundError, OSError, ValueError):
        return False


class DocstringDatasetProcessor:
    def __init__(
        self,
        hf_dataset_name: str,
        batch_size: int = 1000,
        token: str = "",
        save_locally: bool = False,
        local_cache_dir: str = "./cache",
        private_repo: bool = False,
    ):

        self.hf_dataset_name = hf_dataset_name
        self.batch_size = batch_size
        self.private_repo = private_repo
        self.local_cache_dir = Path(local_cache_dir)
        self.local_cache_dir.mkdir(exist_ok=True)
        self.save_locally = save_locally

        self.processed_count = 0
        self.failed_count = 0
        self.all_generated_data = []

        self.token = token

    def process_batch(
        self, batch_data: List[Tuple[str, str]], pipeline: QGPipeline, batch_id: int
    ) -> List[Dict[str, Any]]:
        """Process a batch of (func_name, docstring) tuples with individual error handling"""
        batch_results = []
        batch_success_count = 0
        batch_failure_count = 0

        print(batch_data)

        # Process the entire batch through pipeline
        try:
            generator = pipeline(batch_data)
            for result in generator:

                if result["success"]:
                    batch_results.append(
                        {
                            "function_name": result["function_name"],
                            "docstring": result["docstring"],
                            "question": result["model_output"]["question"],
                        }
                    )
                    batch_success_count += 1
                else:
                    print(
                        f"Failed to process {result['function_name']}: {result['error']}"
                    )
                    batch_failure_count += 1

        except Exception as e:
            # Catastrophic failure - entire batch failed
            print(f"Catastrophic batch failure {batch_id}: {e}")
            batch_failure_count = len(batch_data)
            batch_success_count = 0

        self.processed_count += batch_success_count
        self.failed_count += batch_failure_count

        print(
            f"Batch {batch_id}: {batch_success_count} successful, "
            f"{batch_failure_count} failed out of {len(batch_data)} items"
        )

        if self.save_locally and batch_results:
            self._save_batch_locally(batch_results, batch_id)

        return batch_results

    def _save_batch_locally(self, batch_results: List[Dict], batch_id: int):
        batch_file = self.local_cache_dir / f"batch_{batch_id}.jsonl"
        with open(batch_file, "w") as f:
            for item in batch_results:
                json.dump(item, f)
                f.write("\n")

    def process_full_dataset(self, dataset, pipeline, start_idx: int = 0):
        """Process the entire data set and upload to hugging face"""

        if not self.token:
            print("Hugging face token not provided. Terminatng")
            return

        if not self._can_upload():
            print("Cannot upload to Hugging Face: insufficient permissions or repo access.")
            return

        print(f"Starting processing of {len(dataset)} items from index {start_idx}")

        start_time = time.time()

        for batch_start in tqdm(
            range(start_idx, len(dataset), self.batch_size), desc="Processing batches"
        ):
            batch_end = min(batch_start + self.batch_size, len(dataset))
            batch_data = dataset[batch_start:batch_end]
            batch_id = batch_start // self.batch_size

            batch_results = self.process_batch(batch_data, pipeline, batch_id)
            self.all_generated_data.extend(batch_results)

            # print progress
            if batch_id % 10 == 0:
                elapsed = time.time() - start_time
                rate = self.processed_count / elapsed if elapsed > 0 else 0
                print(
                    f"Processed {self.processed_count} items in {elapsed:.2f} seconds. Rate: {rate:.2f} items/sec"
                )

            # final statistics
            total_time = time.time() - start_time
            print(
                f"Processed {self.processed_count} items in {total_time:.2f} seconds. Rate: {self.processed_count / total_time:.2f} items/sec"
            )

            self._upload_to_hf()


    def _can_upload(self):
        namespace = self.hf_dataset_name.split('/')[0]
        try:
            if dataset_exists(self.hf_dataset_name):
                return True
            else:
                from huggingface_hub import whoami
                user_info = whoami(token=self.token)
                if namespace == user_info["name"]:
                    return True
                else:
                    return False
        except Exception as e:
            print(e)
            return False

    def _upload_to_hf(self):
        try:
            new_dataset = Dataset.from_list(self.all_generated_data)
            new_dataset = new_dataset.map(
                lambda x: {
                    **x,
                    "id": f"{x['function_name']}_{hash(x['docstring']) % 10000}",
                }
            )

            if dataset_exists(self.hf_dataset_name):
                try:
                    existing_dataset_dict = load_dataset(self.hf_dataset_name, token=self.token)
                    if "train" in existing_dataset_dict:
                        existing_dataset = existing_dataset_dict["train"]
                        combined_data = existing_dataset.to_list() + new_dataset.to_list()
                        seen_ids = set()
                        merged_data = []
                        for item in combined_data:
                            if item["id"] not in seen_ids:
                                merged_data.append(item)
                                seen_ids.add(item["id"])
                        merged_dataset = Dataset.from_list(merged_data)
                        dataset_dict = DatasetDict({"train": merged_dataset})
                    else:
                        dataset_dict = DatasetDict({"train": new_dataset})
                except Exception as e:
                    print(f"Error loading existing dataset, proceeding with new data only: {e}")
                    dataset_dict = DatasetDict({"train": new_dataset})
            else:
                dataset_dict = DatasetDict({"train": new_dataset})

            print("Uploading dataset to Hugging Face...")

            dataset_dict.push_to_hub(
                self.hf_dataset_name,
                token=self.token,
                private=self.private_repo,
                commit_message=f"Add {len(self.all_generated_data)} docstring-question pairs",
            )

            print(
                f"Successfully uploaded dataset to https://huggingface.co/datasets/{self.hf_dataset_name}"
            )

        except Exception as e:
            print(f"Error uploading to Hugging Face: {e}")
            if self.save_locally:
                print("Data is available locally in cached directory")
            raise

    def load_from_hf(self):
        """Load the dataset from Hugging Face"""
        try:
            dataset = load_dataset(self.hf_dataset_name, token=self.token)
            print(f"Successfully loaded dataset from {self.hf_dataset_name}")
            return dataset
        except Exception as e:
            print(f"Error loading dataset from Hugging Face: {e}")
            raise

    # TODO: resume processing from local cache file
    # TODO: upload from colab cache to permanent file location (local or drive)


In [12]:
# Constants

from google.colab import userdata

HUGGING_FACE_TOKEN = userdata.get("HUGGING_FACE_TOKEN")
hf_dataset_name = "mrinjera/testing"

In [None]:
dataset_processor = DocstringDatasetProcessor(hf_dataset_name, batch_size=10, token=HUGGING_FACE_TOKEN)

batch_raw_data = zip(train_data[:500]['func_name'], train_data[:500]['func_documentation_string'])
batch_zip_data = list(batch_raw_data)

pipeline = finetuned_t5
dataset_processor.process_full_dataset(batch_zip_data, pipeline)

README.md:   0%|          | 0.00/378 [00:00<?, ?B/s]

Starting processing of 500 items from index 0


Processing batches:   0%|          | 0/50 [00:00<?, ?it/s]

[('Utility.findArgs', 'Extracts the list of arguments that start with any of the specified prefix values'), ('Utility.stripArgs', 'Removes any arguments in the supplied list that are contained in the specified blacklist'), ('Utility.capture', 'Executes a child process and captures its output'), ('Utility.run', 'Executes a child process and waits for it to complete'), ('UnrealManagerBase.setEngineRootOverride', 'Sets a user-specified directory as the root engine directory, overriding any auto-detection'), ('UnrealManagerBase.getEngineRoot', 'Returns the root directory location of the latest installed version of UE4'), ('UnrealManagerBase.getEngineVersion', 'Returns the version number of the latest installed version of UE4'), ('UnrealManagerBase.getEngineChangelist', 'Returns the compatible Perforce changelist identifier for the latest installed version of UE4'), ('UnrealManagerBase.isInstalledBuild', 'Determines if the Engine is an Installed Build'), ('UnrealManagerBase.getEditorBinary'

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Processing batches:   2%|▏         | 1/50 [00:35<28:37, 35.05s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('UnrealManagerBase.getProjectDescriptor', 'Detects the .uproject descriptor file for the Unreal project in the specified directory'), ('UnrealManagerBase.getPluginDescriptor', 'Detects the .uplugin descriptor file for the Unreal plugin in the specified directory'), ('UnrealManagerBase.getDescriptor', 'Detects the descriptor file for either an Unreal project or an Unreal plugin in the specified directory'), ('UnrealManagerBase.listThirdPartyLibs', 'Lists the supported Unreal-bundled third-party libraries'), ('UnrealManagerBase.getThirdpartyLibs', 'Retrieves the ThirdPartyLibraryDetails instance for Unreal-bundled versions of the specified third-party libraries'), ('UnrealManagerBase.getThirdPartyLibCompilerFlags', 'Retrieves the compiler flags for building against the Unreal-bundled versions of the specified third-party libraries'), ('UnrealManagerBase.getThirdPartyLibLinkerFlags', 'Retrieves the linker 

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/379 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Processing batches:   4%|▍         | 2/50 [01:13<29:47, 37.24s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('UnrealManagerBase.getThirdPartyLibDefinitions', 'Retrieves the list of preprocessor definitions for building against the Unreal-bundled versions of the specified third-party libraries'), ('UnrealManagerBase.generateProjectFiles', 'Generates IDE project files for the Unreal project in the specified directory'), ('UnrealManagerBase.cleanDescriptor', 'Cleans the build artifacts for the Unreal project or plugin in the specified directory'), ('UnrealManagerBase.buildDescriptor', 'Builds the editor modules for the Unreal project or plugin in the specified directory, using the specified build configuration'), ('UnrealManagerBase.runEditor', 'Runs the editor for the Unreal project in the specified directory (or without a project if dir is None)'), ('UnrealManagerBase.runUAT', 'Runs the Unreal Automation Tool with the supplied arguments'), ('UnrealManagerBase.packageProject', 'Packages a build of the Unreal pro

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/379 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/22 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/7.89k [00:00<?, ?B/s]

Processing batches:   6%|▌         | 3/50 [01:45<27:01, 34.51s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('UnrealManagerBase._getEngineRoot', 'Retrieves the user-specified engine root directory override (if set), or else performs auto-detection'), ('UnrealManagerBase._getEngineVersionDetails', 'Parses the JSON version details for the latest installed version of UE4'), ('UnrealManagerBase._getEngineVersionHash', 'Computes the SHA-256 hash of the JSON version details for the latest installed version of UE4'), ('UnrealManagerBase._runUnrealBuildTool', 'Invokes UnrealBuildTool with the specified parameters'), ('UnrealManagerBase._getUE4BuildInterrogator', 'Uses UE4BuildInterrogator to interrogate UnrealBuildTool about third-party library details'), ('JsonDataManager.getKey', 'Retrieves the value for the specified dictionary key'), ('JsonDataManager.getDictionary', 'Retrieves the entire data dictionary'), ('JsonDataManager.setKey', 'Sets the value for the specified dictionary key'), ('JsonDataManager.setDictiona

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/380 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/7.89k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/8.66k [00:00<?, ?B/s]

Processing batches:   8%|▊         | 4/50 [02:14<24:49, 32.37s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('UE4BuildInterrogator.interrogate', 'Interrogates UnrealBuildTool about the build flags for the specified third-party libraries'), ('UE4BuildInterrogator._absolutePaths', 'Converts the supplied list of paths to absolute pathnames (except for pure filenames without leading relative directories)'), ('UE4BuildInterrogator._flatten', 'Extracts the entry `field` from each item in the supplied iterable, flattening any nested lists'), ('UE4BuildInterrogator._getThirdPartyLibs', 'Runs UnrealBuildTool in JSON export mode and extracts the list of third-party libraries'), ('CMakeCustomFlags.processLibraryDetails', 'Processes the supplied ThirdPartyLibraryDetails instance and sets any custom CMake flags'), ('PluginManager.getPlugins', 'Returns the list of valid ue4cli plugins'), ('ThirdPartyLibraryDetails.getCompilerFlags', 'Constructs the compiler flags string for building against this library'), ('ThirdPartyLibra

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/380 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/8.66k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/42 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/9.80k [00:00<?, ?B/s]

Processing batches:  10%|█         | 5/50 [02:44<23:48, 31.75s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('ThirdPartyLibraryDetails.getLinkerDirectories', 'Returns the list of linker directories for this library, joined using the specified delimiter'), ('ThirdPartyLibraryDetails.getLibraryFiles', 'Returns the list of library files for this library, joined using the specified delimiter'), ('ThirdPartyLibraryDetails.getPreprocessorDefinitions', 'Returns the list of preprocessor definitions for this library, joined using the specified delimiter'), ('ThirdPartyLibraryDetails.getCMakeFlags', 'Constructs the CMake invocation flags string for building against this library'), ('is_changed', 'Checks if current project has any noncommited changes.'), ('user_agent', 'Return a User-Agent that identifies this client.\n\n    Example:\n        python-requests/2.9.1 edx-rest-api-client/1.7.2 ecommerce\n\n    The last item in the list will be the application name, taken from the\n    OS environment variable EDX_REST_API_CLI

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/380 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/9.80k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

Processing batches:  12%|█▏        | 6/50 [03:16<23:11, 31.62s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('simple_response_str', 'Creates an OSP response XML string.\n\n    Arguments:\n        command (str): OSP Command to respond to.\n        status (int): Status of the response.\n        status_text (str): Status text of the response.\n        content (str): Text part of the response XML element.\n\n    Return:\n        String of response in xml format.'), ('bind_socket', 'Returns a socket bound on (address:port).'), ('bind_unix_socket', 'Returns a unix file socket bound on (path).'), ('close_client_stream', 'Closes provided client stream'), ('OSPDaemon.set_command_attributes', 'Sets the xml attributes of a specified command.'), ('OSPDaemon.add_scanner_param', 'Add a scanner parameter.'), ('OSPDaemon.add_vt', 'Add a vulnerability test information.\n\n        Returns: The new number of stored VTs.\n        -1 in case the VT ID was already present and thus the\n        new VT was not considered.\n        -2

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/381 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Processing batches:  14%|█▍        | 7/50 [03:48<22:49, 31.86s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('OSPDaemon.process_targets_element', 'Receive an XML object with the target, ports and credentials to run\n        a scan against.\n\n        @param: XML element with target subelements. Each target has <hosts>\n        and <ports> subelements. Hosts can be a single host, a host range,\n        a comma-separated host list or a network address.\n        <ports> and  <credentials> are optional. Therefore each ospd-scanner\n        should check for a valid ones if needed.\n\n                Example form:\n                <targets>\n                  <target>\n                    <hosts>localhosts</hosts>\n                    <ports>80,443</ports>\n                  </target>\n                  <target>\n                    <hosts>192.168.0.0/24</hosts>\n                    <ports>22</ports>\n                    <credentials>\n                      <credential type="up" service="ssh" port="22">\n           

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/381 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/72 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Processing batches:  16%|█▌        | 8/50 [04:20<22:20, 31.91s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('OSPDaemon.handle_client_stream', 'Handles stream of data received from client.'), ('OSPDaemon.parallel_scan', 'Starts the scan with scan_id.'), ('OSPDaemon.check_pending_target', 'Check if a scan process is still alive. In case the process\n        finished or is stopped, removes the process from the multiscan\n        _process list.\n        Processes dead and with progress < 100% are considered stopped\n        or with failures. Then will try to stop the other runnings (target)\n        scans owned by the same task.\n\n        @input scan_id        Scan_id of the whole scan.\n        @input multiscan_proc A list with the scan process which\n                              may still be alive.\n\n        @return Actualized list with current runnging scan processes.'), ('OSPDaemon.calculate_progress', 'Calculate the total scan progress from the\n        partial target progress.'), ('OSPDaemon.start_scan',

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/381 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/82 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/15.5k [00:00<?, ?B/s]

Processing batches:  18%|█▊        | 9/50 [04:46<20:32, 30.05s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('OSPDaemon.handle_get_vts_command', 'Handles <get_vts> command.\n\n        @return: Response string for <get_vts> command.'), ('OSPDaemon.handle_help_command', 'Handles <help> command.\n\n        @return: Response string for <help> command.'), ('OSPDaemon.get_help_text', 'Returns the help output in plain text format.'), ('OSPDaemon.elements_as_text', 'Returns the elems dictionary as formatted plain text.'), ('OSPDaemon.handle_delete_scan_command', 'Handles <delete_scan> command.\n\n        @return: Response string for <delete_scan> command.'), ('OSPDaemon.delete_scan', 'Deletes scan_id scan from collection.\n\n        @return: 1 if scan deleted, 0 otherwise.'), ('OSPDaemon.get_scan_results_xml', "Gets scan_id scan's results in XML format.\n\n        @return: String of scan results in xml."), ('OSPDaemon.get_xml_str', 'Creates a string in XML Format using the provided data structure.\n\n        @param: D

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/381 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/15.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/92 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Processing batches:  20%|██        | 10/50 [05:18<20:22, 30.56s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('OSPDaemon.get_vts_xml', 'Gets collection of vulnerability test information in XML format.\n        If vt_id is specified, the collection will contain only this vt, if\n        found.\n        If no vt_id is specified, the collection will contain all vts or those\n        passed in filtered_vts.\n\n        Arguments:\n            vt_id (vt_id, optional): ID of the vt to get.\n            filtered_vts (dict, optional): Filtered VTs collection.\n\n        Return:\n            String of collection of vulnerability test information in\n            XML format.'), ('OSPDaemon.handle_get_scanner_details', 'Handles <get_scanner_details> command.\n\n        @return: Response string for <get_scanner_details> command.'), ('OSPDaemon.handle_get_version_command', 'Handles <get_version> command.\n\n        @return: Response string for <get_version> command.'), ('OSPDaemon.handle_command', 'Handles an osp command in a

Map:   0%|          | 0/110 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/102 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

Processing batches:  22%|██▏       | 11/50 [05:48<19:47, 30.46s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('OSPDaemon.add_scan_host_detail', 'Adds a host detail result to scan_id scan.'), ('OSPDaemon.add_scan_alarm', 'Adds an alarm result to scan_id scan.'), ('VtsFilter.parse_filters', 'Parse a string containing one or more filters\n        and return a list of filters\n\n        Arguments:\n            vt_filter (string): String containing filters separated with\n                semicolon.\n        Return:\n            List with filters. Each filters is a list with 3 elements\n            e.g. [arg, operator, value]'), ('VtsFilter.format_filter_value', 'Calls the specific function to format value,\n        depending on the given element.\n\n        Arguments:\n            element (string): The element of the VT to be formatted.\n            value (dictionary): The element value.\n\n        Returns:\n            Returns a formatted value.'), ('VtsFilter.get_filtered_vts_list', 'Gets a collection of vulnerabi

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/18.4k [00:00<?, ?B/s]

Processing batches:  24%|██▍       | 12/50 [06:19<19:23, 30.62s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('target_to_ipv4_short', 'Attempt to return a IPv4 short range list from a target string.'), ('target_to_ipv4_cidr', 'Attempt to return a IPv4 CIDR list from a target string.'), ('target_to_ipv6_cidr', 'Attempt to return a IPv6 CIDR list from a target string.'), ('target_to_ipv4_long', 'Attempt to return a IPv4 long-range list from a target string.'), ('ipv6_range_to_list', 'Return a list of IPv6 entries from start_packed to end_packed.'), ('target_to_ipv6_short', 'Attempt to return a IPv6 short-range list from a target string.'), ('target_to_ipv6_long', 'Attempt to return a IPv6 long-range list from a target string.'), ('target_to_hostname', 'Attempt to return a single hostname list from a target string.'), ('target_to_list', 'Attempt to return a list of single hosts from a target string.'), ('target_str_to_list', 'Parses a targets string into a list of individual targets.')]
Batch 12: 10 successful, 0 

Map:   0%|          | 0/130 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/122 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

Processing batches:  26%|██▌       | 13/50 [06:52<19:15, 31.24s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('port_range_expand', 'Receive a port range and expands it in individual ports.\n\n    @input Port range.\n    e.g. "4-8"\n\n    @return List of integers.\n    e.g. [4, 5, 6, 7, 8]'), ('port_str_arrange', 'Gives a str in the format (always tcp listed first).\n    T:<tcp ports/portrange comma separated>U:<udp ports comma separated>'), ('ports_str_check_failed', 'Check if the port string is well formed.\n    Return True if fail, False other case.'), ('ports_as_list', 'Parses a ports string into two list of individual tcp and udp ports.\n\n    @input string containing a port list\n    e.g. T:1,2,3,5-8 U:22,80,600-1024\n\n    @return two list of sorted integers, for tcp and udp ports respectively.'), ('port_list_compress', 'Compress a port list and return a string.'), ('valid_uuid', 'Check if value is a valid UUID.'), ('create_args_parser', 'Create a command-line arguments parser for OSPD.'), ('go_to_backgro

Map:   0%|          | 0/140 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/132 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/20.2k [00:00<?, ?B/s]

Processing batches:  28%|██▊       | 14/50 [07:25<19:03, 31.77s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('main', 'OSPD Main function.'), ('ScanCollection.add_result', 'Add a result to a scan in the table.'), ('ScanCollection.set_progress', "Sets scan_id scan's progress."), ('ScanCollection.set_target_progress', "Sets scan_id scan's progress."), ('ScanCollection.set_host_finished', 'Add the host in a list of finished hosts'), ('ScanCollection.get_hosts_unfinished', 'Get a list of finished hosts.'), ('ScanCollection.results_iterator', "Returns an iterator over scan_id scan's results. If pop_res is True,\n        it removed the fetched results from the list."), ('ScanCollection.del_results_for_stopped_hosts', 'Remove results from the result table for those host'), ('ScanCollection.resume_scan', 'Reset the scan status in the scan_table to INIT.\n        Also, overwrite the options, because a resume task cmd\n        can add some new option. E.g. exclude hosts list.\n        Parameters:\n            scan_id (uu

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/20.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/142 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/21.2k [00:00<?, ?B/s]

Processing batches:  30%|███       | 15/50 [07:52<17:50, 30.59s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('ScanCollection.set_option', "Set a scan_id scan's name option to value."), ('ScanCollection.get_target_progress', "Get a target's current progress value.\n        The value is calculated with the progress of each single host\n        in the target."), ('ScanCollection.get_target_list', "Get a scan's target list."), ('ScanCollection.get_ports', "Get a scan's ports list. If a target is specified\n        it will return the corresponding port for it. If not,\n        it returns the port item of the first nested list in\n        the target's list."), ('ScanCollection.get_credentials', "Get a scan's credential list. It return dictionary with\n        the corresponding credential for a given target."), ('ScanCollection.delete_scan', 'Delete a scan if fully finished.'), ('ResultType.get_str', 'Return string name of a result type.'), ('ResultType.get_type', 'Return string name of a result type.'), ('is_float',

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/21.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/152 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/22.1k [00:00<?, ?B/s]

Processing batches:  32%|███▏      | 16/50 [08:19<16:38, 29.36s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('init_logging', 'Init logging settings with default set to INFO'), ('Rule.keywords', 'Returns a list of all keywords that this rule object has defined.\n        A keyword is considered defined if the value it returns != None.'), ('Rule.check_type_keywords', 'All supported keywords:\n         - allowempty_map\n         - assertion\n         - class\n         - date\n         - default\n         - desc\n         - enum\n         - example\n         - extensions\n         - func\n         - ident\n         - include_name\n         - map_regex_rule\n         - mapping\n         - matching\n         - matching_rule\n         - name\n         - nullable\n         - pattern\n         - pattern_regexp\n         - range\n         - regex_mappings\n         - required\n         - schema\n         - sequence\n         - type\n         - type_class\n         - unique\n         - version'), ('Core._load_extensions',

Map:   0%|          | 0/170 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/22.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/162 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/24.3k [00:00<?, ?B/s]

Processing batches:  34%|███▍      | 17/50 [08:59<17:53, 32.54s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('fold', 'Fold a design to reduce confounding effects.\r\n    \r\n    Parameters\r\n    ----------\r\n    H : 2d-array\r\n        The design matrix to be folded.\r\n    columns : array\r\n        Indices of of columns to fold (Default: None). If ``columns=None`` is\r\n        used, then all columns will be folded.\r\n    \r\n    Returns\r\n    -------\r\n    Hf : 2d-array\r\n        The folded design matrix.\r\n    \r\n    Examples\r\n    --------\r\n    ::'), ('build_regression_matrix', 'Build a regression matrix using a DOE matrix and a list of monomials.\r\n    \r\n    Parameters\r\n    ----------\r\n    H : 2d-array\r\n    model : str\r\n    build : bool-array\r\n    \r\n    Returns\r\n    -------\r\n    R : 2d-array'), ('to_bedtool', 'Convert any iterator into a pybedtools.BedTool object.\n\n    Note that the supplied iterator is not consumed by this function. To save\n    to a temp file or to a kno

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/24.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/172 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/28.9k [00:00<?, ?B/s]

Processing batches:  36%|███▌      | 18/50 [09:39<18:31, 34.72s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('from_seqfeature', 'Converts a Bio.SeqFeature object to a gffutils.Feature object.\n\n    The GFF fields `source`, `score`, `seqid`, and `frame` are assumed to be\n    stored as qualifiers.  Any other qualifiers will be assumed to be GFF\n    attributes.'), ('FeatureDB.set_pragmas', 'Set pragmas for the current database connection.\n\n        Parameters\n        ----------\n        pragmas : dict\n            Dictionary of pragmas; see constants.default_pragmas for a template\n            and http://www.sqlite.org/pragma.html for a full list.'), ('FeatureDB._feature_returner', 'Returns a feature, adding additional database-specific defaults'), ('FeatureDB.schema', 'Returns the database schema as a string.'), ('FeatureDB.count_features_of_type', 'Simple count of features.\n\n        Can be faster than "grep", and is faster than checking the length of\n        results from :meth:`gffutils.FeatureDB.featur

Map:   0%|          | 0/190 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/28.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/182 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Processing batches:  38%|███▊      | 19/50 [10:07<16:54, 32.71s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
Batch 19: 10 successful, 0 failed out of 10 items
Processed 200 items in 643.33 seconds. Rate: 0.31 items/sec


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/192 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/35.4k [00:00<?, ?B/s]

Processing batches:  40%|████      | 20/50 [10:52<18:12, 36.43s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('FeatureDB.bed12', 'Converts `feature` into a BED12 format.\n\n        GFF and GTF files do not necessarily define genes consistently, so this\n        method provides flexiblity in specifying what to call a "transcript".\n\n        Parameters\n        ----------\n        feature : str or Feature instance\n            In most cases, this feature should be a transcript rather than\n            a gene.\n\n        block_featuretype : str or list\n            Which featuretype to use as the exons. These are represented as\n            blocks in the BED12 format.  Typically \'exon\'.\n\n            Use the `thick_featuretype` and `thin_featuretype` arguments to\n            control the display of CDS as thicker blocks and UTRs as thinner\n            blocks.\n\n            Note that the features for `thick` or `thin` are *not*\n            automatically included in the blocks; if you do want them included,\n

Map:   0%|          | 0/210 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/35.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/202 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/41.0k [00:00<?, ?B/s]

Processing batches:  42%|████▏     | 21/50 [11:48<20:26, 42.28s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('feature_from_line', "Given a line from a GFF file, return a Feature object\n\n    Parameters\n    ----------\n    line : string\n\n    strict : bool\n        If True (default), assume `line` is a single, tab-delimited string that\n        has at least 9 fields.\n\n        If False, then the input can have a more flexible format, useful for\n        creating single ad hoc features or for writing tests.  In this case,\n        `line` can be a multi-line string (as long as it has a single non-empty\n        line), and, as long as there are only 9 fields (standard GFF/GTF), then\n        it's OK to use spaces instead of tabs to separate fields in `line`.\n        But if >9 fields are to be used, then tabs must be used.\n\n    keep_order, dialect\n        Passed directly to :class:`Feature`; see docstring for that class for\n        description\n\n    Returns\n    -------\n    A new :class:`Feature` object.

Map:   0%|          | 0/220 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/41.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/210 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/44.1k [00:00<?, ?B/s]

Processing batches:  44%|████▍     | 22/50 [12:22<18:39, 39.98s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('_unjsonify', 'Convert JSON string to an ordered defaultdict.'), ('_feature_to_fields', 'Convert feature to tuple, for faster sqlite3 import'), ('_dict_to_fields', 'Convert dict to tuple, for faster sqlite3 import'), ('merge_attributes', 'Merges two attribute dictionaries into a single dictionary.\n\n    Parameters\n    ----------\n    `attr1`, `attr2` : dict\n\n    Returns\n    -------\n    dict'), ('dialect_compare', 'Compares two dialects.'), ('sanitize_gff_db', "Sanitize given GFF db. Returns a sanitized GFF db.\n\n    Sanitizing means:\n\n    - Ensuring that start < stop for all features\n    - Standardizing gene units by adding a 'gid' attribute\n      that makes the file grep-able\n\n    TODO: Do something with negative coordinates?"), ('sanitize_gff_file', 'Sanitize a GFF file.'), ('is_gff_db', 'Return True if the given filename is a GFF database.\n\n    For now, rely on .db extension.'), ('get_

Map:   0%|          | 0/230 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/44.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/220 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/45.5k [00:00<?, ?B/s]

Processing batches:  46%|████▌     | 23/50 [12:52<16:35, 36.88s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
Batch 23: 10 successful, 0 failed out of 10 items
Processed 240 items in 802.30 seconds. Rate: 0.30 items/sec


Map:   0%|          | 0/240 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/382 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/45.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/230 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/51.8k [00:00<?, ?B/s]

Processing batches:  48%|████▊     | 24/50 [13:24<15:23, 35.54s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('_DBCreator.create', 'Calls various methods sequentially in order to fully build the\n        database.'), ('_DBCreator.execute', 'Execute a query directly on the database.'), ('_DBCreator._replace', 'Insert a feature into the database.'), ('wait_for_js', 'Method decorator that waits for JavaScript dependencies before executing `function`.\n    If the function is not a method, the decorator has no effect.\n\n    Args:\n        function (callable): Method to decorate.\n\n    Returns:\n        Decorated method'), ('_decorator', 'Return a class decorator that:\n\n    1) Defines a new class method, `wait_for_js`\n    2) Defines a new class list variable, `store_name` and adds\n        `store_values` to the list.'), ('_wait_for_js', 'Class method added by the decorators to allow\n    decorated classes to manually re-check JavaScript\n    dependencies.\n\n    Expect that `self` is a class that:\n    1) Has be

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/51.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/240 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/53.5k [00:00<?, ?B/s]

Processing batches:  50%|█████     | 25/50 [13:54<14:06, 33.84s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
Batch 25: 10 successful, 0 failed out of 10 items
Processed 260 items in 868.46 seconds. Rate: 0.30 items/sec


Map:   0%|          | 0/260 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/53.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/56.4k [00:00<?, ?B/s]

Processing batches:  52%|█████▏    | 26/50 [14:31<13:49, 34.57s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('AxeCoreAuditConfig.customize_ruleset', 'Updates the ruleset to include a set of custom rules. These rules will\n        be _added_ to the existing ruleset or replace the existing rule with\n        the same ID.\n\n        Args:\n\n            custom_ruleset_file (optional): The filepath to the custom rules.\n                Defaults to `None`. If `custom_ruleset_file` isn\'t passed, the\n                environment variable `BOKCHOY_A11Y_CUSTOM_RULES_FILE` will be\n                checked. If a filepath isn\'t specified by either of these\n                methods, the ruleset will not be updated.\n\n        Raises:\n\n            `IOError` if the specified file does not exist.\n\n        Examples:\n\n            To include the rules defined in `axe-core-custom-rules.js`::\n\n                page.a11y_audit.config.customize_ruleset(\n                    "axe-core-custom-rules.js"\n                )\n\n 

Map:   0%|          | 0/270 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/56.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/260 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/59.3k [00:00<?, ?B/s]

Processing batches:  54%|█████▍    | 27/50 [15:04<13:09, 34.31s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('_local_browser_class', 'Returns class, kwargs, and args needed to instantiate the local browser.'), ('_remote_browser_class', 'Returns class, kwargs, and args needed to instantiate the remote browser.'), ('_proxy_kwargs', 'Determines the kwargs needed to set up a proxy based on the\n    browser type.\n\n    Returns: a dictionary of arguments needed to pass when\n        instantiating the WebDriver instance.'), ('_required_envs', 'Parse environment variables for required values,\n    raising a `BrowserConfig` error if they are not found.\n\n    Returns a `dict` of environment variables.'), ('_optional_envs', 'Parse environment variables for optional values,\n    raising a `BrowserConfig` error if they are insufficiently specified.\n\n    Returns a `dict` of environment variables.'), ('_capabilities_dict', 'Convert the dictionary of environment variables to\n    a dictionary of desired capabilities to se

Map:   0%|          | 0/280 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/59.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/270 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Processing batches:  56%|█████▌    | 28/50 [15:35<12:09, 33.17s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('Query._execute', 'Run the query, generating data from the `seed_fn` and performing transforms on the results.'), ('Query.execute', 'Execute this query, retrying based on the supplied parameters.\n\n        Keyword Args:\n            try_limit (int): The number of times to retry the query.\n            try_interval (float): The number of seconds to wait between each try (float).\n            timeout (float): The maximum number of seconds to spend retrying (float).\n\n        Returns:\n            The transformed results of the query.\n\n        Raises:\n            BrokenPromise: The query did not execute without a Selenium error after one or more attempts.'), ('Query.first', 'Return a Query that selects only the first element of this Query.\n        If no elements are available, returns a query with no results.\n\n        Example usage:\n\n        .. code:: python\n\n            >> q = Query(lambda: li

Map:   0%|          | 0/290 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/280 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Processing batches:  58%|█████▊    | 29/50 [16:06<11:22, 32.51s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('prepare_headers', ':type bound_columns: list of BoundColumn'), ('order_by_on_list', 'Utility function to sort objects django-style even for non-query set collections\n\n    :param objects: list of objects to sort\n    :param order_field: field name, follows django conventions, so "foo__bar" means `foo.bar`, can be a callable.\n    :param is_desc: reverse the sorting\n    :return:'), ('default_cell_formatter', ':type column: tri.table.Column'), ('django_pre_2_0_table_context', ':type table: Table'), ('table_context', ':type table: Table'), ('render_table', 'Render a table. This automatically handles pagination, sorting, filtering and bulk operations.\n\n    :param request: the request object. This is set on the table object so that it is available for lambda expressions.\n    :param table: an instance of Table\n    :param links: a list of instances of Link\n    :param context: dict of extra context para

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/290 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/64.7k [00:00<?, ?B/s]

Processing batches:  60%|██████    | 30/50 [16:35<10:33, 31.68s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('InfobloxObjectManager.create_ip_range', 'Creates IPRange or fails if already exists.'), ('InfobloxObjectManager.network_exists', 'Deprecated, use get_network() instead.'), ('InfobloxObjectManager.delete_objects_associated_with_a_record', 'Deletes records associated with record:a or record:aaaa.'), ('Connector._parse_options', 'Copy needed options to self'), ('Connector._parse_reply', 'Tries to parse reply from NIOS.\n\n        Raises exception with content if reply is not in json format'), ('Connector.get_object', "Retrieve a list of Infoblox objects of type 'obj_type'\n\n        Some get requests like 'ipv4address' should be always\n        proxied to GM on Hellfire\n        If request is cloud and proxy is not forced yet,\n        then plan to do 2 request:\n        - the first one is not proxied to GM\n        - the second is proxied to GM\n\n        Args:\n            obj_type  (str): Infoblox obje

Map:   0%|          | 0/310 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/64.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/300 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/66.4k [00:00<?, ?B/s]

Processing batches:  62%|██████▏   | 31/50 [17:05<09:49, 31.02s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('EA.from_dict', 'Converts extensible attributes from the NIOS reply.'), ('EA.to_dict', 'Converts extensible attributes into the format suitable for NIOS.'), ('EA._process_value', "Applies processing method for value or each element in it.\n\n        :param func: method to be called with value\n        :param value: value to process\n        :return: if 'value' is list/tupe, returns iterable with func results,\n                 else func result is returned"), ('InfobloxObject.from_dict', 'Build dict fields as SubObjects if needed.\n\n        Checks if lambda for building object from dict exists.\n        _global_field_processing and _custom_field_processing rules\n        are checked.'), ('InfobloxObject.field_to_dict', 'Read field value and converts to dict if possible'), ('InfobloxObject.to_dict', 'Builds dict without None object fields'), ('InfobloxObject.fetch', 'Fetch object from NIOS by _ref or sea

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/66.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/310 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

Processing batches:  64%|██████▍   | 32/50 [17:35<09:13, 30.75s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('match', 'Matches the given input againts the available\n    file type matchers.\n\n    Args:\n        obj: path to file, bytes or bytearray.\n\n    Returns:\n        Type instance if type matches. Otherwise None.\n\n    Raises:\n        TypeError: if obj is not a supported type.'), ('signature', 'Returns the first 262 bytes of the given bytearray\n    as part of the file header signature.\n\n    Args:\n        array: bytearray to extract the header signature.\n\n    Returns:\n        First 262 bytes of the file content as bytearray type.'), ('get_bytes', 'Infers the input type and reads the first 262 bytes,\n    returning a sliced bytearray.\n\n    Args:\n        obj: path to readable, file, bytes or bytearray.\n\n    Returns:\n        First 262 bytes of the file content as bytearray type.\n\n    Raises:\n        TypeError: if obj is not a supported type.'), ('get_type', 'Returns the file type instance

Map:   0%|          | 0/330 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/320 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/69.3k [00:00<?, ?B/s]

Processing batches:  66%|██████▌   | 33/50 [18:07<08:50, 31.19s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('Tail._sincedb_update_position', 'Retrieves the starting position from the sincedb sql db for a given file\n        Returns a boolean representing whether or not it updated the record'), ('Tail._sincedb_start_position', 'Retrieves the starting position from the sincedb sql db\n        for a given file'), ('Tail._update_file', 'Open the file for tailing'), ('Tail.tail', 'Read last N lines from file fname.'), ('create_transport', 'Creates and returns a transport object'), ('TailManager.listdir', 'HACK around not having a beaver_config stanza\n        TODO: Convert this to a glob'), ('TailManager.update_files', 'Ensures all files are properly loaded.\n        Detects new files, file removals, file rotation, and truncation.\n        On non-linux platforms, it will also manually reload the file for tailing.\n        Note that this hack is necessary because EOF is cached on BSD systems.'), ('TailManager.close

Map:   0%|          | 0/340 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/69.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/330 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/70.5k [00:00<?, ?B/s]

Processing batches:  68%|██████▊   | 34/50 [18:41<08:31, 31.94s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('multiline_merge', "Merge multi-line events based.\n\n        Some event (like Python trackback or Java stracktrace) spawn\n        on multiple line. This method will merge them using two\n        regular expression: regex_after and regex_before.\n\n        If a line match re_after, it will be merged with next line.\n\n        If a line match re_before, it will be merged with previous line.\n\n        This function return a list of complet event. Note that because\n        we don't know if an event is complet before another new event\n        start, the last event will not be returned but stored in\n        current_event. You should pass the same current_event to\n        successive call to multiline_merge. current_event is a list\n        of lines whose belong to the same event."), ('create_ssh_tunnel', 'Returns a BeaverSshTunnel object if the current config requires us to'), ('BeaverSubprocess.poll', 

Map:   0%|          | 0/350 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/70.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/340 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/71.6k [00:00<?, ?B/s]

Processing batches:  70%|███████   | 35/50 [19:11<07:48, 31.25s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('RedisTransport.invalidate', 'Invalidates the current transport and disconnects all redis connections'), ('RedisTransport.callback', 'Sends log lines to redis servers'), ('RedisTransport._get_next_server', 'Returns a valid redis server or raises a TransportException'), ('RedisTransport._raise_server_index', 'Round robin magic: Raises the current redis server index and returns it'), ('RedisTransport.valid', 'Returns whether or not the transport can send data to any redis server'), ('KafkaTransport.callback', 'publishes lines one by one to the given topic'), ('BaseTransport.format', 'Returns a formatted log line'), ('BaseTransport.get_timestamp', 'Retrieves the timestamp for a given set of data'), ('_make_executable', 'Make the file at path executable.'), ('build_parser', 'Build argument parser.')]
Batch 35: 10 successful, 0 failed out of 10 items
Processed 360 items in 1176.05 seconds. Rate: 0.31 items/s

Map:   0%|          | 0/360 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/71.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/350 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/72.4k [00:00<?, ?B/s]

Processing batches:  72%|███████▏  | 36/50 [19:40<07:08, 30.61s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('subset_main', 'Separate method from main() in order to make testing easier and to\n    enable command-line access.'), ('_read_arg', 'If arg is a list with 1 element that corresponds to a valid file path, use\n    set_io.grp to read the grp file. Otherwise, check that arg is a list of strings.\n\n    Args:\n        arg (list or None)\n\n    Returns:\n        arg_out (list or None)'), ('fast_cov', 'calculate the covariance matrix for the columns of x (MxN), or optionally, the covariance matrix between the\n    columns of x and and the columns of y (MxP).  (In the language of statistics, the columns are variables, the rows\n    are observations).\n\n    Args:\n        x (numpy array-like) MxN in shape\n        y (numpy array-like) MxP in shape\n        destination (numpy array-like) optional location where to store the results as they are calculated (e.g. a numpy\n            memmap of a file)\n\n        

Map:   0%|          | 0/370 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/72.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/360 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/75.2k [00:00<?, ?B/s]

Processing batches:  74%|███████▍  | 37/50 [20:16<07:00, 32.36s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('get_ordered_idx', 'Gets index values corresponding to ids to subset and orders them.\n    Input:\n        - id_type (str): either "id", "idx" or None\n        - id_list (list): either a list of indexes or id names\n    Output:\n        - a sorted list of indexes to subset a dimension by'), ('parse_metadata_df', 'Reads in all metadata from .gctx file to pandas DataFrame\n    with proper GCToo specifications.\n    Input:\n        - dim (str): Dimension of metadata; either "row" or "column"\n        - meta_group (HDF5 group): Group from which to read metadata values\n        - convert_neg_666 (bool): whether to convert "-666" values to np.nan or not\n    Output:\n        - meta_df (pandas DataFrame): data frame corresponding to metadata fields\n            of dimension specified.'), ('replace_666', 'Replace -666, -666.0, and optionally "-666".\n    Args:\n        meta_df (pandas df):\n        convert_neg_

Map:   0%|          | 0/380 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/75.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/370 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/76.7k [00:00<?, ?B/s]

Processing batches:  76%|███████▌  | 38/50 [20:50<06:32, 32.74s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('GCToo.assemble_multi_index_df', 'Assembles three component dataframes into a multiindex dataframe.\n        Sets the result to self.multi_index_df.\n        IMPORTANT: Cross-section ("xs") is the best command for selecting\n        data. Be sure to use the flag "drop_level=False" with this command,\n        or else the dataframe that is returned will not have the same\n        metadata as the input.\n        N.B. "level" means metadata header.\n        N.B. "axis=1" indicates column annotations.\n        Examples:\n            1) Select the probe with pr_lua_id="LUA-3404":\n            lua3404_df = multi_index_df.xs("LUA-3404", level="pr_lua_id", drop_level=False)\n            2) Select all DMSO samples:\n            DMSO_df = multi_index_df.xs("DMSO", level="pert_iname", axis=1, drop_level=False)'), ('parse', 'The main method.\n\n    Args:\n        - file_path (string): full path to gct(x) file you wa

Map:   0%|          | 0/390 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/76.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/380 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/79.1k [00:00<?, ?B/s]

Processing batches:  78%|███████▊  | 39/50 [21:30<06:23, 34.89s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('parse', 'Identifies whether file_path corresponds to a .gct or .gctx file and calls the\n    correct corresponding parse method.\n\n    Input:\n        Mandatory:\n        - gct(x)_file_path (str): full path to gct(x) file you want to parse.\n\n        Optional:\n        - convert_neg_666 (bool): whether to convert -666 values to numpy.nan or not\n            (see Note below for more details on this). Default = False.\n        - rid (list of strings): list of row ids to specifically keep from gctx. Default=None.\n        - cid (list of strings): list of col ids to specifically keep from gctx. Default=None.\n        - ridx (list of integers): only read the rows corresponding to this\n            list of integer ids. Default=None.\n        - cidx (list of integers): only read the columns corresponding to this\n            list of integer ids. Default=None.\n        - row_meta_only (bool): Whether to load

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/79.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/390 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/81.8k [00:00<?, ?B/s]

Processing batches:  80%|████████  | 40/50 [22:13<06:14, 37.45s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('assemble_concatenated_meta', 'Assemble the concatenated metadata dfs together. For example,\n    if horizontally concatenating, the concatenated metadata dfs are the\n    column metadata dfs. Both indices are sorted.\n\n    Args:\n        concated_meta_dfs (list of pandas dfs)\n\n    Returns:\n        all_concated_meta_df_sorted (pandas df)'), ('assemble_data', "Assemble the data dfs together. Both indices are sorted.\n\n    Args:\n        data_dfs (list of pandas dfs)\n        concat_direction (string): 'horiz' or 'vert'\n\n    Returns:\n        all_data_df_sorted (pandas df)"), ('do_reset_ids', "Reset ids in concatenated metadata and data dfs to unique integers and\n    save the old ids in a metadata column.\n\n    Note that the dataframes are modified in-place.\n\n    Args:\n        concatenated_meta_df (pandas df)\n        data_df (pandas df)\n        concat_direction (string): 'horiz' or 'vert'\n\

Map:   0%|          | 0/410 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/81.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/400 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/83.8k [00:00<?, ?B/s]

Processing batches:  82%|████████▏ | 41/50 [22:46<05:24, 36.03s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('fast_spearman', 'calculate the spearman correlation matrix for the columns of x (with dimensions MxN), or optionally, the spearman correlaton\n    matrix between the columns of x and the columns of y (with dimensions OxP).  If destination is provided, put the results there.\n    In the language of statistics the columns are the variables and the rows are the observations.\n\n    Args:\n        x (numpy array-like) MxN in shape\n        y (optional, numpy array-like) OxP in shape.  M (# rows in x) must equal O (# rows in y)\n        destination (numpy array-like) optional location where to store the results as they are calculated (e.g. a numpy\n            memmap of a file)\n\n        returns:\n            (numpy array-like) array of the covariance values\n                for defaults (y=None), shape is NxN\n                if y is provied, shape is NxP'), ('make_specified_size_gctoo', 'Subsets a GCToo 

Map:   0%|          | 0/420 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/83.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/410 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/86.9k [00:00<?, ?B/s]

Processing batches:  84%|████████▍ | 42/50 [23:24<04:53, 36.66s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('LazyUserManager.generate_username', 'Generate a new username for a user'), ('convert', "Convert a temporary user to a real one. Reject users who don't\n    appear to be temporary users (ie. they have a usable password)"), ('is_lazy_user', 'Return True if the passed user is a lazy user.'), ('add', 'Adds a work item to a queue.\n\n    Args:\n        queue_name: Name of the queue to add the work item to.\n        payload: Optional. Payload that describes the work to do as a string.\n            If not a string and content_type is not provided, then this\n            function assumes the payload is a JSON-able Python object.\n        content_type: Optional. Content type of the payload.\n        source: Optional. Who or what originally created the task.\n        task_id: Optional. When supplied, only enqueue this task if a task\n            with this ID does not already exist. If a task with this ID already

Map:   0%|          | 0/430 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/86.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/420 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/89.4k [00:00<?, ?B/s]

Processing batches:  86%|████████▌ | 43/50 [24:00<04:14, 36.36s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('query', 'Queries for work items based on their criteria.\n\n    Args:\n        queue_name: Optional queue name to restrict to.\n        build_id: Optional build ID to restrict to.\n        release_id: Optional release ID to restrict to.\n        run_id: Optional run ID to restrict to.\n        count: How many tasks to fetch. Defaults to None, which means all\n            tasks are fetch that match the query.\n\n    Returns:\n        Dictionaries of the most recent tasks that match the criteria, in\n        order of most recently created. When count is 1 the return value will\n        be the most recent task or None. When count is not 1 the return value\n        will be a  list of tasks.'), ('cancel', 'Cancels work items based on their criteria.\n\n    Args:\n        **kwargs: Same parameters as the query() method.\n\n    Returns:\n        The number of tasks that were canceled.'), ('handle_add', 'Adds 

Map:   0%|          | 0/440 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/89.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/430 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

Processing batches:  88%|████████▊ | 44/50 [24:29<03:26, 34.36s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('jsonify_error', 'Returns a JSON payload that indicates the request had an error.'), ('ignore_exceptions', 'Decorator catches and ignores any exceptions raised by this function.'), ('timesince', 'Returns string representing "time since" or "time until".\n\n    Examples:\n        3 days ago, 5 hours ago, 3 minutes from now, 5 hours from now, now.'), ('human_uuid', 'Returns a good UUID for using as a human readable string.'), ('get_deployment_timestamp', 'Returns a unique string represeting the current deployment.\n\n    Used for busting caches.'), ('register', 'Registers this module as a worker with the given coordinator.'), ('real_main', 'Runs the ur_pair_diff.'), ('fetch_internal', 'Fetches the given request by using the local Flask context.'), ('fetch_normal', 'Fetches the given request over HTTP.'), ('register', 'Registers this module as a worker with the given coordinator.')]
Batch 44: 10 successful

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/440 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/91.3k [00:00<?, ?B/s]

Processing batches:  90%|█████████ | 45/50 [24:59<02:45, 33.02s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('FetchItem.json', "Returns de-JSONed data or None if it's a different content type."), ('CaptureAndDiffWorkflowItem.maybe_imgur', 'Uploads a file to imgur if requested via command line flags.\n\n        Returns either "path" or "path url" depending on the course of action.'), ('real_main', 'Runs diff_my_images.'), ('clean_url', 'Cleans the given URL.'), ('extract_urls', 'Extracts the URLs from an HTML document.'), ('prune_urls', 'Prunes URLs that should be ignored.'), ('real_main', 'Runs the site_diff.'), ('render_or_send', 'Renders an email message for debugging or actually sends it.'), ('send_ready_for_review', 'Sends an email indicating that the release is ready for review.'), ('homepage', 'Renders the homepage.')]
Batch 45: 10 successful, 0 failed out of 10 items
Processed 460 items in 1523.04 seconds. Rate: 0.30 items/sec


Map:   0%|          | 0/460 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/91.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/449 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Processing batches:  92%|█████████▏| 46/50 [25:25<02:03, 30.87s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('new_build', 'Page for crediting or editing a build.'), ('view_build', 'Page for viewing all releases in a build.'), ('view_release', 'Page for viewing all tests runs in a release.'), ('_get_artifact_context', 'Gets the artifact details for the given run and file_type.'), ('view_run', 'Page for viewing before/after for a specific test run.'), ('register', 'Registers this module as a worker with the given coordinator.'), ('get_coordinator', 'Creates a coordinator and returns it.'), ('WorkItem._print_repr', 'Print this WorkItem to the given stack depth.\n\n        The depth parameter ensures that we can print WorkItems in\n        arbitrarily long chains without hitting the max stack depth.\n        This can happen with WaitForUrlWorkflowItems, which\n        create long chains of small waits.'), ('ResultList.error', 'Returns the error for this barrier and all work items, if any.'), ('Barrier.outstanding'

Map:   0%|          | 0/470 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/459 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/93.1k [00:00<?, ?B/s]

Processing batches:  94%|█████████▍| 47/50 [25:53<01:29, 29.97s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('Barrier.get_item', 'Returns the item to send back into the workflow generator.'), ('WorkflowThread.start', 'Starts the coordinator thread and all related worker threads.'), ('WorkflowThread.stop', 'Stops the coordinator thread and all related threads.'), ('WorkflowThread.join', 'Joins the coordinator thread and all worker threads.'), ('WorkflowThread.wait_one', 'Waits until this worker has finished one work item or died.'), ('superuser_required', 'Requires the requestor to be a super user.'), ('can_user_access_build', 'Determines if the current user can access the build ID in the request.\n\n    Args:\n        param_name: Parameter name to use for getting the build ID from the\n            request. Will fetch from GET or POST requests.\n\n    Returns:\n        The build the user has access to.'), ('build_access_required', "Decorator ensures user has access to the build ID in the request.\n\n    May be 

Map:   0%|          | 0/480 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/93.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/468 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/94.2k [00:00<?, ?B/s]

Processing batches:  96%|█████████▌| 48/50 [26:22<00:59, 29.56s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('can_api_key_access_build', 'Determines if the current API key can access the build in the request.\n\n    Args:\n        param_name: Parameter name to use for getting the build ID from the\n            request. Will fetch from GET or POST requests.\n\n    Returns:\n        (api_key, build) The API Key and the Build it has access to.'), ('build_api_access_required', 'Decorator ensures API key has access to the build ID in the request.\n\n    Always calls the given function with the models.Build entity as the\n    first positional argument.'), ('superuser_api_key_required', 'Decorator ensures only superuser API keys can request this function.'), ('manage_api_keys', 'Page for viewing and creating API keys.'), ('revoke_api_key', 'Form submission handler for revoking API keys.'), ('claim_invitations', "Claims any pending invitations for the given user's email address."), ('manage_admins', 'Page for viewing 

Map:   0%|          | 0/490 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/94.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/478 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/95.0k [00:00<?, ?B/s]

Processing batches:  98%|█████████▊| 49/50 [26:50<00:29, 29.15s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing
[('verify_binary', 'Exits the program if the binary from the given flag doesn\'t run.\n\n    Args:\n        flag_name: Name of the flag that should be the path to the binary.\n        process_args: Args to pass to the binary to do nothing but verify\n            that it\'s working correctly (something like "--version") is good.\n            Optional. Defaults to no args.\n\n    Raises:\n        SystemExit with error if the process did not work.'), ('create_release', 'Creates a new release candidate for a build.'), ('_check_release_done_processing', 'Moves a release candidate to reviewing if all runs are done.'), ('_get_release_params', 'Gets the release params from the current request.'), ('_find_last_good_run', 'Finds the last good release and run for a build.'), ('find_run', 'Finds the last good run of the given name for a release.'), ('_get_or_create_run', 'Gets a run for a build or creates it if it do

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/384 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/95.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/488 [00:00<?, ? examples/s]

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/96.0k [00:00<?, ?B/s]

Processing batches: 100%|██████████| 50/50 [27:25<00:00, 32.90s/it]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing





In [None]:
for i in range(10):
    print(train_data[i]['func_documentation_string'])

Check that the WebSocket connection is open.

        Raise :exc:`~websockets.exceptions.ConnectionClosed` if it isn't.
Read incoming messages and put them in a queue.

        This coroutine runs in a task until the closing handshake is started.
Read a single message from the connection.

        Re-assemble data frames if the message is fragmented.

        Return ``None`` when the closing handshake is started.
Read a single data frame from the connection.

        Process control frames received before the next data frame.

        Return ``None`` if a close frame is encountered before any data frame.
Read a single frame from the connection.
Write a close frame if and only if the connection state is OPEN.

        This dedicated coroutine must be used for writing close frames to
        ensure that at most one close frame is sent on a given connection.
Send a Ping frame and wait for a Pong frame at regular intervals.

        This coroutine exits when the connection terminates and o

# Localized Dataset processor

In [30]:
import json
from pathlib import Path
from tqdm import tqdm
import time
from typing import List, Tuple, Dict
from datasets import Dataset, DatasetDict
from huggingface_hub.errors import RepositoryNotFoundError
from datasets import load_dataset

def dataset_exists(dataset_name, token=None):
    try:
        load_dataset(dataset_name, token=token, streaming=True)
        return True
    except (RepositoryNotFoundError, OSError, ValueError):
        return False


class DocstringDatasetProcessor:
    def __init__(
        self,
        hf_dataset_name: str,
        batch_size: int = 1000,
        token: str = "",
        local_cache_dir: str = "./cache",
        private_repo: bool = False,
    ):

        self.hf_dataset_name = hf_dataset_name
        self.batch_size = batch_size
        self.private_repo = private_repo
        self.local_cache_dir = Path(local_cache_dir)
        self.local_cache_dir.mkdir(exist_ok=True)

        self.processed_count = 0
        self.failed_count = 0

        self.token = token

        self.mounted_dataset_dict = DatasetDict()

    def process_batch(
        self, batch_data: List[Tuple[str, str]], pipeline: QGPipeline, batch_id: int, split: str = ""
    ):
        """Process a batch of (func_name, docstring) tuples. Results are saved locally; uploading is handled separately after all batches are processed."""
        batch_results = []
        batch_success_count = 0
        batch_failure_count = 0

        try:
            generator = pipeline(batch_data)
            for result in generator:
                if isinstance(result, dict):
                    if result["success"]:
                        #TODO: consider reducing memory overhead as batch docstring exists in two places
                        batch_results.append(
                            {
                                "function_name": result["function_name"],
                                "docstring": result["docstring"],
                                "question": result["model_output"]["question"],
                            }
                        )
                        batch_success_count += 1
                    else:
                        print(
                            f"Failed to process {result['function_name']}: {result['error']}"
                        )
                        batch_failure_count += 1

        except Exception as e:
            # entire batch failed
            print(f"Catastrophic batch failure {batch_id}: {e}")
            batch_failure_count = len(batch_data)
            batch_success_count = 0

        self.processed_count += batch_success_count
        self.failed_count += batch_failure_count

        print(
            f"Batch {batch_id}: {batch_success_count} successful, "
            f"{batch_failure_count} failed out of {len(batch_data)} items"
        )

        if batch_results:
            self._save_batch_locally(batch_results, batch_id, split=split)

    def _save_batch_locally(self, batch_results: List[Dict], batch_id: int, split: str = ""):
        if split:
            split_dir = self.local_cache_dir / split
            split_dir.mkdir(parents=True, exist_ok=True)
            batch_file = split_dir / f"batch_{batch_id}.jsonl"
        else:
            batch_file = self.local_cache_dir / f"batch_{batch_id}.jsonl"
        with open(batch_file, "w") as f:
            for item in batch_results:
                json.dump(item, f)
                f.write("\n")

    def process_full_dataset(self, dataset, pipeline: QGPipeline, start_idx: int = 0, split: str = ""):
        """Process the entire data set and store all processed batches locally."""

        if not self.token:
            print("Hugging face token not provided. Terminating")
            return

        print(f"Starting processing of {len(dataset)} items from index {start_idx}")

        start_time = time.time()

        for batch_start in tqdm(
            range(start_idx, len(dataset), self.batch_size), desc="Processing batches"
        ):
            batch_end = min(batch_start + self.batch_size, len(dataset))
            batch_data = dataset[batch_start:batch_end]
            batch_id = batch_start // self.batch_size

            self.process_batch(batch_data, pipeline, batch_id, split=split)

        total_time = time.time() - start_time
        print(
            f"Processed {self.processed_count} items in {total_time:.2f} seconds. Rate: {self.processed_count / total_time:.2f} items/sec"
        )

        print("All batches processed and stored locally.")


    def _can_upload(self):
        namespace = self.hf_dataset_name.split('/')[0]
        try:
            if dataset_exists(self.hf_dataset_name):
                return True
            else:
                from huggingface_hub import whoami
                user_info = whoami(token=self.token)
                if namespace == user_info["name"]:
                    return True
                else:
                    return False
        except Exception as e:
            print(e)
            return False

    def mount_for_upload(self, directory_path: str, *, split: str):
        """Perepares locally processed data to upload"""
        import glob
        import os

        local_data = []
        batch_files = sorted(glob.glob(os.path.join(directory_path, "batch_*.jsonl")))
        for batch_file in batch_files:
            with open(batch_file, "r") as f:
                for line in f:
                    item = json.loads(line)
                    local_data.append(item)

        local_dataset = Dataset.from_list(local_data)
        self.mounted_dataset_dict[split] = local_dataset

    def upload_local_to_hf(self, hf_dataset_name, private_repo=False):

        dataset_dict = self.mounted_dataset_dict
        token = self.token
        print("Uploading dataset to Hugging Face...")

        dataset_dict.push_to_hub(
            hf_dataset_name,
            token=token,
            private=private_repo,
            commit_message=f"Added {str(dataset_dict.num_rows)}"
        )

        print(
            f"Successfully uploaded dataset to https://huggingface.co/datasets/{hf_dataset_name}"
        )

    def load_from_hf(self):
        """Load the dataset from Hugging Face"""
        try:
            dataset = load_dataset(self.hf_dataset_name, token=self.token)
            print(f"Successfully loaded dataset from {self.hf_dataset_name}")
            return dataset
        except Exception as e:
            print(f"Error loading dataset from Hugging Face: {e}")
            raise

    # TODO: resume processing from local cache file
    # TODO: upload from colab cache to permanent file location (local or drive)


In [31]:
dataset_processor = DocstringDatasetProcessor(hf_dataset_name, batch_size=10, token=HUGGING_FACE_TOKEN)

batch_raw_data = zip(train_data[:5]['func_name'], train_data[:5]['func_documentation_string'])
batch_zip_data = list(batch_raw_data)

pipeline = finetuned_t5
dataset_processor.process_full_dataset(batch_zip_data, pipeline, split="train")

Starting processing of 5 items from index 0


Processing batches: 100%|██████████| 1/1 [00:17<00:00, 17.68s/it]

Batch 0: 5 successful, 0 failed out of 5 items
Processed 5 items in 17.68 seconds. Rate: 0.28 items/sec
All batches processed and stored locally.





In [32]:
test_batch_raw_data = zip(test_data[:5]['func_name'], test_data[:5]['func_documentation_string'])
test_batch_zip_data = list(test_batch_raw_data)

dataset_processor.process_full_dataset(test_batch_zip_data, pipeline, split="test")

Starting processing of 5 items from index 0


Processing batches: 100%|██████████| 1/1 [00:13<00:00, 13.94s/it]

Batch 0: 5 successful, 0 failed out of 5 items
Processed 10 items in 13.95 seconds. Rate: 0.72 items/sec
All batches processed and stored locally.





In [33]:
dataset_processor.mount_for_upload("cache/test", split="test")
dataset_processor.mount_for_upload("cache/train", split="train")

In [34]:
dataset_processor.upload_local_to_hf("mrinjera/testing2")

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/2.29k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/2.63k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/374 [00:00<?, ?B/s]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing2


In [25]:
dataset_processor.upload_local_to_hf("cache/train", "mrinjera/testing2", HUGGING_FACE_TOKEN, "test")

Uploading dataset to Hugging Face...


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading...:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/225 [00:00<?, ?B/s]

Successfully uploaded dataset to https://huggingface.co/datasets/mrinjera/testing2


# Model Training

In [26]:
import torch
import pandas as pd
from datasets import Dataset, load_dataset
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from typing import List, Dict, Tuple
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers import (
    models,
    SentenceTransformer,
    SentenceTransformerModelCardData,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    InputExample,
    losses,
)
from transformers import EarlyStoppingCallback
from peft import LoraConfig, TaskType

DATASET_NAME = "mrinjera/testing"
MODEL_NAME = "microsoft/codebert-base"
OUTPUT_MODEL_PATH = "./sbert-function-retrieval"
BATCH_SIZE = 16
EPOCHS = 4
LEARNING_RATE = 2e-5
WARMUP_STEPS = 1000
EVALUATION_STEPS = 5000


def load_and_process_dataset(
    dataset_name: str, token: str
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Load the dataset and process it for training."""
    print(f"Loading dataset: {dataset_name}")

    dataset = load_dataset(dataset_name, token=token)

    train_df = pd.DataFrame(dataset["train"])
    test_df = pd.DataFrame(dataset["test"])

    print(f"Train dataset loaded with {len(train_df)} examples")
    print(f"Test dataset loaded with {len(test_df)} examples")
    print(f"Dataset columns: {train_df.columns.tolist()}")

    train_df = train_df.dropna(subset=["docstring", "question"])
    test_df = test_df.dropna(subset=["docstring", "question"])

    print(f"  Train: {len(train_df)} examples")
    print(f"  Test: {len(test_df)} examples")

    return train_df, test_df


# NOTE: This is not being used in the current implementation
def create_training_examples(df: pd.DataFrame) -> List[InputExample]:
    """Create InputExample objects for SBERT training."""
    examples = []

    for idx, row in df.iterrows():
        # input example with question as query and func_name + docstring as positive document
        combined_text = f"{row['function_name']} {row['docstring']}"
        example = InputExample(texts=[str(row["question"]), combined_text], label=1.0)
        examples.append(example)

    print(f"Created {len(examples)} training examples")
    return examples


def create_evaluation_data(
    eval_df: pd.DataFrame,
) -> Tuple[Dict[str, str], Dict[str, str], Dict[str, set]]:
    """Create evaluation data for Information Retrieval evaluation."""
    # Split data for evaluation
    # eval_df = df.sample(n=min(1000, len(df) // 10), random_state=42)

    queries = {}
    corpus = {}
    relevant_docs = {}

    for idx, row in eval_df.iterrows():
        query_id = f"q_{idx}"
        doc_id = f"d_{idx}"

        queries[query_id] = row["question"]
        corpus[doc_id] = f"{row['function_name']} {row['docstring']}"
        relevant_docs[query_id] = doc_id

    print(
        f"Created evaluation data with {len(queries)} queries and {len(corpus)} documents"
    )
    return queries, corpus, relevant_docs


def create_validation_split(
    train_df: pd.DataFrame, validation_size: float = 0.1
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Create a validation split from training data for monitoring during training."""
    train_split, val_split = train_test_split(
        train_df, test_size=validation_size, random_state=42, stratify=None
    )

    print("Created validation split:")
    print(f"  Training: {len(train_split)} examples")
    print(f"  Validation: {len(val_split)} examples")

    return train_split, val_split


def evaluate_on_test_set(
    model: SentenceTransformer,
    test_queries: Dict[str, str],
    test_corpus: Dict[str, str],
    test_relevant_docs: Dict[str, set],
) -> Dict[str, float]:
    """Evaluate the trained model on the test set."""
    print("Evaluating model on test set...")

    # Create test evaluator
    test_evaluator = InformationRetrievalEvaluator(
        queries=test_queries,
        corpus=test_corpus,
        relevant_docs=test_relevant_docs,
        name="test-eval",
    )

    # Evaluate
    test_score = test_evaluator(model, output_path=OUTPUT_MODEL_PATH + "/test_results")

    print(f"Test evaluation completed. Score: {test_score}")
    return test_score


def initialize_model(model_name: str) -> SentenceTransformer:
    transformer = models.Transformer(MODEL_NAME)
    pooling = models.Pooling(
        transformer.get_word_embedding_dimension(), pooling_mode="max"
    )
    model = SentenceTransformer(
        modules=[transformer, pooling],
        model_card_data=SentenceTransformerModelCardData(
            language="en",
            license="apache-2.0",
            model_name=f"{model_name} adapter finetuned on CodeSearchNet question-docstring pairs",
        ),
    )
    return model


def train_model(
    model: SentenceTransformer,
    train_dataset: Dataset,
    val_queries: Dict[str, str],
    val_corpus: Dict[str, str],
    val_relevant_docs: Dict[str, set],
    device: torch.device,
) -> SentenceTransformer:
    """Train the SBERT model with Multiple Negatives Ranking Loss."""

    if isinstance(train_dataset, pd.DataFrame):
        train_dataset = Dataset.from_pandas(train_dataset)


    peft_config = LoraConfig(
        task_type=TaskType.FEATURE_EXTRACTION,
        inference_mode=False,
        r=64,
        lora_alpha=128,
        lora_dropout=0.1,
    )

    model.add_adapter(peft_config, adapter_name="This will never exist")

    loss = losses.MultipleNegativesRankingLoss(model)

    # TODO: consider including epochs, warmup steps, optimizer
    args = SentenceTransformerTrainingArguments(
        output_dir=OUTPUT_MODEL_PATH + "/training_checkpoints/",
        # Optional training parameters:
        num_train_epochs=1,
        per_device_train_batch_size=1024,
        per_device_eval_batch_size=1024,
        learning_rate=2e-5,
        warmup_ratio=0.1,
        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
        bf16=False,  # Set to True if you have a GPU that supports BF16
        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
        # Optional tracking/debugging parameters:
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        logging_steps=25,
        logging_first_step=True,
        load_best_model_at_end=True,
        metric_for_best_model="eval_val_cosine",
        # run_name=run_name,  # Will be used in W&B if `wandb` is installed
    )

    evaluator = InformationRetrievalEvaluator(
        queries=val_queries,
        corpus=val_corpus,
        relevant_docs=val_relevant_docs,
        name="CodeSearchNet retrieval evaluator",
    )

    early_stopping = EarlyStoppingCallback(
        early_stopping_patience=3, early_stopping_threshold=0.01
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        loss=loss,
        evaluator=evaluator,
        callbacks=[early_stopping],
    )

    trainer.train()

    model.save_pretrained(OUTPUT_MODEL_PATH)

    return model

In [15]:
from google.colab import userdata

HUGGING_FACE_TOKEN = userdata.get("HUGGING_FACE_TOKEN")
hf_dataset_name = "mrinjera/testing"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_df, test_df = load_and_process_dataset("mrinjera/testing2", HUGGING_FACE_TOKEN)

train_split_df, val_split_df = create_validation_split(train_df)

val_queries, val_corpus, val_relevant_docs = create_evaluation_data(val_split_df)

test_queries, test_corpus, test_relevant_docs = create_evaluation_data(test_df)

Loading dataset: mrinjera/testing2
Train dataset loaded with 5 examples
Test dataset loaded with 5 examples
Dataset columns: ['function_name', 'docstring', 'question']
  Train: 5 examples
  Test: 5 examples
Created validation split:
  Training: 4 examples
  Validation: 1 examples
Created evaluation data with 1 queries and 1 documents
Created evaluation data with 5 queries and 5 documents


In [28]:
model = initialize_model(MODEL_NAME)

In [19]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [29]:
train_model(model, train_split_df, val_queries, val_corpus, val_relevant_docs, device)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:


Abort: 