# Goals:
- More corpus and questions. 
- Ideally we get different style corpuses.
1. Research:
Start with WikiText then use this to look for other text datasets.
2. Compose a list of all different text datasets.
3. Get a sample from each type.
4. Generate questions for each dataset. 


Tokens extracted from good wiki articles
https://huggingface.co/datasets/wikitext

JSON Conversations between a user and a chat bot. 
https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k?row=0

Finance:
https://huggingface.co/datasets/AdaptLLM/finance-tasks?row=12

Finance SEC fillings:
https://huggingface.co/datasets/JanosAudran/financial-reports-sec

PubMed Open Access:
https://huggingface.co/datasets/pmc/open_access

Legal:
https://huggingface.co/datasets/pile-of-law/pile-of-law



Next goal?

We have a lot of weird data. Now what. 

For each dataset, attempt to get large texts out of it. These texts will then be our 'main docs'. Our corpus dataset will be all these docs for which we run chunking over each individual, then union of chunks. Questions generated just be looking at single doc. 

Current goal:
Get 'docs' from datasets.


In [1]:
import requests

def fetch_huggingface_data(params, N=1):
    # URL for the API endpoint
    url = "https://datasets-server.huggingface.co/rows"

    data = None

    for i in range(N):
        # Make the GET request
        params["offset"] = i*100
        response = requests.get(url, params=params)

        # Check if the request was successful
        if response.status_code == 200:
            # Get the JSON response
            data_chunk = response.json()
            if i==0:
                data = data_chunk
            else:
                data['rows'] += data_chunk["rows"]
        else:
            print("Failed to fetch data. Status code:", response.status_code)
    
    return data

import tiktoken

# Count the number of tokens in each page_content
def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string, disallowed_special=()))
    return num_tokens


# WikiTexts

In [11]:

total_doc = ''

for i in [0, 100, 200, 300, 400]:
    data = fetch_huggingface_data({
        'dataset': 'wikitext',
        'config': 'wikitext-103-v1',
        'split': 'train',
        'offset': i,
        'length': 100
    })

    for row in data['rows']:
        total_doc += row['row']['text']

print(len(total_doc))
print(total_doc[:100])

with open('../data/wikitexts.md', 'w') as f:
    f.write(total_doc)

118372
 = Valkyria Chronicles III = 
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit


# Conversations with Chatbot

In [13]:
data = fetch_huggingface_data({
    'dataset': 'HuggingFaceH4/ultrachat_200k',
    'config': 'default',
    'split': 'train_sft',
    'offset': 0,
    'length': 100
})

In [14]:
data['rows'][1]['row']['messages']

[{'content': 'Which famous landmarks should I visit in London, beyond the usual ones?',
  'role': 'user'},
 {'content': "1. Leadenhall Market - a beautiful indoor market with stunning Victorian architecture, also used as a filming location in the Harry Potter films.\n\n2. St. Dunstan in the East - a ruined church in the middle of the city that has been turned into a beautiful public garden.\n\n3. The Monument - a 202-foot-tall column commemorating the Great Fire of London, with a staircase leading to a viewing platform offering great views of the city.\n\n4. The Camden Town Markets - an eclectic collection of markets offering food, fashion, and vintage items, plus live music and street performers.\n\n5. Novelist's House - the former home of Charles Dickens, now a museum dedicated to his life and works.\n\n6. The Old Operating Theatre - a museum housed in the oldest surviving operating theatre in Europe, with exhibits on the history of surgery and medical practices.\n\n7. The Churchill 

In [15]:
total_doc = '\n\n'.join([str(row['row']['messages']) for row in data['rows']])
print(len(total_doc))

total_doc = total_doc[:40000]

606599


In [6]:
with open('../data/chatlogs.md', 'w') as f:
    f.write(total_doc)

In [None]:
# Assuming df is the existing DataFrame
new_question = {
    'question': 'Question 3',
    'references': [
        {'content': 'Description 4', 'start_index': 40, 'end_index': 400}
    ],
    'corpus_id': 'state_of_the_union'
}

# Create a DataFrame for the new question
new_df = pd.DataFrame([new_question, new_question])

# Append to the original DataFrame
df = pd.concat([df, new_df], ignore_index=True)

In [8]:
import pickle

# Open the pickle file and load the data
with open('./new_questions_references.pkl', 'rb') as f:
    questions_references = pickle.load(f)

# Print the loaded data
print(questions_references)


[("What significant regulatory changes and proposals has President Biden's administration implemented or announced regarding fees and pricing transparency?", [('My administration announced we’re cutting credit card late fees from $32 to $8.', 27340, 27419), ('My administration has proposed rules to make cable, travel, utilities, and online ticket sellers tell you the total price up front so there are no surprises.', 27858, 28015)]), ("What reasons did President Biden give for the failure of a particular bill's passage?", [('But unfortunately, politics have derailed this bill so far.', 29521, 29580), ('I’m told my predecessor called members of Congress in the Senate to demand they block the bill. He feels political win — he viewed it as a — it would be a political win for me and a political loser for him.', 29582, 29788)]), ('What were the cases discussed by President Biden related to reproductive rights in his State of the Union Address?', [('Fourteen months ago — fourteen months ago, 

In [24]:
questions_references[-1]

('What specific legislative actions did President Biden mention in his State of the Union Address aimed at protecting children and veterans?',
 [('Pass bipartisan privacy legislation to protect our children online.',
   43570,
   43637),
  ('That’s why, with the strong support and help of Denis and the VA, I signed the PACT Act — one of the most significant laws ever, helping millions of veterans exposed to toxins who now are battling more than 100 different cancers.',
   43903,
   44132)])

In [13]:
def tuple_to_df_row(question_tuple, corpus_id):
    question, references = question_tuple
    references = [{'content': ref[0], 'start_index': ref[1], 'end_index': ref[2]} for ref in references]
    return {
        'question': question,
        'references': references,
        'corpus_id': corpus_id
    }

In [14]:
import pandas as pd

rows = [tuple_to_df_row(q, 'state_of_the_union') for q in questions_references]

# Create a DataFrame
df = pd.DataFrame(rows)

In [16]:
df.to_csv('../data/questions_df.csv', index=False)


In [21]:
# Read the DataFrame
df = pd.read_csv('../data/questions_df.csv')

# Print the DataFrame
df.head()

# Define the corpus_id variable





Unnamed: 0,question,references,corpus_id
0,What significant regulatory changes and propos...,[{'content': 'My administration announced we’r...,state_of_the_union
1,What reasons did President Biden give for the ...,"[{'content': 'But unfortunately, politics have...",state_of_the_union
2,What were the cases discussed by President Bid...,[{'content': 'Fourteen months ago — fourteen m...,state_of_the_union
3,How many people are no longer denied health in...,[{'content': 'Over 100 million of you can no l...,state_of_the_union
4,"Which country is Putin invading, causing chaos...","[{'content': 'Overseas, Putin of Russia is on ...",state_of_the_union


In [23]:
corpus_id = 'state_of_the_union'

# Get all questions where corpus_id equals corpus_id_var as a string
questions = df[df['corpus_id'] == corpus_id]['question'].tolist()
print(questions)

["What significant regulatory changes and proposals has President Biden's administration implemented or announced regarding fees and pricing transparency?", "What reasons did President Biden give for the failure of a particular bill's passage?", 'What were the cases discussed by President Biden related to reproductive rights in his State of the Union Address?', 'How many people are no longer denied health insurance due to preexisting conditions according to President Biden?', 'Which country is Putin invading, causing chaos in Europe and beyond?', 'When did the murder rate experience the sharpest decrease in history and how did violent crime rates change?', "What is the name of the new health research agency mentioned in President Biden's State of the Union Address that aims to do big things like ending cancer?", 'According to President Biden, what is the current status of China compared to the past decade?', "What type of investment is being made by private companies in America accordi

In [42]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Suppose this is your messy text with inconsistent punctuation
with open('../data/wikitexts.md', 'r') as file:
    text = file.read()

# Sentence you want to search for
query = 'Valkyria Chronicles III was developed for PlayStation Portable: this was due to the team wanting to refine the mechanics created for Valkyria Chronicles II.'

# Split the text into sentences
sentences = text.split('. ')

# Find the sentence that matches the query best
best_match = process.extractOne(query, sentences, scorer=fuzz.token_sort_ratio)

print("Best match found:", best_match[0])
print("Similarity score:", best_match[1])


Best match found: The director of Valkyria Chronicles II , Takeshi Ozawa , returned to that role for Valkyria Chronicles III 
Similarity score: 59


In [34]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Your messy text
text = "He walked with friends , it was quite fun. They enjoyed a lot. What a day!"

# An incomplete or partial query sentence
query = "walked with friends quite"

# Split the text into sentences
sentences = text.split('. ')

# Use a more suitable scorer for incomplete sentences
best_match = process.extractOne(query, sentences, scorer=fuzz.partial_ratio)

# If a match is found, locate it in the original text
if best_match:
    match_text = best_match[0]
    start_index = text.find(match_text)
    end_index = start_index + len(match_text) - 1  # Subtract 1 because index is zero-based

    print("Best match found:", match_text)
    print("Start index:", start_index)
    print("End index:", end_index)
    print("Similarity score:", best_match[1])
else:
    print("No match found")


Best match found: He walked with friends , it was quite fun
Start index: 0
End index: 40
Similarity score: 88


In [37]:
import re



# Example usage:
document = "This is   a sample  document  that   includes   the  queried    text with irregular spacing."
query = "sample  document that includes"
result = find_query_in_text(document, query)

if result:
    print(f"Query found from index {result[0]} to {result[1]}")
    print("Matched text:", document[result[0]:result[1]])
else:
    print("Query not found.")


Query found from index 12 to 45
Matched text: sample  document  that   includes


In [41]:
import re

def find_query_despite_whitespace(document, query):
    # Normalize spaces and newlines in the query
    normalized_query = re.sub(r'\s+', ' ', query).strip()
    
    # Create a regex pattern from the normalized query to match any whitespace characters between words
    pattern = r'\s*'.join(re.escape(word) for word in normalized_query.split())
    
    # Compile the regex to ignore case and search for it in the document
    regex = re.compile(pattern, re.IGNORECASE)
    match = regex.search(document)
    
    if match:
        return document[match.start(): match.end()], match.start(), match.end()
    else:
        return None

# Example usage:
document = "This is a sample\n document  that includes the queried text with irregular\nspacing and line breaks."
query = "sample document that  \nincludes"
result = find_query_despite_whitespace(document, query)

print(result)


('sample\n document  that includes', 10, 41)


# ConvFinQA

In [14]:
data = fetch_huggingface_data({
    'dataset': 'AdaptLLM/finance-tasks',
    'config': 'ConvFinQA',
    'split': 'test',
    'offset': 0,
    'length': 100
}, N=2)

In [None]:
# data['rows']
total_tokens = 0
for i, x in enumerate(data['rows']):
    print(num_tokens_from_string(x['row']['input']))
    total_tokens += num_tokens_from_string(x['row']['input'])
print(total_tokens)

In [17]:
total_doc = '\n\n'.join([row['row']['input'] for row in data['rows']])

In [19]:
with open('../data/finance.md', 'w') as f:
    f.write(total_doc)

# PubMed Open Access

In [2]:
data = fetch_huggingface_data({
    'dataset': 'pmc/open_access',
    'config': '2023-06-17.commercial',
    'split': 'train',
    'offset': 0,
    'length': 100
}, N=1)

In [8]:
data['rows'][0]['row']

{'text': "==== Front\nPLoS BiolPLoS BiolpbioplosbiolPLoS Biology1544-91731545-7885Public Library of Science San Francisco, USA 10.1371/journal.pbio.0000005Research ArticleGenetics/Genomics/Gene TherapyInfectious DiseasesMicrobiologyPlasmodiumThe Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum\n P. falciparum IDC TranscriptomeBozdech Zbynek \n1\nLlinás Manuel \n1\nPulliam Brian Lee \n1\nWong Edith D \n1\nZhu Jingchun \n2\nDeRisi Joseph L joe@derisilab.ucsf.edu\n1\n1Department of Biochemistry and Biophysics, University of California, San FranciscoSan Francisco, CaliforniaUnited States of America2Department of Biological and Medical Informatics, University of California, San FranciscoSan Francisco, CaliforniaUnited States of America10 2003 18 8 2003 18 8 2003 1 1 e512 6 2003 25 7 2003 Copyright: ©2003 Bozdech et al.2003This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, 

In [11]:
total_doc = ''
for row in data['rows']:
    total_doc += "PMID: " + row['row']['pmid'] + "\n"
    total_doc += "Accession ID: " + row['row']['accession_id'] + "\n"
    total_doc += "License: " + row['row']['license'] + "\n"
    total_doc += "Last Updated: " + row['row']['last_updated'] + "\n"
    total_doc += "Retracted: " + row['row']['retracted'] + "\n"
    total_doc += "Citation: " + row['row']['citation'] + "\n"
    total_doc += row['row']['text']
    total_doc += '\n\n'

total_doc = total_doc[:-2]

print(num_tokens_from_string(total_doc))

611475


In [17]:
total_doc = total_doc[:500000]

In [18]:
with open('../data/pubmed.md', 'w') as f:
    f.write(total_doc)

# Legal