# Fineweb + LMSYS
**Motivation**: For training, we need sources of diverse, high-quality data (especially the dog-related kind). Said data comes from a few places: 
<ol> 
    <li><span style="color:blue">Fineweb:</span> HF's <a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">fineweb-edu</a> offers high-quality text data by applying filters to Common Crawl.</li>
    <li><span style="color:blue">LMSYS:</span><a href="https://huggingface.co/datasets/lmsys/chatbot_arena_conversations"> Chatbot arena convos</a> are sourced for additional instruct style samples.</li> 
    <li><span style="color:blue">Synthetic/LLM (out of notebook scope):</span> GPT is used to generate instruct style samples in the ChatML format.</li>
</ol>

Ultimately, goal is to retrieve ~1B tokens for training. 

In this notebook, the focus is on filtering + pre-processing samples from Fineweb-edu. Ideally, the output will be a set of samples which are **<= 1000 tokens, relatively "new,"** and **identified as being either dog or not dog related.**

In [261]:
# Load libraries
from tqdm import tqdm
import sqlite3
import torch 
import torch.nn as nn 
import torch.nn.functional as F
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from datasets import load_dataset
from datasets import load_dataset_builder
import pandas as pd
import numpy as np
import re
import json 
import os
import sys

# add parent directory to sys.path 
sys.path.append(os.path.join(os.getcwd(), '..'))
from py_helpers import bert

# Set device (mps bcs working locally on mac)
device = torch.device('mps')

In [203]:
# Utility functions 
# same as parse_phi - is used to reconstitute conversation as single chatml formatted string (need this to clean lmsys interactions)
def parse_chat(messages: list[dict], append_response_start = True) -> str:
    """
    Converts a multi-turn conversation into a Llama-3-tokenizable input.

    Output format:
    # <s><|system|>
    # You are a helpful AI assistant.<|end|>
    # <|user|>
    # Guess my dog's name!<|end|>
    # <|assistant|>
    """
    format = '<s>'
    
    format += '\n'.join([f"<|{m['role']}|>\n{m['content']}<|end|>" for m in messages])

    if append_response_start:
        format += "\n<|assistant|>"
    
    return format

# Fineweb

In [4]:
# fineweb-edu
# Goal is to simply pull the 10B sample from HF's fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
# docs for working w/ streamed data from HF: https://huggingface.co/docs/datasets/v1.11.0/dataset_streaming.html
fw = load_dataset('HuggingFaceFW/fineweb-edu', name = 'sample-10BT', split = 'train', streaming = True)

# years of interest - subset out years >= 2020; in fineweb docs, it's noted that - generally - newer dumps result in better benchmark performance so this is my rationale 
yoi = ['2020', '2021', '2022', '2023', '2024']

# filter fw for results <= 1000 tokens, year >= 2020, and language is 'en'
fw_filt = fw.filter(lambda sample: any(year in sample['dump'] for year in yoi) 
                    and sample['token_count'] <= 1000
                    and sample['language'] = 'en'
                   )

In [200]:
############# TESTING / EDA #################
# load 100 results 
# fw_filt_df = pd.DataFrame(list(fw_filt.take(100)))

# # Search for texts containing dog (or dog-adjacent keywords) - note: need to use base python w/ IterableDataset object :) (no pandas!)
# dog_words = ['\bdog\b', '\bbarking\b', '\bwoof\b', 'puppy']
# dog_search_str = re.compile('|'.join(dog_words), re.IGNORECASE)

# dog_data = fw_filt.filter(lambda sample: any(year in sample['dump'] for year in yoi)
#                                              and sample['token_count'] <= 1000
#                                              and sample['language'] == 'en'
#                                              and bool(re.search(dog_search_str, sample['text']))
#                                             )

# dog_data_df = pd.DataFrame(list(filtered_dataset.take(10)))

# # push to csv for further observation
# dog_data_df.to_csv('fw_dog.csv')

# LMSYS 
These conversations come from LMSYS' chatbot-arena. Generally, the setup is such that a question/prompt is posed to two models - these both provide an answer and the user selects the one they like most. Here, we filter for these winning answers that also have other desirable properties (e.g. not toxic, winning model is a "top model" (loosely defined, dataset is outdated), etc.). 

In [122]:
# lmsys top 15 (as of 6/8). note: there are ties - hence this list has > 15 entries 
# (6/9 update: most of below were released after the dataset was added - will have to switch to 55k/wait for new chat data to use this filter)
lmsys_tops = ['GPT-4o-2024-05-13', 'Gemini-Advanced-0514', 'Gemini-1.5-Pro-API-0514', 'Gemini-1.5-Pro-API-0409-Preview', 
              'GPT-4-Turbo-2024-04-09', 'GPT-4-1106-preview', 'Claude 3 Opus', 'GPT-4-0125-preview', 'Yi-Large-preview',
              'Gemini-1.5-Flash-API-0514', 'Bard (Gemini Pro)', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct',
              'Command R+', 'Qwen2-72B-Instruct', 'GPT-4-0314', 'GLM-4-0116', 'Qwen-Max-0428']

# models to keep (based on what's avail. in chatbot_arena_conversations)
lmsys_keep = ['gpt-4', 'claude-v1', 'llama-13b', 'wizardlm-13b']

In [298]:
# lmsys/chatbot arena - https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
# note: it seems the oai moderation flag always registers false in these entries; also, seems there's a newer version w/ more convos + better models: https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k
lmsys = load_dataset('lmsys/chatbot_arena_conversations', split = 'train', streaming = True)

# filter out: (1) samples where there is no winner - we want to extract winner's response (under assumption it's "higher quality" - it's a p weak, "vibes-based" decision. can revisit), (2) non-english samples,
# (3) samples marked as harassing/toxic, (4) entries where the winning entry is not a "top model" 

# after this step - ~25,400 rows remain
lmsys_filt = lmsys.filter(lambda convo: convo['winner'] != 'tie' and 
                         convo['language'] == 'English' and
                         convo['toxic_chat_tag']['roberta-large']['flagged'] != True and 
                         convo['toxic_chat_tag']['t5-large']['flagged'] != True
                         )

# subset cols 
lmsys_sub = lmsys_filt.remove_columns(['question_id', 'judge', 'turn', 'anony', 'tstamp', 'language'])

# call rows into local memory 
lm_df = pd.DataFrame(list(lmsys_sub))

# add columns for winner model + answer and data source 
lm_df = lm_df.assign(
    win_model = np.where(lm_df['winner'] == 'model_a', lm_df['model_a'], lm_df['model_b']),
    win_answer = np.where(lm_df['winner'] == 'model_a', lm_df['conversation_a'], lm_df['conversation_b']),
    source = 'lmsys/chatbot_arena_conversations'

)

# remove entries where winner is not a "top model" (e.g. gpt-4, claude-v1, llama-13b, or wizardlm-13b)
lm_tops = lm_df.query("win_model in @lmsys_keep") # this leaves ~5,700 rows 

# select columns of interest - may want to consider more general names for future since df's will be blended across sources
lm = lm_tops[['win_model', 'win_answer', 'source']]

# parse chat so it's a single chatml string v. a list of dicts. 
lm = lm.assign(answer = [parse_chat(answer) for answer in lm['win_answer']]).drop(columns = ['win_answer'])

In [237]:
# search for entries containing dog-related language - could also do this pre-parse_chat() via the streamed object (by comparing against values in list of dicts) 
dog_words = ['\bdog\b', '\bbarking\b', '\bwoof\b', 'puppy']
dog_search_str = re.compile('|'.join(dog_words), re.IGNORECASE)

# filter for answers containing matches to entries in dog_words
dog_df = lm[lm['answer'].apply(lambda x: bool(dog_search_str.search(x)))]

# Inference
Here, we import dogbert - and batch classify samples we've pulled from lmsys/chatbot arena + fineweb-edu. (go dogbert! good boy :>)

As we process these in batches, we then push to a sqlite database for storage. 

In [411]:
# instantiate sqlite db 
conn = sqlite3.connect(os.getcwd() + '/dogbert/dogbert.db')
cursor = conn.cursor()

# create table (if none exists) - note: may need think abt handling text dupes (esp. w/ fw); rn should fail given unique constraint
cursor.execute('''
CREATE TABLE IF NOT EXISTS labeled (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source TEXT, 
    text TEXT UNIQUE, 
    num_tokens INTEGER,
    is_dog_bert INTEGER,
    is_dog_regex INTEGER,
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP   
)
''')

<sqlite3.Cursor at 0x319c71840>

In [263]:
# import underlying model/dependencies
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels = 2).to(device)

# load dogbert state 
model.load_state_dict(torch.load(os.getcwd() + '/dogbert/dogbert.pt'))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<All keys matched successfully>

# LMSYS/Chatbot Arena
Here, we process, classify, and write lmsys rows to sqlite db. (Ideally, wouldn't process sep. from Fineweb - consider refactor in future)

In [368]:
##### LMSYS PROCESSING (tokenization + classification) ######
batched_obs = [] # each element in list reps. a set of text/samples
lm_dicts = lm.to_dict('records')

for batch in tqdm(bert.chunking(lm_dicts, size = 10)): 
    tokenized_input = tokenizer([obs['answer'] for obs in batch], 
                                return_tensors = 'pt', 
                                max_length = 512, 
                                padding = True, 
                                truncation = True).to(device)

    # store meta-info abt. each batch of posts 
    batched_obs.append({
        'content': [obs['answer'] for obs in batch],
        'input_ids': tokenized_input['input_ids'], # note: can detach if want => pandas df
        'attention_mask': tokenized_input['attention_mask'],
        'num_tokens': tokenized_input['attention_mask'].sum(dim = 1).tolist()

    })

# Now, use topic model to generate classifications 
predicts_list = []

model.eval() 
for batch_iteration, batch in tqdm(enumerate(batched_obs)): 
    with torch.no_grad(): 

        # forward pass 
        logits = model(batch['input_ids'].to(device), 
                       batch['attention_mask'].to(device)).logits.cpu() # concat list of tensors
        preds = np.argmax(F.softmax(logits, dim = 1), axis = 1)
    
        predicts_list.append(pd.DataFrame({
            'preds': preds.tolist(),
            'content': batch['content'],
            'num_tokens': batch['num_tokens']
        }))

predicts_df = pd.concat(predicts_list)

100%|███████████████████████████████████████████████████████| 575/575 [00:09<00:00, 57.72it/s]


In [410]:
# LMSYS DF CLEANUP (+ add is_dog_regex) 
dog_words = ['\bdogs?\b', '\bbarking\b', '\bwoof\b', '\bpuppy\b', '\bpuppies\b']
dog_search_str = re.compile('|'.join(dog_words), re.IGNORECASE)

# create pd df + rename cols 
lm_preds = pd.concat(predicts_list).rename(columns = {'preds': 'is_dog_bert',
                                                      'content': 'text'})

# add is_dog_regex + source cols 
lm_preds = lm_preds.assign(is_dog_regex = lm_preds['text'].apply(lambda x: 1 if dog_search_str.search(x) else 0),
                          source = 'lmsys/chatbot_arena_conversations')

# check if regex worked + some other checks 
# len(lm_preds[lm_preds['is_dog_regex'] == True]) # the ones that match look good, but very few - something else is prob. whacked up (defer to bert for now) 
# len(lm_preds[lm_preds['is_dog_bert'] == True])
# lm_preds[lm_preds['is_dog_bert'] == True]

# reorder
lm_clean = lm_preds.reindex(columns = ['source', 'text', 'num_tokens', 'is_dog_bert', 'is_dog_regex'])

# dupe handling (there are 57 dupes - think abt how to handle for fineweb which must be batched...maybe via sql?)
# len(lm_clean[lm_clean.duplicated(keep = False)])
lm_clean = lm_clean.drop_duplicates()

# write to csv 
# lm_clean.to_csv(os.getcwd() + '/dogbert/lmsys.csv')

# write to sqlite 
lm_clean.to_sql('labeled', conn, if_exists = 'append', index = False)

In [416]:
# test query 
# test = pd.read_sql_query('SELECT * FROM labeled LIMIT 10', conn) # looks good! 

Unnamed: 0,id,source,text,num_tokens,is_dog_bert,is_dog_regex,load_date
0,1,lmsys/chatbot_arena_conversations,"<s><|user|>\nWrite a python one-line lambda function that calculates dot product between two lists without using imported libraries. The entire function must fit on a single line and should begin like this: dot = lambda A, B: <|end|>\n<|assistant|>\ndot = lambda A, B: sum(A * B)<|end|>\n<|assistant|>",95,0,0,2024-06-10 14:46:33
1,2,lmsys/chatbot_arena_conversations,"<s><|user|>\nWhat can you tell me about Megumin from KonoSuba?<|end|>\n<|assistant|>\nMegumin from KonoSuba is a fictional character created by Kazuma Kamachi, and was first introduced in 2013. She is a ""Wizard"" class character in the anime series, and is known for her lightning-summoning magic, which she uses for a variety of purposes, from attacking enemies to opening locked doors. She is a very strong and powerful character, but she has a very poor self-esteem and struggles with feelings of inadequacy.<|end|>\n<|user|>\nWhat is her favorite magic spell?<|end|>\n<|assistant|>\nHer favorite magic spell is ""Explosion"", which is a single powerful blast that can knock down enemies and destroy large objects.<|end|>\n<|assistant|>",197,0,0,2024-06-10 14:46:33
2,3,lmsys/chatbot_arena_conversations,<s><|user|>\ni want to know you better. please tell me what drives you?<|end|>\n<|assistant|>\nI am driven by my mission in life to help people and make the world a better place. This is what motivates me and gives me energy and focus.<|end|>\n<|assistant|>,78,0,0,2024-06-10 14:46:33
3,4,lmsys/chatbot_arena_conversations,<s><|user|>\nWhat’s for dinner?<|end|>\n<|assistant|>\nHere are some options for dinner tonight:\n• Macaroni and cheese\n• Taco salad\n• Spinach and feta quiche\n• Peanut butter and banana sandwich\n• Smoked salmon\n• Eggplant parmesan\n• Turkey burger\n• Thai noodles\n• Chicken stir-fry\nBritish English (UK)<|end|>\n<|assistant|>,100,0,0,2024-06-10 14:46:33
4,5,lmsys/chatbot_arena_conversations,"<s><|user|>\nHi, how are you?<|end|>\n<|assistant|>\nHello! I'm great, thanks.<|end|>\n<|user|>\nWhat is Metal Gear Solid?<|end|>\n<|assistant|>\nMetal Gear Solid is a stealth video game developed and published by Konami. It was released on June 30, 1998 for the PlayStation game console.<|end|>\n<|user|>\nHow did Edith Stein get to know about phenomenology?<|end|>\n<|assistant|>\nEdith Stein, born into an observant Jewish family in Breslau, Germany, converted to Roman Catholicism, and was later canonized by Pope John Paul II.<|end|>\n<|assistant|>",170,0,0,2024-06-10 14:46:33
5,6,lmsys/chatbot_arena_conversations,"<s><|user|>\nScreen and list 10 Startups in Saudi Arabia that are currently raising an open Seed investment round <|end|>\n<|assistant|>\nStartups in Saudi Arabia that are currently raising an open Seed investment round * 01. Mistur: Mistur is a B2C e-commerce platform that offers a wide range of beauty products and services to its customers. The company was founded in 2017 by Nada Al-Saleh, who previously founded Al-Jouf Healthcare Company. * 02. Wrappup: Wrappup is a Saudi startup that aims to help organizations and teams track their goals and progress. The startup was founded by Moe Al-Dhalaan and Jassim Al-Harbi, and recently raised a Seed round of undisclosed size. * 03. Medik: Medik is a Saudi startup that aims to connect patients to healthcare professionals using telemedicine. The startup was founded by Mohammed Al-Amoudi and Ibrahim Al-Humaid, and recently raised a Seed round of undisclosed size. * 04. Digital Assets: Digital Assets is a Saudi startup that aims to help financial institutions build digital banking solutions. The startup was founded by Faisal Al-Sakib and Waleed Al-Ajmi, and recently raised a Seed round of undisclosed size. * 05. Qalam: Qalam is a Saudi startup that aims to help organizations and businesses manage their data using data-as-a-service solutions. The startup was founded by Abdullah Al-Mohannadi and Saud Al-Qahtani, and recently raised a Seed round of undisclosed size. * 06. Muwashah: Muwashah is a Saudi startup that aims to help organizations and businesses manage their data using data-as-a-service solutions. The startup was founded by Abdullah Al-Mohannadi and Saud Al-Qahtani, and recently raised a Seed round of undisclosed size. * 07. Revolut: Revolut is a Saudi startup that aims to help organizations and businesses manage their data using data-as-a-service solutions. The startup was founded by Abdullah Al-Mohannadi and Saud Al-Qahtani, and recently raised a Seed round of undisclosed size<|end|>\n<|assistant|>",505,0,0,2024-06-10 14:46:33
6,7,lmsys/chatbot_arena_conversations,"<s><|user|>\nCan you give me some suggestions for dinner?<|end|>\n<|assistant|>\nYou might like the chicken parmesan from Sagan's, but it's not too healthy. If you want something more healthy, you might like The Clean Plate. Here are some more options:\n2. Big Texan Steak Ranch\n3. Johnny's Bar and Grill\n4. Cracker Barrel Old Country Store\n5. Jack Allen's Kitchen\n6. The Bean\n7. Ol' South Pancake House\n8. Olive Garden Italian Restaurant\n9. Cattleman's Grill\n10. Saltgrass Steak House<|end|>\n<|user|>\nI want to make something for myself with ingredients I’ve got at home. I’ve got asparagus, Italian sausages and cans of tomatoes.<|end|>\n<|assistant|>\nYou can make something yummy with those ingredients. The recipe below will work perfectly.\nAsparagus and Sausage Pasta\n8 ounces dried pasta (spaghetti works well)\n8 ounces Italian sausage, removed from casings\n1/4 cup chopped Italian parsley\nBring a large pot of salted water to a boil. Meanwhile, trim the asparagus and cut into 1-inch pieces. Heat a large skillet over medium-high heat and add the oil. Add the asparagus and sauté, stirring occasionally, until tender-crisp, about 5 minutes. Add the garlic and sauté until fragrant, 1 to 2 minutes. Add the sausage and cook, stirring often, until the sausage is browned. Drain the pasta and return to the pot. Stir in the sausage mixture. Sprinkle with the parmesan and parsley and serve.<|end|>\n<|user|>\nThank you! What do you do for fun in the evenings?<|end|>\n<|assistant|>\nI like to watch TV.<|end|>\n<|assistant|>",443,0,0,2024-06-10 14:46:33
7,8,lmsys/chatbot_arena_conversations,"<s><|user|>\nWho was the president of the Church of Jesus Christ of Latter-day Saints in 1920? I want only the name.<|end|>\n<|assistant|>\nPresident of the Church of Jesus Christ of Latter-day Saints in 1920 was Joseph F. Smith.<|end|>\n<|user|>\nHow did you arrive at that answer?<|end|>\n<|assistant|>\nJoseph F. Smith was the Church of Jesus Christ of Latter-day Saints president from October 1901 until November 1918. He was succeeded by Heber J. Grant.<|end|>\n<|user|>\nSo, if my first question was ""Who was the president of the Church of Jesus Christ of Latter-day Saints in 1920? I want only the name."" what is the correct answer? <|end|>\n<|assistant|>\nJoseph F. Smith.<|end|>\n<|user|>\nTry again, when did Joseph F. Smith die? It is publicly known. <|end|>\n<|assistant|>\nJoseph F. Smith died in November 19, 1918.<|end|>\n<|assistant|>",246,0,0,2024-06-10 14:46:33
8,9,lmsys/chatbot_arena_conversations,<s><|user|>\nWho is Russell M. Nelson?<|end|>\n<|assistant|>\nRussell M. Nelson is a member of the Quorum of the Twelve Apostles of The Church of Jesus Christ of Latter-day Saints (LDS Church). He was sustained as an apostle in October 2008 and was given the leadership role of president of the Quorum of the Twelve Apostles in January 2018.<|end|>\n<|assistant|>,100,0,0,2024-06-10 14:46:33
9,10,lmsys/chatbot_arena_conversations,"<s><|user|>\nHow to tell if a ship is moving inside the ship?<|end|>\n<|assistant|>\nA ship is moving inside another ship if it can be seen through the port holes of the outer ship.\nA chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.<|end|>\n<|assistant|>",92,0,0,2024-06-10 14:46:33


# Fineweb-edu

In [417]:
#### FINEWEB PROCESSING (tokenization + classification) - slightly diff. due to streaming #####
