# Fineweb + LMSYS
**Motivation**: For training, we need sources of diverse, high-quality data (especially the dog-related kind). Said data comes from a few places: 
<ol> 
    <li><span style="color:blue">Fineweb:</span> HF's <a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">fineweb-edu</a> offers high-quality text data by applying filters to Common Crawl.</li>
    <li><span style="color:blue">LMSYS:</span><a href="https://huggingface.co/datasets/lmsys/chatbot_arena_conversations"> Chatbot arena convos</a> are sourced for additional instruct style samples.</li> 
    <li><span style="color:blue">Synthetic/LLM (out of notebook scope):</span> GPT is used to generate instruct style samples in the ChatML format.</li>
</ol>

Ultimately, goal is to retrieve ~1B tokens for training. 

In this notebook, the focus is on filtering + pre-processing samples from Fineweb-edu. Ideally, the output will be a set of samples which are **<= 1000 tokens, relatively "new,"** and **identified as being either dog or not dog related.**

In [241]:
# Load libraries
import torch 
import torch.nn as nn 
import torch.nn.functional as F
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from datasets import load_dataset
from datasets import load_dataset_builder
import pandas as pd
import numpy as np
import re
import json 
import os

# Set device (mps bcs working locally on mac)
device = torch.device('mps')

In [203]:
# Utility functions 
# same as parse_phi - is used to reconstitute conversation as single chatml formatted string (need this to clean lmsys interactions)
def parse_chat(messages: list[dict], append_response_start = True) -> str:
    """
    Converts a multi-turn conversation into a Llama-3-tokenizable input.

    Output format:
    # <s><|system|>
    # You are a helpful AI assistant.<|end|>
    # <|user|>
    # Guess my dog's name!<|end|>
    # <|assistant|>
    """
    format = '<s>'
    
    format += '\n'.join([f"<|{m['role']}|>\n{m['content']}<|end|>" for m in messages])

    if append_response_start:
        format += "\n<|assistant|>"
    
    return format

# Fineweb

In [4]:
# fineweb-edu
# Goal is to simply pull the 10B sample from HF's fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
# docs for working w/ streamed data from HF: https://huggingface.co/docs/datasets/v1.11.0/dataset_streaming.html
fw = load_dataset('HuggingFaceFW/fineweb-edu', name = 'sample-10BT', split = 'train', streaming = True)

# years of interest - subset out years >= 2020; in fineweb docs, it's noted that - generally - newer dumps result in better benchmark performance so this is my rationale 
yoi = ['2020', '2021', '2022', '2023', '2024']

# filter fw for results <= 1000 tokens, year >= 2020, and language is 'en'
fw_filt = fw.filter(lambda sample: any(year in sample['dump'] for year in yoi) 
                    and sample['token_count'] <= 1000
                    and sample['language'] = 'en'
                   )

In [200]:
############# TESTING / EDA #################
# load 100 results 
# fw_filt_df = pd.DataFrame(list(fw_filt.take(100)))

# # Search for texts containing dog (or dog-adjacent keywords) - note: need to use base python w/ IterableDataset object :) (no pandas!)
# dog_words = ['\bdog\b', '\bbarking\b', '\bwoof\b', 'puppy']
# dog_search_str = re.compile('|'.join(dog_words), re.IGNORECASE)

# dog_data = fw_filt.filter(lambda sample: any(year in sample['dump'] for year in yoi)
#                                              and sample['token_count'] <= 1000
#                                              and sample['language'] == 'en'
#                                              and bool(re.search(dog_search_str, sample['text']))
#                                             )

# dog_data_df = pd.DataFrame(list(filtered_dataset.take(10)))

# # push to csv for further observation
# dog_data_df.to_csv('fw_dog.csv')

# LMSYS 
These conversations come from LMSYS' chatbot-arena. Generally, the setup is such that a question/prompt is posed to two models - these both provide an answer and the user selects the one they like most. Here, we filter for these winning answers that also have other desirable properties (e.g. not toxic, winning model is a "top model" (loosely defined, dataset is outdated), etc.). 

In [122]:
# lmsys top 15 (as of 6/8). note: there are ties - hence this list has > 15 entries 
# (6/9 update: most of below were released after the dataset was added - will have to switch to 55k/wait for new chat data to use this filter)
lmsys_tops = ['GPT-4o-2024-05-13', 'Gemini-Advanced-0514', 'Gemini-1.5-Pro-API-0514', 'Gemini-1.5-Pro-API-0409-Preview', 
              'GPT-4-Turbo-2024-04-09', 'GPT-4-1106-preview', 'Claude 3 Opus', 'GPT-4-0125-preview', 'Yi-Large-preview',
              'Gemini-1.5-Flash-API-0514', 'Bard (Gemini Pro)', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct',
              'Command R+', 'Qwen2-72B-Instruct', 'GPT-4-0314', 'GLM-4-0116', 'Qwen-Max-0428']

# models to keep (based on what's avail. in chatbot_arena_conversations)
lmsys_keep = ['gpt-4', 'claude-v1', 'llama-13b', 'wizardlm-13b']

In [235]:
# lmsys/chatbot arena - https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
# note: it seems the oai moderation flag always registers false in these entries; also, seems there's a newer version w/ more convos + better models: https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k
lmsys = load_dataset('lmsys/chatbot_arena_conversations', split = 'train', streaming = True)

# filter out: (1) samples where there is no winner - we want to extract winner's response (under assumption it's "higher quality" - it's a p weak, "vibes-based" decision. can revisit), (2) non-english samples,
# (3) samples marked as harassing/toxic, (4) entries where the winning entry is not a "top model" 

# after this step - ~25,400 rows remain
lmsys_filt = lmsys.filter(lambda convo: convo['winner'] != 'tie' and 
                         convo['language'] == 'English' and
                         convo['toxic_chat_tag']['roberta-large']['flagged'] != True and 
                         convo['toxic_chat_tag']['t5-large']['flagged'] != True
                         )

# subset cols 
lmsys_sub = lmsys_filt.remove_columns(['question_id', 'judge', 'turn', 'anony', 'tstamp', 'language'])

# call rows into local memory 
lm_df = pd.DataFrame(list(lmsys_sub))

# add columns for winner model + answer and data source 
lm_df = lm_df.assign(
    win_model = np.where(lm_df['winner'] == 'model_a', lm_df['model_a'], lm_df['model_b']),
    win_answer = np.where(lm_df['winner'] == 'model_a', lm_df['conversation_a'], lm_df['conversation_b']),
    source = 'lmsys/chatbot_arena_conversations'

)

# remove entries where winner is not a "top model" (e.g. gpt-4, claude-v1, llama-13b, or wizardlm-13b)
lm_tops = lm_df.query("win_model in @lmsys_keep") # this leaves ~5,700 rows 

# select columns of interest - may want to consider more general names for future since df's will be blended across sources
lm = lm_tops[['win_model', 'win_answer', 'source']]

# parse chat so it's a single chatml string v. a list of dicts. 
lm = lm.assign(answer = [parse_chat(answer) for answer in lm['win_answer']]).drop(columns = ['win_answer'])

In [237]:
# search for entries containing dog-related language - could also do this pre-parse_chat() via the streamed object (by comparing against values in list of dicts) 
dog_words = ['\bdog\b', '\bbarking\b', '\bwoof\b', 'puppy']
dog_search_str = re.compile('|'.join(dog_words), re.IGNORECASE)

# filter for answers containing matches to entries in dog_words
dog_df = lm[lm['answer'].apply(lambda x: bool(dog_search_str.search(x)))]

# Classification
Here, we import dogbert - and batch classify samples we've pulled from fineweb + chatbot arena. (go dogbert! good boy :>)

As we process these in batches, we then push to a sqlite database for storage. 

In [None]:
# import dogbert


    print(f'Now classifying {model_name[1]} posts')
    
    # Import model/dependencies
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased")
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels = 2).to(device)
    
    # Load torch model for given topic into memory
    model.load_state_dict(torch.load(f"../models/{model_name[1]}.pt"))

    # Posts are batched into groups of 10 posts + tokenized
    batched_posts = []
    for batch in chunking(to_batch, size = 10): 
        tokenized_input = tokenizer([post['content'] for post in batch], return_tensors = 'pt', max_length = 512, padding = 'max_length').to(device)
    
        batched_posts.append({
                'id': [post['id'] for post in batch],
                'content': [post['content'] for post in batch],
                'input_ids': tokenized_input['input_ids'],
                'attention_mask': tokenized_input['attention_mask']
            })
