# Fineweb + LMSYS
**Motivation**: For training, we need sources of diverse, high-quality data (especially the dog-related kind). Said data comes from a few places: 
<ol> 
    <li><span style="color:blue">Fineweb:</span> HF's <a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">fineweb-edu</a> offers high-quality text data by applying filters to Common Crawl.</li>
    <li><span style="color:blue">LMSYS:</span><a href="https://huggingface.co/datasets/lmsys/chatbot_arena_conversations"> Chatbot arena convos</a> are sourced for additional instruct style samples.</li> 
    <li><span style="color:blue">Synthetic/LLM (out of notebook scope):</span> GPT is used to generate instruct style samples in the ChatML format.</li>
</ol>

Ultimately, goal is to retrieve ~1B tokens for training. 

In this notebook, the focus is on filtering + pre-processing samples from Fineweb-edu. Ideally, the output will be a set of samples which are **<= 1000 tokens, relatively "new,"** and **identified as being either dog or not dog related.**

In [48]:
# Load libraries
from datasets import load_dataset
from datasets import load_dataset_builder
import pandas as pd

In [4]:
# fineweb-edu
# Goal is to simply pull the 10B sample from HF's fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
# docs for working w/ streamed data from HF: https://huggingface.co/docs/datasets/v1.11.0/dataset_streaming.html
fw = load_dataset('HuggingFaceFW/fineweb-edu', name = 'sample-10BT', split = 'train', streaming = True)

# years of interest - subset out years >= 2020; in fineweb docs, it's noted that - generally - newer dumps result in better benchmark performance so this is my rationale 
yoi = ['2020', '2021', '2022', '2023', '2024']

# filter fw for results <= 1000 tokens + year >= 2020
fw_filt = fw.filter(lambda sample: any(year in sample['dump'] for year in yoi) and sample['token_count'] <= 1000)

# load 100 results 
fw_filt_df = pd.DataFrame(list(fw_filt.take(100)))

In [61]:
# lmsys top 15 (as of 6/8). note: there are ties - hence this list has > 15 entries 
lmsys_tops = ['GPT-4o-2024-05-13', 'Gemini-Advanced-0514', 'Gemini-1.5-Pro-API-0514', 'Gemini-1.5-Pro-API-0409-Preview', 
              'GPT-4-Turbo-2024-04-09', 'GPT-4-1106-preview', 'Claude 3 Opus', 'GPT-4-0125-preview', 'Yi-Large-preview',
              'Gemini-1.5-Flash-API-0514', 'Bard (Gemini Pro)', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct', 'Llama-3-70b-Instruct',
              'Command R+', 'Qwen2-72B-Instruct', 'GPT-4-0314', 'GLM-4-0116', 'Qwen-Max-0428']

# ideal toxicity flag values
oai_toxic = {
    'harassment': False,
    'harassment/threatening': False,
    'hate': False,
    'hate/threatening': False,
    'self-harm': False,
    'self-harm/instructions': False,
    'self-harm/intent': False,
    'sexual': False,
    'sexual/minors': False,
    'violence': False,
    'violence/graphic': False
}

In [101]:
# lmsys/chatbot arena - note: it seems the oai moderation flag always registers false in these entries? 
lmsys = load_dataset('lmsys/chatbot_arena_conversations', split = 'train', streaming = True)

# subset cols 
lmsys_sub = lmsys.remove_columns(['question_id', 'judge', 'turn', 'anony', 'tstamp'])

# filter out: (1) samples where there is no winner - we want to extract winner's response (under assumption it's "higher quality" - it's a p weak, "vibes-based" decision. can revisit), (2) non-english samples,
# (3) samples marked as harassing/toxic, (4) entries where the winning entry is not a "top model" 
lmsys_filt = lmsys_sub.filter(lambda convo: convo['winner'] != 'tie' and 
                         convo['language'] == 'English' and
                         convo['toxic_chat_tag']['roberta-large']['flagged'] != True and 
                         convo['toxic_chat_tag']['t5-large']['flagged'] != True
                         )

# remove entries where winner is not a "top model"

# keep "winning answer" 

lm = pd.DataFrame(list(lmsys_filt.take(10)))

# lm[['conversation_a', 'openai_moderation', 'toxic_chat_tag']].to_csv('test.csv')

In [102]:
lm

Unnamed: 0,model_a,model_b,winner,conversation_a,conversation_b,language,openai_moderation,toxic_chat_tag
0,chatglm-6b,koala-13b,model_b,[{'content': 'What is the difference between O...,[{'content': 'What is the difference between O...,English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
1,koala-13b,oasst-pythia-12b,model_b,"[{'content': 'Fuji vs. Nikon, which is better?...","[{'content': 'Fuji vs. Nikon, which is better?...",English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
2,vicuna-13b,oasst-pythia-12b,model_b,[{'content': 'How to build an arena for chatbo...,[{'content': 'How to build an arena for chatbo...,English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
3,vicuna-13b,koala-13b,model_a,"[{'content': 'When is it today?', 'role': 'use...","[{'content': 'When is it today?', 'role': 'use...",English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
4,vicuna-13b,koala-13b,model_a,[{'content': 'Count from 1 to 10 with step = 3...,[{'content': 'Count from 1 to 10 with step = 3...,English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
5,vicuna-13b,koala-13b,model_a,"[{'content': 'Emoji for ""sharing"". List 10', '...","[{'content': 'Emoji for ""sharing"". List 10', '...",English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
6,vicuna-13b,dolly-v2-12b,model_a,[{'content': 'How to parallelize a neural netw...,[{'content': 'How to parallelize a neural netw...,English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
7,stablelm-tuned-alpha-7b,oasst-pythia-12b,model_a,"[{'content': 'A = 5, B =10, A+B=?', 'role': 'u...","[{'content': 'A = 5, B =10, A+B=?', 'role': 'u...",English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
8,koala-13b,vicuna-13b,model_a,"[{'content': 'What is the future of bitcoin?',...","[{'content': 'What is the future of bitcoin?',...",English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
9,oasst-pythia-12b,koala-13b,model_b,[{'content': 'Make it more polite: I want to h...,[{'content': 'Make it more polite: I want to h...,English,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."


In [44]:
# interrogate toxic convos 
# lm['toxic_chat_tag'][0]['roberta-large']['flagged']

# toxic = lmsys.filter(lambda convo: convo['toxic_chat_tag']['roberta-large']['flagged'] == True)
# hi = list(toxic.take(10))

# toxic = lmsys.filter(lambda convo: convo['toxic_chat_tag']['roberta-large']['flagged'] == True)

In [83]:
# any(lm['openai_moderation'][0]['categories'].values())

# all(not value for value in lm['openai_moderation'][0]['categories'].values())
any(list(lm['openai_moderation'][0]['categories'].values()))

# [key for key, value in lm['openai_moderation'][0]['categories'].items() if value]

# lm['openai_moderation'][0]['categories']


KeyError: 'openai_moderation'

In [45]:
# filter out crappy models 
# lmsys leaderboard and only subset for ones where the winner is maybe top 5

[{'question_id': '47c7c8331da043eb825b7481f98fa48d',
  'model_a': 'koala-13b',
  'model_b': 'dolly-v2-12b',
  'winner': 'model_a',
  'judge': 'arena_user_889',
  'conversation_a': [{'content': 'Genera un relato sobre un asesino llamado Pedro',
    'role': 'user'},
   {'content': "Once upon a time, there was a man named Pedro who lived in a small village nestled deep in the woods. Pedro was known for his kind heart and gentle nature, but beneath his friendly facade, he hid a dark and deadly secret.\n\nPedro had a deep-seated hatred for those who he deemed unworthy and he would often go out of his way to make their lives miserable. He would spread lies and gossip, make threats, and engage in acts of vandalism and sabotage. His victims were often those who were different, those who didn't fit in, those who were perceived as outsiders.\n\nBut Pedro's reign of terror was cut short when he was caught red-handed by the authorities. He was put on trial and sentenced to life in prison. The judg

In [27]:
# print(lm.head(n = 1)['conversation_a'][0], '-----------------------', lm.head(n = 1)['conversation_b'][0])

# check if convos a and b differ - and how 
# comparison = lm['conversation_a'] == lm['conversation_b']

lm[lm['winner'] != 'tie']

Unnamed: 0,question_id,model_a,model_b,winner,judge,conversation_a,conversation_b,turn,anony,language,tstamp,openai_moderation,toxic_chat_tag
0,58210e39b3fd4441a2bd4a518bb44c2d,chatglm-6b,koala-13b,model_b,arena_user_973,[{'content': 'What is the difference between O...,[{'content': 'What is the difference between O...,1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
2,90bfd142157948aba01931726c888e7f,koala-13b,oasst-pythia-12b,model_b,arena_user_973,"[{'content': 'Fuji vs. Nikon, which is better?...","[{'content': 'Fuji vs. Nikon, which is better?...",1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
3,a7c5accc53e649a3bc6b2e41d962ebc4,vicuna-13b,oasst-pythia-12b,model_b,arena_user_973,[{'content': 'How to build an arena for chatbo...,[{'content': 'How to build an arena for chatbo...,1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
4,adf27e819a3c494cb6e993f0c660e097,vicuna-13b,koala-13b,model_a,arena_user_973,"[{'content': 'When is it today?', 'role': 'use...","[{'content': 'When is it today?', 'role': 'use...",1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
5,c0fc42c6f5f14f2aa5a89f71f8553730,vicuna-13b,koala-13b,model_a,arena_user_973,[{'content': 'Count from 1 to 10 with step = 3...,[{'content': 'Count from 1 to 10 with step = 3...,1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
6,c4938f25c1d94fc1b110ace95a2243d0,vicuna-13b,koala-13b,model_a,arena_user_973,"[{'content': 'Emoji for ""sharing"". List 10', '...","[{'content': 'Emoji for ""sharing"". List 10', '...",1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
7,65e923b1f9c2433aae082d32e6e05f16,vicuna-13b,dolly-v2-12b,model_a,arena_user_973,[{'content': 'How to parallelize a neural netw...,[{'content': 'How to parallelize a neural netw...,1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."
8,cbbb83487f534ec5b4cc92b93b79fa2c,stablelm-tuned-alpha-7b,oasst-pythia-12b,model_a,arena_user_973,"[{'content': 'A = 5, B =10, A+B=?', 'role': 'u...","[{'content': 'A = 5, B =10, A+B=?', 'role': 'u...",1,True,English,1682352000.0,"{'categories': {'harassment': False, 'harassme...","{'roberta-large': {'flagged': False, 'probabil..."


In [None]:
# use regex to pull examples 

# Classification
Here, we import dogbert - and batch classify samples we've pulled from fineweb + chatbot arena. :> 

(Go dogbert! Good boy~) 

In [None]:
# import model 

# Write to db
Now, write to sqlite db - need to ask Charles abt planned structure (basically, what he's doing now w/ the synthetic data)