# Fineweb
**Motivation**: For training, we need sources of diverse, high-quality data (especially the dog-related kind). Said data comes from a few places: 
<ol> 
    <li><span style="color:blue">Fineweb:</span> HF's <a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">fineweb-edu</a> offers high-quality text data by applying filters to Common Crawl.</li>
    <li><span style="color:blue">Synthetic/LLM:</span> GPT is used to generate instruct style samples in the ChatML format.</li>
    <li><span style="color:blue">LMSYS:</span><a href="https://huggingface.co/datasets/lmsys/chatbot_arena_conversations"> Chatbot arena convos</a> are sourced for additional instruct style samples.</li> 
</ol>

Ultimately, goal is to retrieve ~1B tokens for training. 

In this notebook, the focus is on filtering + pre-processing samples from Fineweb-edu. Ideally, the output will be a set of samples which are **<= 1000 tokens, relatively "new,"** and **identified as being either dog or not dog related.**

In [2]:
# Goal is to simply pull the 10B sample from HF's fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
from datasets import load_dataset
from datasets import load_dataset_builder
import pandas as pd

# docs for working w/ streamed data from HF: https://huggingface.co/docs/datasets/v1.11.0/dataset_streaming.html
fw = load_dataset('HuggingFaceFW/fineweb-edu', name = 'sample-10BT', split = 'train', streaming = True)

# print an example from the dataset - note it's returned as an iterable since we're in streaming mode
# print(next(iter(fw)))

# print # of shards - there are 14 here; these are like data groups 
# print(fw.n_shards) 

# years of interest - subset out years >= 2020; in fineweb docs, it's noted that - generally - newer dumps result in better benchmark performance so this is my rationale 
yoi = ['2020', '2021', '2022', '2023', '2024']

# under 1000 tokens 
under_thou = fw.filter(lambda sample: any(year in sample['dump'] for year in yoi) and sample['token_count'] <= 1000)

# next(iter(under_thou))

next(iter(under_thou))

# filtering needs to happen up at the fw point in time - like early early on 
# samples = pd.DataFrame(list(fw.take(1000)))

KeyboardInterrupt: 