---
title: "Filtering FineWeb2 using Polars"
description: "Using Polars to filter the FineWeb2 dataset and other large Hugging Face datasets"
author: "Daniel van Strien"
date: "2024-12-30"
categories: ["polars", "huggingface"]
image: https://huggingface.co/datasets/HuggingFaceFW/admin/resolve/main/fineweb-2-logo.png
toc-depth: 3
---

Recently [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) was released. FineWeb2 builds on the previous [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset to add data for many languages. Building on this work we recently launched a community effort to build educational quality filters for many languages. See this [blog post](https://huggingface.co/blog/davanstrien/fineweb2-community) for more details. 

### Filtering the FineWeb2 dataset to improve educational quality and or filtering for a language

One of the goals of the FineWeb-c project is to build educational quality filters for many languages. To do this the community has been annotating the data with educational quality scores. So far, the majority of the datasets for each language consist of a random sample of 1,000 examples from FineWeb2 for that lanugage. However, for some languages, the community has found that:
- the language identification is not always correct
- the educational quality is very low in the sample 

For these languages, we want to enable the community to create extra filters to either help with the language identification or to filter for educational quality. This blog post shows some ways in which you can use Polars to filter the FineWeb2 dataset to improve educational quality and or filtering for a language.


First we'll install the necessary libraries. We'll use `polars` for the data manipulation and `huggingface_hub` to interact with the Hugging Face Hub. The `dask` library is another good option for working with large datasets.


In [3]:
# %pip install polars huggingface_hub tld rich tqdm --upgrade

In [31]:
from huggingface_hub import list_repo_files, hf_hub_download
import polars as pl
from tld import get_tld
from pathlib import Path
from tqdm.auto import tqdm
import os

In [32]:
# increase amount of data polars shows
pl.Config.set_tbl_rows(100)

polars.config.Config

Many large datasets on the Hub will be organised into different configurations. These configurations are often named after the language they are in. For example, the FineWeb dataset is organised into different languages. Many large datasets will either have structured folders or the names of files can be used to filter the dataset. Let's look at the FineWeb dataset. We can use the wonderful `huggingface_hub` library to list the files in a repository.


In [33]:
paths = list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
paths[:10]

['.gitattributes',
 'README.md',
 'data/aai_Latn/test/000_00000.parquet',
 'data/aai_Latn/train/000_00000.parquet',
 'data/aai_Latn_removed/train/000_00000.parquet',
 'data/aak_Latn/test/000_00000.parquet',
 'data/aak_Latn/train/000_00000.parquet',
 'data/aak_Latn_removed/train/000_00000.parquet',
 'data/aau_Latn/test/000_00000.parquet',
 'data/aau_Latn/train/000_00000.parquet']

You can see we have a hidden `.gitattributes` file and a README.md + a `data` directory containing parquet files organised into different subdirectories. Since these names are very clear we create a very simple filter to get the scots language files we're interested in. We'll look for `scots` in the file name and make sure it ends with `.parquet` and doesn't have `removed` in the file name since these are files that were removed in the FineWeb2 filtering process.


In [34]:
scots = [
    f for f in paths if ("sco" in f and f.endswith("parquet") and "removed" not in f)
]
scots

['data/sco_Latn/test/000_00000.parquet',
 'data/sco_Latn/train/000_00000.parquet']

## Loading the data in Polars

We can directly load the data from the Hugging Face Hub using the `hf://` protocol. In this case we'll just load the `train` file for the scots language. We'll use `read_parquet` to load the data for now but we'll see below a better way to load the data if you are working with large datasets.


In [35]:
df = pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{scots[-1]}")


Let's take a look at the data. We can see we have a number of columns including the actual text but also some other metadata fields that could be useful for filtering.

In [36]:
df.head(5)

text,id,dump,url,date,file_path,language,language_score,language_script,minhash_cluster_size,top_langs
str,str,str,str,str,str,str,f64,str,i64,str
"""2010 All Ford Mustangs Car Sho…","""<urn:uuid:06f10aff-f1da-4d33-b…","""CC-MAIN-2013-20""","""http://www.allfordmustangs.com…","""2013-05-23T16:34:05Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",0.764794,"""Latn""",1258,"""{""sco_Latn_score"": 0.764793634…"
"""Interested in France? We'll se…","""<urn:uuid:abc6bfe8-7af5-40b9-9…","""CC-MAIN-2013-20""","""http://www.tripadvisor.com/All…","""2013-05-23T16:36:10Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",0.651096,"""Latn""",12,"""{""sco_Latn_score"": 0.651095628…"
"""Sherlock Holmes Sherlock Holme…","""<urn:uuid:11ceff04-f5f5-418c-8…","""CC-MAIN-2014-10""","""http://sco.wikipedia.org/wiki/…","""2014-03-08T05:12:30Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.000008,"""Latn""",58,"""{""sco_Latn_score"": 1.000008225…"
"""Munster History[eedit | eedit …","""<urn:uuid:5fd5fa85-72b1-43d3-b…","""CC-MAIN-2014-15""","""http://sco.wikipedia.org/wiki/…","""2014-04-19T09:31:48Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",79,"""{""sco_Latn_score"": 1.000009536…"
"""Snawbuirdin Frae Wikipedia Sna…","""<urn:uuid:72c97fcb-4820-4a52-b…","""CC-MAIN-2014-15""","""http://sco.wikipedia.org/wiki/…","""2014-04-19T09:31:00Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",66,"""{""sco_Latn_score"": 1.000010013…"


We can do some simple EDA style analysis if we want. For example, we can look at the distribution of the language scores.


In [37]:
df.select(pl.col("language_score")).describe()

statistic,language_score
str,f64
"""count""",75821.0
"""null_count""",0.0
"""mean""",0.537262
"""std""",0.214123
"""min""",0.300002
"""25%""",0.371339
"""50%""",0.465798
"""75%""",0.634602
"""max""",1.00001


Do a groupby year of dump and language score and plot a bar chart to see if there is a trend.


In [38]:
df.with_columns(
    pl.col("dump").str.extract(r"(\d{4})").cast(pl.Utf8).alias("year")
).group_by("year").agg(pl.col("language_score").mean()).sort(
    "year", descending=True
).plot.bar(x="year", y="language_score")


## Heuristics for filtering for higher educational quality in FineWeb2 

Whilst the authors of FineWeb2 aimed to do general quality filtering, there are often additional heuristics that can be used to filter for higher educational quality. For example, we can use the `tld` to filter for higher quality websites. We can also use the `url` to filter for higher quality websites. Many of these heuristics will require some domain knowledge for a particular language and the web ecosystem for tha language. 

The top level domain (tld) is a good heuristic for filtering for higher quality websites. The top level domain is the part of the url that is after the last dot. For example, the tld of `https://www.wikipedia.org/` is `org`. This often corresponds to a country or organization. For example, `ac.uk` is the UK's higher education domain. We can use this to filter for higher quality websites.


We can do this by mapping the `url` column to the tld and then filtering for the tlds we're interested in. Let's add a new column with the tld and then filter for the tlds we're interested in.

In [39]:
df = df.with_columns(
    pl.col("url").map_elements(lambda x: get_tld(x), return_dtype=pl.Utf8).alias("tld")
)

In [40]:
import altair as alt

df.select("tld").to_series().value_counts(sort=True).sort(
    "count", descending=True
).head(20).plot.bar(
    x=alt.X("tld", sort="-y"),  # Sort x-axis based on y values in descending order
    y="count",
)

We may already have some knowledge or intuitions about the tlds that are more likely to be higher quality. For example `.us` is relatively high, this is likely partially due this domain being more present in the Web generally. We may also see some personal blogs using this domain. Let's take a look at a few examples. 


In [41]:
df.filter(pl.col("tld").str.contains("us")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]

['https://coremc.us/forno-microonde-incasso.html',
 'https://www.awesomedownloadfilestoday.us/1141-haircut-places-near-my-location.html',
 'https://www.awesomedownloadfilestoday.us/1376-haircut-near-my-location.html',
 'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
 'https://www.awesomedownloadfilestoday.us/2081-twa-styles-4c-hair.html',
 'http://winserver.us/mid-century-modern-front-door-colors/mid-century-modern-front-door-colors-mid-century-modern-front-doors-door-colors-handles-mi-mid-century-modern-front-door-colours/',
 'https://www.awesomedownloadfilestoday.us/3450-hair-styles-for-thick-short-hair.html',
 'https://notwttodaytes.us/casa-mezcal-mexican-grill-cantina.html',
 'https://www.awesomedownloadfilestoday.us/1857-short-haircuts-for-fine-straight-hair.html',
 'https://www.awesomedownloadfilestoday.us/1737-short-haircuts-for-curly-thick-hair.html',
 'http://uggbootsclearanceoutlet.us/jaguar-xj-sport-2003-2003-jaguar-xj-car-for-sale

These don't look super promising! Some domains where we might expect higher quality text for scots are the `.sco` domain which is a domain for websites relating to Scotland. 


In [42]:
df.filter(pl.col("tld").str.contains("sco")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]

['https://stormplay.scot/sco/aboot.html',
 'https://www.makforrit.scot/2020/08/29/anent-the-scots-wikipedia-an-sundays-editathon/',
 'https://www.makforrit.scot/2018/12/23/daein-it-yersel/',
 'https://www.makforrit.scot/2019/09/22/uisin-oor-vyce-hou-we-can-gar-political-action-on-scots-inevitable/',
 'https://www.makforrit.scot/2018/02/03/naewey-tae-bide/',
 'https://www.makforrit.scot/',
 'https://www.makforrit.scot/2018/01/27/than-an-nou-poverty-makkin-dae-an-leukin-out-for-ilk-ither/',
 'https://www.makforrit.scot/',
 'https://salvo.scot/the-scottis-constitutional-covin/',
 'https://amylord.scot/gd/hello-welcome/',
 'https://www.makforrit.scot/category/scotland/',
 'https://projects.handsupfortrad.scot/scotslanguageawards/gies-a-scots-phrase-day-2021/',
 'https://scoblog.stormplay.scot/t3ngist-is-gaunae-need-tae-be-delayed.html',
 'https://www.makforrit.scot/2019/10/29/halloween/',
 'https://www.makforrit.scot/category/history/',
 'https://www.makforrit.scot/2018/11/19/three-days-in

Even inside these URLs we can see some scots language so this is promising. 

One of the issues with some of the Scots data in FineWeb2 is that it is in the wrong language. One way we can try and get a sense of where better language data might be in FineWeb2 is to look at the tlds that have the highest language scores. We can do this by grouping by tld and then taking the mean of the language scores. We can then filter for the tlds that have more than 50 row to make sure we're considering the tlds that have a good amount of data.


In [43]:
(
    df.group_by("tld")
    .agg(
        [
            pl.col("language_score").count().alias("count"),
            pl.col("language_score").mean().alias("language_score"),
        ]
    )
    .filter(pl.col("count") > 50)  # Replace n with your desired minimum count
    .sort("language_score", descending=True)
)

tld,count,language_score
str,u32,f64
"""scot""",102,0.998978
"""ac.uk""",255,0.95732
"""org.uk""",267,0.926128
"""org""",8806,0.814764
"""co.uk""",659,0.770529
"""blogspot.com""",561,0.65765
"""top""",85,0.581157
"""eu""",275,0.558302
"""de""",362,0.544635
"""club""",807,0.543638


We can see some other potentially promising tlds. For example, `ac.uk` is the UK's higher education domain. We can take a look at the urls that have this tld.

In [44]:
df.filter(pl.col("tld").str.contains("ac.uk")).sort(
    "language_score", descending=True
).select("url").to_series().to_list()[:30]

['https://www.scottishcorpus.ac.uk/document/?documentid=1699',
 'https://www.scottishcorpus.ac.uk/document/?documentid=1759',
 'https://www.abdn.ac.uk/elphinstone/kist/search/display.php?sblk65.dat',
 'https://www.abdn.ac.uk/elphinstone/kist/display/folk-history/357/',
 'https://scotslanguagepolicy.ac.uk/warkshoaps/',
 'https://scotslanguagepolicy.ac.uk/survey-final-weekend/',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?fhrg01.dat',
 'https://www.abdn.ac.uk/elphinstone/kist/display/761/',
 'https://scotslanguagepolicy.glasgow.ac.uk/hae-yer-say/',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?lwee66.dat',
 'https://scotslanguagepolicy.ac.uk/jist-fir-burns-nicht/',
 'https://scotslanguagepolicy.ac.uk/aboot/',
 'https://www.scottishcorpus.ac.uk/document/?documentid=122',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?bgre04.dat',
 'http://www.abdn.ac.uk/elphinstone/kist/search/display.php?arob01.dat',
 'https://www.scottishcorpus.ac.uk/document/?

In this case using some EDA and domain knowledge we can filter for the tlds which are likely to be:

- in the scots language
- higher quality educational websites

We can reduce the FineWeb2 dataset to only include the rows that have these tlds.


In [45]:
good_tlds = ["sco", "ac.uk", "org.uk", "org"]

In [46]:
df.filter(pl.col("tld").is_in(good_tlds)).sort("language_score", descending=True).head(
    5
)

text,id,dump,url,date,file_path,language,language_score,language_script,minhash_cluster_size,top_langs,tld
str,str,str,str,str,str,str,f64,str,i64,str,str
"""Snawbuirdin Frae Wikipedia Sna…","""<urn:uuid:72c97fcb-4820-4a52-b…","""CC-MAIN-2014-15""","""http://sco.wikipedia.org/wiki/…","""2014-04-19T09:31:00Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",66,"""{""sco_Latn_score"": 1.000010013…","""org"""
"""Banner o the Sahrawi Arab Demo…","""<urn:uuid:67052692-6020-4870-9…","""CC-MAIN-2014-15""","""http://sco.wikipedia.org/wiki/…","""2014-04-24T06:38:13Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",27,"""{""sco_Latn_score"": 1.000010013…","""org"""
"""Potosí is a ceety an the caipi…","""<urn:uuid:e49b07bb-d7c9-4905-b…","""CC-MAIN-2014-15""","""http://sco.wikipedia.org/wiki/…","""2014-04-21T15:05:27Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",34,"""{""sco_Latn_score"": 1.000010013…","""org"""
"""Port Moresby Port Moresby (Ing…","""<urn:uuid:bb6b995d-b3e8-4dcd-9…","""CC-MAIN-2014-35""","""http://sco.wikipedia.org/wiki/…","""2014-08-30T16:16:49Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",80,"""{""sco_Latn_score"": 1.000010013…","""org"""
"""Seville Seville is a ceety in …","""<urn:uuid:cdcca31a-693e-463b-a…","""CC-MAIN-2014-42""","""http://sco.wikipedia.org/wiki/…","""2014-10-22T21:45:17Z""","""s3://commoncrawl/crawl-data/CC…","""sco""",1.00001,"""Latn""",31,"""{""sco_Latn_score"": 1.000010013…","""org"""


In [47]:
filtered_df = df.filter(pl.col("tld").is_in(good_tlds)).sort(
    "language_score", descending=True
)

We can now save the filtered data to a new file. We'll save the ids of the rows that are in the filtered dataset to a file. These ids can then be used to upload additional filtered data to the Argilla dataset for the language we're working on.

In [48]:
with open("good_ids", "w") as f:
    for id in filtered_df.select("id").to_series().to_list():
        f.write(f"{id}\n")

## Filtering other languages

We can also use the same techniques to filter other languages. Some languages have a lot of data and so we can use the `scan_parquet` function to create a `LazyFrame` this will avoid loading all the data into memory. In addition, Polars will perform query optimizations on the `LazyFrame`. This will make the code we use for filtering more efficient without much work on our part.


In [49]:
def get_paths_for_language(language: str):
    return [
        path
        for path in list_repo_files("HuggingFaceFW/fineweb-2", repo_type="dataset")
        if path.endswith("parquet")
        and "removed" not in path
        and "train" in path
        and language in path
    ]


## Filtering with a higher language score

Some language in fineweb2 are not identified as the correct language. Language identification is still not a "solved" problem but we may be able to use a higher confidence filter to get a set of data that is more likely to be the correct language. We can then label this data for the educational quality of the text without having to remove as many examples as being in the incorrect language.

In [50]:
paths = get_paths_for_language("asm")
paths

['data/asm_Beng/train/000_00000.parquet',
 'data/asm_Latn/train/000_00000.parquet']

Let's load the data for the Assamese language using only the `train` file.



In [51]:
df = pl.read_parquet(f"hf://datasets/HuggingFaceFW/fineweb-2/{paths[-1]}")

In [52]:
df.shape

(1104, 11)

We can use the `describe` function to get a sense of the distribution of the language scores.


In [53]:
df.select("language_score").describe()

statistic,language_score
str,f64
"""count""",1104.0
"""null_count""",0.0
"""mean""",0.829071
"""std""",0.231866
"""min""",0.303687
"""25%""",0.660899
"""50%""",0.970777
"""75%""",0.995925
"""max""",0.999965


You can see that compared to some other languages the mean language score is quite low. We might be able to get a better subset of data by filtering for a higher language score. Let's take a look at some examples of the text that have a high language score. This can help give us a sense of what threshold might have less false positives.


In [33]:
from rich import print as rprint

examples_to_show = 3

rprint(
    df.filter(pl.col("language_score") > 0.9)
    .head(examples_to_show)
    .select("text")
    .to_series()
    .to_list()
)


If we find a better language score we can filter for this. For example, we can filter for the language score to be greater than 0.95.



In [35]:
df_filtered = df.filter(pl.col("language_score") > 0.95)


(697, 11)

In [36]:
with open("good_ids", "w") as f:
    for id in df_filtered.select("id").to_series().to_list():
        f.write(f"{id}\n")


## Filtering bigger languages

Some languages have a lot of data and so we can use the `scan_parquet` function to create a `LazyFrame`. Let's see how we can do this for the Japanese language.

In [54]:
paths = get_paths_for_language("jpn")
len(paths)

148

You can see here we have many more files. If you have a lot of memory, you could use the standard `read_parquet` function. However, if you don't have a lot of memory, you could use the `scan_parquet` function. This will read the data in chunks and is more memory efficient. Even with this we might want to start with a subset of the data to experiment with and then work with the full dataset once we're confident in our filtering.


In [38]:
import random

random.seed(42)

sample_paths = random.choices(paths, k=2)

In [39]:
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

Path("temp_data").mkdir(exist_ok=True)

for path in tqdm(sample_paths):
    hf_hub_download(
        repo_id="HuggingFaceFW/fineweb-2",
        repo_type="dataset",
        filename=path,
        local_dir="temp_data",
    )


  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:00<00:00,  5.87it/s]


In [26]:
df = pl.scan_parquet("temp_data/**/*.parquet")
df.head(5).collect()

text,id,dump,url,date,file_path,language,language_score,language_script,minhash_cluster_size,top_langs
str,str,str,str,str,str,str,f64,str,i64,str
"""欲しかった車を探せるサイト 独身時代は、ただ乗れればいいと思…","""<urn:uuid:9221bbac-4ab3-4d7b-9…","""CC-MAIN-2013-20""","""http://careerspaceezine.com/""","""2013-05-20T01:19:20Z""","""s3://commoncrawl/crawl-data/CC…","""jpn""",1.000009,"""Jpan""",1,"""{""jpn_Jpan_score"": 1.000009059…"
""" ふくむすめどうわしゅう(Hukumusume fairy…","""<urn:uuid:d03fc65f-99bb-4095-b…","""CC-MAIN-2013-20""","""http://hukumusume.com/douwa/En…","""2013-05-20T01:18:14Z""","""s3://commoncrawl/crawl-data/CC…","""jpn""",0.992212,"""Jpan""",2,"""{""jpn_Jpan_score"": 0.992212295…"
"""家電通信をお届けします 家電は一度購入したら、何年も使い続け…","""<urn:uuid:89b3dae5-8a49-4d51-a…","""CC-MAIN-2013-20""","""http://wnclivehosting.com/inde…","""2013-05-20T01:57:39Z""","""s3://commoncrawl/crawl-data/CC…","""jpn""",1.00001,"""Jpan""",1,"""{""jpn_Jpan_score"": 1.000010013…"
"""出版社からのコメント MovableTypeの特徴のひとつと…","""<urn:uuid:84019b07-0424-4d79-b…","""CC-MAIN-2013-20""","""http://www.amazon.co.jp/MOVABL…","""2013-05-20T01:50:55Z""","""s3://commoncrawl/crawl-data/CC…","""jpn""",1.000009,"""Jpan""",2,"""{""jpn_Jpan_score"": 1.000008940…"
"""FrontPage 私も結婚することで、今の保険に入ろうか考…","""<urn:uuid:3fc5c2a5-c3a7-409c-b…","""CC-MAIN-2013-20""","""http://www.christian-louboutin…","""2013-05-20T01:59:19Z""","""s3://commoncrawl/crawl-data/CC…","""jpn""",1.00001,"""Jpan""",15,"""{""jpn_Jpan_score"": 1.000009894…"


In [27]:
df.select("language_score").describe()

statistic,language_score
str,f64
"""count""",33735000.0
"""null_count""",0.0
"""mean""",0.999791
"""std""",0.002776
"""min""",0.886358
"""25%""",0.999996
"""50%""",1.000007
"""75%""",1.000009
"""max""",1.00001


In [28]:
df.filter(pl.col("url").str.contains("wikipedia")).count().collect(streaming=True)

text,id,dump,url,date,file_path,language,language_score,language_script,minhash_cluster_size,top_langs
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
55053,55053,55053,55053,55053,55053,55053,55053,55053,55053,55053


In [29]:
japanese_edu_domains = [
    "http://www.asagaku.com/",
    "www3.nhk.or.jp/news/easy/",
    "http://kids.yahoo.co.jp/",
]

In [30]:
df.filter(pl.col("url").is_in(japanese_edu_domains)).count().collect(streaming=True)

text,id,dump,url,date,file_path,language,language_score,language_script,minhash_cluster_size,top_langs
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
3,3,3,3,3,3,3,3,3,3,3


We'd obviously want to expand this list to include more domains but you can see how we can still use the same techniques to filter very large datasets without running out of memory.