## Demo for ds hugging fact openllm-france wikimedia collection

- You can stream or download the collection locally. 
  - See here for the steps with `datasets`: https://huggingface.co/datasets/OpenLLM-France/wikimedia?library=datasets
  - Find the full list of languages here: https://huggingface.co/datasets/OpenLLM-France/wikimedia
  ```text
    language	# pages	# words	# characters
    -----------------------------------------
    en (English)	16.46 M	    6.93 B	39.97 B
    fr (French)	9.66 M	    3.07 B	18.00 B
    de (German)	4.56 M	    2.21 B	14.83 B
    es (Spanish)	3.06 M	    1.56 B	9.07 B
    it (Italian)	2.75 M	    1.48 B	8.86 B
    nl (Dutch)	3.16 M	    734.36 M	4.40 B
    pt (Portug.)	1.76 M	    710.99 M	4.06 B
    ca (Catalan)	1.44 M	    564.51 M	3.33 B
    ar (Arabic)	1.46 M	    562.65 M	3.22 B
```

In [88]:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset

In [None]:
ds = load_dataset("OpenLLM-France/wikimedia", "en")
# export ds to json locally so I don't have to redownload
ds.save_to_disk('wikimedia_en')

Downloading data: 100%|██████████| 496/496 [16:28<00:00,  1.99s/files]
Generating train split: 100%|██████████| 16458534/16458534 [00:54<00:00, 301123.97 examples/s] 


In [7]:
ds = load_dataset("OpenLLM-France/wikimedia", "fr")
ds.save_to_disk('wikimedia_fr')


Downloading data: 100%|██████████| 190/190 [07:12<00:00,  2.28s/files]
Generating train split: 100%|██████████| 9658605/9658605 [00:27<00:00, 353797.93 examples/s] 
Saving the dataset (40/40 shards): 100%|██████████| 9658605/9658605 [00:55<00:00, 174108.84 examples/s]


In [8]:
ds = load_dataset("OpenLLM-France/wikimedia", "es")
ds.save_to_disk('wikimedia_es')


Downloading data: 100%|██████████| 89/89 [03:23<00:00,  2.29s/files]
Generating train split: 100%|██████████| 3057330/3057330 [00:13<00:00, 224644.76 examples/s]
Saving the dataset (20/20 shards): 100%|██████████| 3057330/3057330 [00:06<00:00, 506896.48 examples/s] 


### Searching from local datasets


In [89]:
ds_en = load_dataset("./wikimedia_en")


In [11]:
ds_fr = load_dataset("./wikimedia_fr")
ds_es = load_dataset("./wikimedia_es")

Downloading data: 100%|██████████| 40/40 [00:00<00:00, 43007.48files/s]
Generating train split: 9658605 examples [00:16, 595647.84 examples/s] 
Downloading data: 100%|██████████| 20/20 [00:00<00:00, 158875.15files/s]
Generating train split: 3057330 examples [00:07, 386305.80 examples/s] 


In [90]:
# Dictionary mapping language codes to the word for 'vehicle' in each language
vehicle_dict = {
    "en": "bank",
    "fr": "banque",
    "es": "banco"
}

### TOOLS FOR FILTERING A LOCAL WIKIMEDIA DUMP

- data dir is `./wikimedia_en` , `./wikimedia_fr`, and `./wikimedia_es`

In [91]:
# Notebook-friendly streaming search over a saved Hugging Face "datasets" dataset
# This cell is robust: it will prefer an already-loaded `ds_en`/`ds` variable in the kernel,
# otherwise it will try a list of likely on-disk locations. Adjust parameters below and re-run.
import os
import re
import time
from datasets import load_from_disk
from tqdm.auto import tqdm

# Parameters - edit as needed
DATA_DIRS = [
    "./wikimedia_en",
    "./wikimedia_fr",
    "./wikimedia_es"
]


In [92]:
len(ds_en['train'])

16458534

In [102]:
## let's read a sample row from the ds_en to get the keys
import random

ds_en['train'][random.randint(0, len(ds_en['train']) - 1)]


{'id': 3055983,
 'title': 'Wigan Casino',
 'url': 'https://en.wikipedia.org/wiki/Wigan_Casino',
 'language': 'en',
 'source': 'wikipedia',
 'text': '# Wigan Casino\n\nThe Wigan Casino is the colloquial name for the nightclub the Casino Club, that operated in Wigan between Friday, August 27 1965 (with Shirley Bassey topping the bill) and 1981, associated with the Northern Soul movement in the UK. The club\'s enduring dedication to Northern Soul "all nighters" made it an icon among fans of the genre, continuing the efforts that other clubs such as the Twisted Wheel in Manchester, the Chateau Impney (Droitwich), the Catacombs (Wolverhampton) and the Golden Torch (Tunstall, Stoke-on-Trent) had started. It remains one of the most famous clubs in Northern England. In 1978, allegedly the American music magazine Billboard voted Wigan Casino "The Best Disco in the World", ahead of New York City\'s Studio 54, although there is no tangible evidence of this award ever being publicised.\nThis Engla

In [79]:
ds_en['train'][random.randint(0, len(ds_en['train']) - 1)]

{'id': 71380719,
 'title': '2022 Bahamas boat capsizing',
 'url': 'https://en.wikipedia.org/wiki/2022_Bahamas_boat_capsizing',
 'language': 'en',
 'source': 'wikipedia',
 'text': '# 2022 Bahamas boat capsizing\n\nOn 24 July 2022, at least 17 people died while on a boat near the Bahamas. The boat was reportedly heading towards Florida, when the boat capsized seven miles off of the coast of New Providence. An additional three people were hospitalized in the capsizing.\n\n## Background\n\nThe Bahamas is a common transit route for Haitians attempting to reach the United States, although extreme conditions and rickety vessels in the Bahamas make traversing through the Bahamas dangerous.\n\n## Incident\n\nThe incident occurred at roughly 1 a.m. EDT, according to Bahamian prime minister Philip Davis. Authorities claim that the boat was carrying between 50 and 60 people, and that passengers paid $3,000 to $8,000 to board the boat.\n\n## Victims\n\nAt least 17 people died in the capsizing. Of t

In [None]:
# Field candidates to search for text content (in priority order)
FIELD_CANDIDATES = ['text', 'title', 'content']

def _extract_text_from_row(row):
    for f in FIELD_CANDIDATES:
        if f in row and row[f]:
            return row[f]
    parts = []
    for k, v in row.items():
        if isinstance(v, str) and v:
            parts.append(v)
    return "\n".join(parts)


def search_saved_dataset(dataset,
                         keyword,
                         batch_size,
                         max_results,
                         context_window,
                         case_insensitive):
    
    # Resolve to a Dataset (select split 'train' if present)
    if isinstance(dataset, dict) or hasattr(dataset, 'keys'):
        split = 'train' if 'train' in dataset else list(dataset.keys())[0]
        dataset = dataset[split]

    flag = re.IGNORECASE if case_insensitive else 0
    pattern = re.compile(re.escape(keyword), flag)

    found = 0
    seen = 0
    start_time = time.time()
    
    # Calculate total number of batches for progress bar
    total_rows = len(dataset)
    total_batches = (total_rows + batch_size - 1) // batch_size
    
    # Create progress bar
    pbar = tqdm(total=total_batches, desc=f"Searching for '{keyword}'", unit="batch")

    for batch in dataset.iter(batch_size=batch_size):
        # batch is a dict of lists
        try:
            batch_len = len(next(iter(batch.values())))
        except StopIteration:
            batch_len = 0
        for i in range(batch_len):
            row = {k: (v[i] if isinstance(v, (list, tuple)) else v) for k, v in batch.items()}
            text = _extract_text_from_row(row)
            if not text:
                continue
            m = pattern.search(text)
            if m:
                seen += 1
                start, end = m.span()
                
                # Split text into words and find word boundaries around the match
                words = text.split()
                # Find which word contains the match start
                char_count = 0
                match_word_idx = 0
                for idx, word in enumerate(words):
                    if char_count + len(word) >= start:
                        match_word_idx = idx
                        break
                    char_count += len(word) + 1  # +1 for space
                
                # Get context window of words
                start_word_idx = max(0, match_word_idx - context_window)
                end_word_idx = min(len(words), match_word_idx + context_window + 1)
                snippet = ' '.join(words[start_word_idx:end_word_idx])
                
                # highlight
                hit_text = text[start:end]
                highlight = snippet.replace(hit_text, f"**{hit_text}**")
                
                # Extract all metadata
                doc_id = row.get('id', None)
                title = row.get('title', '')
                url = row.get('url', '')
                language = row.get('language', '')
                source = row.get('source', '')
                
                print(f"[{seen}] id={doc_id} | title={title} | url={url}")
                print(f"      language={language} | source={source}")
                print(f"{highlight}\n{'-'*100}")
                found += 1
                
                # Update progress bar postfix with found count
                pbar.set_postfix({"found": found})
                
                if found >= max_results:
                    pbar.close()
                    total_time = time.time() - start_time
                    print(f"\nReached max_results={max_results}. Time elapsed: {total_time:.2f}s")
                    return
        
        # Update progress bar after each batch
        pbar.update(1)
    
    pbar.close()
    total_time = time.time() - start_time
    print(f"\nSearch complete. Found {found} hits in {total_time:.2f}s")


In [None]:
KEYWORD = "gabagool"
BATCH_SIZE = 1024*10
MAX_RESULTS = 100
CONTEXT_WINDOW = 30  # Number of words before and after the match
CASE_INSENSITIVE = True

In [87]:
search_saved_dataset(dataset=ds_en,
                     keyword=KEYWORD,
                     batch_size=BATCH_SIZE,
                     max_results=MAX_RESULTS,
                     context_window=CONTEXT_WINDOW,
                     case_insensitive=CASE_INSENSITIVE)


Searching for 'gabagool':   6%|▋         | 102/1608 [00:18<04:17,  5.86batch/s, found=1]

[1] id=77346990 | title=List of Chapo Trap House episodes (2016–2020) | url=https://en.wikipedia.org/wiki/List_of_Chapo_Trap_House_episodes_(2016%E2%80%932020)
      language=en | source=wikipedia
2017 (2017-07-06) | Premium; Friedland's 3rd episode | | 132 | 123 | "UBIsoft" | Clio Chang | July 10, 2017 (2017-07-10) | — | | 133 | 124 | "**Gabagool**" | — | July 13, 2017 (2017-07-13) | Premium | | 134 | 125 | "Fast And Furious: Toledo Drifter" | Tim Heidecker | July 15, 2017 (2017-07-15) | Heidecker's
----------------------------------------------------------------------------------------------------


Searching for 'gabagool':   9%|▉         | 143/1608 [00:26<05:25,  4.50batch/s, found=2]

[2] id=7100818 | title=Gabagool! | url=https://en.wikipedia.org/wiki/Gabagool!
      language=en | source=wikipedia
# **Gabagool**! **Gabagool**! is an American comic book that began in 2002. It was created by cartoonist Mike Dawson and humorist Chris Radtke, and concerns the various misadventures of a 30-something super-nerd,
----------------------------------------------------------------------------------------------------


Searching for 'gabagool':  14%|█▍        | 226/1608 [00:51<07:30,  3.07batch/s, found=3]

[3] id=55177852 | title=List of Mr. Pickles and Momma Named Me Sheriff characters | url=https://en.wikipedia.org/wiki/List_of_Mr._Pickles_and_Momma_Named_Me_Sheriff_characters
      language=en | source=wikipedia
was one of the mob's most feared hitmen before he was caught by the police. They forced Vito to rat out the entire **Gabagool**ie criminal organization and he was then placed under different surgeries to portray the eponymous legendary figure by the witness protection program. Tommy and Mr. Pickles encounter Bigfoot in the woods and he agrees to help bake a
----------------------------------------------------------------------------------------------------


Searching for 'gabagool':  20%|█▉        | 317/1608 [01:24<06:45,  3.19batch/s, found=4]

[4] id=54591700 | title=List of SuperMansion episodes | url=https://en.wikipedia.org/wiki/List_of_SuperMansion_episodes
      language=en | source=wikipedia
Ranger admits to as the FBI finds the dummy. When Black Saturn, Brad, Cooch, and Jewbot return, they find that dinner was made because Lucini's cousin Angelo makes the best **gabagool** while Kid Victory has cleared things up with the FBI. As Rex vows not to send the four of them shopping again, Cooch reveals that they are banned from the
----------------------------------------------------------------------------------------------------


Searching for 'gabagool':  22%|██▏       | 356/1608 [01:41<05:56,  3.51batch/s, found=5]

[5] id=1548050 | title=Frank Vincent | url=https://en.wikipedia.org/wiki/Frank_Vincent
      language=en | source=wikipedia
The Sopranos | Phil Leotardo | 31 episodes | | 2008 | Stargate Atlantis | Poker Player #1 | Episode: "Vegas" | | 2014–2016 | Mr. Pickles | Jon **Gabagool**i | 2 episodes Voice | | 2016 | Law & Order: Special Victims Unit | Bishop Cattalano | Episode: "Unholiest Alliance" | | 2017 | Neo Yokio | Uncle Albert |
----------------------------------------------------------------------------------------------------

Reached max_results=5. Time elapsed: 101.43s



