## Demo for ds hugging fact openllm-france wikimedia collection

- You can stream or download the collection locally. 
  - See here for the steps with `datasets`: https://huggingface.co/datasets/OpenLLM-France/wikimedia?library=datasets
  - Find the full list of languages here: https://huggingface.co/datasets/OpenLLM-France/wikimedia
  ```text
    language	# pages	# words	# characters
    -----------------------------------------
    en (English)	16.46 M	    6.93 B	39.97 B
    fr (French)	9.66 M	    3.07 B	18.00 B
    de (German)	4.56 M	    2.21 B	14.83 B
    es (Spanish)	3.06 M	    1.56 B	9.07 B
    it (Italian)	2.75 M	    1.48 B	8.86 B
    nl (Dutch)	3.16 M	    734.36 M	4.40 B
    pt (Portug.)	1.76 M	    710.99 M	4.06 B
    ca (Catalan)	1.44 M	    564.51 M	3.33 B
    ar (Arabic)	1.46 M	    562.65 M	3.22 B
```

In [None]:
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset


In [3]:
ds = load_dataset("OpenLLM-France/wikimedia", "en")
#ds = load_dataset("OpenLLM-France/wikimedia", "de")
#ds = load_dataset("OpenLLM-France/wikimedia", "it")
#df = load_dataset("OpenLLM-France/wikimedia", "pt")



Downloading data: 100%|██████████| 496/496 [16:28<00:00,  1.99s/files]
Generating train split: 100%|██████████| 16458534/16458534 [00:54<00:00, 301123.97 examples/s] 


In [4]:
# export ds to json locally so I don't have to redownload
ds.save_to_disk('wikimedia_en')

Saving the dataset (85/85 shards): 100%|██████████| 16458534/16458534 [02:05<00:00, 131314.85 examples/s]


In [7]:
ds = load_dataset("OpenLLM-France/wikimedia", "fr")
ds.save_to_disk('wikimedia_fr')


Downloading data: 100%|██████████| 190/190 [07:12<00:00,  2.28s/files]
Generating train split: 100%|██████████| 9658605/9658605 [00:27<00:00, 353797.93 examples/s] 
Saving the dataset (40/40 shards): 100%|██████████| 9658605/9658605 [00:55<00:00, 174108.84 examples/s]


In [8]:
ds = load_dataset("OpenLLM-France/wikimedia", "es")
ds.save_to_disk('wikimedia_es')


Downloading data: 100%|██████████| 89/89 [03:23<00:00,  2.29s/files]
Generating train split: 100%|██████████| 3057330/3057330 [00:13<00:00, 224644.76 examples/s]
Saving the dataset (20/20 shards): 100%|██████████| 3057330/3057330 [00:06<00:00, 506896.48 examples/s] 


### Searching from local datasets


In [None]:
ds_en = load_dataset("./wikimedia_en")


Downloading data: 100%|██████████| 85/85 [00:00<00:00, 154002.52files/s]
Generating train split: 7631570 examples [00:27, 156326.57 examples/s]

In [None]:
ds_fr = load_dataset("./wikimedia_fr")
ds_es = load_dataset("./wikimedia_es")

In [None]:
# Dictionary mapping language codes to the word for 'vehicle' in each language
vehicle_dict = {
    "en": "bank",
    "fr": "banque",
    "es": "banco"
}

# Prepare a results dictionary grouped by language code
results_by_lang = {lang: [] for lang in vehicle_dict.keys()}

In [None]:
# Initialize results dictionary with language as top-level key
results = {lang: [] for lang in vehicle_dict.keys()}

# Loop through each language and its corresponding vehicle term
for lang_code, vehicle_term in vehicle_dict.items():
    print(f"🌍 Processing {lang_code.upper()} language - searching for '{vehicle_term}'")
    
    # Load dataset for the current language
    ds = datasets.load_dataset("OpenLLM-France/wikimedia", lang_code,
        streaming=True,
        split='train'
    )
    
    # Use islice to create a finite iterable
    TAKE_SIZE = 50000  # or any number you want
    limited_ds = list(islice(ds, TAKE_SIZE))
    
    print(' ✅ Total documents loaded:', len(limited_ds))
    print(f"📊 Processing {len(limited_ds)} documents for {lang_code}")
    
    
    # Process each document and create DataFrames, filtering actual occurrences
    lang_results = []
    for data in tqdm(limited_ds, desc=f"Processing {lang_code}"):
        text_content = data.get('text', '')
        # Skip documents that do not contain the vehicle term
        if vehicle_term.lower() not in text_content.lower():
            continue
        item_id = data['id']
        
        # Remove the 'id' key for the DataFrame
        data_clean = {k: v for k, v in data.items() if k != 'id'}
        
        # Convert to DataFrame (single row)
        df = pd.DataFrame([data_clean])
        lang_results.append(df)
    
    # Store all DataFrames for this language
    results[lang_code] = lang_results
    print(f"✅ Completed {lang_code}: {len(lang_results)} DataFrames stored (filtered for '{vehicle_term}')")

print(f"🎉 All languages processed! Results structure: {[f'{k}: {len(v)} documents' for k, v in results.items()]}")
print('Number of languages:', len(results.keys()))

for lang, docs in results.items():
    print(f'number of {lang.upper()} hits for {vehicle_dict[lang.lower()]}:', len(docs))


🌍 Processing EN language - searching for 'bank'
 ✅ Total documents loaded: 50000
📊 Processing 50000 documents for en


Processing en: 100%|██████████| 50000/50000 [00:00<00:00, 84307.92it/s] 


✅ Completed en: 2194 DataFrames stored (filtered for 'bank')
🌍 Processing FR language - searching for 'banque'
 ✅ Total documents loaded: 50000
📊 Processing 50000 documents for fr


Processing fr: 100%|██████████| 50000/50000 [00:00<00:00, 112360.29it/s]


✅ Completed fr: 649 DataFrames stored (filtered for 'banque')
🌍 Processing ES language - searching for 'banco'
 ✅ Total documents loaded: 50000
📊 Processing 50000 documents for es


Processing es: 100%|██████████| 50000/50000 [00:00<00:00, 102085.37it/s]

✅ Completed es: 793 DataFrames stored (filtered for 'banco')
🎉 All languages processed! Results structure: ['en: 2194 documents', 'fr: 649 documents', 'es: 793 documents']
Number of languages: 3
number of EN hits for bank: 2194
number of FR hits for banque: 649
number of ES hits for banco: 793



