# Download dataset

-----
## 1. Bank cards management

> `Schild, E. (2021). French trainset for chatbots dealing with usual requests on bank cards [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4769949.`

#### v1.0.0

In [None]:
! curl -X GET \
"https://zenodo.org/record/7307432/files/French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v1.0.0.xlsx?download=1" \
> "French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v1.0.0.xlsx"

#### v2.0.0

In [None]:
! curl -X GET \
"https://zenodo.org/record/7307432/files/French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v2.0.0.xlsx?download=1" \
> "French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v2.0.0.xlsx"

-----
## 2. MSLUM

> `Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020). MLSUM: The Multilingual Summarization Corpus (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2004.14900.`

In [1]:
from datasets import load_dataset, load_from_disk, DownloadMode, DownloadConfig
import pandas as pd
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100

In [3]:
# Download dataset.
dataset_mlsum_fr_train = load_dataset(
    path="mlsum",
    name="fr",
    split="train",
    download_config=DownloadConfig(
        cache_dir=".temp",
        max_retries=5,
    ),
    download_mode=DownloadMode.REUSE_CACHE_IF_EXISTS,
)

# Save dataset to local disk.
dataset_mlsum_fr_train.save_to_disk(
    dataset_path="./.temp/mlsum/train/fr"
)

Using custom data configuration mlsum-aa1badac41007a39


Downloading and preparing dataset json/mlsum to C:/Users/SCHILDEW/.cache/huggingface/datasets/json/mlsum-aa1badac41007a39/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


HF google storage unreachable. Downloading and preparing it from source


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset

In [4]:
# Load dataset from local disk.
dataset_mlsum_fr_train = load_from_disk(
    dataset_path="./.temp/mlsum/train/fr",
)

In [5]:
# Get article titles and topics.
df_mlsum_fr_train = dataset_mlsum_fr_train.to_pandas()[["title", "topic"]]
df_mlsum_fr_train.head(100)

Unnamed: 0,title,topic
0,La rentrée littéraire promet un programme de belle qualité,livres
1,Gordon Brown appelle à une réunion internationale sur le Yémen,proche-orient
2,L'abandon des poursuites contre Blackwater indigne l'Irak,asie-pacifique
3,Au moins 93 morts dans un attentat au Pakistan,asie-pacifique
4,Cinq morts dans un incendie à Nîmes,societe
5,Le texte sur le transfert des monuments censuré,culture
6,Les incessantes métamorphoses du Cirque invisible,culture
7,"A New York, les galeries dépriment et guettent la reprise",culture
8,"Le mimétisme, art guerrier animal",planete
9,Des neurochimistes identifient une molécule de la panique,planete


In [6]:
# Get most popular topics.
df_most_popular_topics_of_mlsum_fr_train = df_mlsum_fr_train["topic"].value_counts().to_frame()
df_most_popular_topics_of_mlsum_fr_train.head(100)

Unnamed: 0,topic
economie,42975
idees,24271
politique,24063
societe,23341
europe,21651
afrique,17632
sport,16049
culture,15121
planete,13636
proche-orient,11891


In [11]:
# Select some of the most popular topics.
selected_topics = [
    'economie',
    'politique',
    #'societe',
    'international',
    'sport',
    #'culture',
    'planete',
    #'livres',
    #'technologies',
    #'cinema',
    'sciences',
    'police-justice',
    'disparitions',
    'emploi',
    'sante',
    'musiques',
    'arts',
    'education',
    'climat',
    'immobilier'
]
print("Number of selected topics:", len(selected_topics))

Number of selected topics: 15


In [12]:
# Get a subset of article of most popular topics.
df_mlsum_fr_train_subset = pd.concat([
    df_mlsum_fr_train[
        df_mlsum_fr_train["topic"]==topic
    ].sample(
        n=75,
        random_state=42,
    )
    for topic in selected_topics
])
df_mlsum_fr_train_subset.sort_values(
    by=["topic", "title"],
    inplace=True,
)
df_mlsum_fr_train_subset.reset_index(
    drop=True,
    inplace=True,
)
print("dataset size:", len(df_mlsum_fr_train_subset))
df_mlsum_fr_train_subset["topic"].value_counts().to_frame()

dataset size: 1125


Unnamed: 0,topic
immobilier,75
politique,75
education,75
musiques,75
police-justice,75
sport,75
international,75
planete,75
sciences,75
economie,75


In [14]:
# Save the created dataset.
with pd.ExcelWriter("mlsum_fr_train_subset_x"+str(len(df_mlsum_fr_train_subset))+".xlsx") as writer:
    df_mlsum_fr_train_subset.to_excel(
        excel_writer=writer,
        sheet_name="dataset",
        engine="openpyxl",  # Implicit usage
        index=False,
    )