<a href="https://colab.research.google.com/github/famma-bench/bench-script/blob/main/notebooks/FAMMA_1_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction of FAMMA Dataset  

FAMMA is a multi-modal financial Q&A benchmark dataset. The questions encompass three heterogeneous image types - tables, charts and text & math screenshots - and span eight subfields in finance, comprehensively covering topics across major asset classes.

## Download the dataset

To download the dataset, firstly we need to git clone the repository and install the dependencies.

In [1]:
! git clone https://github.com/famma-bench/bench-script.git
! pip install -r bench-script/requirements.txt

Cloning into 'bench-script'...
remote: Enumerating objects: 677, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 677 (delta 24), reused 28 (delta 18), pack-reused 625 (from 3)[K
Receiving objects: 100% (677/677), 81.56 MiB | 11.85 MiB/s, done.
Resolving deltas: 100% (388/388), done.
Collecting omegaconf (from -r bench-script/requirements.txt (line 1))
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting easyllm_kit (from -r bench-script/requirements.txt (line 2))
  Downloading easyllm_kit-0.0.8.9-py3-none-any.whl.metadata (840 bytes)
Collecting datasets (from -r bench-script/requirements.txt (line 3))
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting tiktoken (from -r bench-script/requirements.txt (line 4))
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dictdatabase (from -r bench-script/require

Then we need to add the path of the repository to the system path to facilitate the import of related modules.

In [1]:
import sys
sys.path.append('./bench-script')

Finally, we can download the dataset by running the following script.

In [2]:
from famma_runner.utils.data_utils import download_data

# the directory of the dataset in huggingface
hf_dir = "weaverbirdllm/famma"

# the version of the dataset, there are two versions: release_basic and release_livepro
# if None, it will download the whole dataset
split = "release_basic"

# the local directory to save the dataset
save_dir = "./data"

# whether to download the dataset from huggingface or local, by default it is False
from_local = False

success = download_data(
        hf_dir=hf_dir,
        split=split,
        save_dir=save_dir,
        from_local=from_local
    )



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

release_livepro-00000-of-00001.parquet:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

release_basic-00000-of-00001.parquet:   0%|          | 0.00/92.2M [00:00<?, ?B/s]

Generating release_livepro split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating release_basic split:   0%|          | 0/1945 [00:00<?, ? examples/s]

Saved release_basic split to ./data/release_basic.json

Dataset downloaded and saved to ./data
Images are saved in ./data/images_release_basic


After downloading, the dataset will be saved in the local directory `./data` in json format.  

# Dataset Inspection

## Dataset Statistics

We use the following script to show the overall statistics of the dataset.

In [3]:
from famma_runner.utils.descriptive_utils import get_dataset_statistics

stat_dict = get_dataset_statistics('data/release_basic.json')


In [4]:
stat_dict.keys()

dict_keys(['total_count', 'total_main_question_count', 'unique_main_question_ids', 'language_count', 'question_type_count', 'image_type_count', 'image_type_set', 'subfield_count', 'subfield_set', 'topic_difficulty_count', 'explanation_count', 'multiple_images_count', 'arithmetic_count', 'arithmetic_by_language', 'arithmetic_by_difficulty', 'token_counts', 'subfield_difficulty_count', 'total_token_sum', 'language_difficulty_count', 'question_type_difficulty_count'])

In [5]:
'Total questions {}'.format(stat_dict['total_count'])

'Total questions 1945'

In [6]:
'Language distribution /n'
stat_dict['language_count']

defaultdict(int, {'english': 1378, 'chinese': 411, 'french': 156})

In [7]:
'Difficulty distribution /n'
stat_dict['topic_difficulty_count']

defaultdict(int, {'easy': 694, 'medium': 473, 'hard': 778})

In [8]:
'Question type distribution /n'
stat_dict['question_type_count']

defaultdict(int, {'multiple-choice': 1057, 'open question': 888})

In [9]:
'Subfield distribution /n'
stat_dict['subfield_count']

defaultdict(int,
            {'fixed income': 264,
             'equity': 278,
             'portfolio management': 602,
             'financial statement analysis': 100,
             'derivatives': 296,
             'economics': 44,
             'alternative investments': 92,
             'corporate finance': 269})