# Configuration

Configure the dataset to process. For a list of available datasets, see the contents of the `data` directory.

In [7]:
sets = {"9200300": "Austria",
        "9200301": "Finland",
        "9200303": "Latvia",
        "9200338": "Hamburg",
        "9200339": "Serbia",
        "9200355": "Berlin",
        "9200356": "Estonia",
        "9200357": "Poland",
        "9200359": "Netherlands",
        "9200396": "Luxembourg"}

data_dir = '../data'

# Data loading

Load the JSON object from the dataset directory that provides a mapping from object identifier to the text file with full text content.

In [8]:
import json

maps = {}
for dataset in sets:
    data_set_dir = f'{data_dir}/{dataset}'
    with open(f'{data_set_dir}/id_file_map.json','r') as f:
        map = json.load(f)

    f'Loaded map for data set {dataset}: {len(map):,} items'
    maps[dataset] = map

# Processing

Count the total number of words (whitespace separated substrings) in the entire dataset.

In [9]:
for dataset in maps:
    map = maps[dataset]
    word_count = 0
    print(f'Counting words in {dataset} ({sets[dataset]})')
    for id in map:
        filename = map[id]
        data_set_dir = f'{data_dir}/{dataset}'
        with open(f'{data_set_dir}/{filename}', 'r') as file:
            for line in file.readlines():
                word_count += len(str.split(line))

    print(f'Total word count for {dataset} ({sets[dataset]}): {word_count:,} words')

Counting words in 9200300 (Austria)
Total word count for 9200300 (Austria): 2,351,079,191 words
Counting words in 9200301 (Finland)
Total word count for 9200301 (Finland): 393,776,815 words
Counting words in 9200303 (Latvia)
Total word count for 9200303 (Latvia): 964,243,746 words
Counting words in 9200338 (Hamburg)
Total word count for 9200338 (Hamburg): 5,593,768,847 words
Counting words in 9200339 (Serbia)
Total word count for 9200339 (Serbia): 338,080,416 words
Counting words in 9200355 (Berlin)
Total word count for 9200355 (Berlin): 2,996,820,265 words
Counting words in 9200356 (Estonia)
Total word count for 9200356 (Estonia): 351,656,185 words
Counting words in 9200357 (Poland)
Total word count for 9200357 (Poland): 181,102,489 words
Counting words in 9200359 (Netherlands)
Total word count for 9200359 (Netherlands): 2,869,483,985 words
Counting words in 9200396 (Luxembourg)
Total word count for 9200396 (Luxembourg): 29,266,765 words
