# Counting all the words in one newspaper set

In this example, we count all the words in a single newspaper set. 

**Note that this example only works if the data directory is populated with (at least) the set that we configure in the beginning of the example. If the data has not yet been retrieved and stored in this environment, please run the data retrieval script first!**

## Configuration

Configure the dataset to process. To see which data sets are actually available in this environment, see the contents of the `data` directory.

In [8]:
## To count words in a set, remove the '#' from the corresponding line and make sure that all the other lines do start with a '#' (commenting/uncommenting)

# dataset = '9200300' # Austria
# dataset = '9200301' # Finland
# dataset = '9200303' # Latvia
# dataset = '9200338' # Hamburg
# dataset = '9200339' # Serbia
# dataset = '9200355' # Berlin
# dataset = '9200356' # Estonia
# dataset = '9200357' # Poland
# dataset = '9200359' # Netherlands
dataset = '9200396' # Luxembourg

from os.path import expanduser
datadir = expanduser("~") + '/work/data'

## Data loading

Load the JSON object from the dataset directory that provides a mapping from object identifier to the text file with full text content.

In [9]:
import json

set_dir = f'{datadir}/{dataset}'
with open(f'{set_dir}/id_file_map.json','r') as f:
    map = json.load(f)

f'Loaded map for data set {dataset}: {len(map):,} items'

'Loaded map for data set 9200396: 1,317 items'

## Processing

Count the total number of words (whitespace separated substrings) in the entire dataset.

In [10]:
word_count = 0
for id in map:
    filename = map[id]
    with open(f'{set_dir}/{filename}', 'r') as file:
        for line in file.readlines():
            word_count += len(str.split(line))

f'Total word count: {word_count:,} words'

'Total word count: 29,266,765 words'