In [1]:
import os
from cleaner import clean_years

### Data extraction

You can use the following lines to extract the texts from the XML files and apply some basic preprocessing steps to the raw texts.
The preprocessing of the articles can take a long time, we advise to run these lines on small intervalls.

Due to the NDA, we can not post the data on Github, so you will have to get the data from the cluster and specify the path below. We assume that the texts are in xml files separated by month such as 01.xml, 02.xml, ... and classified into folders by year exactly as the data on the iccluster. For this example, we decided to retrieve the data from the "Gazette de Lausanne" between 1805 and 1825.

In [2]:
DATA_DIR = 'data'
DIR_XML_DATA = os.path.join(DATA_DIR, 'GDL')
DIR_OUTPUT_PKL_FILES = os.path.join(DATA_DIR, 'GDL_pkl')
year_start = 1805
year_end = 1825
clean_years(DIR_XML_DATA, DIR_OUTPUT_PKL_FILES, year_start, year_end, False)

### Dictionaries creations

In [3]:
from dictFunctions import create_dictionary, load_dictionary, load_dictionaries
from dictFunctions import clean_dict_by_occ, merge_dictionaries
from correctText import clean_dictionary, cleanAndSaveArticles



In [4]:
ranges = [(range(year,year+1)) for year in range(year_start,year_end+1)]

In [5]:
DIR_OUTPUT_DICTIONARIES = os.path.join(DATA_DIR, 'GDL_dict')
for range_values in ranges:
    create_dictionary(range_values, DIR_OUTPUT_PKL_FILES, DIR_OUTPUT_DICTIONARIES)

In [6]:
interval = 10
years = range(year_start, year_end, interval)
occs_clean = [2,3,3,3,5,5,5,5]

In [7]:
for year in years:
    dict_10years = load_dictionaries(DIR_OUTPUT_DICTIONARIES, year, year + interval)
    dict_10years_cleaned = [clean_dict_by_occ(dictio, occs_clean) for dictio in dict_10years]
    fileName = os.path.join(DIR_OUTPUT_DICTIONARIES, str(year) + '-' + str(year + interval) + '.pkl')
    merge_dictionaries(dict_10years_cleaned, fileName)
    print(str(year) + '-' + str(year + interval) + ' done')

1805-1815 done
1815-1825 done


In [8]:
dictPath = os.path.join(DIR_OUTPUT_DICTIONARIES, str(year_start) + '-' + str(year_start + interval) + '.pkl')
dictionnaries = clean_dictionary(dictPath)

With the following function, you can clean and correct part of the articles. Then the produced articles can be used in the other notebooks. This operation can take a very long time.

In [9]:
DIR_OUTPUT_PKL_FILES_CLEANED = os.path.join(DATA_DIR, 'GDL_cleaned')
cleanAndSaveArticles(DIR_OUTPUT_PKL_FILES, DIR_OUTPUT_PKL_FILES_CLEANED, year_start, year_start+1, dictionnaries)

0 articles cleaned, 2137 remaining
100 articles cleaned, 2037 remaining
200 articles cleaned, 1937 remaining
300 articles cleaned, 1837 remaining
400 articles cleaned, 1737 remaining
500 articles cleaned, 1637 remaining
600 articles cleaned, 1537 remaining
700 articles cleaned, 1437 remaining
800 articles cleaned, 1337 remaining
900 articles cleaned, 1237 remaining
1000 articles cleaned, 1137 remaining
1100 articles cleaned, 1037 remaining
1200 articles cleaned, 937 remaining
1300 articles cleaned, 837 remaining
1400 articles cleaned, 737 remaining
1500 articles cleaned, 637 remaining
1600 articles cleaned, 537 remaining
1700 articles cleaned, 437 remaining
1800 articles cleaned, 337 remaining
1900 articles cleaned, 237 remaining
2000 articles cleaned, 137 remaining
2100 articles cleaned, 37 remaining
1805
