## Links to Project Resources

- [Trello board](https://trello.com/invite/b/BWnRAtKJ/3e7ce03017000289323e762d0ed2e304/histaware)
- [Notion Wiki](https://www.notion.so/HistAware-529aba41f84946b19d493394ef6a2748)

# Part I: Text selection

In this first phase of the project, we approach the first problem of selecting texts similar texts. Intially the scope of the research is focused on texts that deal with `energy`. However, this scope might change and/or might be expanded.

**Phases of Part I:**
- **Validate the approach to the project**:
    1. Decide whether to use title and paragraphs or only one of the two
    2. Find the most efficient way to read all the xml files
    3. Begin to label a golden set of texts that are within the scope of the research AND select the most important keywords that will be used to search for similar texts
    4. Run the text similarity ML algorithm
    5. Have the teaching assistant go throught the selection and identify mistakes
- **To think about**: how to keep the relevant information about the text fragment (i.e. newspaper origin and date)?
- **Decide the tools to use for text selection**. Current choices are:
    - Use `sentence-transformers` from UKPLab (https://github.com/UKPLab/sentence-transformers)
        - Generate embeddings on sentences (max 512 words)
        - Find similar texts
    - Use `faiss` from Facebook AI (https://github.com/facebookresearch/faiss)
        - Less documentation but seemingly more scalable
    - Use ASReview from Utrecht University ()
        - A meeting with Jonathan or Raul is necessary to understand the feasibility of this approach

### Import statements

In [18]:
from IPython.display import display, clear_output, Markdown
import pathlib
import sys
import pickle
import csv

import numpy as np
import pandas as pd
import logging
import seaborn
import xml.etree.ElementTree as et 
import collections
import xmltodict
from itertools import chain

# Import created modules
#import utils
#import text_selection

# Config for jupyter
%matplotlib inline
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Set parameters & variables

In [19]:
# Find path of data folder
main_path = sys.path
# To go back to main folder
sys.path.insert(0, "..")
# Save path for files
save_path=main_path[0]+"/data/processed/"

#### Just some code to print debug information to stdout
np.set_printoptions(threshold=100)

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO)

## Delpher Dataset

### Create a catalogue of the files

#### Find the location of each article

We save the file path and the file name into a dictionary. Then we transform the dictionary into a DataFrame so that we can later keep track of the index at which the parsing got stopped/interrupted (Dictionaries in Python do not have an order)

In [3]:
xml_article_names = utils.iterate_directory(
    root_path=main_path[0],
    dir_path="/data/1950/",
    file_type=".xml")
article_names = pd.DataFrame.from_dict(xml_article_names)
article_names.reset_index(inplace=True)

#### Find location of each metadata and "ungizp" them

Ungzip the .gz metadata files. It is a work that only needs to be done once.

In [5]:
#utils.ungzip_metdata(
#    root_path=main_path[0],
#    dir_path="/data/1950/",
#    file_type=".gz")

In [4]:
gz_metadata_files = utils.iterate_directory_gz(
    root_path=main_path[0],
    dir_path="/data/1950/",
    file_type=".gz")
metadata_files = pd.DataFrame.from_dict(gz_metadata_files)
metadata_files.reset_index(inplace=True)

**Utils Addendum**

To search for an `article_path` or `article_name` given the other, use the following:

In [301]:
#a = df_file_names.loc[df_file_names['article_name'] == "DDD_110637387_0004_articletext.xml"]
#a = df_file_names.iloc[0]
c = df_file_names.iloc[500000]

### Iterate through the files given

**Attention**: this is a work that only needs to be done once, and it's a **very** long work.

#### TODO: make this work scalable?

In [10]:
utils.iterate_files(save_path=save_path, files=article_names)

'Files parsed: 2000'

'Current file: DDD_010537363_0096_articletext.xml (Index: 2000)'

KeyboardInterrupt: 

In [133]:
utils.iterate_metadata(save_path=save_path, files=metadata_files)

TypeError: iterate_metadata() got an unexpected keyword argument 'save_path'

## Text selection model

## Ingest parsed files previously saved

Once we parse all the files present in the example `data-1950` folder, we produce 65 files containing the parsed original data into a format which is more easily readable by a machine. The total weight of the files is 65*10=650MB which is a 5x reduction from the original size of the dataset.

## Read saved files

#### Retrieve all the names of the ftr files saved

In [6]:
ftr_articles = utils.iterate_directory(
    root_path=main_path[0],
    dir_path="/data/processed/processed_articles",
    file_type=".ftr")
ftr_articles = pd.DataFrame(ftr_articles)
ftr_articles.rename({'article_name': 'ftr_name', 'article_path': 'ftr_path', 'article_dir': 'ftr_dir'}, axis=1, inplace=True)

In [7]:
ftr_metadata = utils.iterate_directory(
    root_path=main_path[0],
    dir_path="/data/processed/processed_metadata",
    file_type=".ftr")
ftr_metadata = pd.DataFrame(ftr_metadata)
ftr_metadata.rename({'article_name': 'ftr_name', 'article_path': 'ftr_path', 'article_dir': 'ftr_dir'}, axis=1, inplace=True)

#### Retrieve all the content of the files into a list format

Read one ftr file as a test

In [8]:
df_articles = utils.iterate_ftr(ftr_articles)
df_articles.sort_values(by=["index"], ascending=True)
df_articles.rename({"filepath": "article_filepath", "index": "index_article"}, axis=1, inplace=True)

In [9]:
df_metadata = utils.iterate_ftr(ftr_metadata)
df_metadata.drop(["level_0", "date"], axis=1, inplace=True)
df_metadata.rename({"filepath": "metadata_filepath", "index": "index_metadata"}, axis=1, inplace=True)

Merge articles and metadata in one single file

In [10]:
df_joined = df_articles.merge(df_metadata, how='left', on='dir')

#### Now we have (one)a merged file

This will be needed to be done recursively, for all files present in the database.

So efficiency is key here!!!

### Find synonym(s) for the key search word(s)

Load pre-trained nlp model from spacy

In [17]:
import nl_core_news_lg
nlp = nl_core_news_lg.load()

#### Create a list with only the text (paragraphs) and not the other variables

Retrieve all the paragraphs into one single file

In [18]:
a = text_selection.select_articles(
    nlp=nlp,
    word="energie",
    df=df_joined,
    n=50)

Searching using the following synonyms of energie:
['energie', 'oerenergie', 'Cenergie', 'energieeen', 'energieen', 'energi', 'aarde-energie', 'lichtenergie', 'levensenergie', 'energiestroom', 'energie-boost', 'energiedip', 'hulpenergie', 'warmte-energie', 'Bio-energie', 'zonneenergie', 'Reiki-energie', 'hartenergie', 'remenergie', 'aardenergie', 'energievorm', 'energiegolf', 'energiestoot', 'energievol', 'energiegolven', 'bewegingsenergie', 'bio-energie', 'energiecellen', 'basisenergie', 'energievolle', 'energieën', 'energiebron', 'zonenergie', 'stralingsenergie', 'énergie', 'energiebalans', 'energiestromen', 'lichaamsenergie', 'energiebewustzijn', 'energietoevoer', 'energiemix', 'groepsenergie', 'energieboost', 'energievreter', 'Levensenergie', 'waterenergie', 'energie-opwekking', 'energierijk', 'energieverbuik', 'energiebronnen']


  2%|▏         | 1/50 [00:02<02:25,  2.97s/it]


KeyboardInterrupt: 

Using the row that were found, select the entire record from the merged dataframe

In [20]:
# Save to CSV
a.to_csv(save_path+"energie_search_16092020.csv",
            sep=",",
            quotechar='"',
            index=False)

#with open('list_sentences.pkl', "wb") as fOut:
#    pickle.dump(list_sentences, fOut, protocol=pickle.HIGHEST_PROTOCOL)

In [43]:
a["text"].str.len().max()

3896.6

### Further text selection

In [20]:
# https://www.sbert.net/docs/
from sentence_transformers import SentenceTransformer, LoggingHandler, util

# These are the pure transformers from huggingface
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

# Set fixed random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Find GPU on device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Import previously saved csv

In [25]:
search = pd.read_csv(main_path[0]+"/data/processed/selected_articles/2020-09-22_energie.csv")

In [26]:
search

Unnamed: 0,Unnamed: 0_x,type,text,article_name,date,index_article,article_filepath,dir,Unnamed: 0_y,metadata_title,index_metadata,metadata_filepath,newspaper_title,newspaper_date,newspaper_city,newspaper_publisher,newspaper_source,newspaper_volume,newspaper_issuenumber,newspaper_language
0,942,p,Meer energie 1118111 Meer levenskracht mW^jmÊi...,DDD_010480453_0022_articletext.xml,1950-02-09,100397,../data/1950/02-09/DDD_010480453/DDD_010480453...,../data/1950/02-09/DDD_010480453,779.0,DDD:ddd:010480453:mpeg21.didl.xml.gz.xml,925.0,../data/1950/02-09/DDD_010480453/DDD:ddd:01048...,Het nieuws : algemeen dagblad,1950-02-09,Paramaribo,A.J. Morpurgo,KB C 197,7.0,1984.0,nl
1,966,p,"zijtt mjnist'êr ble'e,k hoopvol té Md ,rsteind...",DDD_010480453_0001_articletext.xml,1950-02-09,100405,../data/1950/02-09/DDD_010480453/DDD_010480453...,../data/1950/02-09/DDD_010480453,779.0,DDD:ddd:010480453:mpeg21.didl.xml.gz.xml,925.0,../data/1950/02-09/DDD_010480453/DDD:ddd:01048...,Het nieuws : algemeen dagblad,1950-02-09,Paramaribo,A.J. Morpurgo,KB C 197,7.0,1984.0,nl
2,1095,p,Hij klom op tot hoofdcommies en referendaris e...,DDD_010950392_0073_articletext.xml,1950-02-09,100454,../data/1950/02-09/DDD_010950392/DDD_010950392...,../data/1950/02-09/DDD_010950392,780.0,DDD:ddd:010950392:mpeg21.didl.xml.gz.xml,926.0,../data/1950/02-09/DDD_010950392/DDD:ddd:01095...,Het vrĳe volk : democratisch-socialistisch dag...,1950-02-09,Rotterdam,De Arbeiderspers,Gemeentearchief Rotterdam,5.0,1444.0,nl
3,1134,p,„De Verenigde Staten overwegen steeds of het m...,DDD_010950392_0001_articletext.xml,1950-02-09,100470,../data/1950/02-09/DDD_010950392/DDD_010950392...,../data/1950/02-09/DDD_010950392,780.0,DDD:ddd:010950392:mpeg21.didl.xml.gz.xml,926.0,../data/1950/02-09/DDD_010950392/DDD:ddd:01095...,Het vrĳe volk : democratisch-socialistisch dag...,1950-02-09,Rotterdam,De Arbeiderspers,Gemeentearchief Rotterdam,5.0,1444.0,nl
4,1235,p,"Telkens opnieuw hebben wij gezien, aldus de mi...",DDD_010950392_0002_articletext.xml,1950-02-09,100510,../data/1950/02-09/DDD_010950392/DDD_010950392...,../data/1950/02-09/DDD_010950392,780.0,DDD:ddd:010950392:mpeg21.didl.xml.gz.xml,926.0,../data/1950/02-09/DDD_010950392/DDD:ddd:01095...,Het vrĳe volk : democratisch-socialistisch dag...,1950-02-09,Rotterdam,De Arbeiderspers,Gemeentearchief Rotterdam,5.0,1444.0,nl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536,73218,p,In 3B schakelde Helder de grootste concurrent ...,DDD_010852409_0112_articletext.xml,1950-11-27,127103,../data/1950/11-27/DDD_010852409/DDD_010852409...,../data/1950/11-27/DDD_010852409,980.0,DDD:ddd:010852409:mpeg21.didl.xml.gz.xml,1155.0,../data/1950/11-27/DDD_010852409/DDD:ddd:01085...,De waarheid,1950-11-27,Amsterdam,s.n.,Internationaal Instituut voor Sociale Geschied...,10.0,479.0,nl
537,73362,p,WEST I TWEEDE KLASSE B: AFC—Volendam I—2' Zeeb...,DDD_010852409_0113_articletext.xml,1950-11-27,127150,../data/1950/11-27/DDD_010852409/DDD_010852409...,../data/1950/11-27/DDD_010852409,980.0,DDD:ddd:010852409:mpeg21.didl.xml.gz.xml,1155.0,../data/1950/11-27/DDD_010852409/DDD:ddd:01085...,De waarheid,1950-11-27,Amsterdam,s.n.,Internationaal Instituut voor Sociale Geschied...,10.0,479.0,nl
538,73575,p,DERDE KLASSE WEST I A: Hollandia— *I«m. B. I—3...,DDD_011202181_0119_articletext.xml,1950-11-27,127230,../data/1950/11-27/DDD_011202181/DDD_011202181...,../data/1950/11-27/DDD_011202181,981.0,DDD:ddd:011202181:mpeg21.didl.xml.gz.xml,1156.0,../data/1950/11-27/DDD_011202181/DDD:ddd:01120...,De Tĳd : godsdienstig-staatkundig dagblad,1950-11-27,'s-Hertogenbosch,Gebr. Verhoeven,Koninklijke Bibliotheek C 236,106.0,34592.0,nl
539,99637,p,De laatste uitslagen in de co .:-! petitie van...,DDD_110584788_0094_articletext.xml,1950-01-17,136569,../data/1950/01-17/DDD_110584788/DDD_110584788...,../data/1950/01-17/DDD_110584788,55.0,DDD:ddd:110584788:mpeg21.didl.xml.gz.xml,1241.0,../data/1950/01-17/DDD_110584788/DDD:ddd:11058...,De Telegraaf,1950-01-17,Amsterdam,Dagblad De Telegraaf,KB C 98,53.0,19424.0,nl


### Use the multilingual model pre-trained on 10+ languages

### Play around with `SBERT`

The model is the `distiluse-base-multilingual-cased` model. From [sbert]( https://www.sbert.net/docs/pretrained_models.html)

In [27]:
# Create embeddings
model = SentenceTransformer("../data/models/distiluse-base-multilingual-cased", device=device)

# Load paragraphs
sentences = list(search["text"])

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Store sentences & embeddings on disc
with open('../data/processed/embeddings/test_embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

#Load sentences & embeddings from disc
with open('embeddings.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']

2020-09-22 16:22:49 - Load pretrained SentenceTransformer: ../data/models/distiluse-base-multilingual-cased
2020-09-22 16:22:49 - Load SentenceTransformer from folder: ../data/models/distiluse-base-multilingual-cased
2020-09-22 16:22:49 - loading configuration file ../data/models/distiluse-base-multilingual-cased/0_DistilBERT/config.json
2020-09-22 16:22:49 - Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 119547
}

2020-09-22 16:22:49 - loading weights file ../data/models/distiluse-base-multilingual-cased/0_DistilBERT/pytorch_model.bin
2020-09-22 16:22:56 - All model checkpoi

HBox(children=(FloatProgress(value=0.0, description='Batches', max=17.0, style=ProgressStyle(description_width…

KeyboardInterrupt: 

In [33]:
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: ét Stoelrompenfabrlek te Bergschenhoek vraagt halfwas meubelmaKers. Aanmelden . fabriek Rottekade 22, N. Na 20 uur Zaagmolenkade 3b, N. . . ..-.. ... ét Bekwame machlnestlksters, handwerksters en - leerlingen. Hoog loon, prettige werkkring. Confectiebedrijf G.. v.- d.'.Ree, Brede HiUedijk 201, Z. . - ét Flinke verpleeghulp of leerling, ln- of extern. Verpleeginrichting „Maria", I Mathenesserlaan 343. W. '• ••■ ét Fletsjongen. Apotheek Dr Gerhardt & De Keuning, Bergselaan 275, N. Aanm.- tussen 8 en 18 uur. > .. ét Accuraat meisje voor eenvoudig expedltlewerk. Goed - handschrift en goed kunnende rekenen. „Twenithe", Bergweg 293, N. ét ■ Aankomende stiksters en leerlingen v. confectle-grootwerk. Maurltsstr. 77, C. (In de poort). Na 17.30 uur St.-Mariastraat 31, C. ét Wij vragen voor direct éen snljder/opstoter. Betaling volgens C.A.O. Drukkerij C. ChevaUer, Plekstraat 20, Zuid, Telefoon 77700. .. -.'..' .".'■•:..< r ét Machlnestlksters, handwerksters en leerlingen. Herenconfectl

### Semantic search

In [29]:
embedder = SentenceTransformer('../data/models/distiluse-base-multilingual-cased')

# Corpus with example sentences
corpus = list(search["text"])[10:]

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = list(a["text"])[0:10]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 10
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

2020-09-22 16:28:07 - Load pretrained SentenceTransformer: ../data/models/distiluse-base-multilingual-cased
2020-09-22 16:28:07 - Load SentenceTransformer from folder: ../data/models/distiluse-base-multilingual-cased
2020-09-22 16:28:07 - loading configuration file ../data/models/distiluse-base-multilingual-cased/0_DistilBERT/config.json
2020-09-22 16:28:07 - Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 119547
}

2020-09-22 16:28:07 - loading weights file ../data/models/distiluse-base-multilingual-cased/0_DistilBERT/pytorch_model.bin
2020-09-22 16:28:12 - All model checkpoi

HBox(children=(FloatProgress(value=0.0, description='Batches', max=17.0, style=ProgressStyle(description_width…




NameError: name 'a' is not defined

### Playing around with BERTje

In [18]:
from transformers import BertTokenizer, BertModel


tokenizer = BertTokenizer.from_pretrained("wietsedv/bert-base-dutch-cased")
model = BertModel.from_pretrained("wietsedv/bert-base-dutch-cased")

2020-09-02 11:23:32 - Lock 5397104528 acquired on /Users/leonardovida/.cache/torch/transformers/75d9be4cc7910048b3bdd477c435ffc46330193705f74eaf9a4f375cd3be28b2.1e00a56207196ed1759c49bdd1fa93c2fb20273d59fabb0c4c8092f7beb773c2.lock
2020-09-02 11:23:32 - https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt not found in cache or force_download set to True, downloading to /Users/leonardovida/.cache/torch/transformers/tmppdbo_bik


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=241440.0, style=ProgressStyle(descripti…

2020-09-02 11:23:33 - storing https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt in cache at /Users/leonardovida/.cache/torch/transformers/75d9be4cc7910048b3bdd477c435ffc46330193705f74eaf9a4f375cd3be28b2.1e00a56207196ed1759c49bdd1fa93c2fb20273d59fabb0c4c8092f7beb773c2
2020-09-02 11:23:33 - creating metadata file for /Users/leonardovida/.cache/torch/transformers/75d9be4cc7910048b3bdd477c435ffc46330193705f74eaf9a4f375cd3be28b2.1e00a56207196ed1759c49bdd1fa93c2fb20273d59fabb0c4c8092f7beb773c2
2020-09-02 11:23:33 - Lock 5397104528 released on /Users/leonardovida/.cache/torch/transformers/75d9be4cc7910048b3bdd477c435ffc46330193705f74eaf9a4f375cd3be28b2.1e00a56207196ed1759c49bdd1fa93c2fb20273d59fabb0c4c8092f7beb773c2.lock
2020-09-02 11:23:33 - loading file https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt from cache at /Users/leonardovida/.cache/torch/transformers/75d9be4cc7910048b3bdd477c435ffc46330193705f74




2020-09-02 11:23:34 - Lock 5357248464 acquired on /Users/leonardovida/.cache/torch/transformers/6702c5c53edb76b65d71f73ff2d9811ba62f16257ea58e36dedceffd71290a6a.1a78bd120fe46d78b55efa59f4ffa1dafcc9242743ab9fd6629d1b56672c9119.lock
2020-09-02 11:23:34 - https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json not found in cache or force_download set to True, downloading to /Users/leonardovida/.cache/torch/transformers/tmpqyn8q3s_


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

2020-09-02 11:23:34 - storing https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json in cache at /Users/leonardovida/.cache/torch/transformers/6702c5c53edb76b65d71f73ff2d9811ba62f16257ea58e36dedceffd71290a6a.1a78bd120fe46d78b55efa59f4ffa1dafcc9242743ab9fd6629d1b56672c9119
2020-09-02 11:23:34 - creating metadata file for /Users/leonardovida/.cache/torch/transformers/6702c5c53edb76b65d71f73ff2d9811ba62f16257ea58e36dedceffd71290a6a.1a78bd120fe46d78b55efa59f4ffa1dafcc9242743ab9fd6629d1b56672c9119
2020-09-02 11:23:34 - Lock 5357248464 released on /Users/leonardovida/.cache/torch/transformers/6702c5c53edb76b65d71f73ff2d9811ba62f16257ea58e36dedceffd71290a6a.1a78bd120fe46d78b55efa59f4ffa1dafcc9242743ab9fd6629d1b56672c9119.lock
2020-09-02 11:23:34 - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json from cache at /Users/leonardovida/.cache/torch/transformers/6702c5c53edb76b65d71f73ff2d




2020-09-02 11:23:35 - Lock 5390180048 acquired on /Users/leonardovida/.cache/torch/transformers/e5754f612ca0f16edba5b775fdddba806751f5e4b87c5e7f16cc0c8d8d17df4d.b7c03627733fd0712f078a4d3a31ad964550f50a6113efdf874ecbcf5ddf6b53.lock
2020-09-02 11:23:35 - https://cdn.huggingface.co/wietsedv/bert-base-dutch-cased/pytorch_model.bin not found in cache or force_download set to True, downloading to /Users/leonardovida/.cache/torch/transformers/tmp1wjgs76_


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438869143.0, style=ProgressStyle(descri…

2020-09-02 11:24:36 - storing https://cdn.huggingface.co/wietsedv/bert-base-dutch-cased/pytorch_model.bin in cache at /Users/leonardovida/.cache/torch/transformers/e5754f612ca0f16edba5b775fdddba806751f5e4b87c5e7f16cc0c8d8d17df4d.b7c03627733fd0712f078a4d3a31ad964550f50a6113efdf874ecbcf5ddf6b53
2020-09-02 11:24:36 - creating metadata file for /Users/leonardovida/.cache/torch/transformers/e5754f612ca0f16edba5b775fdddba806751f5e4b87c5e7f16cc0c8d8d17df4d.b7c03627733fd0712f078a4d3a31ad964550f50a6113efdf874ecbcf5ddf6b53
2020-09-02 11:24:36 - Lock 5390180048 released on /Users/leonardovida/.cache/torch/transformers/e5754f612ca0f16edba5b775fdddba806751f5e4b87c5e7f16cc0c8d8d17df4d.b7c03627733fd0712f078a4d3a31ad964550f50a6113efdf874ecbcf5ddf6b53.lock
2020-09-02 11:24:36 - loading weights file https://cdn.huggingface.co/wietsedv/bert-base-dutch-cased/pytorch_model.bin from cache at /Users/leonardovida/.cache/torch/transformers/e5754f612ca0f16edba5b775fdddba806751f5e4b87c5e7f16cc0c8d8d17df4d.b7c036




2020-09-02 11:24:40 - All model checkpoint weights were used when initializing BertModel.

2020-09-02 11:24:40 - All the weights of BertModel were initialized from the model checkpoint at wietsedv/bert-base-dutch-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertModel for predictions without further training.
