# Part I: Text selection

In this first phase of the project, we approach the first problem of selecting texts similar texts. Intially the scope of the research is focused on texts that deal with `energy`. However, this scope might change and/or might be expanded.

**Phases of Part I:**
- **Validate the approach to the project**:
    1. Decide whether to use title and paragraphs or only one of the two
    2. Find the most efficient way to read all the xml files
    3. Begin to label a golden set of texts that are within the scope of the research AND select the most important keywords that will be used to search for similar texts
    4. Run the text similarity ML algorithm
    5. Have the teaching assistant go throught the selection and identify mistakes
- **To think about**: how to keep the relevant information about the text fragment (i.e. newspaper origin and date)?
- **Decide the tools to use for text selection**. Current choices are:
    - Use `sentence-transformers` from UKPLab (https://github.com/UKPLab/sentence-transformers)
        - Generate embeddings on sentences (max 512 words)
        - Find similar texts
    - Use `faiss` from Facebook AI (https://github.com/facebookresearch/faiss)
        - Less documentation but seemingly more scalable
    - Use ASReview from Utrecht University ()
        - A meeting with Jonathan or Raul is necessary to understand the feasibility of this approach

## Approach with `sentence-transformers`

In [1]:
from sentence_transformers import SentenceTransformer, LoggingHandler
import numpy as np
import pandas as pd
import logging

#### Just some code to print debug information to stdout
np.set_printoptions(threshold=100)

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

### Use the multilingual model pre-trained on 10+ languages

The model is the `distiluse-base-multilingual-cased` model. From (sbert)[https://www.sbert.net/docs/pretrained_models.html]

In [2]:
model = SentenceTransformer('distiluse-base-multilingual-cased')

2020-08-27 15:25:05 - Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
2020-08-27 15:25:05 - Did not find a '/' or '\' in the name. Assume to download model from server.
2020-08-27 15:25:05 - Downloading sentence transformer model from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip and saving it at /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip


100%|██████████| 504M/504M [01:04<00:00, 7.81MB/s] 


2020-08-27 15:26:21 - Load SentenceTransformer from folder: /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip
2020-08-27 15:26:22 - loading configuration file /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip/0_DistilBERT/config.json
2020-08-27 15:26:22 - Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 119547
}

2020-08-27 15:26:22 - loading weights

### Todos

- Find a way to create inventory of the files
    - Create a dictionary(?) to keep track of the files location
- Each Series or Dataframe needs to have a title that describe the file location
- Compare speed between Series and DF?


In [119]:
import xml.etree.ElementTree as et 
import collections
import sys
import os

sys.path.insert(0, "..")

### Catalogue files

Iterate over the main directory and its children and create a dictionary of names

In [122]:
rootdir = path[0]+"/data/"
xml_file_names = {}

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        #print os.path.join(subdir, file)
        filepath = subdir + os.sep + file

        if filepath.endswith(".xml"):
            xml_file_names[file] = filepath

### Parse XML files

In [117]:
def parse_XML(xml_file):
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    data = {}
    
    for i, node in enumerate(xroot):
        if node.tag != "p":
            data[node.tag] = node.text
        else:
            data[node.tag+"_"+str(i)] = node.text
    s = pd.Series(data)
    out_df = pd.DataFrame.from_records(data, index=[0])
    
    return out_df

In [114]:
a = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0002_articletext.xml")
b = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0001_articletext.xml")
c = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0002_articletext.xml")
data = pd.DataFrame({'a': a,'b':b, 'c':c})
data

Unnamed: 0,a,b,c
p_1,"Xew Vork — Volgens de „New Vork Times"" zouden ...","Londen — Mao Tse Toeng, de Chinese communistis...","Xew Vork — Volgens de „New Vork Times"" zouden ..."
p_2,"eens geworden zijn, dat van een Xoordamerikaan...",,"eens geworden zijn, dat van een Xoordamerikaan..."
p_3,Volgens Reston is er een ernstig meningsversch...,,Volgens Reston is er een ernstig meningsversch...
p_4,"Het Staatsdepartement was: van gevoelen, dat d...",,"Het Staatsdepartement was: van gevoelen, dat d..."
title,U.S.A. — CHINA.,Chinese communistenleider onderhandelt in Mosk...,U.S.A. — CHINA.


In [118]:
a = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0002_articletext.xml")
b = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0001_articletext.xml")
c = parse_XML(path[0]+"/data/1950/01-02/DDD_010403143/DDD_010403143_0002_articletext.xml")
pd.concat([a, b, c], ignore_index=True)

Unnamed: 0,p_1,p_2,p_3,p_4,title
0,"Xew Vork — Volgens de „New Vork Times"" zouden ...","eens geworden zijn, dat van een Xoordamerikaan...",Volgens Reston is er een ernstig meningsversch...,"Het Staatsdepartement was: van gevoelen, dat d...",U.S.A. — CHINA.
1,"Londen — Mao Tse Toeng, de Chinese communistis...",,,,Chinese communistenleider onderhandelt in Mosk...
2,"Xew Vork — Volgens de „New Vork Times"" zouden ...","eens geworden zijn, dat van een Xoordamerikaan...",Volgens Reston is er een ernstig meningsversch...,"Het Staatsdepartement was: van gevoelen, dat d...",U.S.A. — CHINA.
