In [1]:
from IPython.display import Markdown, display

## Links to Project Resources

- [Trello board](https://trello.com/invite/b/BWnRAtKJ/3e7ce03017000289323e762d0ed2e304/histaware)
- [Notion Wiki](https://www.notion.so/HistAware-529aba41f84946b19d493394ef6a2748)

# Part I: Text selection

In this first phase of the project, we approach the first problem of selecting texts similar texts. Intially the scope of the research is focused on texts that deal with `energy`. However, this scope might change and/or might be expanded.

**Phases of Part I:**
- **Validate the approach to the project**:
    1. Decide whether to use title and paragraphs or only one of the two
    2. Find the most efficient way to read all the xml files
    3. Begin to label a golden set of texts that are within the scope of the research AND select the most important keywords that will be used to search for similar texts
    4. Run the text similarity ML algorithm
    5. Have the teaching assistant go throught the selection and identify mistakes
- **To think about**: how to keep the relevant information about the text fragment (i.e. newspaper origin and date)?
- **Decide the tools to use for text selection**. Current choices are:
    - Use `sentence-transformers` from UKPLab (https://github.com/UKPLab/sentence-transformers)
        - Generate embeddings on sentences (max 512 words)
        - Find similar texts
    - Use `faiss` from Facebook AI (https://github.com/facebookresearch/faiss)
        - Less documentation but seemingly more scalable
    - Use ASReview from Utrecht University ()
        - A meeting with Jonathan or Raul is necessary to understand the feasibility of this approach

### Import statements

In [2]:
import numpy as np
import pandas as pd
import logging
import re
from datetime import datetime
import xml.etree.ElementTree as et 
import collections
import sys
import os

%matplotlib inline
%config InlineBackend.figure_format='retina'


#### Just some code to print debug information to stdout
np.set_printoptions(threshold=100)

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

# Find path of data folder
path = sys.path
# To go back to main folder
sys.path.insert(0, "..")

### Create a catalogue of the files

We save the file path and the file name into a dictionary. Then we transform the dictionary into a DataFrame so that we can later keep track of the index at which the parsing got stopped/interrupted (Dictionaries in Python do not have an order)

In [20]:
def iterate_directory(path_dir,file_type):
    """Iterate over the `path_dir` and its children and
    create a dictionary of names of files found, given their
    file type `file_type`, and their path.
    """
    rootdir = path[0]+path_dir
    file_names = {}

    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            #print os.path.join(subdir, file)
            filepath = subdir + os.sep + file

            if filepath.endswith(str(file_type)):
                file_names[file] = filepath

    return(file_names)

In [5]:
xml_file_names = iterate_directory("/data/",".xml")
df_file_names = pd.DataFrame.from_dict(xml_file_names.items())
df_file_names.rename({0: 'article_name', 1: 'article_path'}, axis=1, inplace=True)

### Parse XML files

In [6]:
def parse_XML(xml_file, title, index):
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    
    Takes the filepath, file title and index integer of the df
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    data = {}
    
    # Parse the date with regex
    match = re.search(r'\d{4}[/]\d{2}[-]\d{2}', xml_file)
    date = datetime.strptime(match.group(), '%Y/%m-%d').date()
    
    for i, node in enumerate(xroot):
        data["article_name"] = str(title)
        data["date"] = str(date)
        data["index"] = index
        if node.tag != "p":
            data[node.tag] = node.text
        else:
            data[node.tag+"_"+str(i)] = node.text
            
#    out_df = pd.DataFrame.from_dict(data.items(), columns=data.keys())
    s = pd.Series(data)
    
    return s

**Utils Addendum**

To search for an `article_path` or `article_name` given the other, use the following:

In [301]:
#a = df_file_names.loc[df_file_names['article_name'] == "DDD_110637387_0004_articletext.xml"]
#a = df_file_names.iloc[0]
c = df_file_names.iloc[500000]

### Iterate through the files given

Currently, this loop takes ~0.012s for each parsing. This is extremely slow and it's not due to the `parse_XML` function (which is efficient), but instead it's because of the `concat` between series. 

In this way 100.000 documents take around 20 minutes to be parsed.
- If possible, substitute the concat statement with something more efficient!

In [11]:
def iterate_files(files):
    """Iterate through files `files`, parse them and concatenate
    the result in a pandas DataFrame with the
    """
    main = None
    previous_i = 0
    current_i = 0
    i = 0
    n = 0
    cnt = 0
    list_series = []
    
    for index, row in files.iterrows():
        list_series.append(parse_XML(row["article_path"], row["article_name"], index))
        if (i == 10000):
            current_i = current_i + i
            file_path = path[0]+"/data/processed/processed_data_"+str(previous_i)+"_"+str(current_i)+".ftr"
            main = pd.DataFrame(pd.concat(list_series, axis = 1).T)
            main.to_feather(file_path)
            list_series = []
            main = None
            previous_i = current_i
            i = 0
        if (i % 1000 == 0):
            print("Files parsed: "+str(1000*cnt))
            print("Current file: "+row["article_name"]+"\n")
            cnt += 1
        i += 1

In [12]:
iterate_files(df_file_names)

Files parsed: 0
Current file: DDD_110637387_0004_articletext.xml

Files parsed: 1000
Current file: DDD_010865749_0044_articletext.xml

Files parsed: 2000
Current file: DDD_010537363_0050_articletext.xml

Files parsed: 3000
Current file: DDD_011210678_0092_articletext.xml

Files parsed: 4000
Current file: DDD_010612636_0060_articletext.xml

Files parsed: 5000
Current file: DDD_110584865_0073_articletext.xml

Files parsed: 6000
Current file: DDD_010537272_0086_articletext.xml

Files parsed: 7000
Current file: DDD_010862531_0063_articletext.xml

Files parsed: 8000
Current file: DDD_010873914_0016_articletext.xml

Files parsed: 9000
Current file: DDD_011202061_0045_articletext.xml

Files parsed: 10000
Current file: DDD_010850957_0008_articletext.xml

Files parsed: 11000
Current file: DDD_010950415_0125_articletext.xml

Files parsed: 12000
Current file: DDD_010612597_0072_articletext.xml

Files parsed: 13000
Current file: DDD_010733818_0003_articletext.xml

Files parsed: 14000
Current file:

KeyboardInterrupt: 

## Text selection model

### Ingest parsed files

Once we parse all the files present in the example `data-1950` folder, we produce 65 files containing the parsed original data into a format which is more easily readable by a machine. The total weight of the files is 65*10=650MB which is a 5x reduction from the original size of the dataset.

In [14]:
# https://www.sbert.net/docs/
from sentence_transformers import SentenceTransformer, LoggingHandler

# These are the pure transformers from huggingface
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

# Set searborn settings
rcParams['figure.figsize'] = 12, 8

# Set fixed random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Find GPU on device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [23]:
ftr_file_names = iterate_directory("/data/processed/",".ftr")
ftr_file_names = pd.DataFrame.from_dict(ftr_file_names.items())
ftr_file_names.rename({0: 'ftr_name', 1: 'ftr_path'}, axis=1, inplace=True)

In [27]:
list_ftr = []

for index, row in ftr_file_names.iterrows():
    if index > 1:
        break
    else:
        ftr = pd.read_feather(row["ftr_path"])
        list_ftr.append(ftr)

In [48]:
list_sentences = []

for index, row in ftr.iterrows():
    for i in range(1,ftr.shape[1]-4):
        p = "p_"+str(i)
        if row[p] and row[p] is not None:
            list_sentences.append(row[p])

In [55]:
min(list_sentences, key=len)

'8'

### Use the multilingual model pre-trained on 10+ languages

The model is the `distiluse-base-multilingual-cased` model. From (sbert)[https://www.sbert.net/docs/pretrained_models.html]

In [2]:
model = SentenceTransformer('distiluse-base-multilingual-cased', device=device)
# Load paragraphs
sentences = 

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

2020-08-27 15:25:05 - Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
2020-08-27 15:25:05 - Did not find a '/' or '\' in the name. Assume to download model from server.
2020-08-27 15:25:05 - Downloading sentence transformer model from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip and saving it at /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip


100%|██████████| 504M/504M [01:04<00:00, 7.81MB/s] 


2020-08-27 15:26:21 - Load SentenceTransformer from folder: /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip
2020-08-27 15:26:22 - loading configuration file /Users/leonardovida/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip/0_DistilBERT/config.json
2020-08-27 15:26:22 - Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 119547
}

2020-08-27 15:26:22 - loading weights