## Links to Project Resources

- [Trello board](https://trello.com/invite/b/BWnRAtKJ/3e7ce03017000289323e762d0ed2e304/histaware)
- [Notion Wiki](https://www.notion.so/HistAware-529aba41f84946b19d493394ef6a2748)

# Part I: Text selection

In this first phase of the project, we approach the first problem of selecting texts similar texts. Intially the scope of the research is focused on texts that deal with `energy`. However, this scope might change and/or might be expanded.

**Phases of Part I:**
- **Validate the approach to the project**:
    1. Decide whether to use title and paragraphs or only one of the two
    2. Find the most efficient way to read all the xml files
    3. Begin to label a golden set of texts that are within the scope of the research AND select the most important keywords that will be used to search for similar texts
    4. Run the text similarity ML algorithm
    5. Have the teaching assistant go throught the selection and identify mistakes
- **To think about**: how to keep the relevant information about the text fragment (i.e. newspaper origin and date)?
- **Decide the tools to use for text selection**. Current choices are:
    - Use `sentence-transformers` from UKPLab (https://github.com/UKPLab/sentence-transformers)
        - Generate embeddings on sentences (max 512 words)
        - Find similar texts
    - Use `faiss` from Facebook AI (https://github.com/facebookresearch/faiss)
        - Less documentation but seemingly more scalable
    - Use ASReview from Utrecht University ()
        - A meeting with Jonathan or Raul is necessary to understand the feasibility of this approach

### Import statements

In [55]:
from IPython.display import display, clear_output, Markdown
import pathlib
import sys
import os
sys.path.append("/Users/leonardovida/dev/HistAware")
import pickle
import csv

import numpy as np
import pandas as pd
import logging
import xml.etree.ElementTree as et 
import collections
from itertools import chain
import nl_core_news_lg
from datetime import datetime

# Import created modules
from src import text_selection
from src import iterators
from src import parsers
from src import logger

# Config for jupyter
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Set parameters & variables

In [16]:
FILE_PATH = "/Users/leonardovida/dev/HistAware/"
# Data path for Delpher data
DIR_PATH = os.path.join(FILE_PATH, "data", "1950", "Delpher")
# Save path
SAVE_PATH = os.path.join(FILE_PATH, "data", "processed")
# Decide whether to ungizip metadata
UNGIZP = False
# Decide whether to process and save articles and metadata data
DATAFILE = False
# Keywords to use for the naive text selection
KEYWORDS = ["olie", "aardgas", "steenkool"]
# Number of synonyms to retrieve for each keyword, the more the less accurate
NUM_SYNONYMS = 50
# Transformer model to use for the creation of the synonyms
NLP = nl_core_news_lg.load()

In [18]:
# Find path and name of saved data
print("Find path and name of saved articles")
csv_articles = iterators.iterate_directory(
    dir_path=os.path.join(SAVE_PATH, "processed_articles"), file_type=".csv"
)
csv_articles = pd.DataFrame(csv_articles)
csv_articles.rename(
    {
        "article_name": "csv_name",
        "article_path": "csv_path",
        "article_dir": "csv_dir",
    },
    axis=1,
    inplace=True,
)

Find path and name of saved articles


In [20]:
print("Find path and name of saved metadata")
csv_metadata = iterators.iterate_directory(
    dir_path=os.path.join(SAVE_PATH, "processed_metadata"), file_type=".csv"
)
csv_metadata = pd.DataFrame(csv_metadata)
csv_metadata.rename(
    {
        "article_name": "csv_name",
        "article_path": "csv_path",
        "article_dir": "csv_dir",
    },
    axis=1,
    inplace=True,
)

Find path and name of saved metadata


In [21]:
li = []
for index, row in csv_metadata.iterrows():
    csv_file = pd.read_csv(row["csv_path"])
    li.append(csv_file)
df_metadata = pd.concat(li, axis=0)
df_metadata.drop(["level_0", "date"], axis=1, inplace=True)
df_metadata.rename(
    {"filepath": "metadata_filepath", "index": "index_metadata"},
    axis=1,
    inplace=True,
)

In [51]:
# text_selection.py
from tqdm import tqdm
import pandas as pd
import numpy as np
import re

def search_synonyms(nlp, word, df, n):
    """Find all texts in which a synonym of the word appears.

    Takes:
        - string (word)
        - dataframe in which to search
        - The total number of synonym to retrieve
    """
    result = pd.DataFrame()

    ms = nlp.vocab.vectors.most_similar(
        np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=n
    )
    synonyms = [nlp.vocab.strings[w] for w in ms[0][0]]
    print(f"Searching using the following synonyms of {word}:")
    print(synonyms)
    df.dropna(subset=["text"], inplace=True)

    for syn in tqdm(synonyms):
        df = df[df["text"].str.contains(syn, case=False, regex=False)]
        df["count"] = df["text"].str.count(syn)
        #count=sum(1 for _ in re.finditer(r"\b%s\b" % re.escape(word), syn)),
        result = result.append(df)
    return result

def select_articles(nlp, word, df, n):
    res = search_synonyms(nlp, word, df, n)

    # Drop duplicates to keep only individual articles
    res.drop_duplicates(ignore_index=True, inplace=True)

    return res

In [56]:
li = []
print("Searching synonyms")
for i, row in csv_articles.iterrows():
    csv_file = pd.read_csv(row["csv_path"])
    li.append(csv_file)
    if i % 5 == 0:
        # Iterate 250.000 articles at the time
        df_articles = pd.concat(li, axis=0)
        df_articles.sort_values(by=["index"], ascending=True)
        df_articles.rename(
            {"filepath": "article_filepath", "index": "index_article"},
            axis=1,
            inplace=True,
        )
        df_joined = df_articles.merge(df_metadata, how="left", on="dir")
        #df_joined.dropna(subset=["text"], inplace=True)
        #df = df_joined[df_joined["text"].str.contains("aardgas", case=False, regex=False)]
        #print(df_joined[df_joined["text"].str.contains("aardgas", case=False, regex=False)])
        #df["count"] = df["text"].str.count("aardgas")
        #count = sum(1 for _ in re.finditer(r"\b%s\b" % re.escape(word), syn))

        for keyword in KEYWORDS:
            print(f"Searching synonym {keyword}")
            selected_art = select_articles(
                nlp=NLP, word=keyword, df=df_joined, n=NUM_SYNONYMS
            )
            today = datetime.now()
            NAME = str(today.date()) + "_" + keyword + ".csv"

            selected_art.to_csv(
                os.path.join(SAVE_PATH, "selected_articles", NAME),
                sep=",",
                quotechar='"',
                index=False,
            )

        # Reset list of saved csv to zero
        selected_art = []


Searching synonyms
Searching synonym olie
  0%|          | 0/50 [00:00&lt;?, ?it/s]Searching using the following synonyms of olie:
[&#39;olie&#39;, &#39;-olie&#39;, &#39;olieen&#39;, &#39;olies&#39;, &#39;wokolie&#39;, &#39;MCT-olie&#39;, &#39;bakolie&#39;, &#39;olieën&#39;, &#39;cocosolie&#39;, &#39;smeerolie&#39;, &#39;paraffineolie&#39;, &#39;maïsolie&#39;, &#39;remolie&#39;, &#39;castorolie&#39;, &#39;rijstolie&#39;, &#39;bio-olie&#39;, &#39;olien&#39;, &#39;schalieolie&#39;, &#39;citrusolie&#39;, &#39;spijsolie&#39;, &#39;Gasolie&#39;, &#39;ricinusolie&#39;, &#39;slaolie&#39;, &#39;maisolie&#39;, &#39;zaadolie&#39;, &#39;kruidenolie&#39;, &#39;boterolie&#39;, &#39;kruidnagelolie&#39;, &#39;aardolie&#39;, &#39;Badolie&#39;, &#39;koolzaadolie&#39;, &#39;citroenolie&#39;, &#39;soja-olie&#39;, &#39;mosterdolie&#39;, &#39;teakolie&#39;, &#39;Smeerolie&#39;, &#39;sesamolie&#39;, &#39;oliën&#39;, &#39;dieselolie&#39;, &#39;palmpitolie&#39;, &#39;arachideolie&#39;, &#39;kokosnootolie&#39;