<a href="https://colab.research.google.com/github/ccaballeroh/Translator-Attribution/blob/master/01Processing_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (Optional)

If run in Google Colab, execute the following cell

In [0]:
from pathlib import Path
from google.colab import drive
import sys

drive.mount('/content/drive')

FOLDER_thesis = Path(r"./drive/My Drive/00Tesis/")
sys.path.insert(0, f"{FOLDER_thesis"/)



# Preprocessing

The *Quixote* files were retrieved from the [website](http://www.husseinabbass.net/translator.html) of professor Hussein Abbass. The files are plain text files&mdash;one file per chapter of the two parts of the novel&mdash;and only require minor preprocessing: removal of bracketed numbers, collapsing of spaces to only one whitespace, and the replacement of special characters, such as é and ü.

The Ibsen files were retieved from [Project Gutenberg](www.gutenberg.org). Therefore, the files contain legal information that needs to be removed along with bracketed numbers, collapsing of spaces, and the replacement of special characters also. However, before doing that, the plays were splitted in 5 kB chunks.

These operations are encapsulated in two functions, `quixote()` and `ibsen()`, respectively, within the submodule `preprocessing` available in the `helper` module. The functions use the relative paths to the folders containing the raw files (`Raw_Quixote` and `Raw_Ibsen`) in the subfolder `Corpora` and ouput the processed files in the folders `Proc_Quixote` and `Proc_Ibsen`.

In [0]:
from helper import preprocessing

In [0]:
preprocessing.quixote()

In [0]:
preprocessing.ibsen()

# Processing

The processing of the files comprises generating an object of the custom class `MyDoc` available in the `analysis` submodule  in the `helper` module for each document in both corpora. In order to instantiate the objects, a spaCy language model has to be given. A Python list with each object is serialized and saved to disk using Python's `pickle` protocol.

In [0]:
from helper.analysis import MyDoc
from pathlib import Path
import pickle
import spacy

CORPORA = Path(r"./Corpora/")
PICKLE = Path(r"./auxfiles/pickle/")

nlp = spacy.load("en_core_web_md")

docs = {}

for author in ["Quixote", "Ibsen"]:
    path = CORPORA/f"Proc_{author}"
    docs[author] = [
        MyDoc(file, nlp) for file in path.iterdir() if file.suffix == ".txt" and file.stat().st_size != 0
        ]
    # save to disk
    doc_data = pickle.dumps(docs[author])
    with open(PICKLE/f"{author}.pickle", "wb") as f:
        f.write(doc_data)

# Retrieving processed documents from disk

We can pick up the process from this step retrieving the processed documents from disk.

In [0]:
from pathlib import Path
import pickle


PICKLE = Path(r"./auxfiles/pickle/")
docs = {}

for author in ["Quixote", "Ibsen"]:
    with open(PICKLE/f"{author}.pickle", "rb") as f:
        doc_data=f.read()

    docs[author] = pickle.loads(doc_data)

# Features extraction

With the processed documents stored in memory in a dictionary, we can generate feature JSON files using the custom function `save_dataset_to_json` available in the `analysis` submodule in the `helper` module. 


In [0]:
from helper.analysis import save_dataset_to_json

for author in ["Quixote", "Ibsen"]:
    # syntactic n-grams with n in {2, 3}
    for n in range(2,4):
        FILE_TEMPLATE = f"features{author}_synctactic_n{n}"
        save_dataset_to_json([
            (doc.n_grams_syntactic(n=n), doc.translator) for doc in docs[author]
            ], FILE_TEMPLATE)

    for punct in [True, False]:
        # word n-grams with and without punctuation with n in {1, 2, 3}
        for n in range(1,4):
            FILE_TEMPLATE = f"features{author}_{n}grams{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.n_grams(n=n, punct=punct, pos=False), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)
        # POS n-grams with and without punctuation with n in {2, 3}
        for n in range(2,4):
            FILE_TEMPLATE = f"features{author}_{n}gramsPOS{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.n_grams(n=n, punct=punct, pos=True), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)
        # Cohesive markers with and without punctuation
        for _ in range(1):
            FILE_TEMPLATE = f"features{author}_cohesive{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.cohesive(punct=punct), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)

# Cleaning (Optional)

We can delete from disk the files generated during the preprocessing and synctactic feature extraction steps in the folders `Corpora/Proc_{author}` and `auxfiles/txt/{author}` using the custom function `clean_files` in the `utils` submodule in the `helper` module. 

In [0]:
from helper.utils import clean_files

clean_files()