<a href="https://colab.research.google.com/github/ccaballeroh/Translator-Attribution/blob/master/01Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (Optional)

If running in colab, execute the following cells.

In [1]:
from pathlib import Path
import sys

IN_COLAB = "google.colab" in sys.modules

In [2]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive/', force_remount=True)
    ROOT = Path(r"./drive/My Drive/Translator-Attribution")
    sys.path.insert(0,f"{ROOT}/")

Mounted at /content/drive/


# Preprocessing

The *Quixote* files were retrieved from professor Hussein Abbass's [website](http://www.husseinabbass.net/translator.html). The files are plain text files&mdash;one file per chapter of the two parts of the novel&mdash;and only require minor preprocessing: removal of bracketed numbers, collapsing of spaces to only one whitespace, and the replacement of special characters, such as é and ü.

The Ibsen files were retrieved from [Project Gutenberg](http://www.gutenberg.org). Therefore, the files contain legal information that needs to be removed along with bracketed numbers, collapsing of spaces, and the replacement of special characters also. However, before doing that, the plays were splitted in 5 kB chunks.

These operations are encapsulated in two functions, `quixote()` and `ibsen()`, respectively, within the submodule `preprocessing` available in the `helper` module. The functions use the relative paths to the folders containing the raw files (`Raw_Quixote` and `Raw_Ibsen`) in the subfolder `Corpora` and ouput the processed files in the folders `Proc_Quixote` and `Proc_Ibsen`. Not necessary if already have been preprocessed.

In [3]:
from helper import preprocessing

In colab!


In [None]:
preprocessing.quixote()

In [None]:
preprocessing.ibsen()

# Processing

The processing of the files comprises generating an object of the custom class `MyDoc` available in the `analysis` submodule  in the `helper` module for each document in both corpora. In order to instantiate the objects, a spaCy language model has to be given. A Python list with each object is serialized and saved to disk using Python's `pickle` protocol. 

**Note:** If the Notebook is being run on Colab, spaCy must be installed first and the English language model downloaded. After that, it is necessary to restart the runtime and run the first cells where the Drive is mounted.

In [4]:
if IN_COLAB:
  !pip install spacy==2.2.2
  !python -m spacy download en_core_web_md
else:
  try:
    import spacy
    nlp = spacy.load("en_core_web_md")
  except:
    !python -m spacy download en_core_web_md

[31mERROR: Invalid requirement: 'spacy=2.2.2'
Hint: = is not a valid operator. Did you mean == ?[0m
Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
Traceback (most recent call last):
  File "/usr/lib/python3.6/subprocess.py", line 289, in call
    return p.wait(timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 1477, in wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.6/subprocess.py", line 1424, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, i

In [None]:
from helper import ROOT
from helper.analysis import MyDoc
from pathlib import Path
import pickle
import spacy
import platform

CORPORA = Path(fr"{ROOT}/Corpora/")
PICKLE = Path(fr"{ROOT}/auxfiles/pickle/")

nlp = spacy.load("en_core_web_md")

if not PICKLE.exists():
    PICKLE.mkdir()

docs = {}

for author in ["Quixote", "Ibsen"]:
    path = CORPORA/f"Proc_{author}"
    docs[author] = [
        MyDoc(file, nlp) for file in path.iterdir() if file.suffix == ".txt" and file.stat().st_size != 0
        ]
    # save to disk
    doc_data = pickle.dumps(docs[author])
    with open(PICKLE/f"{author}_{platform.system()}.pickle", "wb") as f:
        f.write(doc_data)

# (Optional) Retrieving Processed Documents from Disk

We can pick up the process from this step retrieving the processed documents from disk.

If you just processed the documents, you can skip to `Features Extraction`.

In [None]:
from helper import ROOT
from pathlib import Path
import pickle
import platform


PICKLE = Path(fr"{ROOT}/auxfiles/pickle/")
docs = {}

for author in ["Quixote", "Ibsen"]:
    with open(PICKLE/f"{author}_{platform.system()}.pickle", "rb") as f:
        doc_data=f.read()
    docs[author] = pickle.loads(doc_data)

# Features Extraction

With the processed documents stored in memory in a dictionary, we can generate feature JSON files using the custom function `save_dataset_to_json` available in the `analysis` submodule in the `helper` module. 


** DRAFT **

Extraer de forma separada las obras de Ibsen

In [None]:
from helper.analysis import save_dataset_to_json

author = "Ibsen"
FILE_TEMPLATE = f"features_{author}_cohesive_parallel_training"

save_dataset_to_json([
    (doc.cohesive(punct=False), doc.translator) for doc in docs[author] if "Ghosts" in doc.filename
], FILE_TEMPLATE)

FILE_TEMPLATE = f"features_{author}_cohesive_parallel_test"

save_dataset_to_json([
    (doc.cohesive(punct=False), doc.translator) for doc in docs[author] if "Ghosts" not in doc.filename
], FILE_TEMPLATE)

In [None]:
from helper.analysis import save_dataset_to_json

for author in ["Quixote", "Ibsen"]:
    # syntactic n-grams with n in {2, 3}
    for n in range(2,4):
        FILE_TEMPLATE = f"features_{author}_syntactic_n{n}"
        save_dataset_to_json([
            (doc.n_grams_syntactic(n=n), doc.translator) for doc in docs[author]
            ], FILE_TEMPLATE)

    for punct in [True, False]:
        # word n-grams with and without punctuation with n in {1, 2, 3}
        for n in range(1,4):
            FILE_TEMPLATE = f"features_{author}_{n}grams{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.n_grams(n=n, punct=punct, pos=False), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)
        # POS n-grams with and without punctuation with n in {2, 3}
        for n in range(2,4):
            FILE_TEMPLATE = f"features_{author}_{n}gramsPOS{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.n_grams(n=n, punct=punct, pos=True), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)
        # Cohesive markers with and without punctuation
        for _ in range(1):
            FILE_TEMPLATE = f"features_{author}_cohesive{'_punct' if punct else ''}"
            save_dataset_to_json([
                (doc.cohesive(punct=punct), doc.translator) for doc in docs[author]
                ], FILE_TEMPLATE)

# Cleaning (Optional)

We can delete from disk the files generated during the preprocessing and synctactic feature extraction steps in the folders `Corpora/Proc_{author}` and `auxfiles/txt/{author}` using the custom function `clean_files` in the `utils` submodule in the `helper` module. 

In [None]:
from helper.utils import clean_files

clean_files()