<a href="https://colab.research.google.com/github/ccaballeroh/Translator-Attribution/blob/master/01Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (Optional)

If running in colab, execute the following cells.

In [1]:
from pathlib import Path
import sys

IN_COLAB = "google.colab" in sys.modules

In [2]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive/', force_remount=True)
    ROOT = Path(r"./drive/My Drive/Translator-Attribution")
    sys.path.insert(0,f"{ROOT}/")

Mounted at /content/drive/


# Preprocessing

The *Quixote* files were retrieved from professor Hussein Abbass's [website](http://www.husseinabbass.net/translator.html). The files are plain text files&mdash;one file per chapter of the two parts of the novel&mdash;and only require minor preprocessing: removal of bracketed numbers, collapsing of spaces to only one whitespace, and the replacement of special characters, such as é and ü.

The Ibsen files were retrieved from [Project Gutenberg](http://www.gutenberg.org). Therefore, the files contain legal information that needs to be removed along with bracketed numbers, collapsing of spaces, and the replacement of special characters also. However, before doing that, the plays were splitted in 5 kB chunks.

These operations are encapsulated in two functions, `quixote()` and `ibsen()`, respectively, within the submodule `preprocessing` available in the `helper` module. The functions use the relative paths to the folders containing the raw files (`Raw_Quixote` and `Raw_Ibsen`) in the subfolder `Corpora` and ouput the processed files in the folders `Proc_Quixote` and `Proc_Ibsen`. Not necessary if already have been preprocessed.

In [3]:
from helper import preprocessing

In colab!


In [None]:
preprocessing.quixote()

In [None]:
preprocessing.ibsen()

# Processing

The processing of the files comprises generating an object of the custom class `MyDoc` available in the `analysis` submodule  in the `helper` module for each document in both corpora. In order to instantiate the objects, a spaCy language model has to be given. A Python list with each object is serialized and saved to disk using Python's `pickle` protocol. 

**Note:** If the Notebook is being run on Colab, spaCy must be installed first and the English language model downloaded. After that, it is necessary to restart the runtime and run the first cells where the Drive is mounted.

In [3]:
if IN_COLAB:
  !pip install spacy==2.2.2
  !python -m spacy download en_core_web_md
else:
  try:
    import spacy
    nlp = spacy.load("en_core_web_md")
  except:
    !python -m spacy download en_core_web_md

Collecting spacy==2.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/b9/05/e82c888a36f24608664b56abe737f4428410d370791f6112fb3e9b4a4a81/spacy-2.2.2-cp36-cp36m-manylinux1_x86_64.whl (10.3MB)
[K     |████████████████████████████████| 10.3MB 5.4MB/s 
Collecting thinc<7.4.0,>=7.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/59/6bb553bc9a5f072d3cd479fc939fea0f6f682892f1f5cff98de5c9b615bb/thinc-7.3.1-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 49.5MB/s 
Installing collected packages: thinc, spacy
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed spacy-2.2.2 thinc-7.3.1
Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_w

In [None]:
from helper import ROOT
from helper.analysis import MyDoc
from pathlib import Path
import pickle
import spacy
import platform

CORPORA = Path(fr"{ROOT}/Corpora/")
PICKLE = Path(fr"{ROOT}/auxfiles/pickle/")

nlp = spacy.load("en_core_web_md")

if not PICKLE.exists():
    PICKLE.mkdir()

docs = {}

for author in ["Quixote", "Ibsen"]:
    path = CORPORA/f"Proc_{author}"
    docs[author] = [
        MyDoc(file, nlp) for file in path.iterdir() if file.suffix == ".txt" and file.stat().st_size != 0
        ]
    # save to disk
    doc_data = pickle.dumps(docs[author])
    with open(PICKLE/f"{author}_{platform.system()}.pickle", "wb") as f:
        f.write(doc_data)

# Cleaning (Optional)

We can delete from disk the files generated during the preprocessing and synctactic feature extraction steps in the folders `Corpora/Proc_{author}` and `auxfiles/txt/{author}` using the custom function `clean_files` in the `utils` submodule in the `helper` module. 

In [None]:
from helper.utils import clean_files

clean_files()