# Preprocessing of Corpora

The *Quixote* files were retrieved from the [website](http://www.husseinabbass.net/translator.html) of professor Hussein Abbass. The files are plain text files&mdash;one file per chapter of the two parts of the novel&mdash;and only require minor preprocessing: removal of bracketed numbers, collapsing of spaces to only one whitespace, and the replacement of special characters, such as é and ü.

The Ibsen files were retieved from [Project Gutenberg](www.gutenberg.org). Therefore, the files contain legal information that needs to be removed along with bracketed numbers, collapsing of spaces, and the replacement of special characters also. However, before doing that, the plays were splitted in 5 KB.

These operations are encapsulated in two functions, `quixote()` and `ibsen()`, respectively, within the module `preprocessing`. The functions use the relative paths to the folders containing the raw files (`Raw_Quixote` and `Raw_Ibsen`) in the subfolder `Corpora` and ouput the processed files in the folders `Proc_Quixote` and `Proc_Ibsen`.

In [None]:
from helper import preprocessing

In [None]:
preprocessing.quixote()

In [None]:
preprocessing.ibsen()

# Processing

In [None]:
from helper.analysis import MyDoc
from pathlib import Path
from helper.analysis import save_dataset_to_json
import spacy

In [None]:
CORPORA = Path(r"./Corpora/")
nlp = spacy.load("en_core_web_md")

for author in ["Ibsen"]:
    path = CORPORA/f"Proc_{author}" 
    docs = [MyDoc(file, nlp) for file in path.iterdir() if file.suffix == ".txt" and file.stat().st_size != 0]
    
    for n in range(2,4):
        FILE_TEMPLATE = f"features{author}_synctactic_n{n}" 
        save_dataset_to_json([(doc.n_grams_syntactic(n=n), doc.translator) for doc in docs], FILE_TEMPLATE)

    for punct in [True, False]:
        for n in range(3):
            FILE_TEMPLATE = f"features{author}_{n+1}grams{'_punct' if punct else ''}"
            save_dataset_to_json([(doc.n_grams(n=n+1, punct=punct), doc.translator) for doc in docs], FILE_TEMPLATE)
        for n in range(2):
            FILE_TEMPLATE = f"features{author}_{n+1}gramsPOS{'_punct' if punct else ''}"
            save_dataset_to_json([(doc.n_gramsPOS(n=n+1, punct=punct), doc.translator) for doc in docs], FILE_TEMPLATE)
        for _ in range(1):
            FILE_TEMPLATE = f"features{author}_cohesive{'_punct' if punct else ''}"
            save_dataset_to_json([(doc.cohesive(punct=punct), doc.translator) for doc in docs], FILE_TEMPLATE)

# Cleaning

In [30]:
from helper.utils import clean_example

clean_example()

## Borrar lo siguiente

In [None]:
CORPORA = Path(r"./Corpora/")
PICKLE = Path(r"./auxfiles/pickle/")
nlp = spacy.load("en_core_web_md")

for author in ["Quixote", "Ibsen"]:
    path = CORPORA/f"Proc_{author}" 
    docs = [MyDoc(file, nlp) for file in path.iterdir() if file.suffix == ".txt"]
    doc_data = pickle.dumps(docs)
    with open(PICKLE/f"{author}.pickle", "wb") as f:
        f.write(doc_data)

If we were to pick up the process from this point, we can load the pickle file from disk.

In [None]:
# with open(PICKLE/"Quixote.pickle", "rb") as f:
#     doc_data=f.read()

# docs = pickle.loads(doc_data)