# Discourse cohesive markers

This notebook extracts cohesive markers characteristics from the entire Quixote corpus. There are two ways: one taking only the markers specified in `markersList.json` (taken from [this page](http://home.ku.edu.tr/~doregan/Writing/Cohesion.html)), and taking the punctuation marks (if there are any) surrounding them.

In order to extract these markers, we'll use the PhraseMatcher class from spaCy. That's why we need to process the text, although only using the tokenizer.

In [1]:
import os
import spacy
from spacy.lang.en import English  # English tokenizer

In [2]:
from helper.functions import proc_texts, get_translator, marker_matcher, save_dataset_to_json
from helper.features import extract_features_cohesive

In [3]:
nlp = English()  # Pipe with only tokenizer

In [4]:
matcher = marker_matcher(nlp, "markersList.json")  # PhraseMatcher for cohesive markers in file

In [5]:
CORPUS_FOLDER = ".\\Corpora\\Proc_Quixote\\"
file_list = os.listdir(CORPUS_FOLDER)

In [6]:
docs = [(proc_texts(CORPUS_FOLDER, file, nlp, debug=True), get_translator(file)) for file in file_list]

In [7]:
featureset_cohesive = [(extract_features_cohesive(doc, matcher), translator) for doc, translator in docs]
featureset_cohesive_extended = [(extract_features_cohesive(doc, matcher, extended=True), translator) for doc, translator in docs]

In [8]:
save_dataset_to_json(featureset_cohesive, "featuresQuixote_cohesive")
save_dataset_to_json(featureset_cohesive_extended, "featuresQuixote_cohesive_punctuation")

file saved
file saved


# Ibsen

In [None]:
CORPUS_FOLDER = ".\\Corpora\\Proc_Ibsen\\"
file_list = os.listdir(CORPUS_FOLDER)

docs = [(proc_texts(CORPUS_FOLDER, file, nlp, debug=True), get_translator(file)) for file in file_list]

In [None]:
featureset_cohesive = [(extract_features_cohesive(doc, matcher), translator) for doc, translator in docs]

featureset_cohesive_extended = [(extract_features_cohesive(doc, matcher, extended=True), translator)
                                for doc, translator in docs]

In [None]:
from helper.functions import save_dataset_to_json

save_dataset_to_json(featureset_cohesive, "featuresIbsen_cohesive")
save_dataset_to_json(featureset_cohesive_extended, "featuresIbsen_cohesive_punctuation")