# Characteristics extraction

In this notebook, we'll extract characteristics from the pre-processed corpus. The pre-processing consisted on removing numbers between brackets and parentheses and special characters. The characteristics are unigrams, bigrams, and trigrams of words and of POS (considering and ignoring punctuation).

In order to obtain these characteristics, we need to tokenize and to tag the texts, but not to parse them. However, we're going to use the entire pipeline and save the result to disk so we can use it later for extracting syntactic information.

In [1]:
import spacy
from pathlib import Path

In [2]:
# funciones propias
from helper.functions import proc_texts, get_translator, save_dataset_to_json
from helper.features import extract_features_unigrams, extract_features_bigrams, extract_features_trigrams
from helper.features import extract_features_bigramsPOS, extract_features_trigramsPOS

In [3]:
CORPUS_FOLDER = Path(r"./Corpora/Proc_Ibsen_final/")

In [4]:
#nlp = spacy.load("en_core_web_md", disable=["parser", "ner"])
nlp = spacy.load("en_core_web_md")

We process the entire corpus.

In [9]:
docs = [(proc_texts(file, nlp), file.name) for file in CORPUS_FOLDER.iterdir() if file.suffix == ".txt"]

We save the result to disk using the pickle protocol.

In [9]:
import pickle

doc_data = pickle.dumps(docs)

with open(".\\auxfiles\pickle\\ibsen_proc.pickle", "wb") as f:
    f.write(doc_data)

If we were to pick up the process from this point, we can load the pickle file from disk.

In [None]:
# with open(".\\auxfiles\pickle\\ibsen_proc.pickle", "rb") as f:
#     doc_data=f.read()

# docs = pickle.loads(doc_data)

## Word n-gram extraction

In the following two cells we extract unigrams, bigrams, trigrams for the entire corpus disregarding punctuation, and save to disk the results.

In [12]:
featureset_unigrams = [(extract_features_unigrams(doc, punct=False), get_translator(filename)) for doc, filename in docs]
featureset_bigrams = [(extract_features_bigrams(doc, punct=False), get_translator(filename)) for doc, filename in docs]
featureset_trigrams = [(extract_features_trigrams(doc, punct=False), get_translator(filename)) for doc, filename in docs]

In [11]:
save_dataset_to_json(featureset_unigrams, "featuresIbsen_unigrams")
save_dataset_to_json(featureset_bigrams, "featuresIbsen_bigrams")
save_dataset_to_json(featureset_trigrams, "featuresIbsen_trigrams")

file saved
file saved
file saved


In the following two cells we repeat the process, but taking into account the punctuation this time.

In [12]:
featureset_unigrams_punct = [(extract_features_unigrams(doc, punct=True), get_translator(filename)) for doc, filename in docs]
featureset_bigrams_punct = [(extract_features_bigrams(doc, punct=True), get_translator(filename)) for doc, filename in docs]
featureset_trigrams_punct = [(extract_features_trigrams(doc, punct=True), get_translator(filename)) for doc, filename in docs]

In [13]:
save_dataset_to_json(featureset_unigrams_punct, "featuresIbsen_unigrams_punct")
save_dataset_to_json(featureset_bigrams_punct, "featuresIbsen_bigrams_punct")
save_dataset_to_json(featureset_trigrams_punct, "featuresIbsen_trigrams_punct")

file saved
file saved
file saved


## POS n-gram extraction

Next, we'll repeat the process but for POS bigrams and trigrams. Again, taking the punctuation into account and not.

In [14]:
featureset_bigrams_pos = [(extract_features_bigramsPOS(doc, punct=False), get_translator(filename)) for doc, filename in docs]
featureset_trigrams_pos = [(extract_features_trigramsPOS(doc, punct=False), get_translator(filename)) for doc, filename in docs]

In [15]:
save_dataset_to_json(featureset_bigrams_pos, "featuresIbsen_bigrams_pos")
save_dataset_to_json(featureset_trigrams_pos, "featuresIbsen_trigrams_pos")

file saved
file saved


In [16]:
featureset_bigrams_pos_punct = [(extract_features_bigramsPOS(doc, punct=True), get_translator(filename)) for doc, filename in docs]
featureset_trigrams_pos_punct = [(extract_features_trigramsPOS(doc, punct=True), get_translator(filename)) for doc, filename in docs]

In [17]:
save_dataset_to_json(featureset_bigrams_pos_punct, "featuresIbsen_bigrams_pos_punct")
save_dataset_to_json(featureset_trigrams_pos_punct, "featuresIbsen_trigrams_pos_punct")

file saved
file saved


Cells with leftover code that might come handy in the future.

In [None]:
# # Create set of vocab of words
# def create_vocab_words(docs):
#     vocab_words = set()
#     for doc,_ in docs:
#         words = ngrams_tokens(doc, n=1)
#         strings_words = [token.text.lower() for token, in words]
#         vocab_words = vocab_words.union(set(strings_words))
#     return vocab_words

# # Create set of vocab of bigrams
# def create_vocab_bigrams(docs):
#     vocab_bigrams = set()
#     for doc,_ in docs:
#         bigrams = ngrams_tokens(doc, n=2)
#         strings_bigrams = [" ".join([token1.text.lower(), token2.text.lower()])
#                             for token1, token2 in bigrams]
#         vocab_bigrams = vocab_bigrams.union(set(strings_bigrams))
#     return vocab_bigrams

# # Create set of vocab of trigrams
# def create_vocab_trigrams(docs):
#     vocab_trigrams = set()
#     for doc,_ in docs:
#         trigrams = ngrams_tokens(doc, n=3)
#         strings_trigrams = [" ".join([token1.text.lower(), token2.text.lower(), token3.text.lower()])
#                             for token1, token2, token3 in trigrams]
#         vocab_trigrams = vocab_trigrams.union(set(strings_trigrams))
#     return vocab_trigrams

In [None]:
#vocab = create_vocab_words(docs)