## TextProcessor

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/TextProcessor```:

In [None]:
PATH = "/path/to/TextProcessor" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
import numpy as np
import spacy

from kwic import run_kwic
from nltklib import nlp_text, ngram_an, pos_tagger
from textractlib import textract_file

### Read document file

Supported formats include `.doc(x)`, `.epub`, `.odt` and `.pdf`. **To-do:** add whole folder conversion with recursion.

In [None]:
input_path = "" # <-- file containing text to parse

In [None]:
text = textract_file(input_path, convert=False)

In [None]:
#np.array(text.replace('\n\n','\n').split('\n')) # <-- text

### KWIC Analysis

Look for keywords in context in a text file or string, return associated tags. **Hint:** try changing associated `size` (5) for different results.

In [None]:
keyword = "" # <-- keyword to analyze context

run_kwic(text, keyword, size=5)

### Natural Language Processing

#### N-gram analysis

Requires the Natural Language Toolkit installed. **Hint:** try changing the default `n-value` (2) for different results.

In [None]:
grams = ngram_an(text, n=2)
for g in grams:
    print(g)

#### Syntax analysis

Perform analysis using `spaCy`, a library for Natural Language Processing featuring NER, POS tagging, dependency parsing, word vectors and more.

**Note:** it is required to download the NLP model if you're running for the first time e.g. by executing ```python -m spacy download en``` for english.

In [None]:
doc = nlp_text(text, 'en') # 'en_core_web_sm'

#### POS tagging

Returns part-of-speech tagging from text.

In [None]:
tags = pos_tagger(doc); tags

#### Analyze syntax

Gets nouns, verbs and entities from text.

In [None]:
print("Noun phrases:\n", np.array([chunk.text for chunk in doc.noun_chunks]))

In [None]:
print("Verbs:\n", np.array([token.lemma_ for token in doc if token.pos_ == "VERB"]))

In [None]:
print("Entities:")
for entity in doc.ents:
    print(entity.label_, '=>', entity.text)

#### Compress output →  `output.zip`

In [None]:
!zip output.zip *csv *txt

### [Download output files](output.zip)

_____

### References

* Natural Language Toolkit: https://www.nltk.org/

* spaCy documentation: https://spacy.io

* textract documentation: http://textract.readthedocs.io/en/latest/

* textract @ GitHub: https://github.com/deanmalmgren/textract

* tika-python @ GitHub: https://github.com/chrismattmann/tika-python

* KWIC implementation by @vahbuna: https://github.com/vahbuna/kwic/