## TextProcessor

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/TextProcessor```:

In [None]:
PATH = "/path/to/TextProcessor" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
from TextProcessor import TextProcessor
from kwic import run_kwic
from nltklib import ngram_an
from spacylib import nlp_an
from textractlib import textract_file

### Process text from document

Analyzes keywords in context, n-grams, part-of-speech tags, noun phrases, verbs and entities, writing all output to CSV files.

**Hint:** the `keywords` parameter accepts either a single word, a list, a string list (comma separated) or an integer (for highest occurrences).

In [None]:
text = "" # <-- string or document file with content to parse

In [None]:
TextProcessor(text,
              output_folder='RESULTS',
              keywords='',
              kwic_size=5,
              ngram_size=2,
              ignore_case=False,
              column_csv='',
              allow_download=True,
              language='auto',
              encode='utf-8',
              convert=False,
              binary='')

### Advanced usage

#### Read/convert document file

Supported formats include `.doc(x)`, `.epub`, `.odt` and `.pdf`. **To-do:** add whole folder conversion with recursion.

In [None]:
text = textract_file(text, convert=False)

#### KWIC Analysis

Look for keywords in context in a text file or string, return associated tags. **Hint:** try changing associated `size` (5) for different results.

In [None]:
keywords = "" # <-- keywords to analyze context (optional)

run_kwic(text, keywords, size=5, ignore_case=False, display=True)

#### N-gram analysis

Requires the Natural Language Toolkit installed. **Hint:** try changing the default `n-value` (2) for different results.

In [None]:
grams = ngram_an(text, n=2)
for g in grams:
    print(g)

#### Syntax analysis

Perform analysis using `spaCy`, a library for Natural Language Processing featuring NER, POS tagging, dependency parsing, word vectors and more.

**Note:** it is required to download the [NLP model](https://spacy.io/usage/models) if you're running for the first time e.g. by executing ```python -m spacy download en``` for english.

In [None]:
pos_tags, noun_chunks, verbs, entities = nlp_an(text, language='auto', allow_download=False)

In [None]:
#!python -m spacy download en

#### Inspect output

In [None]:
print('POS tags:')
for pos in pos_tags:
    print(pos)

In [None]:
print('Noun phrases:')
for chunk in noun_chunks:
    print(chunk)

In [None]:
print('Verbs:')
for verb in verbs:
    print(verb)

In [None]:
print('Entities:')
for ent in entities:
    print(ent)

#### Compress output →  `output.zip`

In [None]:
!zip output.zip *csv *txt

### [Download output files](output.zip)

_____

### References

* Natural Language Toolkit: https://www.nltk.org/

* spaCy documentation: https://spacy.io

* textract documentation: http://textract.readthedocs.io/en/latest/

* textract @ GitHub: https://github.com/deanmalmgren/textract

* tika-python @ GitHub: https://github.com/chrismattmann/tika-python

* KWIC implementation by @vahbuna: https://github.com/vahbuna/kwic/