## TextProcessor

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/TextProcessor```:

In [None]:
PATH = "/path/to/TextProcessor" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
from TextProcessor import TextProcessor
from csvlib import read_from_csv
from kwic import run_kwic
from nltklib import ngram_an
from spacylib import nlp_an
from textractlib import textract_file

### Process text from document

Analyzes keywords in context, n-grams, part-of-speech tags, noun phrases, verbs and entities, writing all output to CSV files.

**Hint:** the `keywords` parameter accepts either a single word, a list, a string list (comma separated) or an integer (for highest occurrences).

In [None]:
text_input = "" # text string or document file with content to parse
keywords = ""   # keywords for KWIC and n-gram analyses e.g. "love,hate"
column = ""     # column index or title to consider from spreadsheet

#### Advanced settings

Set keywords in context and n-grams size, among other settings. **Hint:** try changing `kwic_size` and `ngram_size` for different results.

**Note:** required [NLP models](https://spacy.io/usage/models) can be installed by enabling `allow_download` or by executing e.g. `python -m spacy download en` for English.

In [None]:
kwic_size = 5           # size of keywords in context
ngram_size = 2          # size of n-grams to analyze
append_columns = ""     # for KWIC and n-grams e.g. "text"

nlp = True              # enable natural language processing (spaCy)
ignore_case = True      # do not consider case letters (AaBbCc)
allow_download = False  # automatic model download from spaCy
convert = False         # extract text from document and exit

model = "auto"          # model selection from spaCy (e.g. "english")
encode = "utf-8"        # file encoding (e.g. "windows-1252")
binary = ""             # binary name to extract text from file

output_folder = "TEXT"  # output folder name to write results to

### Start analysis

Launch KWIC, n-gram and syntax analyses, and write results to the `output_folder` set.

In [None]:
kwic, grams, pos, nouns, verbs, entities = TextProcessor(text_input,
                                                         output_folder=output_folder,
                                                         keywords=keywords,
                                                         column=column, 
                                                         kwic_size=kwic_size,
                                                         ngram_size=ngram_size,
                                                         nlp=nlp, 
                                                         ignore_case=ignore_case,
                                                         append_columns=append_columns,
                                                         allow_download=allow_download,
                                                         model=model,
                                                         encode=encode,
                                                         convert=convert,
                                                         binary=binary)

#### Keywords in context

Returns data frame built from keywords in context.

In [None]:
kwic

#### N-gram analysis

Returns data frame built from Natural Language Toolkit (NLTK).

In [None]:
grams

#### Parts-of-speech tags

Returns data frame of POS tags from spaCy.

In [None]:
pos

#### Noun chunks

Returns data frame of noun phrases from spaCy.

In [None]:
nouns

#### Verbs

Returns data frame of verbs after lemmatization performed by spaCy.

In [None]:
verbs

#### Entities

Returns identified entities from spaCy.

In [None]:
entities

#### Compress output →  `output.zip`

In [None]:
!zip -r output.zip TEXT

### [Download output files](output.zip)

_____

### References

* Natural Language Toolkit: https://www.nltk.org/

* spaCy documentation: https://spacy.io

* textract documentation: http://textract.readthedocs.io/en/latest/

* textract @ GitHub: https://github.com/deanmalmgren/textract

* tika-python @ GitHub: https://github.com/chrismattmann/tika-python

* Original KWIC code: https://github.com/vahbuna/kwic/