# textaCy (version 0.6.2)
## textaCy is our first NLP preprocessing library with sophisticated statistical analyses, and  abstract applications, such as n-grams, and vectorization. Very useful in preparing data for ML.
## Pros: textaCy is fast, efficient and has a wide variety of functions; it is built on spaCy; it has a high level of abstraction, so we can control our ML through preprocessing.
## Cons: named entity extraction doesn't work for our (untrained) dataset ;(

In [1]:
import pandas as pd
import es_core_news_md
import textacy
import spacy
import csv

In [2]:
def fetch_dataset(csv_file_name):
    data = pd.read_csv(csv_file_name, sep=',')

    return data

In [3]:
handle = fetch_dataset('mattermost_running.csv')
mess = handle['text'].to_string()
#must be a string for textacy
print(type(mess))


<class 'str'>


### a textaCy-style keyword search. 

In [4]:
#look for keywords in the text
def keyword_search(keyword):
    feedback = textacy.text_utils.KWIC(mess, keyword, window_width=35)
    return feedback

In [5]:
keyword_search('poliza')

además necesitaría una copia de la  poliza 
12     Hola estoy en la pagina web
or favor necesito que me envíen la  poliza . Es...
29        deseo consultarle
a con ustedes pero se me venció la  poliza 
98            Es saber si podía ha
a con ustedes pero se me venció la  poliza 
104           Es saber si podía ha
a con ustedes pero se me venció la  poliza 
121           Es saber si podía ha
ola quisiera saber cuando vence la  poliza 
149           Hola quisiera saber 
ola quisiera saber cuando vence la  poliza 
150               si x favor manda


## Convert our text data into a .Doc to tokenize and analyze our data on a higher level

In [6]:
#no need to tell it the language. Textacy already knows!
doc = textacy.Doc(mess)
doc

Doc(3000 tokens; "0      hola necesito conocer sobre precios de s...")

## textacy.text_stats

In [7]:
#if we convert the doc to .TextStats we can get some statistics, easy and fast
stats = textacy.TextStats(doc)

In [8]:
stats.basic_counts

{'n_chars': 8755,
 'n_long_words': 307,
 'n_monosyllable_words': 1206,
 'n_polysyllable_words': 340,
 'n_sents': 126,
 'n_syllables': 3574,
 'n_unique_words': 825,
 'n_words': 2168}

In [9]:
#unique words
stats.n_unique_words

825

## textacy.extract:  here's a list of some of the things we can extract 

In [None]:
"""
>words(doc, filter_stops=True, filter_punct=True, filter_nums=False, include_pos=None, exclude_pos=None, min_freq=1):
>ngrams(doc, n, filter_stops=True, filter_punct=True, filter_nums=False, include_pos=None, exclude_pos=None, min_freq=1):
>named_entities(doc, include_types=None, exclude_types=None, drop_determiners=True, min_freq=1):            
>noun_chunks(doc, drop_determiners=True, min_freq=1):
>subject_verb_object_triples(doc):
>acronyms_and_definitions(doc, known_acro_defs=None):
>direct_quotations(doc):
>pos_regex_matches(doc, pattern):
---Examples of POS_REGEX_PATTERNS``):
            * noun phrase: r'<DET>? (<NOUN>+ <ADP|CONJ>)* <NOUN>+'
            * compound nouns: r'<NOUN>+'
            * verb phrase: r'<VERB>?<ADV>*<VERB>+'
            * prepositional phrase: r'<PREP> <DET>? (<NOUN>+<ADP>)* <NOUN>+' """

### What about Stopwords? textaCy filters stopwords at the moment of extracting, rather than creating a new object without stopwords, as we have to do with spaCy and NLTK. This saves time!

In [10]:
#let's start with n-grams (in this case trigrams)
ngrams = list(textacy.extract.ngrams(doc, 3, filter_stops=True, filter_punct=True))
ngrams

[precios de seguro,
 o posible cliente,
 necesito una cotizacion,
 Hice el pago,
 Y me e,
 imprimir mi cupon,
 cupon de pago,
 Me envia a,
 a la pagina,
 pagina de mercado,
 mercado pago y,
 y me pide,
 Toco el boton,
 medios de pagos,
 imprimir mi cupon,
 cupon de pago,
 Me envia a,
 a la pagina,
 pagina de mercado,
 mercado pago y,
 y me pide,
 Toco el boton,
 medios de pagos,
 venció el seguro,
 necesitaría una copia,
 pagina web y,
 web y quiero,
 viene la grúa,
 Ya pasaron 2,
 Assistance no aparece,
 y para servir,
 Gente necesitamos ayuda,
 Pablo bustos fierro,
 seguro a presentar,
 presentar la boleta,
 Luis de Cabrera,
 pague con débito,
 débito on Line,
 envíen la poliza,
 pagina web y,
 web y quiero,
 esperando q inda,
 inda me conteste,
 mande al grupo,
 anulada por falta,
 falta de pago,
 SE LO HAGO,
 verifico con day,
 pagina web y,
 web y quiero,
 pagina web y,
 web y quiero,
 asegurar mi vehículo,
 Esa me interesa,
 Ford f 100,
 f 100 modelo,
 100 modelo 80,
 paso los da

In [11]:
#Let's try named entities
entit = list(textacy.extract.named_entities(doc, drop_determiners=True))
print(len(entit))

456


In [12]:
#entity labels
myents = {}
ents = []
label = []
for en in entit:
    myents[en] = en.label_
for key,val in myents.items():
    ents.append(key)
    label.append(val)
print(len(label))

456


In [17]:
#Let's viz ...
df = pd.DataFrame({'entity' : ents, 
                  'label' : label})
df

Unnamed: 0,entity,label
0,(\n),ORG
1,(\n),ORG
2,"(\n, 280)",MISC
3,( ),ORG
4,"(Perfecto, !, !, !)",MISC
5,"(\n, 112)",MISC
6,(\n),ORG
7,"(Perfecto, !)",MISC
8,( ),MISC
9,"(\n, 249)",MISC


### Named entity extraction performs as poorly in textaCy as in spaCy, probably because they are both using the same algorithm.

In [13]:
import textacy.keyterms

# textaCy keyterms: a step towards ML
## With textaCy keyterms we can see how our words will be processed in ML: these numerical rankings assigned are the same that will be used in our tf-idf vectorization.

In [61]:
# output reveals keywords, like from before, but with a numerical ranking, the same numerical 
textacy.keyterms.sgrank(doc, ngrams=(1, 2, 3, 4), normalize='lemma', window_width=1500, n_keyterms=10, idf=None)

[('seguro', 0.19278695357946754),
 ('Hola', 0.12481885409310207),
 ('user joined', 0.10946179157848471),
 ('paginar web', 0.0550999266727019),
 ('Quiero', 0.05480748216303773),
 ('pagar', 0.050713555207169275),
 ('gracia', 0.029431723259660005),
 ('favor', 0.028356676186600118),
 ('precio', 0.024231605372670127),
 ('poliza', 0.024204858512917652)]

In [62]:

textacy.keyterms.sgrank(doc, ngrams=(2, 3), normalize='lemma', window_width=1500, n_keyterms=10, idf=None)

[('user joined', 0.7439178671590614), ('paginar web', 0.2560821328409387)]

## textaCy Bag of Terms:

In [66]:
b_o_t = doc.to_bag_of_terms(ngrams=1, named_entities=True, as_strings=True)
sorted(b_o_t.items(), key=lambda x: x[1], reverse=True)[:15]

[('', 197),
 ('seguro', 33),
 ('y', 29),
 ('Hola', 27),
 ('pagar', 22),
 ('a', 20),
 ('Me', 18),
 ('auto', 16),
 ('q', 15),
 ('Quiero', 15),
 ('querer', 14),
 ('favor', 12),
 ('Quería', 10),
 ('voleta', 10),
 ('Es', 10)]

## We will save vectorization for the next notebook on scikit learn and ML.