# More feature extraction: Textacy

The Natural Language Toolkit is the grand daddy of text mining libraries/modules for Python, but since its inception, other tools have made themeselves available. One of those tools is called "Textacy". Textacy does many of the things done by the NLTK, and it does more. This extra functionality is both more complicated and more expressive than the 'Toolkit. 

In [9]:
# configure
CARREL   = 'CodeOfConduct4Lib-2021-03-10'
TEMPLATE = './carrels/%s/etc/reader.txt'
KEYWORD = 'code'
MODEL   = 'en_core_web_sm'


In [10]:
# require
import textacy
import os
import spacy
from textacy.ke import yake


In [11]:
# slurp up some plain text...
file   = TEMPLATE % CARREL
data = open( file ).read()

# ...and do the tiniest bit of normalization ("cleaning") against it
data = data.replace( '\n', ' ' ).replace( '\t', ' ').replace( '  ', ' ')


In [12]:
# perform a keyword in context (KWIC) query against the data; concordance
result = textacy.text_utils.KWIC( data, KEYWORD )
print( list( result ) )


                                                   code ofconduct lib cc licensed under cc short link: htt
censed under cc short link: http://bit.ly/coc lib  code  lib seeks to provide a welcoming, professionally 
ent by having that initial discussion themselves.  code  lib understands that there are many reasons speak
ommunity support squad volunteers list, or on the  code  lib wiki.. if you can't find either any such peop
 you want to ensure anonymity. if you are in the # code  lib irc channel, the zoia command to list people 
know how to direct you to them. if you are in the  code  lib slack, you may reach a volunteer by including
 posted to a public channel, such as #general or # code  libcon. you may also private message a known memb
publicly-accessible website(s). if you are in the  code  lib discord server, contact anyone who is assigne
ing the offender, expelling the offender from the  code  lib event, or banning the offender from a chatroo
e offender not be allowed to voluntee

In [13]:
# create a spaCy "doc object"; depending on the size of the input, this may take a few minutes to process
size           = os.stat( file ).st_size
nlp            = spacy.load( MODEL  )
nlp.max_length = size
doc            = nlp( data )


In [14]:
doc._.preview

'Doc(1442 tokens: " codeofconduct lib cc licensed under cc short l...")'

In [15]:
textacy.TextStats( doc ).flesch_reading_ease

41.32295496887835

In [16]:
list(textacy.extract.ngrams( doc, 2, filter_stops=True, filter_punct=True, filter_nums=False) )

[codeofconduct lib,
 lib cc,
 cc licensed,
 cc short,
 short link,
 http://bit.ly/coc lib,
 lib code,
 code lib,
 lib seeks,
 professionally engaging,
 safe conference,
 ongoing community,
 tolerate harassment,
 discriminatory language,
 including sexual,
 sexualized language,
 event venue,
 including talks,
 community channel,
 mailing list,
 unsafe environment,
 includes offensive,
 offensive verbal,
 verbal comments,
 verbal expressions,
 expressions related,
 gender identity,
 gender expression,
 sexual orientation,
 physical appearance,
 body size,
 political beliefs,
 discriminatory images,
 including online,
 deliberate intimidation,
 harassing photography,
 sustained disruption,
 inappropriate physical,
 physical contact,
 unwelcome sexual,
 sexual attention,
 conflict resolution,
 resolution initial,
 initial incident,
 feel comfortable,
 comfortable speaking,
 offending behavior,
 initial discussion,
 code lib,
 lib understands,
 reasons speaking,
 speaking directly,
 persona

In [17]:
yake( doc )

[('community', 0.16129525407839013),
 ('lib', 0.17105750584987534),
 ('code', 0.17175691455112038),
 ('offender', 0.17277388872364463),
 ('event', 0.1846411322382838),
 ('volunteer', 0.1949806304418236),
 ('support', 0.20092741544357734),
 ('list', 0.24882729969551),
 ('conference', 0.24984179958301242),
 ('code lib', 0.2507713303348389)]

In [18]:
list( textacy.extract.entities( doc ) )

[third, at least one, #, eric, first]

In [19]:
list( textacy.extract.subject_verb_object_triples( doc ) )

[(code lib, seeks, to provide),
 (we, do not tolerate, harassment),
 (that, produces, environment),
 (it, includes, comments),
 (it, includes, expressions),
 (it, includes, disruption),
 (they, have affected, you),
 (offender, may resolve, incident),
 (offended, may resolve, incident),
 (offender, harassing, you),
 (engagement, is not, option),
 (escalation, will need, party),
 (escalation, will need, to step),
 (you, will need, party),
 (you, will need, to step),
 (you, can't find, people),
 (there, will be, staff),
 (you, may use, email address),
 (you, want, to ensure),
 (helpers, may not be, support volunteers),
 (you, may reach, volunteer),
 (who, is assigned, @community_support_volunteers role),
 (who, are designated, role),
 (those, will have, user name),
 (someone, is, support volunteer),
 (you, can see, list),
 (listserv, does have, maintainer),
 (listserv, does have, lease morgan),
 (incident, doesn't pass, step),
 (discussion, reveals, offense),
 (discussion, reveals, apolog

In [20]:
list( textacy.extract.noun_chunks( doc ) )

[ codeofconduct lib cc,
 cc short link,
 http://bit.ly/coc lib code lib,
 welcoming,
 fun,
 safe conference,
 ongoing community,
 ) experience,
 everyone,
 we,
 harassment,
 form,
 discriminatory language,
 imagery,
 sexual or sexualized language,
 imagery,
 event venue,
 talks,
 community channel,
 chatroom,
 mailing list,
 harassment,
 behavior,
 person,
 group,
 unsafe environment,
 it,
 offensive verbal comments,
 non-verbal expressions,
 gender,
 gender identity,
 gender expression,
 sexual orientation,
 disability,
 physical appearance,
 body size,
 race,
 age,
 religious or political beliefs,
 sexual, sexualized, or discriminatory images,
 public,
 online) spaces,
 deliberate intimidation,
 photography,
 recording,
 sustained disruption,
 talks,
 other events,
 inappropriate physical contact,
 sexual attention,
 conflict resolution initial incident,
 you,
 someone,
 other concerns,
 you,
 offender,
 offender,
 they,
 you,
 offending behavior,
 offender,
 incident,
 initial discu

In [21]:
list( textacy.extract.semistructured_statements( doc, entity='you', cue='be' ) ) 

[(you, are, at a conference or other community event),
 (you, are, in the #code lib irc channel),
 (you, are, in the code lib slack),
 (you, are, in the code lib discord server),
 (you, are, on the listserv),
 (you, 're, in a free-for-all for public messages),
 (you, are, welcome to involve)]