# NLPeasy Workshop - Introduction
# SDS 2020, 25th of June 2020

In this demo you will use NLPEasy to quickly setup a Pandas-based pipeline. The results will then be saved in Elasticsearch and a Kibana dashboard is generated automatically to explore the texts and results.

For the workshop you have 2 possibilities to participate:
- **mybinder.org:** No setup needed, only a webbrowser, however only 1-2 GB RAM are available
- **own laptop:** You need to install the python environment and either Docker or Elasticsearch+Kibana

This is why we have to variants of the executen, a small one for binder and a bigger one to use on your laptop:

In [1]:
import os
DOIT_SMALL = 'BINDER_PORT' in os.environ

You can overwrite the BINDER-setting if you need to or want to trye

In [2]:
# We are on bigger machines for the workshop:
DOIT_SMALL = False
"DOIT_SMALL: {}".format(DOIT_SMALL)

'DOIT_SMALL: False'

In [3]:
if DOIT_SMALL:
    nrows_to_load, spacy_model, spacy_cols = 1000, 'en_core_web_sm', ['myself_summary']
else:
    nrows_to_load, spacy_model, spacy_cols = 10000, 'en_core_web_md', None
"nrows_to_load: {} spacy_model: {} spacy_cols: {}".format(nrows_to_load, spacy_model, spacy_cols)

'nrows_to_load: 10000 spacy_model: en_core_web_md spacy_cols: None'

## Imports

We will analyse today a dataset using NLPeasy package. We will also need pandas and numpy.

In [4]:
import pandas as pd
import numpy as np
import nlpeasy as ne
import util

`util` is part of the workshop repository and helps with accessing the Kibana dashboard on binder.

## Connect to Elastic Stack

Connect to running elastic instance on your local machine. If this can't be found, it will start an Open Source stack on your docker.

By specifying mount_volume_prefix your Elastic data is saved.

In [5]:
elk = ne.connect_elastic(elk_version='7.10.2',
                         mount_volume_prefix='./elastic-data-live/')

'Elasticsearch already running'

> If it is started on docker it will on the first time pull the images (1.3GB)!
BTW, this function is not blocking, i.e. the servers might only be active couple of seconds later.
Setting mountVolumePrefix="./elastic-data/" would keep the data of elastic in your
filesystems and then the data survives container restarts

On binder you do not have access to it's localhost. However set up some binder-jupyter magic so you can access it:

In [6]:
if util.BINDER:
   util.display_kibana_link()

## Read and Process Data

We use 'OK Cupid' dataset in this workshop (the anonymized legal version of it).
Let's load the data and see what we have here.

In [7]:
url = "profiles.csv.zip" if "profiles.csv.zip" in os.listdir() else "https://github.com/rudeboybert/JSE_OkCupid/raw/master/profiles.csv.zip"
print(f"Getting data from {url}")
okc = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/raw/master/profiles.csv.zip', nrows=nrows_to_load)
okc

Getting data from profiles.csv.zip


Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,24,curvy,anything,socially,never,graduated from two-year college,"hi my name is chantea and i am 23, i am a hair...",right now i work full time at a hair salon and...,playing basketball... well lets say i use to.....,my hair lol,...,"burlingame, california",has a kid,straight,likes dogs and likes cats,christianity and very serious about it,f,libra but it doesn&rsquo;t matter,when drinking,english,single
9996,24,skinny,,socially,never,graduated from college/university,"i came to the states when i was 17, and i fini...",,,,...,"san francisco, california",doesn&rsquo;t have kids,straight,,,f,virgo,no,"english, chinese",single
9997,19,athletic,,rarely,never,,heey :)<br />\ni am 19 and bi sexual . i have ...,i am currently going to school and so is my bo...,"i am really good at sports and photography , i...","my eyes and hair , or so i've been told . lol :)",...,"martinez, california",,bisexual,has dogs and has cats,,f,aquarius but it doesn&rsquo;t matter,sometimes,english,seeing someone
9998,47,a little extra,,rarely,never,,after giving this much thought i have finally ...,"just doin me now, if happen chance?!?!?<br />\...",learning graphic art<br />\nwriting<br />\ncoo...,my life is quite simple. i get up in the morni...,...,"emeryville, california",,straight,likes dogs and likes cats,other,f,scorpio and it&rsquo;s fun to think about,no,english,single


We have some categorical / continuous variables about the OK cupid customers:
* body type
* diet
* education
* relgion
* sex
* astro sign
* preferences for alcohol, cigarettes, drugs
* sexual orientation
* relationship status
* language
* location


But they also needed to answer some questions (textual data):

* essay0- My self summary
* essay1- What I’m doing with my life
* essay2- I’m really good at
* essay3- The first thing people usually notice about me
* essay4- Favorite books, movies, show, music, and food
* essay5- The six things I could never do without
* essay6- I spend a lot of time thinking about
* essay7- On a typical Friday night I am
* essay8- The most private thing I am willing to admit
* essay9- You should message me if...

Source: https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook.txt  


### Drop some columns

In [8]:
okc.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

In [9]:
okc = okc.drop(['essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay8'], axis=1)

In [10]:
okc = okc.rename(columns={"essay0": "myself_summary", "essay7": "typical_friday_night", "essay9": "message_me_if"})

In [11]:
okc.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,myself_summary,typical_friday_night,message_me_if,ethnicity,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,trying to find someone to hang out with. i am ...,you want to be swept off your feet!<br />\nyou...,"asian, white",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,,,white,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...",viewing. listening. dancing. talking. drinking...,"you are bright, open, intense, silly, ironic, ...",,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,,you feel so inclined.,white,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,,,"asian, black, other",...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


## NLPeasy Pipeline

First step is to construct pipeline object:

In [12]:
pipeline = ne.Pipeline(index='okc-intro',
                       text_cols=['myself_summary', 'typical_friday_night', 'message_me_if'],
                       num_cols = ['age', 'height', 'income'],
                       tag_cols=['body_type', 'diet', 'drinks', 'drugs', 'education'],
                       elk=elk)

Now let's add stages in the pipeline!

NLPeasy pipelines can have as few or as many stages as you wish. Here we just use the following:
* `VaderSentiment` is a nice rule-based sentiment prediction for english.
* `Spacy` uses the `spacy` package together with it's language model - here english small (=sm) or medium (=md).
    We use it for a couple of enrichments, as you will see below.

There are more stages in NLPeasy you can use (e.g. RegexTag Extraction, Splitting) or you can define your own functions there.

First the pipeline should calculate sentiments to all of the text columns. The first parameter specifies for which column, the second one stipulates the name of the output column:

In [13]:
pipeline += ne.VaderSentiment('myself_summary', 'myself_summary_sentiment')

Now we will add NLPeasy's spaCy enrichemnt.

For small environments (like on binder) do it just for one column. Also we then only use the small model which does not provide vectors.

In [14]:
f"spacy_model={spacy_model}"

'spacy_model=en_core_web_md'

The spacy model takes ~30 secs to load:

In [15]:
pipeline += ne.SpacyEnrichment(nlp=spacy_model,
                               cols=['myself_summary'])

###### Run the pipeline - Spacy needs some time to process, so these ~60000 messages take about 100 minutes to process.

We do this for a subset of the data as it will take too long otherwise.

In [16]:
okc_enriched = pipeline.process(okc.head(100), write_elastic=True, batchsize=20)



Create Kibana Dashboard of all the columns

In [17]:
pipeline.create_kibana_dashboard()

Trying to do a histogram on str values: 20 - 46
Trying to do a histogram on str values: 60 - 78
Trying to do a histogram on str values: -1 - 1000000
Trying to do a histogram on str values: -0.5574 - 0.9995
okc-intro: adding index-pattern
okc-intro: adding search
okc-intro: adding visualisation for body_type
okc-intro: adding visualisation for diet
okc-intro: adding visualisation for drinks
okc-intro: adding visualisation for drugs
okc-intro: adding visualisation for education
okc-intro: adding visualisation for myself_summary_ents
okc-intro: adding visualisation for myself_summary_subj
okc-intro: adding visualisation for myself_summary_verb
okc-intro: adding visualisation for myself_summary
okc-intro: adding visualisation for typical_friday_night
okc-intro: adding visualisation for message_me_if
okc-intro: adding visualisation for age
okc-intro: adding visualisation for height
okc-intro: adding visualisation for income
okc-intro: adding visualisation for myself_summary_sentiment
okc-in

Open Kibana in webbrowser

In [18]:
util.display_kibana_link() if util.BINDER else elk.kibana

Oh, and you wonder what took so long in the upload?

In [19]:
pipeline._tictoc.summary()

global / process: 2.7s
Stage 1 / VaderSentiment: 218ms
Stage 2 / SpacyEnrichment: 1.8s
SpacyEnrichment / spacy make iter: 36.0us
SpacyEnrichment / spacy iter: 1.5s
SpacyEnrichment / process spacy docs: 245ms
elastic / upload: 670ms
global / concat results: 14ms
global / min_max_calc: 1.8ms
