# NLPeasy Workshop - OKCupid (1/2)

In this demo you will use NLPEasy to quickly setup a Pandas-based pipeline, enhanced with ML-Methods and pre-trained models (e.g. word embeddings, sentiment analysis). The results will then be saved in Elasticsearch and a Kibana dashboard is generated automatically to explore the texts and results.

For the workshop you have 2 possibilities to participate:
- **mybinder.org:** No setup needed, only a webbrowser, however only 1-2 GB RAM are available
- **own laptop:** You need to install the python environment and either Docker or Elasticsearch+Kibana

This is why we have to variants of the executen, a small one for binder and a bigger one to use on your laptop:

In [1]:
import os
DOIT_SMALL = 'BINDER_PORT' in os.environ

You can overwrite the BINDER-setting if you need to or want to trye

In [2]:
# We are on bigger machines for the workshop:
DOIT_SMALL = False
"DOIT_SMALL: {}".format(DOIT_SMALL)

'DOIT_SMALL: False'

In [3]:
if DOIT_SMALL:
    nrows_to_load, spacy_model, spacy_cols = 1000, 'en_core_web_sm', ['myself_summary']
else:
    nrows_to_load, spacy_model, spacy_cols = 10000, 'en_core_web_md', None
"nrows_to_load: {} spacy_model: {} spacy_cols: {}".format(nrows_to_load, spacy_model, spacy_cols)

'nrows_to_load: 10000 spacy_model: en_core_web_md spacy_cols: None'

You can later get the full processed data for analysis if you download (1.3GB)
> <https://github.com/d-one/NLPeasy-workshop/releases/download/v0.1/okc_enriched.zip>

## Mybinder.org

Online version so now installation needed on your laptop - might even work on a laptop ;-)

> <https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/master?urlpath=lab> [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/master?urlpath=lab)

> With prepared-data: <https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/elastic?urlpath=lab> [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/elastic?urlpath=lab)

However, there is only 1-2 GB of RAM available, which is tough for our example. Also if you loose your connection or close your laptop for 10 minutes your session is lost.

## Installation on your own Laptop

For this example to completely work you need to have Python at least in Version 3.6 installed.
Also you need to have install and start either

- **Docker** <https://www.docker.com/get-started>, direct download links for
    [Mac (DMG)](https://download.docker.com/mac/stable/Docker.dmg) and
    [Windows (exe)](https://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exe).
- **Elasticsearch** and **Kibana**:
<https://www.elastic.co/downloads/> or
<https://www.elastic.co/downloads/elasticsearch-oss> (pure Apache licensed version)

Ideally before the workshop, on the terminal or inside this notebook issue in the terminal:

The last command downloads a spacy model for the english language -
for the following you need to have at least it's `md` (=medium) version which has wordvectors.

## Imports

We will analyse today a dataset using NLPeasy package. We will also need pandas and numpy.

In [4]:
import pandas as pd
import numpy as np
import nlpeasy as ne
import util

`util` is part of the workshop repository and helps with accessing the Kibana dashboard on binder.

## Connect to Elastic Stack

Connect to running elastic instance on your local machine. If this can't be found, it will start an Open Source stack on your docker.

By specifying mount_volume_prefix your Elastic data is saved.

In [5]:
elk = ne.connect_elastic(elk_version='7.10.2',
                         mount_volume_prefix='./elastic-data-live/')

'Elasticsearch already running'

> If it is started on docker it will on the first time pull the images (1.3GB)!
BTW, this function is not blocking, i.e. the servers might only be active couple of seconds later.
Setting mountVolumePrefix="./elastic-data/" would keep the data of elastic in your
filesystems and then the data survives container restarts

On binder you do not have access to it's localhost. However set up some binder-jupyter magic so you can access it:

In [6]:
if util.BINDER:
   util.display_kibana_link()

## Read and Process Data

We use 'OK Cupid' dataset in this workshop (the anonymized legal version of it).
Let's load the data and see what we have here.

In [7]:
url = "profiles.csv.zip" if "profiles.csv.zip" in os.listdir() else "https://github.com/rudeboybert/JSE_OkCupid/raw/master/profiles.csv.zip"
print(f"Getting data from {url}")
okc = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/raw/master/profiles.csv.zip', nrows=nrows_to_load)
okc.head()

Getting data from profiles.csv.zip


Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


We have some categorical / continuous variables about the OK cupid customers:
* body type
* diet
* education
* relgion
* sex
* astro sign
* preferences for alcohol, cigarettes, drugs
* sexual orientation
* relationship status
* language
* location


But they also needed to answer some questions (textual data):

* essay0- My self summary
* essay1- What I’m doing with my life
* essay2- I’m really good at
* essay3- The first thing people usually notice about me
* essay4- Favorite books, movies, show, music, and food
* essay5- The six things I could never do without
* essay6- I spend a lot of time thinking about
* essay7- On a typical Friday night I am
* essay8- The most private thing I am willing to admit
* essay9- You should message me if...

Source: https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook.txt  


### Drop some columns

In [8]:
okc.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

In [9]:
okc = okc.drop(['essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay8'], axis=1)

In [10]:
okc.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay7,essay9,ethnicity,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,trying to find someone to hang out with. i am ...,you want to be swept off your feet!<br />\nyou...,"asian, white",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,,,white,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...",viewing. listening. dancing. talking. drinking...,"you are bright, open, intense, silly, ironic, ...",,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,,you feel so inclined.,white,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,,,"asian, black, other",...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


### Preprocessing

We want to do some cleanup on the data first to make it ready for analysis.
* Replace NaN values
* Remove HTML tags
* Height inch --> cm

In [11]:
#1: replace NaN with '', replace '\n' with ' ' 
okc = okc.replace(np.nan, '', regex=True)
okc = okc.replace('\n', ' ', regex=True)

#2: remove html tags
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

fixcols = okc.columns[okc.columns.str.startswith('essay')]
for col in fixcols:
    okc[col] = [BeautifulSoup(text).get_text() for text in okc[col] ]
    print(col + ' done.')

okc['offspring'] = [BeautifulSoup(text).get_text() for text in okc['offspring'] ]
okc['sign'] = [BeautifulSoup(text).get_text() for text in okc['sign'] ]

essay0 done.
essay7 done.
essay9 done.


In [12]:
#convert height in inches to cm
okc['height_cm'] = round(pd.to_numeric(okc['height'])*2.54, 0)

### Add Lat/Long for location

Kibana does not containt mappings of place names to geographical coordinates.
We have provided a file to map coordinates to the city names so that we can plot it in Kibana.

In [13]:
cities = pd.read_csv('city_coordinates.csv', index_col=0)
okc = okc.merge(cities, how='left', left_on='location', right_on = 'city')

In [14]:
pd.set_option("display.max_columns", 500)

In [15]:
okc['geolocation'] = okc['lat'].astype(str) + ',' + okc['long'].astype(str) 
okc = okc.drop(columns = ['lat', 'long'])

### Feature Engineering

Let's construct some other features that could be interesting.
Some of the categorical features are too verbose, and we want to make it shorter.
E.g. from the pets column, we will summarise those that like dogs and those that have dogs into a 'likes_dogs' category, like so:

The same we will do for the cat lovers.

In [16]:
okc['pets'].fillna('', inplace=True)
okc['likes_dogs'] = 0
okc.loc[okc['pets'].str.contains('likes dogs'), 'likes_dogs'] = 1
okc.loc[okc['pets'].str.contains('has dogs'), 'likes_dogs'] = 1

okc['likes_cats'] = 0
okc.loc[okc['pets'].str.contains('likes cats'), 'likes_cats'] = 1
okc.loc[okc['pets'].str.contains('has cats'), 'likes_cats'] = 1

In [17]:
okc[['pets', 'likes_dogs', 'likes_cats']].head()

Unnamed: 0,pets,likes_dogs,likes_cats
0,likes dogs and likes cats,1,1
1,likes dogs and likes cats,1,1
2,has cats,0,1
3,likes cats,0,1
4,likes dogs and likes cats,1,1


Also, languages are stuffed into one column so let's take them apart.
And let's count how many languages they speak.

In [18]:
okc.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay7', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status', 'height_cm',
       'city', 'geolocation', 'likes_dogs', 'likes_cats'],
      dtype='object')

In [19]:
# first extract languages without modifiers, e.g. 'german (poorly)'
okc['language'] = okc['speaks'].str.replace(r'\s*\([^)]*\)','').str.split(pat=',\s*', expand=False)
# now just split the speaks col and save it in place
okc['speaks'] = okc['speaks'].str.split(pat=r',\s*', expand=False)

In [20]:
okc['language_count'] = okc['speaks'].map(len)

Columns 'religion' and 'sign' also indicated how much people care about the respective topic.
Let's break it into a clean religion and a clean sign column:

In [21]:
# isolate religion
okc[['religion_clean', 'religion_remainder']] = okc['religion'].str.split(pat='and|but', expand=True)
# isolate astro sign
okc[['sign_clean', 'sign_remainder']] = okc['sign'].str.split(pat='and|but', expand=True)

In [22]:
okc['sign_clean'] = okc['sign_clean'].str.strip()
okc['sign_clean'] = okc['sign_clean'].replace('', np.nan)
okc['sign_clean'].unique()

array(['gemini', 'cancer', 'pisces', 'aquarius', 'taurus', 'virgo',
       'sagittarius', 'leo', nan, 'aries', 'libra', 'scorpio',
       'capricorn'], dtype=object)

In [23]:
okc['religion_clean'] = okc['religion_clean'].str.strip()
okc['religion_clean'] = okc['religion_clean'].replace('', np.nan)
okc['religion_clean'].unique()

array(['agnosticism', nan, 'atheism', 'christianity', 'other',
       'catholicism', 'buddhism', 'judaism', 'hinduism', 'islam'],
      dtype=object)

Let's do the same with ethnicity, to get it separated:

In [24]:
okc['ethnicity'] = okc['ethnicity'].str.split(pat=r',\s*', expand=False)

In [25]:
okc = okc.rename(columns={"essay0": "myself_summary", "essay7": "typical_friday_night", "essay9": "message_me_if"})

In [26]:
okc

Unnamed: 0,age,body_type,diet,drinks,drugs,education,myself_summary,typical_friday_night,message_me_if,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,height_cm,city,geolocation,likes_dogs,likes_cats,language,language_count,religion_clean,religion_remainder,sign_clean,sign_remainder
0,22,a little extra,strictly anything,socially,never,working on college/university,about me: i would love to think that i was so...,trying to find someone to hang out with. i am ...,you want to be swept off your feet! you are ti...,"[asian, white]",75,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn’t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,[english],single,190.0,"south san francisco, california","37.6549493,-122.40812509999999",1,1,[english],1,agnosticism,very serious about it,gemini,
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means. 1. i am ...,,,[white],70,80000,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn’t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"[english (fluently), spanish (poorly), french ...",single,178.0,"oakland, california","37.804455700000005,-122.27135630000001",1,1,"[english, spanish, french]",3,agnosticism,not too serious about it,cancer,
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...",viewing. listening. dancing. talking. drinking...,"you are bright, open, intense, silly, ironic, ...",[],68,-1,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn’t matter,no,"[english, french, c++]",available,173.0,"san francisco, california","37.779280799999995,-122.4192363",0,1,"[english, french, c++]",3,,,pisces,it doesn’t matter
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,,you feel so inclined.,[white],71,20000,student,2012-06-28-14-22,"berkeley, california",doesn’t want kids,straight,likes cats,,m,pisces,no,"[english, german (poorly)]",single,180.0,"berkeley, california","37.8708393,-122.27286389999999",0,1,"[english, german]",2,,,pisces,
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,,,"[asian, black, other]",66,-1,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,[english],single,168.0,"san francisco, california","37.779280799999995,-122.4192363",1,1,[english],1,,,aquarius,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,24,curvy,anything,socially,never,graduated from two-year college,"hi my name is chantea and i am 23, i am a hair...",either home with my son or..... out parting...lol,ur not only about sex...cause im not...,[black],66,20000,other,2012-01-16-23-10,"burlingame, california",has a kid,straight,likes dogs and likes cats,christianity and very serious about it,f,libra but it doesn’t matter,when drinking,[english],single,168.0,"burlingame, california","37.5841026,-122.36608249999999",1,1,[english],1,christianity,very serious about it,libra,it doesn’t matter
9996,24,skinny,,socially,never,graduated from college/university,"i came to the states when i was 17, and i fini...",,,[asian],59,-1,,2012-06-30-02-24,"san francisco, california",doesn’t have kids,straight,,,f,virgo,no,"[english, chinese]",single,150.0,"san francisco, california","37.779280799999995,-122.4192363",0,0,"[english, chinese]",2,,,virgo,
9997,19,athletic,,rarely,never,,heey :) i am 19 and bi sexual . i have a lovin...,,"you are a down to earth girl , looking for som...","[pacific islander, white, other]",61,-1,,2012-03-08-13-08,"martinez, california",,bisexual,has dogs and has cats,,f,aquarius but it doesn’t matter,sometimes,[english],seeing someone,155.0,"martinez, california","38.0138934,-122.13386740000001",1,1,[english],1,,,aquarius,it doesn’t matter
9998,47,a little extra,,rarely,never,,after giving this much thought i have finally ...,"someone to go out with on occasion, instead of...",if you're into having fun. just taking life on...,[black],64,50000,transportation,2012-06-30-06-12,"emeryville, california",,straight,likes dogs and likes cats,other,f,scorpio and it’s fun to think about,no,[english],single,163.0,"emeryville, california","37.8314089,-122.28652659999999",1,1,[english],1,other,,scorpio,it’s fun to think about


Allright, we should be all set now to start our analysis.
So let's deep dive into NLPeasy:

## NLPeasy Pipeline

First step is to construct pipeline object:

In [27]:
pipeline = ne.Pipeline(index='okc',
                       text_cols=['myself_summary', 'typical_friday_night', 'message_me_if'],
                       num_cols = ['age', 'height_cm', 'income', 'language_count'],
                       tag_cols=['body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity', 'pets',
                                 'religion', 'sex', 'sign', 'smokes', 'speaks', 'status', 'religion_clean',
                                'sign_clean', 'language', 'likes_cats', 'likes_dogs'],
                       geopoint_cols = ['geolocation'], 
                       elk=elk)

The parameters here mean the following:
* `index` will be the name of the Elasticsearch index (something like a Database name).
* `textCols` here you can specify which columns of the dataframe are textual.
* `numCols` will be used for histograms in Kibana.
* `tagCols` will be used for barplots in Kibana.
* `geoPointCols` can be used to draw maps in Kibana.
* `elk` is the Elastic stack we connected to above.

Now let's add stages in the pipeline!

NLPeasy pipelines can have as few or as many stages as you wish. Here we just use the following:
* `VaderSentiment` is a nice rule-based sentiment prediction for english.
* `Spacy` uses the `spacy` package together with it's language model - here english small (=sm) or medium (=md).
    We use it for a couple of enrichments, as you will see below.

There are more stages in NLPeasy you can use (e.g. RegexTag Extraction, Splitting) or you can define your own functions there.

First the pipeline should calculate sentiments to all of the text columns. The first parameter specifies for which column, the second one stipulates the name of the output column:

In [28]:
pipeline += ne.VaderSentiment('myself_summary', 'myself_summary_sentiment')
pipeline += ne.VaderSentiment('typical_friday_night', 'typical_friday_night_sentiment')
pipeline += ne.VaderSentiment('message_me_if', 'message_me_if_messageme')

Now we will add NLPeasy's spaCy enrichemnt.

For small environments (like on binder) do it just for one column. Also we then only use the small model which does not provide vectors.

In [29]:
spacy_cols = spacy_cols or ['myself_summary', 'typical_friday_night', 'message_me_if']
use_spacy_vecs = not spacy_model.endswith('_sm')
"spacy_model={} use_spacy_vecs={} spacy_cols={}".format(spacy_model, use_spacy_vecs,spacy_cols)

"spacy_model=en_core_web_md use_spacy_vecs=True spacy_cols=['myself_summary', 'typical_friday_night', 'message_me_if']"

The spacy model takes ~30 secs to load:

In [30]:
pipeline += ne.SpacyEnrichment(nlp=spacy_model,
                               cols=spacy_cols,
                               vec=use_spacy_vecs)

###### Run the pipeline - Spacy needs some time to process, so these ~60000 messages take about 100 minutes to process.

We do this for a subset of the data as it will take too long otherwise.

In [31]:
okc_enriched = pipeline.process(okc.head(100), write_elastic=True, batchsize=20)



Create Kibana Dashboard of all the columns

In [32]:
pipeline.create_kibana_dashboard()

Trying to do a histogram on str values: 20 - 46
Trying to do a histogram on str values: 152.0 - 198.0
Trying to do a histogram on str values: -1 - 1000000
Trying to do a histogram on str values: 1 - 5
Trying to do a histogram on str values: -0.5574 - 0.9995
Trying to do a histogram on str values: -0.4767 - 0.9844
Trying to do a histogram on str values: -0.8944 - 0.9992
okc: adding index-pattern
okc: setting default index-pattern
okc: adding search
okc: adding visualisation for body_type
okc: adding visualisation for diet
okc: adding visualisation for drinks
okc: adding visualisation for drugs
okc: adding visualisation for education
okc: adding visualisation for ethnicity
okc: adding visualisation for pets
okc: adding visualisation for religion
okc: adding visualisation for sex
okc: adding visualisation for sign
okc: adding visualisation for smokes
okc: adding visualisation for speaks
okc: adding visualisation for status
okc: adding visualisation for religion_clean
okc: adding visualisa

Open Kibana in webbrowser

In [33]:
util.display_kibana_link() if util.BINDER else elk.kibana

Oh, and you wonder what took so long in the upload?

In [34]:
pipeline._tictoc.summary()

global / process: 0.0us
Stage 1 / VaderSentiment: 221ms
Stage 2 / VaderSentiment: 21ms
Stage 3 / VaderSentiment: 76ms
Stage 4 / SpacyEnrichment: 0.0us
SpacyEnrichment / spacy make iter: 148.0us
SpacyEnrichment / spacy iter: 1.9s
SpacyEnrichment / process spacy docs: 854ms
elastic / upload: 1.4s
global / concat results: 31ms
global / min_max_calc: 2.9ms


## Shutdown

Save the data

In [35]:
filename = 'okc_enriched_demo.pickle'
okc_enriched.to_pickle(filename)

If elastic was started on Docker and you want to shutdown the servers, issue:

> *Warning*: If you didn't use a mountVolumePrefix when you started the servers, all the data in elastic and kibana will be lost!