# NLPeasy Demo - OKCupid
# Applied Machine Learning Days, Lausanne - 25 January 2020

In this demo you will use NLPEasy to quickly setup a Pandas-based pipeline, enhanced with ML-Methods and pre-trained models (e.g. word embeddings, sentiment analysis). The results will then be saved in Elasticsearch and a Kibana dashboard is generated automatically to explore the texts and results.

For the workshop you have 2 possibilities to participate:
- **mybinder.org:** No setup needed, only a webbrowser, however only 1-2 GB RAM are available
- **own laptop:** You need to install the python environment and either Docker or Elasticsearch+Kibana

This is why we have to variants of the executen, a small one for binder and a bigger one to use on your laptop:

In [None]:
import os
DOIT_SMALL = 'BINDER_PORT' in os.environ

You can overwrite the BINDER-setting if you need to or want to trye

In [None]:
# DOIT_SMALL = True
"DOIT_SMALL: {}".format(DOIT_SMALL)

In [None]:
if DOIT_SMALL:
    nrows_to_load, spacy_model, spacy_cols = 1000, 'en_core_web_sm', ['essay0']
else:
    nrows_to_load, spacy_model, spacy_cols = None, 'en_core_web_md', None
"nrows_to_load: {} spacy_model: {} spacy_cols: {}".format(nrows_to_load, spacy_model, spacy_cols)

You can later get the full processed data for analysis if you download (1.3GB)
> <https://github.com/d-one/NLPeasy-workshop/releases/download/v0.1/okc_enriched.zip>

## Mybinder.org

Online version so now installation needed on your laptop - might even work on a laptop ;-)

> <https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/master?urlpath=lab> [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/master?urlpath=lab)

> With prepared-data: <https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/elastic?urlpath=lab> [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/d-one/NLPeasy-workshop/elastic?urlpath=lab)

However, there is only 1-2 GB of RAM available, which is tough for our example. Also if you loose your connection or close your laptop for 10 minutes your session is lost.

## Installation on your own Laptop

For this example to completely work you need to have Python at least in Version 3.6 installed.
Also you need to have install and start either

- **Docker** <https://www.docker.com/get-started>, direct download links for
    [Mac (DMG)](https://download.docker.com/mac/stable/Docker.dmg) and
    [Windows (exe)](https://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exe).
- **Elasticsearch** and **Kibana**:
<https://www.elastic.co/downloads/> or
<https://www.elastic.co/downloads/elasticsearch-oss> (pure Apache licensed version)

Ideally before the workshop, on the terminal or inside this notebook issue in the terminal:

The last command downloads a spacy model for the english language -
for the following you need to have at least it's `md` (=medium) version which has wordvectors.

## Imports

We will analyse today a dataset using NLPeasy package. We will also need pandas and numpy.

In [None]:
import pandas as pd
import numpy as np
import nlpeasy as ne
import util

`util` is part of the workshop repository and helps with accessing the Kibana dashboard on binder.

## Connect to Elastic Stack

Connect to running elastic instance on your local machine. If this can't be found, it will start an Open Source stack on your docker.

By specifying mount_volume_prefix your Elastic data is saved.

In [None]:
elk = ne.connect_elastic(elk_version='7.5.2',
                         mount_volume_prefix='./elastic-data-all/')

> If it is started on docker it will on the first time pull the images (1.3GB)!
BTW, this function is not blocking, i.e. the servers might only be active couple of seconds later.
Setting mountVolumePrefix="./elastic-data/" would keep the data of elastic in your
filesystems and then the data survives container restarts

On binder you do not have access to it's localhost. However set up some binder-jupyter magic so you can access it:

In [None]:
if util.BINDER:
   util.display_kibana_link()

## Read and Process Data

We use 'OK Cupid' dataset in this workshop (the anonymized legal version of it).
Let's load the data and see what we have here.

In [None]:
okc = pd.read_csv('https://github.com/rudeboybert/JSE_OkCupid/raw/master/profiles.csv.zip', nrows=nrows_to_load)
okc.head()

We have some categorical / continuous variables about the OK cupid customers:
* body type
* diet
* education
* relgion
* sex
* astro sign
* preferences for alcohol, cigarettes, drugs
* sexual orientation
* relationship status
* language
* location


But they also needed to answer some questions (textual data):

* essay0- My self summary
* essay1- What I’m doing with my life
* essay2- I’m really good at
* essay3- The first thing people usually notice about me
* essay4- Favorite books, movies, show, music, and food
* essay5- The six things I could never do without
* essay6- I spend a lot of time thinking about
* essay7- On a typical Friday night I am
* essay8- The most private thing I am willing to admit
* essay9- You should message me if...

Source: https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook.txt  


### Preprocessing

We want to do some cleanup on the data first to make it ready for analysis.
* Replace NaN values
* Remove HTML tags
* Height inch --> cm

In [None]:
#1: replace NaN with '', replace '\n' with ' ' 
okc = okc.replace(np.nan, '', regex=True)
okc = okc.replace('\n', ' ', regex=True)

#2: remove html tags
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

fixcols = okc.columns[okc.columns.str.startswith('essay')]
for col in fixcols:
    okc[col] = [BeautifulSoup(text).get_text() for text in okc[col] ]
    print(col + ' done.')

okc['offspring'] = [BeautifulSoup(text).get_text() for text in okc['offspring'] ]
okc['sign'] = [BeautifulSoup(text).get_text() for text in okc['sign'] ]

In [None]:
#convert height in inches to cm
okc['height_cm'] = round(pd.to_numeric(okc['height'])*2.54, 0)

### Add Lat/Long for location

Kibana does not containt mappings of place names to geographical coordinates.
We have provided a file to map coordinates to the city names so that we can plot it in Kibana.

In [None]:
cities = pd.read_csv('city_coordinates.csv')
okc = okc.merge(cities, how='left', left_on='location', right_on = 'city')

In [None]:
pd.set_option("display.max_columns", 500)

In [None]:
okc['all_free_text'] =   \
        okc['essay0'] + '\n' + \
        okc['essay1'] + '\n' + \
        okc['essay2'] + '\n' + \
        okc['essay3'] + '\n' + \
        okc['essay4'] + '\n' + \
        okc['essay5'] + '\n' + \
        okc['essay6'] + '\n' + \
        okc['essay7'] + '\n' + \
        okc['essay8'] + '\n' + \
        okc['essay9']

okc['geolocation'] = okc['lat'].astype(str) + ',' + okc['long'].astype(str) 
okc = okc.drop(columns = ['lat', 'long'])

### Feature Engineering

Let's construct some other features that could be interesting.
Some of the categorical features are too verbose, and we want to make it shorter.
E.g. from the pets column, we will summarise those that like dogs and those that have dogs into a 'likes_dogs' category, like so:

The same we will do for the cat lovers.

In [None]:
okc['pets'].fillna('', inplace=True)
okc['likes_dogs'] = 0
okc.loc[okc['pets'].str.contains('likes dogs'), 'likes_dogs'] = 1
okc.loc[okc['pets'].str.contains('has dogs'), 'likes_dogs'] = 1

okc['likes_cats'] = 0
okc.loc[okc['pets'].str.contains('likes cats'), 'likes_cats'] = 1
okc.loc[okc['pets'].str.contains('has cats'), 'likes_cats'] = 1

In [None]:
okc[['pets', 'likes_dogs', 'likes_cats']].head()

In [None]:
okc.columns

Also, languages are stuffed into one column so let's take them apart.
And let's count how many languages they speak.

In [None]:
# split language column
okc[['language0','language1','language2','language3','language4']] = okc['speaks'].str.split(pat=',',expand=True)

In [None]:
okc['count_languages'] = (okc[['language0', 'language1', 'language2','language3','language4']].notnull()).astype(int).sum(axis=1)

Columns 'religion' and 'sign' also indicated how much people care about the respective topic.
Let's break it into a clean religion and a clean sign column:

In [None]:
# isolate religion
okc[['religion_clean', 'religion_remainder']] = okc['religion'].str.split(pat='and|but',expand=True)
# isolate astro sign
okc[['sign_clean', 'sign_remainder']] = okc['sign'].str.split(pat='and|but',expand=True)

In [None]:
okc['sign_clean'] = okc['sign_clean'].str.strip()
okc['sign_clean'] = okc['sign_clean'].replace('', np.nan)
okc['sign_clean'].unique()

In [None]:
okc['religion_clean'] = okc['religion_clean'].str.strip()
okc['religion_clean'] = okc['religion_clean'].replace('', np.nan)
okc['religion_clean'].unique()

Let's do the same with ethnicity, to get it separated:

In [None]:
okc[['eth0','eth1','eth2','eth3','eth4','eth5','eth6','eth7','eth8']] = okc['ethnicity'].str.split(pat=', ',expand=True)

Allright, we should be all set now to start our analysis.
So let's deep dive into NLPeasy:

## NLPeasy Pipeline

First step is to construct pipeline object:

In [None]:
pipeline = ne.Pipeline(index='okc',
                       text_cols=['all_free_text', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'],
                       num_cols = ['age', 'height_cm', 'income'],
                       tag_cols=['body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity', 'pets',
                                 'religion', 'sex', 'sign', 'smokes', 'speaks', 'status', 'religion_clean',
                                'sign_clean', 'eth0', 'likes_cats', 'likes_dogs'],
                       geopoint_cols = ['geolocation'], 
                       elk=elk)

The parameters here mean the following:
* `index` will be the name of the Elasticsearch index (something like a Database name).
* `textCols` here you can specify which columns of the dataframe are textual.
* `numCols` will be used for histograms in Kibana.
* `tagCols` will be used for barplots in Kibana.
* `geoPointCols` can be used to draw maps in Kibana.
* `elk` is the Elastic stack we connected to above.

Now let's add stages in the pipeline!

NLPeasy pipelines can have as few or as many stages as you wish. Here we just use the following:
* `VaderSentiment` is a nice rule-based sentiment prediction for english.
* `Spacy` uses the `spacy` package together with it's language model - here english small (=sm) or medium (=md).
    We use it for a couple of enrichments, as you will see below.

There are more stages in NLPeasy you can use (e.g. RegexTag Extraction, Splitting) or you can define your own functions there.

First the pipeline should calculate sentiments to all of the text columns. The first parameter specifies for which column, the second one stipulates the name of the output column:

In [None]:
pipeline += ne.VaderSentiment('all_free_text' , 'sentiment')
pipeline += ne.VaderSentiment('essay0', 'sentiment_summary')
pipeline += ne.VaderSentiment('essay1', 'sentiment_life')
pipeline += ne.VaderSentiment('essay2', 'sentiment_goodat')
pipeline += ne.VaderSentiment('essay3', 'sentiment_noticeabout')
pipeline += ne.VaderSentiment('essay4', 'sentiment_favorites')
pipeline += ne.VaderSentiment('essay5', 'sentiment_6things')
pipeline += ne.VaderSentiment('essay6', 'sentiment_thinkingabout')
pipeline += ne.VaderSentiment('essay7', 'sentiment_Fridaynight')
pipeline += ne.VaderSentiment('essay8', 'sentiment_privatething')
pipeline += ne.VaderSentiment('essay9', 'sentiment_messageme')

Now we will add NLPeasy's spaCy enrichemnt.

For small environments (like on binder) do it just for one column. Also we then only use the small model which does not provide vectors.

In [None]:
spacy_cols = spacy_cols or ['all_free_text', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
use_spacy_vecs = not spacy_model.endswith('_sm')
"spacy_model={} use_spacy_vecs={} spacy_cols={}".format(spacy_model, use_spacy_vecs,spacy_cols)

The spacy model takes ~30 secs to load:

In [None]:
pipeline += ne.SpacyEnrichment(nlp=spacy_model,
                               cols=spacy_cols,
                               vec=use_spacy_vecs)

###### Run the pipeline - Spacy needs some time to process, so these ~60000 messages take about 100 minutes to process.

We do this for a subset of the data as it will take too long otherwise.

In [None]:
okc_enriched = pipeline.process(okc.head(1000), write_elastic=True, batchsize=20)

Create Kibana Dashboard of all the columns

In [None]:
pipeline.create_kibana_dashboard()

Open Kibana in webbrowser

In [None]:
util.display_kibana_link() if util.BINDER else elk.kibana

Oh, and you wonder what took so long in the upload?

In [None]:
pipeline._tictoc.summary()

## Shutdown

Save the data

In [None]:
import datetime
filename = '{}_okc_enriched.pickle'.format(datetime.datetime.now())
okc_enriched.to_pickle(filename)

If elastic was started on Docker and you want to shutdown the servers issue:

> *Warning*: If you didn't use a mountVolumePrefix when you started the servers, all the data in elastic and kibana will be lost!