# Who: Andrei Olariu
## Stone Soup Technology
<br/>
<br/>


# What: Matching Journalists with Domain Experts
## Text Classification with BERT and Gradient Boosting Trees from Idea to Production

- intro on me and StoneSoup

- describing the client and his problem

- baseline approach

- Machine Learning approach

- discussion

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/sst.png" width="800vmin" style="padding: 0 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/andrei.jpg" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

that's me during the pandemic

phd in natural language processing

3 top 10 finishes in kaggle contests

at sst, involved in backend, api and ml work

now you have more context on me and the company I work at

unfortunately, I don't know a lot about you, so I had to prepare this workshop making some assumptions

if you have experience with something and you find some things trivial, then just breath in, breath out, enjoy the moment and bask in your awesomeness

don't wait till the end for questions

# Assumptions

- no/little knowledge of Python
- no/little experience building and deploying Machine Learning models

# Feel free to interrupt me and ask questions

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/guidedpr.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

We have a database of members, with a short description for each

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/brad.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

We have new requests coming in from journalists and we want to identify the best matching members

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/request1.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

What do you think?

How can we approach this problem?

We need a baseline approach. It needs to be:

- easy to implement

- easy to explain/understand/debug

- predictable

- independent of training data

Cmon, let's hear some ideas

actively thinking about a solution will stimulate the creation of new neural paths inside your brain, making you better problem solvers

i swear I'm not moving forward until I get another idea

Our baseline approach:

- index the members' descriptions using a search engine (Solr)

- given a request, send it to the search engine as the query and get the highest ranking members

- **have a moderator review and correct the matches**

- send updated matches to the journalist

notice the third step (hard not to)

we don't trust the current approach - and it's right so, we haven't evaluated it

we are also building a manually annotated dataset to train a ML model

In [37]:
import pickle

with open('matches.pickle', 'rb') as f:
    matches = pickle.loads(f.read())

let's load the matches in memory

please have a look, try to get a feel of the data we have

do a very short exploratory analysis session

In [38]:
len(matches)

39375

In [39]:
matches[0]

{'member_id': 113,
 'request_id': 28552,
 'mismatch': False,
 'auto_generated': True}

In [40]:
matches[-1]

{'member_id': 'Andrei',
 'request_id': 'Codiax',
 'mismatch': True,
 'auto_generated': False,
 'message': 'remove this from the matches list'}

In [41]:
matches = matches[:-1]

In [None]:
y = [] # the correct labels
baseline_predictions = []

# write code here...


# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

write code to build the dataset

we'll start with the correct labels for our matches, as well as the baseline predictions

In [42]:
y = [] # the correct labels
baseline_predictions = []

# write code here...
for match in matches:
    if match['auto_generated']:
        baseline_predictions.append(1)
        if match['mismatch']:
            y.append(0)
        else:
            y.append(1)
    else:
        baseline_predictions.append(0)
        if match['mismatch']:
            raise Exception('this is not possible', match)
        else:
            y.append(1)

# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

In [43]:
import numpy as np

y = np.array(y)
baseline_predictions = np.array(baseline_predictions)

sum(baseline_predictions == y) / len(y)

0.3396911667597907

let's compute the accuracy of our baseline approach

is this a good score or a bad score?

if doing pedestrian detection for autonomous driving, probably a bad score

but if you search for something on google and 3 of the first 10 results are what you were looking for, then it's probably a good score

in our case, we can ask for feedback from the moderators, since they're the ones that see these results

apart from that, for us this is just a number out of context

it will prove useful further on, as we can evaluate new models and improvements and compare them to existing models

# Word2vec

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/word2vec.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>


technique for computing word embeddings - represent words as arrays

underneath, a neural network that would see a lot of phrases

the resulting arrays would show some interesting properties and relationships between words

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/word2vec2.svg" width="1300vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

king - man + woman ~= queen

very interesting technique, lots of applications, but not a breakthrough as what was happening at the time with neural networks on images, where pretrained neural networks were being used everywhere

that breakthrough came though in 2018

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/bert2.jpg" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

we're no longer working with word embeddings, but with sentence embeddings

these new models are trained on huge datasets and then shared, so you can use them out of the box or fine tune them on your small dataset

they revolutionized the field of NLP and led to significant improvements on most problems

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/bert.png" width="1000vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>


## Install

```bash
pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`
```

the server loads the pretrained network in memory

In [16]:
from bert_serving.client import BertClient

bc = BertClient()
results = bc.encode(['Lion is the king of the jungle',
    'The tiger hunts in this forest',
    'Everybody loves New York'])
results.shape

(3, 768)

you can see the embeddings for the first two sentences are more alligned when compared to the third sentence

we can't share the dataset with the request content and member descriptions, but we can share the embeddings

In [17]:
print(np.mean(results[0] * results[1]))
print(np.mean(results[0] * results[2]))
print(np.mean(results[1] * results[2]))

0.18777914
0.13391979
0.12317723


In [44]:
with open('embeddings.pickle', 'rb') as f:
  member_embeddings, request_embeddings = pickle.loads(f.read())

In [58]:
from sklearn.model_selection import train_test_split

X = []
for match in matches:
    X.append(member_embeddings[match['member_id']] * \
        request_embeddings[match['request_id']])

X = np.array(X)

X_train, X_test, y_train, y_test, _, baseline_predictions_test = \
    train_test_split(X, y, baseline_predictions, test_size=0.05, random_state=42)

now that we have a training dataset, let's try to plug it into an algorithm and get some results

In [46]:
X_train.shape

(37405, 768)

In [47]:
X_test.shape

(1969, 768)

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/titanic.jpg" width="600vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/forest.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/gbt.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

In [59]:
from xgboost import XGBClassifier

model_v1 = XGBClassifier(
    n_estimators=300,
    max_depth=7,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric='auc',
    gpu_id=0, #GPU enabled
    tree_method='gpu_hist',
)
model_v1.fit(X_train, y_train)
model_v1_predictions = model_v1.predict(X_test)

In [60]:
print('baseline:', sum(baseline_predictions_test == y_test)/len(y_test))
print('new model:', sum(model_v1_predictions == y_test)/len(y_test))

baseline: 0.3484002031488065
new model: 0.8207211782630777


we're using accuracy

great results

job done, go home and feel happy about it

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

does anybody feel we might be missing something here?

this formula for accuracy is the general one; for binary classification we have the equivalent, but slightly more detailed:

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc2.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/confusion-matrix.png" width="800vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

did anybody figure out where i'm going with this?

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc2.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/confusion-matrix2.png" width="800vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

so it seems like we have no true negatives in our dataset

let's say we have 100 members and the algorithm only returns 5 matches; let's say the moderator adds another 5 members to the results; the remaining 90 members are the true negatives; both the algorithm and the human moderator haven't selected them; but they are not in our dataset

we need to add true negatives to our dataset:
- for a better understanding of our algorithm's performance
- for training a better ML model



In [None]:
EXTEND_COUNT = 120000
# list of dictionaries {'member_id': int, 'request_id': int}
true_negatives = []

# generate 120000 random new matches ...


# ... such that these tests pass
assert len(true_negatives) == EXTEND_COUNT
true_negatives_set = {(tn['member_id'], tn['request_id']) for tn in true_negatives}
assert len(true_negatives_set) == EXTEND_COUNT

old_matches_set = {(d['member_id'], d['request_id']) for d in matches}
member_ids_set = {d['member_id'] for d in matches}
request_ids_set = {d['request_id'] for d in matches}

assert len(true_negatives_set.difference(old_matches_set)) == EXTEND_COUNT
for tn in true_negatives:
    assert tn['member_id'] in member_ids_set
    assert tn['request_id'] in request_ids_set

In [61]:
EXTEND_COUNT = 120000
# list of dictionaries {'member_id': int, 'request_id': int}
true_negatives = []

# generate 120000 random new matches ...
import random

member_ids = list({d['member_id'] for d in matches})
request_ids = list({d['request_id'] for d in matches})
all_matches_set = {(d['member_id'], d['request_id']) for d in matches}

while True:
  member_id = member_ids[random.randint(0, len(member_ids) - 1)]
  request_id = request_ids[random.randint(0, len(request_ids) - 1)]
  if (member_id, request_id) in all_matches_set:
    continue
  true_negatives.append({
      'member_id': member_id,
      'request_id': request_id,
  })
  all_matches_set.add((member_id, request_id))

  if len(true_negatives) == EXTEND_COUNT:
    break

# ... such that these tests pass
assert len(true_negatives) == EXTEND_COUNT
true_negatives_set = {(tn['member_id'], tn['request_id']) for tn in true_negatives}
assert len(true_negatives_set) == EXTEND_COUNT

old_matches_set = {(d['member_id'], d['request_id']) for d in matches}
member_ids_set = {d['member_id'] for d in matches}
request_ids_set = {d['request_id'] for d in matches}

assert len(true_negatives_set.difference(old_matches_set)) == EXTEND_COUNT
for tn in true_negatives:
    assert tn['member_id'] in member_ids_set
    assert tn['request_id'] in request_ids_set

In [None]:
X_tn = []
y_tn = []
baseline_predictions_tn = []

# generate a dataset solely for these new true negatives...


# .. so that the asserts pass
assert len(X_tn) == len(y_tn) == len(baseline_predictions_tn)

In [62]:
X_tn = []
y_tn = []
baseline_predictions_tn = []

# generate a dataset solely for these new true negatives...
for match in true_negatives:
    X_tn.append(member_embeddings[match['member_id']] * \
        request_embeddings[match['request_id']])
    y_tn.append(0)
    baseline_predictions_tn.append(0)

# .. so that the assert passes
assert len(X_tn) == len(y_tn) == len(baseline_predictions_tn)

In [None]:
# split this new dataset into train and test...

# ... then extend the previous dataset...

# ... so that the asserts pass
assert X_train_extended.shape == (151405, 768)
assert X_test_extended.shape == (7969, 768)
assert len(y_train_extended) == 151405
assert len(y_test_extended) == 7969
assert len(baseline_predictions_test_extended) == 7969
assert sum(y_train_extended) == 14913
assert sum(y_test_extended) == 826
assert sum(baseline_predictions_test_extended) == 1829

In [63]:
# split this new dataset into train and test...
X_train_tn, X_test_tn, y_train_tn, y_test_tn, _, baseline_predictions_test_tn = \
    train_test_split(X_tn, y_tn, baseline_predictions_tn, test_size=0.05, random_state=42)

# ... then extend the previous dataset...
X_train_extended = np.vstack((X_train, X_train_tn))
X_test_extended = np.vstack((X_test, X_test_tn))
y_train_extended = np.hstack((y_train, y_train_tn))
y_test_extended = np.hstack((y_test, y_test_tn))
baseline_predictions_test_extended = np.hstack((baseline_predictions_test, baseline_predictions_test_tn))

# ... so that the asserts pass
assert X_train_extended.shape == (151405, 768)
assert X_test_extended.shape == (7969, 768)
assert len(y_train_extended) == 151405
assert len(y_test_extended) == 7969
assert len(baseline_predictions_test_extended) == 7969
assert sum(y_train_extended) == 14913
assert sum(y_test_extended) == 826
assert sum(baseline_predictions_test_extended) == 1829

In [None]:
from xgboost import XGBClassifier

model_v2 = XGBClassifier(
    n_estimators=300,
    max_depth=7,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric='auc',
    gpu_id=0, #GPU enabled
    tree_method='gpu_hist',
)
model_v2.fit(X_train_extended, y_train_extended)

In [65]:
model_v1_predictions = model_v1.predict(X_test_extended)
model_v2_predictions = model_v2.predict(X_test_extended)

print('baseline:', sum(baseline_predictions_test_extended == y_test_extended)/len(y_test_extended))
print('model v1:', sum(model_v1_predictions == y_test_extended)/len(y_test_extended))
print('model v2:', sum(model_v2_predictions == y_test_extended)/len(y_test_extended))

baseline: 0.8390011293763333
model v1: 0.6987074915296775
model v2: 0.9116576734847535


we've been using accuracy, since we've approached this as a classification problem are we're working with classification models

but this is more of a ranking problem, so we should evaluate it accordingly

for example, when working with accuracy, each trained model needs a threshold that separates the two labels; usually 0.5

but for ranking, there is no such threshold; instead items are assigned a score by the model and then ranked based on the score

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/roc.jpg" width="900vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

let's look at this chart

i'm not going to explain it from a theoretical perspective - i don't think i'll do a good job

i'll just talk about interpreting it visually

the blue line on the diagonal is how a worthless model would look like - using a coin toss to see if matches are good

the better a model is, the more it will lean towards the upper left

can we have something towards the lower right? yes, a model that outputs the opposite of a good prediction; but you can just reverse it and get a good prediction, so in practice there's nothing below the diagonal

the metric we'll be using is called AUC - area under curve; and it's exactly that; worthless gets 0.5, perfect gets 1

In [66]:
from sklearn import metrics

y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
predictions = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print('accuracy:', sum(predictions == y)/len(y))
print('auc:', metrics.roc_auc_score(y, predictions))

accuracy: 0.9
auc: 0.5


when comparing accuracy with auc, there are 2 big things to consider

first is performance on unbalanced datasets

imagine a covid test that always says you don't have the virus; if you apply the test to 100 people and only one has the virus, the accuracy of the test will be 99%, which 99% of the people will say is very good

In [67]:
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
predictions = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2])
print('auc:', metrics.roc_auc_score(y, predictions))

auc: 1.0


second difference - auc doesn't need 0s and 1s, it works with values inbetween; this is great if you're interested in the confidence or probability of a result; also great for ranking

In [68]:
model_v1_predictions = model_v1.predict_proba(X_test_extended)[:,1]
model_v2_predictions = model_v2.predict_proba(X_test_extended)[:,1]

print('baseline:', metrics.roc_auc_score(y_test_extended, baseline_predictions_test_extended))
print('model v1:', metrics.roc_auc_score(y_test_extended, model_v1_predictions))
print('model v2:', metrics.roc_auc_score(y_test_extended, model_v2_predictions))

baseline: 0.8352458374561322
model v1: 0.7896102247446577
model v2: 0.9018669626607467


## Epilogue

- the model has been in production for 2 years
- there were some questions in the beginning as to why someone was matched or someone wasn't matched; usually you can figure out from the text what's happening
- a year ago, the manual review step was removed, so matches are now being sent directly to members, asking them to comment on the request
- no other complaints since then

## What can we improve?

## Questions?