# Who: Andrei Olariu
## Stone Soup Technology
<br/>
<br/>


# What: Matching Journalists with Domain Experts
## Text Classification with BERT and Gradient Boosting Trees from Idea to Production

- intro on me and StoneSoup

- describing the client and his problem

- baseline approach

- Machine Learning approach

- discussion

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/sst.png" width="800vmin" style="padding: 0 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/andrei.jpg" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

that's me during the pandemic

phd in natural language processing

3 top 10 finishes in kaggle contests

at sst, involved in backend, api and ml work

now you have more context on me and the company I work at

unfortunately, I don't know a lot about you, so I had to prepare this workshop making some assumptions

if you have experience with something and you find some things trivial, then just breath in, breath out, enjoy the moment and bask in your awesomeness

don't wait till the end for questions

# Assumptions

- no/little knowledge of Python
- no/little experience building and deploying Machine Learning models

# Feel free to interrupt me and ask questions

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/guidedpr.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

We have a database of members, with a short description for each

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/brad.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

We have new requests coming in from journalists and we want to identify the best matching members

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/request1.png" width="1200vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

What do you think?

How can we approach this problem?

We need a baseline approach. It needs to be:

- easy to implement

- easy to explain/understand/debug

- predictable

- independent of training data

Cmon, let's hear some ideas

actively thinking about a solution will stimulate the creation of new neural paths inside your brain, making you better problem solvers

i swear I'm not moving forward until I get another idea

Our baseline approach:

- index the members' descriptions using a search engine (Solr)

- given a request, send it to the search engine as the query and get the highest ranking members

- **have a moderator review and correct the matches**

- send updated matches to the journalist

notice the third step (hard not to)

we don't trust the current approach - and it's right so, we haven't evaluated it

we are also building a manually annotated dataset to train a ML model

In [3]:
import pickle

with open('matches.pickle', 'rb') as f:
    matches = pickle.loads(f.read())

In [4]:
len(matches)

39375

In [5]:
matches[0]

{'member_id': 113,
 'request_id': 28552,
 'mismatch': False,
 'auto_generated': True}

In [6]:
matches[-1]

{'member_id': 'Andrei',
 'request_id': 'Codiax',
 'mismatch': True,
 'auto_generated': False,
 'message': 'remove this from the matches list'}

In [7]:
matches = matches[:-1]

In [27]:
y = [] # the correct labels
baseline_predictions = []

# write code here...


# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

In [8]:
y = [] # the correct labels
baseline_predictions = []

# write code here...
for match in matches:
    if match['auto_generated']:
        baseline_predictions.append(1)
        if match['mismatch']:
            y.append(0)
        else:
            y.append(1)
    else:
        baseline_predictions.append(0)
        if match['mismatch']:
            raise Exception('this is not possible', match)
        else:
            y.append(1)

# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

In [9]:
import numpy as np

y = np.array(y)
baseline_predictions = np.array(baseline_predictions)

sum(baseline_predictions == y) / len(y)

0.3396911667597907

is this a good score or a bad score?

if doing pedestrian detection for autonomous driving, probably a bad score

but if you search for something on google and 3 of the first 10 results are what you were looking for, then it's probably a good score

in our case, we can ask for feedback from the moderators, since they're the ones that see these results

apart from that, for us this is just a number out of context

speaking of context, does anybody feel we might be missing something here?

this formula for accuracy is the general one; for binary classification we have the equivalent, but slightly more detailed:

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc2.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/confusion-matrix.png" width="800vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

did anybody figure out where i'm going with this?

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/acc2.png" width="700vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/confusion-matrix2.png" width="800vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>

so it seems like we have no true negatives in our dataset

let's say we have 100 members and the algorithm only returns 5 matches; let's say the moderator adds another 5 members to the results; the remaining 90 members are the true negatives; both the algorithm and the human moderator haven't selected them; but they are not in our dataset

we need to add true negative to our dataset:
- for a better understanding of our algorithm's performance
- for training a ML model



In [None]:
EXTEND_COUNT = 40000
# list of dictionaries {'member_id': int, 'request_id': int}
true_negatives = []

# generate 40000 random new matches ...


# ... such that these tests pass
assert len(true_negatives) == EXTEND_COUNT
true_negatives_set = {(tn['member_id'], tn['request_id']) for tn in true_negatives}
assert true_negatives_set == EXTEND_COUNT

old_matches_set = {(d['member_id'], d['request_id']) for d in matches}
member_ids_set = {{d['member_id'] for d in matches}}
request_ids_set = {{d['request_id'] for d in matches}}

assert len(true_negatives_set.difference(old_matches_set)) == EXTEND_COUNT
for tn in true_negatives:
    assert tn['member_id'] in member_ids_set
    assert tn['request_id'] in request_ids_set

In [16]:
EXTEND_COUNT = 40000
# list of dictionaries {'member_id': int, 'request_id': int}
true_negatives = []

# generate 40000 random new matches ...
import random

member_ids = list({d['member_id'] for d in matches})
request_ids = list({d['request_id'] for d in matches})
all_matches_set = {(d['member_id'], d['request_id']) for d in matches}

while True:
  member_id = member_ids[random.randint(0, len(member_ids) - 1)]
  request_id = request_ids[random.randint(0, len(request_ids) - 1)]
  if (member_id, request_id) in all_matches_set:
    continue
  true_negatives.append({
      'member_id': member_id,
      'request_id': request_id,
  })
  all_matches_set.add((member_id, request_id))

  if len(true_negatives) == EXTEND_COUNT:
    break

# ... such that these tests pass
assert len(true_negatives) == EXTEND_COUNT
true_negatives_set = {(tn['member_id'], tn['request_id']) for tn in true_negatives}
assert len(true_negatives_set) == EXTEND_COUNT

old_matches_set = {(d['member_id'], d['request_id']) for d in matches}
member_ids_set = {d['member_id'] for d in matches}
request_ids_set = {d['request_id'] for d in matches}

assert len(true_negatives_set.difference(old_matches_set)) == EXTEND_COUNT
for tn in true_negatives:
    assert tn['member_id'] in member_ids_set
    assert tn['request_id'] in request_ids_set

In [13]:
extended_matches = matches + true_negatives

# update y and baseline_predictions with the new true negatives...


# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(extended_matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

40000

In [17]:
extended_matches = matches + true_negatives

# update y and baseline_predictions with the new true negatives...
y = np.append(y, [0] * EXTEND_COUNT)
baseline_predictions = np.append(baseline_predictions, [0] * EXTEND_COUNT)

# .. so that the asserts pass
assert len(y) == len(baseline_predictions) == len(extended_matches)
assert sum(y) == 15739
assert sum(baseline_predictions) == 37010

<figure style="display: table; margin: 0 auto">
  <center>
    <img src="images/bert.png" width="1000vmin" style="padding: 4vmin 0 0 0">
  </center>
</figure>


## Install

```bash
pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`
```

In [20]:
from bert_serving.client import BertClient
bc = BertClient()
results = bc.encode(['First do it', 'then do it right', 'then do it better'])
results.shape

(3, 768)

In [21]:
results[0]

array([ 1.31864727e-01,  3.24041188e-01, -8.27044010e-01, -5.12046754e-01,
       -2.10917473e-01, -7.56367436e-03,  2.78215021e-01,  1.45472363e-01,
        3.16130877e-01, -8.04042339e-01,  1.51105180e-01, -8.90429616e-02,
       -2.58598894e-01, -2.78033942e-01, -9.10568118e-01,  1.96250051e-01,
       -4.78811204e-01, -2.19583943e-01, -1.24930609e-02,  5.57602704e-01,
       -1.76175982e-01,  2.76079953e-01, -1.14945874e-01,  4.04391944e-01,
        1.26200050e-01, -4.31549132e-01,  3.79727269e-03, -4.06851470e-01,
       -3.90977770e-01,  3.08501661e-01,  3.04518729e-01, -3.86588812e-01,
       -1.80198848e-01, -6.69611692e-02,  7.84349591e-02,  3.09790313e-01,
        4.17365730e-01,  2.92629272e-01, -5.05525947e-01,  1.26575202e-01,
       -4.70998287e-01, -2.99572051e-01, -1.80541933e-01, -5.99076271e-01,
        4.62537371e-02, -7.34317601e-01, -2.08782218e-02, -1.51368544e-01,
       -3.55325997e-01, -7.57688761e-01, -7.27522016e-01,  4.01857048e-01,
        1.07890710e-01,  

In [22]:
results[0] * results[1]

array([ 3.27994078e-02, -3.99684906e-02,  3.22000295e-01,  2.59024799e-02,
        5.97291924e-02, -4.99160924e-05,  1.98063795e-02,  7.61032477e-02,
        8.24118927e-02,  3.75977308e-01, -5.87234087e-03,  9.40696523e-03,
        5.86674549e-02,  1.54506311e-01,  7.41285682e-01,  9.34188068e-02,
        2.53797531e-01,  8.79169703e-02,  4.12853085e-04,  1.42778203e-01,
        6.06828332e-02,  8.12790617e-02,  3.85900103e-02,  4.97599468e-02,
        5.09935990e-02,  7.48353302e-02,  1.09241046e-04,  1.39142290e-01,
        1.63582548e-01, -2.39758641e-02,  9.89605337e-02,  9.30262581e-02,
        3.93940769e-02,  7.11570354e-03,  1.99927427e-02,  1.12612352e-01,
        1.05399601e-01,  6.46991059e-02,  4.01080549e-01, -2.21739504e-02,
        2.42837638e-01,  1.31357461e-01,  2.97974106e-02,  2.44869694e-01,
        9.19163576e-04,  3.15052181e-01,  6.50656223e-03, -2.03785133e-02,
       -1.41895851e-02,  5.85714400e-01,  6.19590640e-01,  5.16663492e-02,
        3.93970357e-03, -