In [1]:
%load_ext autoreload
%autoreload 2

Starting with the necessary inputs

In [2]:
import random

import numpy as np
import pandas as pd

from utils import load_data, evalute_prf, display_errors
from models import USE4ZeroShotClassifier, HuggingFaceZeroShotClassifier

In [3]:
pd.set_option('display.max_colwidth', None)

Now let's load both models.
First the huggingface model

In [4]:
hface_model = HuggingFaceZeroShotClassifier()

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some layers from the model checkpoint at roberta-large-mnli were not used when initializing TFRobertaModel: ['classifier']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


and then a simple model that I developed based on the Universal Sentence Encoder from Tensorflow Hub.

In [5]:
use4_model = USE4ZeroShotClassifier()

INFO:absl:Using /var/folders/_x/7nl19fwd3wd88k22c_njm4200000gn/T/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 90.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 180.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 270.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 360.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 450.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 540.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 630.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 720.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 810.00MB
INFO:absl:Downloading https://tfhub.d

Next, we load the pre-processed data (original source Wikipedia)

In [6]:
female_dict, male_dict = load_data()

Let's say that we want to measure the performance of our models on the pooled gender data i.e. ignoring region information. We build a dict of dicts where:
- the outer dict has two keys (namely (`All Regions`, `Female`) and (`All Regions`, `Male`) ) which are the group names
- Each inner dict contains the true label (`Female` or `Male`) and the items (i.e. first names) to be classified

In [7]:
pooled_data = dict()

pooled_data[('All Regions', 'Female')] = dict()
pooled_data[('All Regions', 'Male')] = dict()

pooled_data[('All Regions', 'Female')]['True Label'] = 'Female'
pooled_data[('All Regions', 'Male')]['True Label'] = 'Male'

pooled_data[('All Regions', 'Female')]['Items'] = list()
pooled_data[('All Regions', 'Male')]['Items'] = list()


for k, v in female_dict.items():
    pooled_data[('All Regions', 'Female')]['Items'].extend(v)
    
for k, v in male_dict.items():
    pooled_data[('All Regions', 'Male')]['Items'].extend(v)

In [8]:
pooled_data[('All Regions', 'Female')]['Items'][:5]

['Fatma', 'Karima', 'Fatiha', 'Sara', 'Fatima']

Now let's predict using the two models

In [9]:
hface_predictions = hface_model.predict_gender(inputs=pooled_data)

In [10]:
use4_predictions = use4_model.predict_gender(inputs=pooled_data)

The `predict_gender` function has added a new entry in the inner dict which contains the predictions of the model per case

In [11]:
use4_predictions[('All Regions', 'Female')].keys()

dict_keys(['True Label', 'Items', 'Predictions'])

We can use this object to evaluate various metrics such as Precision, Recall and F1 score

In [12]:
evalute_prf(predictions=hface_predictions, labels=['Female', 'Male'])

Unnamed: 0,Precision,Recall,F1,Support
Female,0.881383,0.954082,0.916292,1176
Male,0.950685,0.873322,0.910363,1192


In [13]:
evalute_prf(predictions=use4_predictions, labels=['Female', 'Male'])

Unnamed: 0,Precision,Recall,F1,Support
Female,0.77,0.85119,0.808562,1176
Male,0.836142,0.749161,0.790265,1192


which shows that the Huggingface model achieves ~0.9 F1 score in both classes vs 0.8 of the baseline model

## Zero-shot location classification

On the other hand we could evaluate the model per region and display the errors in each case, so let's build the appropriate input

In [21]:
per_region_data = dict()

for true_label, case_dict in zip(['Female', 'Male'], [female_dict, male_dict]): 
    for k, v in case_dict.items():
        per_region_data[(k, true_label)] = dict()
        per_region_data[(k, true_label)]['True Label'] = k
        per_region_data[(k, true_label)]['Items'] = v


In [22]:
per_region_data[('Italy', 'Female')]

{'True Label': 'Italy',
 'Items': {'Alice',
  'Anna',
  'Aurora',
  'Beatrice',
  'Emma',
  'Ginevra',
  'Giorgia',
  'Giulia',
  'Greta',
  'Sofia'}}

And we can choose some regions at random to evaluate

In [29]:
COMBINATIONS = [
    ('Japan', 'Male'),
    ('Greece', 'Male'),
    ('China', 'Male')
]

In [32]:
location_predictions = hface_model.predict_location(inputs=per_region_data, combinations=COMBINATIONS)

In [33]:
display_errors(location_predictions)

Unnamed: 0,Unnamed: 1,correct,wrong,accuracy,True Label,"errors (Text, Prediction)"
Japan,Male,19.0,2.0,0.904762,Japan,"(Hinata, China),(Ren, China)"
China,Male,6.0,4.0,0.6,China,"(Yong, Japan),(Jun, Japan),(Yi, Japan),(Jie, Japan)"
Greece,Male,10.0,0.0,1.0,Greece,


The `accuracy` column above is the precision per class