<a href="https://colab.research.google.com/github/christianvedels/A_perfect_storm_replication/blob/main/OccCANINE_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OccCANINE

This notebook demonstrates the capabilities of OccCANINE. For any use at scale we recommend downloading the entire [GitHub repository](https://github.com/christianvedels/OccCANINE). A natural starting point is the script [PREDICT_HISCOs.py](https://github.com/christianvedels/OccCANINE/blob/main/PREDICT_HISCOs.py)


# Setting up everything
We start by cloning the GitHub repository. This part takes a few minutes.

In [1]:
!rm -rf OccCANINE # Remove existing
!git clone https://github.com/christianvedels/OccCANINE

Cloning into 'OccCANINE'...
remote: Enumerating objects: 2512, done.[K
remote: Counting objects: 100% (458/458), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 2512 (delta 360), reused 338 (delta 307), pack-reused 2054[K
Receiving objects: 100% (2512/2512), 850.69 MiB | 39.79 MiB/s, done.
Resolving deltas: 100% (1308/1308), done.
Updating files: 100% (538/538), done.


Other required packages. `unidecode` is not standard in colab and needs to be installed.

In [2]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m225.3/235.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8


Setting the working directory to OccCANINE

In [3]:
import os
os.chdir('/content/OccCANINE')
print(os.getcwd())

/content/OccCANINE


We load the 'OccCANINE' class, which is an all-in-one class, which handles loading of the underlying transformer model, prediction and finetuning.

In [4]:
from OccCANINE.n103_Prediction_assets import OccCANINE # HISCO prediciton class

We initialize the model with the class we just loaded

In [17]:
model = OccCANINE() # This downloads and initializes the model from HuggingFace

# Example 1
*OccCANINE works pretty well out of the box*

In [18]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    get_dict = True # Simple output
)

[[['79190', 0.81692904, 'Other Tailors and Dressmakers']],
 [['87350', 0.9993395, 'Boilersmith']],
 [['98330', 0.986886, 'Railway SteamEngine Fireman']]]

# Example 2
***We can get better results by adding language as context***  

`lang="en"`

- In this case we simply get a higher probability of the correct label.
- In other cases, this can make the difference between correct and incorrect
- OccCANINE is trained on 13 languages and the following number of observations:
  + English: "en" (6.34M)
  + Danish: "da" (4.66M)
  + Swedish: "se" (1.68M)
  + Dutch: "nl" (1.00M)
  + Catalan: "ca" (554K)
  + French: "fr" (243K)
  + Norwegian: "no" (136K)
  + Icelandic: "is" (17.4K)
  + Portugese: "pt" (17.4K)
  + German: "ge/de" (11.7k)
  + Spanish: "es" (7372)
  + Italian: "it" (3828)
  + Greek: "gr" (1466)




In [19]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    lang = "en",
    get_dict = True # Simple output
)

[[['79100', 0.8775564, 'Tailor, Specialisation Unknown']],
 [['87350', 0.99941397, 'Boilersmith']],
 [['98330', 0.99424595, 'Railway SteamEngine Fireman']]]

# Fast performance for many observations

In [8]:
import pandas as pd
df = pd.read_csv("Data/TOYDATA.csv")
model.verbose = True # Set updates to True
x = model.predict(
    df["occ1"],
    lang = "en",
    threshold = 0.22 # Optimal for f1
    )

x

Processed batch 40 out of 40 batches
Prediction done. Cleaning results.
Produced HISCO codes for 10000 observations in 0 hours, 0 minutes and 27.223 seconds.
Estimated hours saved compared to human labeller (assuming 10 seconds per label):
 ---> 27 hours, 46 minutes and 13 seconds


Unnamed: 0,inputs,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]soldier (reserve),58340,0.999909,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]wine and spirit merchant,41025,0.999686,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
2,en[SEP]coal merchant (deceased),41025,0.999705,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
3,en[SEP]paper mill operative,99930,0.441911,Factory Worker,,,No pred,,,No pred,,,No pred,,,No pred
4,en[SEP]soldier (deceased),58340,0.999180,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,"en[SEP]holloware turner, deceased",83320,0.987633,Lathe SetterOperator,,,No pred,,,No pred,,,No pred,,,No pred
9996,en[SEP]construction engineer,2210,0.906727,"Civil Engineer, General",,,No pred,,,No pred,,,No pred,,,No pred
9997,en[SEP]operative brewer,77810,0.992721,"Brewer, General",,,No pred,,,No pred,,,No pred,,,No pred
9998,en[SEP]clothier and outfitter,41030,0.951929,Working Proprietor (Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred


# Finetuning
OccCANINE works well of the box, but for even better performance, it can be finetuned on for a specific domain with a few observations.

- To create data for finetuning OccCANINE can be used for initial predictions, which can be corrected by a human labeller
- In turn this can then be used as training data in finetuning


In [9]:
# Setup
import pandas as pd
df = pd.read_csv("Data/TOYDATA.csv")

# Set lang
df["lang"] = "en"  # English

df

Unnamed: 0,occ1,hisco_1,lang
0,soldier (reserve),58340,en
1,wine and spirit merchant,41025,en
2,coal merchant (deceased),41025,en
3,paper mill operative,99930,en
4,soldier (deceased),58340,en
...,...,...,...
9995,"holloware turner, deceased",83320,en
9996,construction engineer,2210,en
9997,operative brewer,77810,en
9998,clothier and outfitter,41030,en


Now we are ready to finetune the model:

In [10]:
model.finetune(
    df,
    label_cols = ["hisco_1"],
    epochs = 3,
    save_model = False # Disable saving of finetuned model
)

==== Started finetuning procedure ====
9000 observations will be used in training.
1000 observations will be used in validation.
Saved tmp files to ../Data/Tmp_finetune
----------
Intital performance:
Validation acc: 0.8939587823275862; Validation loss: 0.00021238651606836356
----------
Epoch 1/3
Train loss 0.00041914516299988865, accuracy 0.8079210069444445
Val loss 0.00017326655506622046, accuracy 0.9049030172413793
Validation loss improved.
----------
Epoch 2/3
Train loss 0.0003105451764390131, accuracy 0.8524305555555556
Val loss 0.0001679362940194551, accuracy 0.9131196120689655
Validation loss improved.
----------
Epoch 3/3
Train loss 0.000293297742448178, accuracy 0.8596788194444445
Val loss 0.00016662566849845462, accuracy 0.9121430495689655
Validation loss improved.
----------
Final performance:
Validation acc: 0.9121430495689655; Validation loss: 0.00016662566849845462
Finetuning completed successfully.


Straight away we can use the model, which is now has better performance for this specific domain. E.g. we can see that it is now more certain, that "tailor of finest suits" is a tailor.

In [16]:
model.verbose = False
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    lang = "en",
    get_dict = True # Simple output
)

[[['79100', 0.9612704, 'Tailor, Specialisation Unknown']],
 [['87350', 0.99369264, 'Boilersmith']],
 [['98330', 0.9934058, 'Railway SteamEngine Fireman']]]