<a href="https://colab.research.google.com/github/christianvedels/OccCANINE/blob/main/OccCANINE_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OccCANINE

This notebook demonstrates the capabilities of OccCANINE. For any use at scale we recommend looking at [GitHub repository](https://github.com/christianvedels/OccCANINE). A natural starting point is the script [PREDICT_HISCOs.py](https://github.com/christianvedels/OccCANINE/blob/main/PREDICT_HISCOs.py)


# Setting up everything
We start by cloning the GitHub repository

In [1]:
!rm -rf OccCANINE # Remove existing
!git clone https://github.com/christianvedels/OccCANINE

Cloning into 'OccCANINE'...
remote: Enumerating objects: 2749, done.[K
remote: Counting objects: 100% (695/695), done.[K
remote: Compressing objects: 100% (257/257), done.[K
remote: Total 2749 (delta 486), reused 581 (delta 433), pack-reused 2054[K
Receiving objects: 100% (2749/2749), 861.95 MiB | 30.15 MiB/s, done.
Resolving deltas: 100% (1434/1434), done.
Updating files: 100% (541/541), done.


Install histocc package

In [2]:
!pip install OccCANINE/

Processing ./OccCANINE
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: histocc
  Building wheel for histocc (setup.py) ... [?25l[?25hdone
  Created wheel for histocc: filename=histocc-0.1.1-py3-none-any.whl size=351922 sha256=7bdcbcdd7bb8954cd3b1ef7f3e43be8522663f7050f3b47a2f8e9db1759e9211
  Stored in directory: /tmp/pip-ephem-wheel-cache-7mzlbcyn/wheels/81/0a/20/821ea5b7ddc47314568c18dee9e67d8401154089f1dcb3fdd2
Successfully built histocc
Installing collected packages: histocc
Successfully installed histocc-0.1.1


Other required packages. `unidecode` is not standard in colab and needs to be installed.

In [3]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m225.3/235.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8


We load the 'OccCANINE' class, which is an all-in-one class, which handles loading of the underlying transformer model, prediction and finetuning.

In [5]:
from histocc import OccCANINE # HISCO prediciton class

We initialize the model with the class we just loaded

In [6]:
model = OccCANINE() # This downloads and initializes the model from HuggingFace

tokenizer_config.json:   0%|          | 0.00/854 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/534M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/529M [00:00<?, ?B/s]

# Example 1
*OccCANINE works pretty well out of the box*

In [7]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    get_dict = True # Simple output
)

[[[79190, 0.81692904, 'Other Tailors and Dressmakers']],
 [[87350, 0.9993395, 'Boilersmith']],
 [[98330, 0.986886, 'Railway SteamEngine Fireman']]]

# Example 2
***We can get better results by adding language as context***  

`lang="en"`

- In this case we simply get a higher probability of the correct label.
- In other cases, this can make the difference between correct and incorrect
- OccCANINE is trained on 13 languages and the following number of observations:
  + English: "en" (6.34M)
  + Danish: "da" (4.66M)
  + Swedish: "se" (1.68M)
  + Dutch: "nl" (1.00M)
  + Catalan: "ca" (554K)
  + French: "fr" (243K)
  + Norwegian: "no" (136K)
  + Icelandic: "is" (17.4K)
  + Portugese: "pt" (17.4K)
  + German: "ge/de" (11.7k)
  + Spanish: "es" (7372)
  + Italian: "it" (3828)
  + Greek: "gr" (1466)




In [8]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    lang = "en",
    get_dict = True # Simple output
)

[[[79100, 0.8775564, 'Tailor, Specialisation Unknown']],
 [[87350, 0.99941397, 'Boilersmith']],
 [[98330, 0.99424595, 'Railway SteamEngine Fireman']]]

# Fast performance for many observations

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv("OccCANINE/histocc/Data/TOYDATA.csv")
model.verbose = True # Set updates to True
x = model.predict(
    df["occ1"],
    lang = "en",
    threshold = 0.22 # Optimal for f1
    )

x

Processed batch 40 out of 40 batches
Prediction done. Cleaning results.
Produced HISCO codes for 10000 observations in 0 hours, 0 minutes and 27.452 seconds.
Estimated hours saved compared to human labeller (assuming 10 seconds per label):
 ---> 27 hours, 46 minutes and 13 seconds


Unnamed: 0,inputs,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]soldier (reserve),58340.0,0.999909,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]wine and spirit merchant,41025.0,0.999686,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
2,en[SEP]coal merchant (deceased),41025.0,0.999705,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
3,en[SEP]paper mill operative,99930.0,0.441911,Factory Worker,,,No pred,,,No pred,,,No pred,,,No pred
4,en[SEP]soldier (deceased),58340.0,0.999180,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,"en[SEP]holloware turner, deceased",83320.0,0.987633,Lathe SetterOperator,,,No pred,,,No pred,,,No pred,,,No pred
9996,en[SEP]construction engineer,2210.0,0.906727,"Civil Engineer, General",,,No pred,,,No pred,,,No pred,,,No pred
9997,en[SEP]operative brewer,77810.0,0.992721,"Brewer, General",,,No pred,,,No pred,,,No pred,,,No pred
9998,en[SEP]clothier and outfitter,41030.0,0.951929,Working Proprietor (Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred


# Finetuning
OccCANINE works well of the box, but for even better performance, it can be finetuned on for a specific domain with a few observations.

- To create data for finetuning OccCANINE can be used for initial predictions, which can be corrected by a human labeller
- In turn this can then be used as training data in finetuning


In [11]:
# Setup
df = pd.read_csv(
    "OccCANINE/histocc/Data/TOYDATA.csv"
    )
label_cols = ["hisco_1"]

# Set lang
df["lang"] = "en"  # English

df

Unnamed: 0,occ1,hisco_1,lang
0,soldier (reserve),58340,en
1,wine and spirit merchant,41025,en
2,coal merchant (deceased),41025,en
3,paper mill operative,99930,en
4,soldier (deceased),58340,en
...,...,...,...
9995,"holloware turner, deceased",83320,en
9996,construction engineer,2210,en
9997,operative brewer,77810,en
9998,clothier and outfitter,41030,en


We use the .finetune method to finetune the model

In [12]:
model.finetune(
    df,
    label_cols,
    batch_size=64,
    save_name = "Finetune_toy_model",
    verbose_extra = True # Print even more updates while finetuning.
)

==== Started finetuning procedure ====
9000 observations will be used in training.
1000 observations will be used in validation.
Saved tmp files to Data/Tmp_finetune
----------
Intital performance:
Validation acc: 0.894140625; Validation loss: 0.0002107825112034334
----------
Epoch 1/3
Batch 141/141 - Loss: 0.0004, Acc: 0.8500, ETA: 0m0s of 1m34s
Epoch completed.
Train loss 0.00040298335513030677, accuracy 0.8155363475177304
Val loss 0.00016636294640193228, accuracy 0.916796875
Validation loss improved. Saved improved model
----------
Epoch 2/3
Batch 141/141 - Loss: 0.0002, Acc: 0.9000, ETA: 0m0s of 1m36s
Epoch completed.
Train loss 0.00030619350757577017, accuracy 0.8596631205673759
Val loss 0.00016594656472079805, accuracy 0.912890625
Validation loss improved. Saved improved model
----------
Epoch 3/3
Batch 141/141 - Loss: 0.0004, Acc: 0.8500, ETA: 0m0s of 1m36s
Epoch completed.
Train loss 0.0002664030719559197, accuracy 0.8730496453900709
Val loss 0.00016394033627875615, accuracy 0.


*   The finetuned model is saved in a new folder called "Finetuned"
*   It can be loaded in other projects with the OccCANINE class



In [13]:
model_new = OccCANINE("Finetuned/Finetune_toy_model", hf = False)
model_new.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    lang = "en",
    get_dict = True # Simple output
)

[[[79100, 0.94281733, 'Tailor, Specialisation Unknown']],
 [[87350, 0.9936194, 'Boilersmith']],
 [[98330, 0.97705656, 'Railway SteamEngine Fireman']]]