<a href="https://colab.research.google.com/github/christianvedels/OccCANINE/blob/main/OccCANINE_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OccCANINE

This notebook demonstrates the capabilities of OccCANINE. For any use at scale we recommend downloading the entire [GitHub repository](https://github.com/christianvedels/OccCANINE). A natural starting point is the script [PREDICT_HISCOs.py](https://github.com/christianvedels/OccCANINE/blob/main/PREDICT_HISCOs.py)


# Setting up everything
We start by cloning the GitHub repository

In [19]:
!rm -rf OccCANINE # Remove existing
!git clone https://github.com/christianvedels/OccCANINE

Other required packages. `unidecode` is not standard in colab and needs to be installed.

In [13]:
!pip install unidecode



Setting the working directory to OccCANINE

In [14]:
import os
os.chdir('/content/OccCANINE')
print(os.getcwd())

/content/OccCANINE


We load the 'OccCANINE' class, which is an all-in-one class, which handles loading of the underlying transformer model, prediction and finetuning.

In [15]:
from OccCANINE.n103_Prediction_assets import OccCANINE # HISCO prediciton class

We initialize the model with the class we just loaded

In [16]:
model = OccCANINE() # This downloads and initializes the model from HuggingFace

FileNotFoundError: [Errno 2] No such file or directory: '../Data/Key.csv'

# Example 1
*OccCANINE works pretty well out of the box*

In [6]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    get_dict = True # Simple output
)

[[['79190', 0.81692904, 'Other Tailors and Dressmakers']],
 [['87350', 0.9993395, 'Boilersmith']],
 [['98330', 0.986886, 'Railway SteamEngine Fireman']]]

# Example 2
***We can get better results by adding language as context***  

`lang="en"`

- In this case we simply get a higher probability of the correct label.
- In other cases, this can make the difference between correct and incorrect
- OccCANINE is trained on 13 languages and the following number of observations:
  + English: "en" (6.34M)
  + Danish: "da" (4.66M)
  + Swedish: "se" (1.68M)
  + Dutch: "nl" (1.00M)
  + Catalan: "ca" (554K)
  + French: "fr" (243K)
  + Norwegian: "no" (136K)
  + Icelandic: "is" (17.4K)
  + Portugese: "pt" (17.4K)
  + German: "ge/de" (11.7k)
  + Spanish: "es" (7372)
  + Italian: "it" (3828)
  + Greek: "gr" (1466)




In [7]:
model.predict(
    ["tailor of the finest suits", "local boiler maker", "train's fireman"],
    lang = "en",
    get_dict = True # Simple output
)

[[['79100', 0.8775564, 'Tailor, Specialisation Unknown']],
 [['87350', 0.99941397, 'Boilersmith']],
 [['98330', 0.99424595, 'Railway SteamEngine Fireman']]]

# Fast performance for many observations

In [9]:
df = pd.read_csv("Data/TOYDATA.csv")
model.verbose = True # Set updates to True
x = model.predict(
    df["occ1"],
    lang = "en",
    threshold = 0.22 # Optimal for f1
    )

x # Change to English mariage certificates.

Processed batch 40 out of 40 batches
Prediction done. Cleaning results.
Produced HISCO codes for 10000 observations in 0 hours, 0 minutes and 27.668 seconds.
Estimated hours saved compared to human labeller (10 seconds per label):
 ---> 27 hours, 46 minutes and 12 seconds


Unnamed: 0,inputs,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]soldier (reserve),58340,0.999909,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]wine and spirit merchant,41025,0.999686,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
2,en[SEP]coal merchant (deceased),41025,0.999705,Working Proprietor (Wholesale or Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred
3,en[SEP]paper mill operative,99930,0.441911,Factory Worker,,,No pred,,,No pred,,,No pred,,,No pred
4,en[SEP]soldier (deceased),58340,0.999180,Other Military Ranks,,,No pred,,,No pred,,,No pred,,,No pred
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,"en[SEP]holloware turner, deceased",83320,0.987633,Lathe SetterOperator,,,No pred,,,No pred,,,No pred,,,No pred
9996,en[SEP]construction engineer,2210,0.906727,"Civil Engineer, General",,,No pred,,,No pred,,,No pred,,,No pred
9997,en[SEP]operative brewer,77810,0.992721,"Brewer, General",,,No pred,,,No pred,,,No pred,,,No pred
9998,en[SEP]clothier and outfitter,41030,0.951929,Working Proprietor (Retail Trade),,,No pred,,,No pred,,,No pred,,,No pred


# Finetuning
OccCANINE works well of the box, but for even better performance, it can be finetuned on for a specific domain with a few observations.

- To create data for finetuning OccCANINE can be used for initial predictions, which can be corrected by a human labeller
- In turn this can then be used as training data in finetuning


In [10]:
# Setup
df = pd.read_csv(
    "Data/TOYDATA.csv"
    )
label_cols = ["hisco_1"]

# Set lang
df["lang"] = "en"  # English

df

Unnamed: 0,occ1,lang
0,soldier (reserve),en
1,wine and spirit merchant,en
2,coal merchant (deceased),en
3,paper mill operative,en
4,soldier (deceased),en
...,...,...
9995,"holloware turner, deceased",en
9996,construction engineer,en
9997,operative brewer,en
9998,clothier and outfitter,en
