# Lawyer / Sawyer; Engineer / Engingeer

In [42]:
from histocc import OccCANINE
model = OccCANINE(verbose=False)
model_old = OccCANINE("OccCANINE", verbose=False)

Defining input

In [32]:
examples1 = ["lawyer", "lawyer & editor", "lawyer (retired)", "lawyer retired", "lawyer clerk"]
examples2 = ["asst. chief engineer", "engineer"]

# Adding context 1
examples1_context_city = [i + " in the city" for i in examples1]
examples2_context_city = [i + " in the city" for i in examples2]

# Adding context 2
examples1_context_office = [i + " in an office" for i in examples1]
examples2_context_office = [i + " in an office" for i in examples2]

# Adding context 3
examples1_context_company = [i + " in a company" for i in examples1]
examples2_context_company = [i + " in a company" for i in examples2]

## Explanation:
In the following we will try to replicate the problem and see whether the seq2seq decoder fixes it. We will also try to add context to the string.

### 1. Replicating problem with flat output (like v1 OccC)
Below the problem is replicated. 'lawyer' is labelled as 'sawyer' and so is 'lawyer (retired)' but not 'lawyer reited'.
For the engineers they are both labelled as 'machinery fitters ...' which is a good baseline guess. I.e. the problem does not replicate in this case, all though in the specific context of city directories we should expect it to be more likely that it is '02000 Engineer'.

In [16]:
res = model(examples1, behavior = "fast", lang = "en")
res

Based on behavior = 'fast', prediction_type was automatically set to 'flat'


Unnamed: 0,occ1,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]lawyer,73210,0.744454,"Sawyer, General",,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]lawyer & editor,12110,0.987649,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
2,en[SEP]lawyer (retired),73210,0.59782,"Sawyer, General",12410.0,0.398772,Solicitor,12110.0,0.245023,Lawyer,,,No pred,,,No pred
3,en[SEP]lawyer retired,12110,0.976207,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
4,en[SEP]lawyer clerk,39340,0.827256,Legal Clerk,,,No pred,,,No pred,,,No pred,,,No pred


In [18]:
res = model(examples2, behavior = "fast", lang = "en")
res

Based on behavior = 'fast', prediction_type was automatically set to 'flat'


Unnamed: 0,occ1,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]asst. chief engineer,84100,0.33167,Machinery Fitters and Machine Assemblers,,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]engineer,84100,0.775141,Machinery Fitters and Machine Assemblers,2000.0,0.223059,"Engineer, Specialisation Unknown",,,No pred,,,No pred,,,No pred


### 2. Trying seq2seq decoder (default in new version)
Using the updated model based on digit by digit prediction, we improve somewhat. All though 'sawyer/lawyer' is still mixed up in the simple case. Engineer / Ass. Engineer is now not labelled consistently. 

In [17]:
res = model(examples1, lang = "en")
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]lawyer,73210,"Sawyer, General",0.760004
1,en[SEP]lawyer & editor,12110,Lawyer,0.680842
2,en[SEP]lawyer (retired),12410,Solicitor,0.585299
3,en[SEP]lawyer retired,12110,Lawyer,0.886585
4,en[SEP]lawyer clerk,39340,Legal Clerk,0.886244


In [19]:
res = model(examples2, lang = "en")
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]asst. chief engineer,2000,"Engineer, Specialisation Unknown",0.554545
1,en[SEP]engineer,84100,Machinery Fitters and Machine Assemblers,0.778638


## 3. Adding context
The model seems to be able to do a better job when there is other relevant information in the string. 'lawyer' is only one character away from 'sawyer' but very few sawyers work in an office or live in the city. We can simply add this to the string. It is the type of information which would often occur naturally in our training data anyway.

We try by adding  but also " in the city" and " in an office". The first could reliably be added to the motivating example of city directories, since it is in fact people living in the city that appear in the data. The last could also be used if it is possible to be relatively sure that the strings are mostly office workers. 

In [33]:
res = model(examples1_context_city, lang = "en") # With ' in an office' context as the last part of every string
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]lawyer in the city,94990,Other Production and Related Workers Not Elsew...,0.264434
1,en[SEP]lawyer & editor in the city,12110,Lawyer,0.475744
2,en[SEP]lawyer (retired) in the city,12110,Lawyer,0.423914
3,en[SEP]lawyer retired in the city,12110,Lawyer,0.41341
4,en[SEP]lawyer clerk in the city,39340,Legal Clerk,0.501577


In [25]:
res = model(examples2_context_city, lang = "en")
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]asst. chief engineer in the city,94990,Other Production and Related Workers Not Elsew...,0.204317
1,en[SEP]engineer in the city,94990,Other Production and Related Workers Not Elsew...,0.370008


In [36]:
res = model(examples1_context_office, lang = "en") # With ' in an office' context as the last part of every string
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]lawyer in an office,12110,Lawyer,0.700124
1,en[SEP]lawyer & editor in an office,12110,Lawyer,0.531699
2,en[SEP]lawyer (retired) in an office,12110,Lawyer,0.562139
3,en[SEP]lawyer retired in an office,12110,Lawyer,0.891648
4,en[SEP]lawyer clerk in an office,39340,Legal Clerk,0.877609


In [27]:
res = model(examples2_context_office, lang = "en")
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]asst. chief engineer in an office,4315,Ship's Chief Engineer,0.577685
1,en[SEP]engineer in an office,39310,"Office Clerk, General",0.888074


In [30]:
res = model(examples1_context_company, lang = "en") # With ' in a company' context as the last part of every string
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]lawyer in a company,12110,Lawyer,0.750336
1,en[SEP]lawyer & editor in a company,12110,Lawyer,0.413249
2,en[SEP]lawyer (retired) in a company,12110,Lawyer,0.493027
3,en[SEP]lawyer retired in a company,12110,Lawyer,0.656803
4,en[SEP]lawyer clerk in a company,39310,"Office Clerk, General",0.54282


In [31]:
res = model(examples2_context_company, lang = "en") # With ' in a company' context as the last part of every string
res

Based on behavior = 'good', prediction_type was automatically set to 'greedy'


Unnamed: 0,occ1,hisco_1,desc_1,conf
0,en[SEP]asst. chief engineer in a company,2000,"Engineer, Specialisation Unknown",0.78194
1,en[SEP]engineer in a company,84100,Machinery Fitters and Machine Assemblers,0.466215


# What about the old OccCANINE?

In [43]:
res = model_old(examples1 + examples1_context_office, lang = "en", behavior = "fast") # With ' in an office' context as the last part of every string
res

Based on behavior = 'fast', prediction_type was automatically set to 'flat'


Unnamed: 0,occ1,hisco_1,prob_1,desc_1,hisco_2,prob_2,desc_2,hisco_3,prob_3,desc_3,hisco_4,prob_4,desc_4,hisco_5,prob_5,desc_5
0,en[SEP]lawyer,73210,0.844501,"Sawyer, General",,,No pred,,,No pred,,,No pred,,,No pred
1,en[SEP]lawyer & editor,12110,0.670885,Lawyer,12410.0,0.538294,Solicitor,,,No pred,,,No pred,,,No pred
2,en[SEP]lawyer (retired),73210,0.591734,"Sawyer, General",12410.0,0.296036,Solicitor,,,No pred,,,No pred,,,No pred
3,en[SEP]lawyer retired,12110,0.874454,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
4,en[SEP]lawyer clerk,12110,0.594191,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
5,en[SEP]lawyer in an office,12110,0.940642,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
6,en[SEP]lawyer & editor in an office,12110,0.974351,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
7,en[SEP]lawyer (retired) in an office,12110,0.959118,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
8,en[SEP]lawyer retired in an office,12110,0.98537,Lawyer,,,No pred,,,No pred,,,No pred,,,No pred
9,en[SEP]lawyer clerk in an office,39340,0.845586,Legal Clerk,,,No pred,,,No pred,,,No pred,,,No pred
