## Generate JSONS for an occupation/industry classsification scheme

This notebook takes input given by the user on occupation/industry classsification scheme of choice as defined in a XLSX format and generates a JSON file. This file is then used downstream by the tool to output a numeric code and description for an occupation definition entered as user input. <br>
This numeric code and description output is the closest match the tool has found to the free text description of an occupation input to the tool by user.<br>
Each code cell should be run in order to ensure correct operation.  <br>
Note that this assumes all prerequisites are satisfied and setup has been completed following the instructions in the [README.md](README.md).

In [1]:
import os
os.chdir("../../")
from occupationcoder.createdictionaries import build_dict

Specify the (relative) path to the XLSX file where the classification scheme is defined. Also specify name of the sheet that contains the data in the XLSX file. Note this has to be specified even if "Sheet1". The file specified in `file_name` variable is assumed to be kept in the `data\` subfolder of the repository structure. Also, specify in the `code_col` the column name of the column in the XLSX document that contains all the numeric occupation codes.

In [2]:
file_name = "data/ISCO-08 EN Structure and definitions.xlsx"
sheet_name = 'ISCO-08 EN Struct and defin'
code_col = 'ISCO 08 Code'

In [3]:
# reading the file and loading in the data
input_df = build_dict.load_file(filename=file_name,
                         sheet_name=sheet_name,
                         code_col=code_col)

In the classification as defined in the `file_name` / `sheet_name` above will be some text that are redundant and should be removed before the classification is put into a JSON. 
Currently as defined in the next cell, the column names and the corresponding text to delete relate to the ISCO classification scheme.
The names of the columns that contains text to be removed should replace the current column names of 'Tasks include' and 'Included occupations'.  <br>
The corresponding text or phrases to be removed from the columns should be listed within the square brackets, separated by a comma and each phrase or text to be within the speech marks.  <br>
The values pre-populated below are provided as examples only and should be modified to match the given occupation classification scheme.

In [3]:
exclude_text = {'Tasks include': ['Tasks performed', 'Tasks include', 'Tasks performed by', 'usually include'],
                    'Included occupations': ['Examples of the occupations classified here:',
                                             'Occupations in this major group are classified into the following',
                                             'Occupations in this sub-major group are classified into the following',
                                             'Occupations in this minor group are classified into the following',
                                             'major group', 'minor group', 'sub-major group', 'sub-', 'unit group'
                                             ]}

Specify the names of the columns containing the following information:

- `code_col`, The column name in the XLSX file assigned to `file_name`/ `sheet_name` that contains numeric codes for occupation descriptions in classification scheme.
- `bucket_cols`(list): List of strings, corresponding to column names to be processed into word buckets.
- `exact_col`, (str, optional): String, corresponding to dataframe column containing expected exact job title matches. The default is ''.
- `exact_col_split`, (str, optional): Character string that represents how job title matches in exact_col are to be split. Could for example be hard returns ('\n') or dashes ('*). Only needed when exact_col is specified.
- `exclude_text`, (dict, optional): Dictionary of lists of strings, where keys correspond to columns from which substrings should be removed, and values are lists of substrings to removed from the given column. Only needed if exact_col is specified.
- `exclude_pattern`, (str, optional): Regex expression to remove from any given column. 
- `output_files`, (dict): Dictionary specifying output file names for word bucket JSON and exact match JSON outputs (if needed). Keys should be specified as 'buckets' and 'exact', and values should be strings of output file names. Default: {'buckets': 'buckets.json', 'exact': 'titles.json'}.
- `bucket_field_names` (list, optional): List of strings specifying field titles in word bucket JSON. Defaults: ['code','description'].
- `level`, (list, optional): List of level codes (numbers) to process. If not specified (default: None), all levels present in input are included.

The values pre-populated below are provided as examples only and should be modified to match the given input.

In [5]:
build_dict.process_file(input_df=input_df,
                            code_col=code_col,
                            bucket_cols=['Title EN', 'Definition', 'Tasks include'],
                            exact_col='Included occupations',
                            exact_col_split='\n',
                            exclude_text=exclude_text,
                            output_files= {'buckets': 'isco/buckets_isco.json', 'exact': 'isco/titles_isco.json'},
                            bucket_field_names=['isco_code', 'Titles_nospace'],
                            level=4
                            )