# Normalizing custom model analysis results

This notebook demonstrates how to use dictionaries for noramalizing NLU results from a custom language model that was ceated from those same dictionaries.

- Step 1: Import saved analysis results
- Step 2: Load dictionaries
- Step 3: Normalize results
- Step 5: Save normalized results in Watson Studio project

## Step 1: Import saved analysis results

In the [Custom language model](https://github.com/spackows/CASCON-2019_NLP-workshops/blob/master/notebooks/Notebook-3_Custom-language-model.ipynb) notebook, we saved custom language results in a JSON file as a Project Asset.

To import the saved data into this notebook, perform these steps:
1. Open the data panel by clicking on the **Find and Add Data** icon ( <img style="margin: 0px; padding: 0px; display: inline;" src="https://github.com/spackows/CASCON-2019_NLP-workshops/raw/master/images/find-add-data-icon.png"/> )
2. Click on the empty cell below
5. In the data panel, under the file named <code>NLU-results-custom-model.json</code> click **Insert to code** and then select "Insert Credentials"

In [2]:
# Define a helper function for copying files from  
# Project storage to the notebook working directory
#

from ibm_botocore.client import Config
import ibm_boto3

def copyToNotebookDir( credentials ):
    cos = ibm_boto3.client(
        service_name='s3',
        ibm_api_key_id=credentials['IBM_API_KEY_ID'],
        ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
        ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
        config=Config(signature_version='oauth'),
        endpoint_url=credentials['ENDPOINT'])
    cos.download_file(Bucket=credentials['BUCKET'],Key=credentials['FILE'],Filename=credentials['FILE'])
    print( "Done: '" + credentials['FILE'] + "'" )

In [3]:
copyToNotebookDir( credentials_1 )

Done: 'NLU-results-custom-model.json'


In [4]:
import json
with open( credentials_1['FILE'] ) as json_file:
    raw_results_list = json.load(json_file)
raw_results_list[0:3]

[{'header': '-------------------------------------------------------------',
  'message': 'Good morning can you help me upload a shapefile?',
  'actions': ['upload'],
  'objects': ['shapefile'],
  'tech': [],
  'docs': [],
  'persona': [],
  'spacer': ''},
 {'header': '-------------------------------------------------------------',
  'message': 'Good night where to place my file to import it into notebook?',
  'actions': ['import'],
  'objects': ['notebook'],
  'tech': [],
  'docs': [],
  'persona': [],
  'spacer': ''},
 {'header': '-------------------------------------------------------------',
  'message': 'hai how can i do analyze with csv file is there any tutorial on it',
  'actions': ['analyze'],
  'objects': [],
  'tech': [],
  'docs': ['tutorial'],
  'persona': [],
  'spacer': ''}]

## Step 2: View results

View the action words, object words, and technology words, listed in order of occurrence.

In [35]:
# Define a helper function for counting words
#
from collections import OrderedDict
def countWords( results_list, entity_type ):
    all_words = {}
    for result in results_list:
        words_arr = result[entity_type]
        for word in words_arr:
            if( word not in all_words ):
                all_words[word] = 0
            all_words[word] += 1
    common_words = dict( [ (k,v) for k,v in all_words.items() if v > 1 ] )
    ordered_common_words = OrderedDict( sorted( common_words.items(), key=lambda x:x[1], reverse=True ) )
    return ordered_common_words

In [17]:
actions_counts  = countWords( raw_results_list, "actions" )
objects_counts = countWords( raw_results_list, "objects" )
tech_counts    = countWords( raw_results_list, "tech" )

In [18]:
print( "\nActions" )
actions_counts


Actions


OrderedDict([('create', 10),
             ('upload', 5),
             ('import', 3),
             ('download', 3),
             ('creating', 3),
             ('add', 3),
             ('connection', 3),
             ('connect', 2),
             ('training', 2),
             ('export', 2),
             ('deploy', 2),
             ('signup', 2)])

In [19]:
print( "\nObjects" )
objects_counts


Objects


OrderedDict([('notebook', 14),
             ('project', 8),
             ('model', 7),
             ('account', 4),
             ('Notebook', 3),
             ('trial', 2)])

In [33]:
print( "\nTechnologies" )
tech_counts


Technologies


OrderedDict([('R', 4),
             ('WML', 2),
             ('Github', 2),
             ('python', 2),
             ('jupyter', 2),
             ('spark', 2)])

## Step 3: Normalize results

There are some noisy results above.  For example, in `actions_counts`, "create" and "creating" are counted as separate entities.  But for our analysis purposes, those both refer to the same action.  Instead of being counted separately, they should be counted together.

#### Dictionary files

To train the custom language model, we created dictionary files that looked like this:

[Action words](https://raw.githubusercontent.com/spackows/CASCON-2019_NLP-workshops/master/custom-language-model/dictionaries/action.csv)

```
lemma,poscode,surface
select,2,select,selecting
create,2,create,creating
train,2,train,training
load,2,load,loading,upload,uploading
sign up,2,sign up,sign-up,signup,register,registering
import,2,import,importing,imported
...
```

Given that we went to the trouble of creating those dictionaries, let's use them to *normalize* results.  For example, count "create" and "creating" as two instances of the same action.

#### Method: _lookup_

The way we'll use those dictionary files to normalize results is this:

From the dictionaries, create an important words look-up structure that we can use to map any `surface` form back to the `lemma` form.

For example, using the action words dictionary, "loading", "upload", and "uploading" should all map back to: "load".

#### Other methods

We could use *stemming* or *lemmatization* libraries.. But why?  We already have this dictionaries of words we care about, so let's just use those!

In [21]:
import urllib.request
import re

def readSource( url ):
    content = urllib.request.urlopen( url )
    lines_arr = []
    for line in content:
        lines_arr.append( line.decode("utf-8") )
    return lines_arr

def addLookups( lines_arr, lookup_dict ):
    for i in range( 1, len( lines_arr ) ):
        line = lines_arr[i]
        line = re.sub( "\s+$", "", line )
        arr = line.split( "," )
        lemma = arr[0].lower()
        for j in range( 3, len( arr ) ):
            variant = arr[j].lower()
            if variant not in lookup_dict:
                lookup_dict[ variant ] = lemma
    return lookup_dict

def readCustomDictionaries( url_arr ):
    lookup_dict = {}
    for url in url_arr:
        lines_arr = readSource( url )
        lookup_dict = addLookups( lines_arr, lookup_dict )
    return lookup_dict

In [40]:
action_dict_url = "https://raw.githubusercontent.com/spackows/CASCON-2019_NLP-workshops/master/custom-language-model/dictionaries/action.csv"
obj_dict_url    = "https://raw.githubusercontent.com/spackows/CASCON-2019_NLP-workshops/master/custom-language-model/dictionaries/obj.csv"
tech_dict_url   = "https://raw.githubusercontent.com/spackows/CASCON-2019_NLP-workshops/master/custom-language-model/dictionaries/tech.csv"

In [41]:
lookup_struct = readCustomDictionaries( [ action_dict_url, obj_dict_url, tech_dict_url ] )
lookup_struct

{'selecting': 'select',
 'creating': 'create',
 'training': 'train',
 'loading': 'load',
 'upload': 'load',
 'uploading': 'load',
 'sign-up': 'sign up',
 'signup': 'sign up',
 'register': 'sign up',
 'registering': 'sign up',
 'importing': 'import',
 'imported': 'import',
 'adding': 'add',
 'recovering': 'recover',
 'changing': 'change',
 'building': 'build',
 'login': 'log in',
 'logging in': 'log in',
 'sign in': 'log in',
 'signing in': 'log in',
 'sign-in': 'log in',
 'signin': 'log in',
 'connecting': 'connect',
 'connection': 'connect',
 'connections': 'connect',
 'deploys': 'deploy',
 'deployed': 'deploy',
 'deploying': 'deploy',
 'setting up': 'set up',
 'setup': 'set up',
 'set-up': 'set up',
 'editing': 'edit',
 'exceeds': 'exceed',
 'exceeded': 'exceed',
 'exceeding': 'exceed',
 'exporting': 'export',
 'analyzing': 'analyze',
 'downloading': 'download',
 'accessing': 'access',
 'acess': 'access',
 'saving': 'save',
 'initiating': 'initiate',
 'preparing': 'prepare',
 'reques

In [42]:
# Normalize results
#

def normalize( word, lookup_struct ):
    if word in lookup_struct:
        return lookup_struct[word]
    else:
        return word
    
normalized_results_list = []
for result in raw_results_list:
    actions_arr = []
    for action in result["actions"]:
        actions_arr.append( normalize( action.lower(), lookup_struct ) )
    objects_arr = []
    for obj in result["objects"]:
        objects_arr.append( normalize( obj.lower(), lookup_struct ) )
    tech_arr = []
    for tech in result["tech"]:
        tech_arr.append( normalize( tech.lower(), lookup_struct ) )
    normalized_results_list.append( { "header"  : result["header"],
                                      "message" : result["message"],
                                      "actions" : actions_arr,
                                      "objects" : objects_arr,
                                      "tech"    : tech_arr,
                                      "docs"    : result["docs"],
                                      "persona" : result["persona"],
                                      "spacer"  : result["spacer"] } )

In [46]:
actions_counts_normalized  = countWords( normalized_results_list, "actions" )
objects_counts_normalized = countWords( normalized_results_list, "objects" )
tech_counts_normalized    = countWords( normalized_results_list, "tech" )

In [49]:
print( "\nActions (normalized)" )
actions_counts_normalized


Actions (normalized)


OrderedDict([('create', 14),
             ('load', 7),
             ('connect', 6),
             ('import', 4),
             ('download', 3),
             ('add', 3),
             ('sign up', 3),
             ('log in', 2),
             ('exceed', 2),
             ('access', 2),
             ('train', 2),
             ('export', 2),
             ('deploy', 2)])

In [50]:
print( "\nObjects (normalized)" )
objects_counts_normalized


Objects (normalized)


OrderedDict([('notebook', 19),
             ('model', 8),
             ('project', 8),
             ('account', 5),
             ('trial', 5),
             ('limit', 2),
             ('endpoint', 2)])

In [52]:
print( "\nTechnologies (normalized)" )
tech_counts_normalized


Technologies (normalized)


OrderedDict([('machine learning', 8),
             ('r', 6),
             ('spark', 4),
             ('cloudant', 3),
             ('object storage', 3),
             ('github', 2),
             ('api', 2),
             ('python', 2),
             ('jupyter', 2)])

## Step 4: Save results

Save NLU custom model results in a JSON file as a Project Asset.

To be able to easily save questions in .csv files as assets in our Watson Studio project, we need a project token.

Follow the steps in this topic: [Adding a project token](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data)

***The project token is added in the very first cell at the top of the notebook.  Don't forget to scroll up and run that cell.***

(If you forget to run the inserted cell, you'll see the error <code>name 'project' is not defined</code> when you try to run the next cell below.)

In [55]:
project.save_data( 'NLU-results-custom-model-normalized.json', json.dumps( normalized_results_list, indent=3 ) , overwrite=True )

{'file_name': 'NLU-results-custom-model-normalized.json',
 'message': 'File saved to project storage.',
 'bucket_name': 'cascon2019-donotdelete-pr-gsnhbqe4skdcxh',
 'asset_id': '4a5a476b-ce32-49a6-91de-ae09d12c4dfd'}

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.