# Process Synonyms

This notebook uses a combination of Python data science libraries and the Google Natural Language API (machine learning) to expand the vocabulary of the chatbot by generating synonyms for topics created in the previous notebook.

In [1]:
!pip uninstall -y google-cloud-datastore

Uninstalling google-cloud-datastore-1.2.0:
  Successfully uninstalled google-cloud-datastore-1.2.0


In [2]:
!pip install google-cloud-datastore==1.2.0

Collecting google-cloud-datastore==1.2.0
  Using cached https://files.pythonhosted.org/packages/ab/04/c261a6236a846dd2aeb4dd74ac7ddc8012b00434a9661d31ad8b7a9bd9b6/google_cloud_datastore-1.2.0-py2.py3-none-any.whl
Installing collected packages: google-cloud-datastore
Successfully installed google-cloud-datastore-1.2.0


Hit Reset Session > Restart, then resume with the following cells. 

In [1]:
# Only need to do this once...
!pip install inflect

Collecting inflect
[?25l  Downloading https://files.pythonhosted.org/packages/86/02/e6b11020a9c37d25b4767a1d0af5835629f6e75d6f51553ad07a4c73dc31/inflect-2.1.0-py2.py3-none-any.whl (40kB)
[K    100% |████████████████████████████████| 51kB 4.7MB/s ta 0:00:011
[?25hInstalling collected packages: inflect
Successfully installed inflect-2.1.0


In [2]:
# Only need to do this once...
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /content/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

In [4]:
from google.cloud import datastore

In [5]:
datastore_client = datastore.Client()

In [6]:
client = datastore.Client()
query = client.query(kind='Topic')
results = list(query.fetch())

In [7]:
import inflect
plurals = inflect.engine()

## Extract Synonyms with Python
Split the topic into words and use PyDictionary to look up synonyms in a "thesaurus" for each word.  Store these in Datastore and link them back to the topic.  Note this section uses the concept of "stop words" to filter out articles and other parts of speech that don't contribute to meaning of the topic.

In [8]:
from nltk.corpus import wordnet
from sets import Set

for result in results:
  for word in result.key.name.split():
    
    if word in stop:
        continue

    
    synonyms = Set()
    for syn in wordnet.synsets(word):
      
      if ".n." in str(syn):

        for l in syn.lemmas():
          lemma = l.name()
          if (lemma.isalpha()):
            synonyms.add(lemma)
            synonyms.add(plurals.plural(lemma))
      
      if ".a." in str(syn):
        synonyms = Set()
        break

    print result.key.name, word, synonyms
    
    kind = 'Synonym'
    synonym_key = datastore_client.key(kind, result.key.name)

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)
    
    synonym_key = datastore_client.key(kind, word)

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)
    
    for dictionary_synonym in synonyms:
      
      synonym_key = datastore_client.key(kind, dictionary_synonym)

      synonym = datastore.Entity(key=synonym_key)
      synonym['synonym'] = result.key.name

      datastore_client.put(synonym)
      
    synonym_key = datastore_client.key(kind, plurals.plural(word))

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)
    

  from ipykernel import kernelapp as app
  _warn_if_not_unicode(string)


annual salary annual Set([])
annual salary salary Set([u'wage', u'salary', u'remuneration', u'pay', u'salaries', u'earnings', u'pays', u'wages', u'earning', u'remunerations'])
compassionate leave compassionate Set([])
compassionate leave leave Set([u'partings', u'leaves', u'farewells', u'leave', u'farewell', u'parting'])
confidential information confidential Set([])
confidential information information Set([u'info', u'information', u'informations', u'entropy', u'entropies', u'infos', u'data', u'datas'])
disability leave disability Set([u'handicap', u'disabilities', u'disability', u'disablements', u'handicaps', u'disablement', u'impairment', u'impairments'])
disability leave leave Set([u'partings', u'leaves', u'farewells', u'leave', u'farewell', u'parting'])
discipline discipline Set([u'discipline', u'bailiwicks', u'disciplines', u'fields', u'study', u'field', u'subjects', u'bailiwick', u'corrections', u'studies', u'correction', u'subject'])
employee classifications employee Set([u'empl

## Extract Synonyms with Machine Learning
Use Google Natural Language API (machine learning) to evaluate the "salience" of each word in the topic areas to identify its relevance to the overall meaning of the paragraph.  Highly relevant words in each block of text will be stored as synonyms for the topic.  Note:  The Threshold for salience is currently set at 0.1.  You can adjust this threshold to see the effect on synonym generation.

In [9]:
!pip install google-cloud-language

Collecting google-cloud-language
[?25l  Downloading https://files.pythonhosted.org/packages/95/03/af29755711f5e30e6228f03083de0b06089c8ea41e4a67a172e7e65daaa4/google_cloud_language-1.1.0-py2.py3-none-any.whl (64kB)
[K    100% |████████████████████████████████| 71kB 3.2MB/s ta 0:00:011
Installing collected packages: google-cloud-language
Successfully installed google-cloud-language-1.1.0


In [10]:
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

In [11]:
for result in results:
  
  print result.key.name, '*'*5, result['action_text']
  
  client = language.LanguageServiceClient()

  #document = client.document_from_text(result['action text'])
  
  document = types.Document(
        content=result['action_text'],
        type=enums.Document.Type.PLAIN_TEXT)
  
  entities = client.analyze_entities(document).entities  

  # entity types from enums.Entity.Type
  entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
                   'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')

  for entity in entities:
    if (entity_type[entity.type] != 'ORGANIZATION') and (entity.salience > 0.1):
      
      print('=' * 20)
      print('         name: %s' % (entity.name.lower()))
      print('         type: %s' % (entity_type[entity.type]))
      print('     salience: %s' % (entity.salience))
      
      kind = 'Synonym'
      synonym_key = datastore_client.key(kind, entity.name.lower())
      
      synonym = datastore.Entity(key=synonym_key)
      synonym['synonym'] = result.key.name

      datastore_client.put(synonym)

annual salary ***** Salaries shall be determined by the Executive Director, based on budget considerations and commensurate with the experience of the successful candidate.   The organization shall pay employees on a bi-weekly basis, less the usual and necessary statutory and other deductions payable in accordance with the Employer’s standard payroll practices.  These payroll practices may be changed from time to time at the Employer’s sole discretion.  Currently, payday occurs every second Thursday and covers the pay period ended the previous Saturday.

         name: salaries
         type: OTHER
     salience: 0.224026843905
         name: executive director
         type: PERSON
     salience: 0.224026843905
compassionate leave ***** [THE ORGANIZATION] will grant up to three (3) working days per event on the occasion of a death in the staff member’s immediate family.  Immediate family is defined as: parent(s), step parent(s), foster parent(s), sibling(s), grandparent(s), spouse (in

         name: copyrights
         type: OTHER
     salience: 0.105179071426
         name: patents
         type: OTHER
     salience: 0.105179071426
jury duty ***** Employees will be allowed up to two (2) weeks paid time off for jury duty.  After that, employees will be asked to continue jury duty without pay.  Any compensation, covering the first two (2) weeks, received from the court system shall be surrendered to the Organization.  A copy of the notice to serve should be provided for inclusion in the employee’s personnel file.

         name: employees
         type: PERSON
     salience: 0.439145386219
         name: jury duty
         type: OTHER
     salience: 0.166851088405
layoff ***** Operation requirements are subject to change based on workload and the funding levels received on an annual basis.  All efforts will be made to keep staff in a position similar, in scope and salary, to that they have become accustom to.  If the organization is unable to do this, then employees 

         name: employment opportunities
         type: OTHER
     salience: 0.729986727238
renovations ***** As odours from building materials and noise levels for tools can cause discomfort to employees, renovations will be scheduled to have a minimum impact on employees.  This may include renovating during non work hours (evenings & weekends) and ensuring direct ventilation to control fumes.  Carpets should be installed and cloth furniture unwrapped late in the day so emissions may occur during non working hours.

         name: odours
         type: OTHER
     salience: 0.272073924541
         name: renovations
         type: EVENT
     salience: 0.15124745667
         name: building materials
         type: OTHER
     salience: 0.138610512018
         name: noise levels
         type: OTHER
     salience: 0.138610512018
resignation ***** After completion of the first ninety (90) days of the probationary period, employees must give the Employer two (2) weeks’ notice of resignation. 