## Entity Extraction with Comprehend Entity detection and POS (Parto of Speech) Tagging

In the sentences where we find a Comprehend commercial item, it´s important to detect all the words that compose the full name of the entity.


In this section I take comprehend entity detection results, take commercial items and build the entire name.


For example, in the phrase  "My least favorite, that I've tried, is Maybelline Dream Liquid Mousse ." Comprehend only detects "Dream" as a commercial item but we want the full name wich is Maybelline Dream Liquid Mousse.


To get the full name we can use the results of POSTagging as the name is going to be enclosed in verbs, adverbs, adjectives and other parts of speech.

In [8]:
import pickle
import boto3
import re
from IPython.core.display import display, HTML
from utils import string_process, get_phrases_index_comment, get_item_phrase, get_item_offset, is_valid_pos
import os
import re

In [2]:
comprehend = boto3.client('comprehend')

In [3]:
comment_count = 0
entity_count = 0
display_html = ''

In [4]:
def process_submission(filename):
    with open(f'submissions/{filename}', 'rb') as handle:
        comments = pickle.load(handle)
    for comment in comments:
        process_comment(comment['text'])
    

In [5]:
def process_comment(comment):
    global comment_count
    global entity_count
    global display_html
    comment_count += 1 
    if comment[-1] != '.':
        comment = comment + '.'
    comment = comment.replace('.',' . ')
    
    entities = comprehend.detect_entities(Text=comment,LanguageCode='en')['Entities']
    commercial_items = [ent for ent in entities if ent['Type'] == 'COMMERCIAL_ITEM']
    phrase_sep = get_phrases_index_comment(comment)
    for item in commercial_items:
        entity_count += 1
        try:
            item_phrase = get_item_phrase(item['BeginOffset'],item['EndOffset'],phrase_sep,comment)
            item_position = get_item_offset(item['Text'],item_phrase)
            pos_tags = comprehend.detect_syntax(Text=item_phrase,LanguageCode='en')['SyntaxTokens']
            full_item_name = ''
            #Traverse the entities forwars
            for i in range(0,len(pos_tags)):
                if pos_tags[i]['BeginOffset'] >= item_position[0]:
                    if pos_tags[i]['EndOffset'] <= item_position[1]:
                        full_item_name += ' ' +  str(pos_tags[i]['Text'])
                    else:
                        if is_valid_pos(pos_tags[i]):
                            full_item_name += ' ' +  str(pos_tags[i]['Text'])
                        else:
                            break

            #Traverse the entities backwards
            for i in range(len(pos_tags) -1,-1,-1):
                if pos_tags[i]['EndOffset'] < item_position[0]:
                    if is_valid_pos(pos_tags[i]):
                        full_item_name = str(pos_tags[i]['Text']) + ' ' +  full_item_name 
                    else:
                        break

            display_html += f"<hr> The entity <b> {item['Text']} </b>  <br /> in the  phrase  <b> {item_phrase} </b> <br/> "\
                 f"becomes <b> {full_item_name.strip()}</b> with POS tagging <hr>"
        except:
            pass

In [6]:
for file in os.listdir('submissions'):
    display_html += f'Submission  <u> {file} </u> <br /> '
    if '.pickle' in file:
        process_submission(file)

In [7]:
print(f'Found {entity_count} commercial items in {comment_count} comments')
display(HTML(display_html))

Found 414 commercial items in 351 comments
