# Text Mining Process for Face Recognition Data

## Getting the desired text from the patents

We start from parsing the files containing the patents and look for the ones which can give us useful information about the topic we are interested into.
This is going to take some minutes.

In [1]:
import os
import re
path = 'C:\\Users\\Stefano\\Desktop\\Stefano\\Business\\Project\\Material\\MyPatents'
print("Getting files...")
# getting all files from the directory given by the path
files = os.listdir(path)
# moving to the desired directory
os.chdir(path)
text = ""
print("START!")
for filename in files:
    file = open(filename, encoding="utf-8")
    text+=file.read();
    #print("A file! "+filename)
    #    print("\n")
    #print(text)
print("END!")


Getting files...
START!
END!


After obtaining the whole text of the file of our interest, we separate the sections containing the abstract and the claims, which are the two sections we are interested on for our analysis

In [2]:
p1 = ""
p1=re.findall(r'(?<=<abstract>\n)(?s:.*?)(?=\n</abstract>)',text)
print(p1)
print("ABSTRACT TEXT OBTAINED!")

['An electronic apparatus including an image capturing device, a storage device and a processor and an operation method thereof are provided. The image capturing device captures an image for a user, and the storage device records a plurality of modules. The processor is coupled to the image capturing device and the storage device and is configured to: configure the image capturing device to capture a head image of a user; perform a face recognition operation to obtain a face region; detect a plurality of facial landmarks within the face region; estimate a head posture angle of the user according to the facial landmarks; calculate a gaze position where the user gazes on the screen according to the head posture angle, a plurality of rotation reference angle, and a plurality of predetermined calibration positions; and configure the screen to display a corresponding visual effect according to the gaze position.', 'The present disclosure provides a computation method and product thereof. Th

In [3]:
p2 = ""
p2=re.findall(r'(?<=<claims>\n)(?s:.+?)(?=\n</claims>)',text)
print(p2)
print("CLAIM TEXT OBTAINED!")

CLAIM TEXT OBTAINED!


### Text cleaining
First we lower the text for both sections, then we do the whitespace and punctuation removal.

In [4]:
# lower() is a Python function for strings
lower_atext = ""
for abstract_text in p1:
    lower_atext += abstract_text.lower() #we pick each word and add to a variable, which will contain all the text
lower_atext

"an electronic apparatus including an image capturing device, a storage device and a processor and an operation method thereof are provided. the image capturing device captures an image for a user, and the storage device records a plurality of modules. the processor is coupled to the image capturing device and the storage device and is configured to: configure the image capturing device to capture a head image of a user; perform a face recognition operation to obtain a face region; detect a plurality of facial landmarks within the face region; estimate a head posture angle of the user according to the facial landmarks; calculate a gaze position where the user gazes on the screen according to the head posture angle, a plurality of rotation reference angle, and a plurality of predetermined calibration positions; and configure the screen to display a corresponding visual effect according to the gaze position.the present disclosure provides a computation method and product thereof. the com

In [5]:
lower_ctext = ""
for claim_text in p2:
    lower_ctext += claim_text.lower() #we pick each word and add to a variable, which will contain all the text
lower_ctext



In [6]:
#white space removal for both sections
def remove_whitespace(text):
    return  " ".join(text.split())

lowera_text = remove_whitespace(lower_atext)
lowera_text
lowerc_text = remove_whitespace(lower_ctext)
lowerc_text



In [7]:
#punctuation and digits removal: we replace any undesired character with a ''
for char in '?.,!/;:()1234567890':  
    lowera_text = lowera_text.replace(char,'')
print(lowera_text)
for char in '?.,!/;:()1234567890':  
    lowerc_text = lowerc_text.replace(char,'')
print(lowerc_text)

an electronic apparatus including an image capturing device a storage device and a processor and an operation method thereof are provided the image capturing device captures an image for a user and the storage device records a plurality of modules the processor is coupled to the image capturing device and the storage device and is configured to configure the image capturing device to capture a head image of a user perform a face recognition operation to obtain a face region detect a plurality of facial landmarks within the face region estimate a head posture angle of the user according to the facial landmarks calculate a gaze position where the user gazes on the screen according to the head posture angle a plurality of rotation reference angle and a plurality of predetermined calibration positions and configure the screen to display a corresponding visual effect according to the gaze positionthe present disclosure provides a computation method and product thereof the computation method

### KeyWord Cleaning
We do a preliminary keyword removal to clean the text from redundant words that are not needed in our analysis

In [8]:
import pke

# initialize keyphrase extraction model, here TopicRank
print("Initializing extractor...")
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
print("Loading text...");
extractor.load_document(input=lowera_text, language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
print("Candidate Selection...")
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
print("Weighting...")
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
print("Selecting 10 best candidates...")
keyphrases = extractor.get_n_best(n=10)
for tuple in keyphrases:
    print(tuple[0])
    lowera_text = lowera_text.replace(tuple[0],'')
print(keyphrases)

Initializing extractor...
Loading text...
Candidate Selection...
Weighting...
Selecting 10 best candidates...
image
computation method
user
face recognition operation
device
face region
plurality
processor
system
facial recognitiona camera
[('image', 0.04694284852728213), ('computation method', 0.031442039493363806), ('user', 0.027788302431730104), ('face recognition operation', 0.02196271915915034), ('device', 0.020319617325812032), ('face region', 0.019573677639206713), ('plurality', 0.01656449323088685), ('processor', 0.01628557320191861), ('system', 0.014369073395178037), ('facial recognitiona camera', 0.013183701899647368)]


In [9]:
# initialize keyphrase extraction model, here TopicRank
print("Initializing extractor...")
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
print("Loading text...");
extractor.load_document(input=lowerc_text, language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
print("Candidate Selection...")
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
print("Weighting...")
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
print("Selecting 10 best candidates...")
keyphrases = extractor.get_n_best(n=10)
for tuple in keyphrases:
    print(tuple[0])
    lowerc_text = lowerc_text.replace(tuple[0],'')
print(keyphrases)

Initializing extractor...
Loading text...
Candidate Selection...
Weighting...
Selecting 10 best candidates...
claim
image
method
plurality
electronic device
user
faces
second face recognition operation
processor
feature tensors
[('claim', 0.04382683192815315), ('image', 0.027245173031580146), ('method', 0.02484087304737162), ('plurality', 0.02188169701653695), ('electronic device', 0.016951377660441365), ('user', 0.01656228675250289), ('faces', 0.015134349086063339), ('second face recognition operation', 0.013276317914395708), ('processor', 0.013040922806219603), ('feature tensors', 0.012906305396095897)]


In [10]:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(lowera_text)
for tuple in keywords:
    print(tuple[0])
    lowera_text = lowera_text.replace(tuple[0],'')

print(kw_model.extract_keywords(lowera_text, keyphrase_ngram_range=(1, 1), stop_words=None))

recognition
recognizing
recognitiona
features
classification
[('detection', 0.4378), ('classifier', 0.4285), ('computing', 0.4276), ('classifying', 0.4272), ('supervised', 0.4258)]


In [11]:
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(lowerc_text)
for tuple in keywords:
    print(tuple[0])
    lowerc_text = lowerc_text.replace(tuple[0],'')

print(kw_model.extract_keywords(lowerc_text, keyphrase_ngram_range=(1, 1), stop_words=None))

calibration
tracking
recognition
orientation
recognizing
[('analyses', 0.4213), ('capturing', 0.4131), ('posture', 0.4113), ('gaze', 0.4086), ('vision', 0.4033)]


### Tokenization
In this step we tokenize the text of both sections.

In [12]:
import nltk
# the output is a list, where each element is a sentence of the original text
nltk.sent_tokenize(lowera_text)
nltk.sent_tokenize(lowerc_text)



In [13]:
# the output is a list, where each element is a token of the original text
tokenized_text_a = nltk.word_tokenize(lowera_text)
print(tokenized_text_a)

tokenized_text_c = nltk.word_tokenize(lowerc_text)
print(tokenized_text_c)

['an', 'electronic', 'apparatus', 'including', 'an', 'capturing', 'a', 'storage', 'and', 'a', 'and', 'an', 'operation', 'method', 'thereof', 'are', 'provided', 'the', 'capturing', 'captures', 'an', 'for', 'a', 'and', 'the', 'storage', 'records', 'a', 'of', 'modules', 'the', 'is', 'coupled', 'to', 'the', 'capturing', 'and', 'the', 'storage', 'and', 'is', 'configured', 'to', 'configure', 'the', 'capturing', 'to', 'capture', 'a', 'head', 'of', 'a', 'perform', 'a', 'to', 'obtain', 'a', 'detect', 'a', 'of', 'facial', 'landmarks', 'within', 'the', 'estimate', 'a', 'head', 'posture', 'angle', 'of', 'the', 'according', 'to', 'the', 'facial', 'landmarks', 'calculate', 'a', 'gaze', 'position', 'where', 'the', 'gazes', 'on', 'the', 'screen', 'according', 'to', 'the', 'head', 'posture', 'angle', 'a', 'of', 'rotation', 'reference', 'angle', 'and', 'a', 'of', 'predetermined', 'calibration', 'positions', 'and', 'configure', 'the', 'screen', 'to', 'display', 'a', 'corresponding', 'visual', 'effect', '



### StopWords removal
We remove the stopwords from the text. The language we are using is english, so we remove the english stopwords.

In [14]:
from nltk.corpus import stopwords
stopwords_en = stopwords.words('english')

In [15]:
# we prepare a empty list, which will contain the words after the stopwords removal
tokenized_vector_a = []

# we iterate into the list of tokens obtained through the tokenization
for token in tokenized_text_a:
    # if a token is not a stopword, we insert it in the list
    if token not in stopwords_en:
        tokenized_vector_a.append(token)

# the output is a list of all the tokens of the original text excluding the stopwords
print(tokenized_vector_a)

['electronic', 'apparatus', 'including', 'capturing', 'storage', 'operation', 'method', 'thereof', 'provided', 'capturing', 'captures', 'storage', 'records', 'modules', 'coupled', 'capturing', 'storage', 'configured', 'configure', 'capturing', 'capture', 'head', 'perform', 'obtain', 'detect', 'facial', 'landmarks', 'within', 'estimate', 'head', 'posture', 'angle', 'according', 'facial', 'landmarks', 'calculate', 'gaze', 'position', 'gazes', 'screen', 'according', 'head', 'posture', 'angle', 'rotation', 'reference', 'angle', 'predetermined', 'calibration', 'positions', 'configure', 'screen', 'display', 'corresponding', 'visual', 'effect', 'according', 'gaze', 'positionthe', 'present', 'disclosure', 'provides', 'product', 'thereof', 'adopts', 'fusion', 'method', 'perform', 'machine', 'learning', 'computations', 'technical', 'effects', 'present', 'disclosure', 'include', 'fewer', 'computations', 'less', 'power', 'consumptiona', 'method', 'detecting', 'body', 'information', 'passengers', '

In [16]:
# we prepare a empty list, which will contain the words after the stopwords removal
tokenized_vector_c = []

# we iterate into the list of tokens obtained through the tokenization
for token in tokenized_text_c:
    # if a token is not a stopword, we insert it in the list
    if token not in stopwords_en:
        tokenized_vector_c.append(token)

# the output is a list of all the tokens of the original text excluding the stopwords
print(tokenized_vector_c)



### POS Analysis
We now do the POS analysis: we use the pos tagging to assign each word to its pos tag, then we clean and simplify the pos text.

In [17]:
pos_tagging_a = nltk.pos_tag(tokenized_vector_a)
print(pos_tagging_a)

[('electronic', 'JJ'), ('apparatus', 'NN'), ('including', 'VBG'), ('capturing', 'VBG'), ('storage', 'NN'), ('operation', 'NN'), ('method', 'NN'), ('thereof', 'NN'), ('provided', 'VBD'), ('capturing', 'VBG'), ('captures', 'NNS'), ('storage', 'NN'), ('records', 'NNS'), ('modules', 'NNS'), ('coupled', 'VBD'), ('capturing', 'VBG'), ('storage', 'NN'), ('configured', 'VBD'), ('configure', 'NN'), ('capturing', 'VBG'), ('capture', 'NN'), ('head', 'NN'), ('perform', 'NN'), ('obtain', 'VB'), ('detect', 'JJ'), ('facial', 'JJ'), ('landmarks', 'NNS'), ('within', 'IN'), ('estimate', 'JJ'), ('head', 'NN'), ('posture', 'NN'), ('angle', 'NN'), ('according', 'VBG'), ('facial', 'JJ'), ('landmarks', 'NN'), ('calculate', 'NN'), ('gaze', 'NN'), ('position', 'NN'), ('gazes', 'VBZ'), ('screen', 'JJ'), ('according', 'VBG'), ('head', 'NN'), ('posture', 'NN'), ('angle', 'JJ'), ('rotation', 'NN'), ('reference', 'NN'), ('angle', 'NN'), ('predetermined', 'VBD'), ('calibration', 'NN'), ('positions', 'NNS'), ('config

In [18]:
pos_tagging_c = nltk.pos_tag(tokenized_vector_c)
print(pos_tagging_c)



In [19]:
cleaned_POS_text_a = []

for tuple in pos_tagging_a:
    # POS tagged text is a list of tuples, where the first element tuple[0] is a token and the second one tuple[1] is
    # the Part of Speech. If the POS has length == 1, the token is punctuation, otherwise it is not, and we insert it
    # in the list cleaned_POS_text
    if len(tuple[1]) > 1:
        cleaned_POS_text_a.append(tuple)
        
print(cleaned_POS_text_a) 

[('electronic', 'JJ'), ('apparatus', 'NN'), ('including', 'VBG'), ('capturing', 'VBG'), ('storage', 'NN'), ('operation', 'NN'), ('method', 'NN'), ('thereof', 'NN'), ('provided', 'VBD'), ('capturing', 'VBG'), ('captures', 'NNS'), ('storage', 'NN'), ('records', 'NNS'), ('modules', 'NNS'), ('coupled', 'VBD'), ('capturing', 'VBG'), ('storage', 'NN'), ('configured', 'VBD'), ('configure', 'NN'), ('capturing', 'VBG'), ('capture', 'NN'), ('head', 'NN'), ('perform', 'NN'), ('obtain', 'VB'), ('detect', 'JJ'), ('facial', 'JJ'), ('landmarks', 'NNS'), ('within', 'IN'), ('estimate', 'JJ'), ('head', 'NN'), ('posture', 'NN'), ('angle', 'NN'), ('according', 'VBG'), ('facial', 'JJ'), ('landmarks', 'NN'), ('calculate', 'NN'), ('gaze', 'NN'), ('position', 'NN'), ('gazes', 'VBZ'), ('screen', 'JJ'), ('according', 'VBG'), ('head', 'NN'), ('posture', 'NN'), ('angle', 'JJ'), ('rotation', 'NN'), ('reference', 'NN'), ('angle', 'NN'), ('predetermined', 'VBD'), ('calibration', 'NN'), ('positions', 'NNS'), ('config

In [20]:
cleaned_POS_text_c = []

for tuple in pos_tagging_c:
    # POS tagged text is a list of tuples, where the first element tuple[0] is a token and the second one tuple[1] is
    # the Part of Speech. If the POS has length == 1, the token is punctuation, otherwise it is not, and we insert it
    # in the list cleaned_POS_text
    if len(tuple[1]) > 1:
        cleaned_POS_text_c.append(tuple)
        
print(cleaned_POS_text_c) 



In [21]:
def simpler_pos_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return "a"
    elif nltk_tag.startswith('V'):
        return "v"
    elif nltk_tag.startswith('N'):
        return "n"
    elif nltk_tag.startswith('R'):
        return "r"
    else:         
        return None
    
simpler_POS_text_a = []

# for each tuple of the list, we create a new tuple: the first element is the token, the second is
# the simplified pos tag, obtained calling the function simpler_pos_tag()
# then we append the new created tuple to a new list, which will be the output
for tuple in cleaned_POS_text_a:
    if tuple[1] == 'NNP':   #this is because there is some text in japanese categorized as 'NNP';
                            #no other relevant words are categorized in such a way
        continue;
    POS_tuple = (tuple[0], simpler_pos_tag(tuple[1]))
    simpler_POS_text_a.append(POS_tuple)
    
print(simpler_POS_text_a)

simpler_POS_text_c = []

for tuple in cleaned_POS_text_c:
    if tuple[1] == 'NNP':   #this is because there is some text in japanese categorized as 'NNP';
                            #no other relevant words are categorized in such a way
        continue;
    POS_tuple = (tuple[0], simpler_pos_tag(tuple[1]))
    simpler_POS_text_c.append(POS_tuple)
    
print(simpler_POS_text_c)

[('electronic', 'a'), ('apparatus', 'n'), ('including', 'v'), ('capturing', 'v'), ('storage', 'n'), ('operation', 'n'), ('method', 'n'), ('thereof', 'n'), ('provided', 'v'), ('capturing', 'v'), ('captures', 'n'), ('storage', 'n'), ('records', 'n'), ('modules', 'n'), ('coupled', 'v'), ('capturing', 'v'), ('storage', 'n'), ('configured', 'v'), ('configure', 'n'), ('capturing', 'v'), ('capture', 'n'), ('head', 'n'), ('perform', 'n'), ('obtain', 'v'), ('detect', 'a'), ('facial', 'a'), ('landmarks', 'n'), ('within', None), ('estimate', 'a'), ('head', 'n'), ('posture', 'n'), ('angle', 'n'), ('according', 'v'), ('facial', 'a'), ('landmarks', 'n'), ('calculate', 'n'), ('gaze', 'n'), ('position', 'n'), ('gazes', 'v'), ('screen', 'a'), ('according', 'v'), ('head', 'n'), ('posture', 'n'), ('angle', 'a'), ('rotation', 'n'), ('reference', 'n'), ('angle', 'n'), ('predetermined', 'v'), ('calibration', 'n'), ('positions', 'n'), ('configure', 'v'), ('screen', 'a'), ('display', 'n'), ('corresponding', '



### Lemmatization
In this step we lemmatize the pos text, so we obtain the final two vectors with all the lemmas we need for our analysis.

In [22]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [29]:
lemmatized_text_a = []

for tuple in simpler_POS_text_a:
    if (tuple[1] == None):
        lemmatized_text_a.append(lemmatizer.lemmatize(tuple[0]))
    else:
        lemmatized_text_a.append(lemmatizer.lemmatize(tuple[0], pos=tuple[1]))
    
print(lemmatized_text_a)

['electronic', 'apparatus', 'include', 'capture', 'storage', 'operation', 'method', 'thereof', 'provide', 'capture', 'capture', 'storage', 'record', 'module', 'couple', 'capture', 'storage', 'configure', 'configure', 'capture', 'capture', 'head', 'perform', 'obtain', 'detect', 'facial', 'landmark', 'within', 'estimate', 'head', 'posture', 'angle', 'accord', 'facial', 'landmark', 'calculate', 'gaze', 'position', 'gaze', 'screen', 'accord', 'head', 'posture', 'angle', 'rotation', 'reference', 'angle', 'predetermine', 'calibration', 'position', 'configure', 'screen', 'display', 'correspond', 'visual', 'effect', 'accord', 'gaze', 'positionthe', 'present', 'disclosure', 'provide', 'product', 'thereof', 'adopts', 'fusion', 'method', 'perform', 'machine', 'learn', 'computation', 'technical', 'effect', 'present', 'disclosure', 'include', 'few', 'computation', 'less', 'power', 'consumptiona', 'method', 'detect', 'body', 'information', 'passenger', 'vehicle', 'base', 'human', "'", 'status', 'pro

In [30]:
lemmatized_text_c = []

for tuple in simpler_POS_text_c:
    if (tuple[1] == None):
        lemmatized_text_c.append(lemmatizer.lemmatize(tuple[0]))
    else:
        lemmatized_text_c.append(lemmatizer.lemmatize(tuple[0], pos=tuple[1]))
    
print(lemmatized_text_c)

['configure', 'make', 'screen', 'display', 'frame', 'comprise', 'capture', 'device', 'storage', 'device', 'store', 'module', 'couple', 'capture', 'device', 'storage', 'device', 'configure', 'execute', 'module', 'storage', 'device', 'configure', 'screen', 'display', 'marker', 'object', 'predetermine', 'position', 'configure', 'capture', 'device', 'capture', 'first', 'head', 'look', 'predetermined', 'position', 'perform', 'first', 'face', 'operation', 'first', 'head', 'obtain', 'first', 'face', 'region', 'correspond', 'predetermined', 'position', 'detect', 'first', 'facial', 'landmark', 'correspond', 'first', 'face', 'region', 'calculate', 'rotation', 'reference', 'angle', 'look', 'predetermine', 'position', 'accord', 'first', 'facial', 'landmark', 'configure', 'capture', 'device', 'capture', 'second', 'head', 'perform', 'second', 'head', 'obtain', 'second', 'face', 'region', 'detect', 'second', 'facial', 'landmark', 'within', 'second', 'face', 'region', 'estimate', 'head', 'posture', 'a

In [31]:
lem_text_a = ""
for abstract_text in lemmatized_text_a:
    lem_text_a += abstract_text + " " #we pick each word and add to a variable, which will contain all the text
lem_text_a

"electronic apparatus include capture storage operation method thereof provide capture capture storage record module couple capture storage configure configure capture capture head perform obtain detect facial landmark within estimate head posture angle accord facial landmark calculate gaze position gaze screen accord head posture angle rotation reference angle predetermine calibration position configure screen display correspond visual effect accord gaze positionthe present disclosure provide product thereof adopts fusion method perform machine learn computation technical effect present disclosure include few computation less power consumptiona method detect body information passenger vehicle base human ' status provide method include step passenger body information-detecting inputting interior vehicle face network detect face passenger output passenger feature information inputting interior body network detect body output body-part length information b retrieve specific height mappin

In [32]:
lem_text_c = ""
for claim_text in lemmatized_text_c:
    lem_text_c += claim_text + " " #we pick each word and add to a variable, which will contain all the text
lem_text_c

"configure make screen display frame comprise capture device storage device store module couple capture device storage device configure execute module storage device configure screen display marker object predetermine position configure capture device capture first head look predetermined position perform first face operation first head obtain first face region correspond predetermined position detect first facial landmark correspond first face region calculate rotation reference angle look predetermine position accord first facial landmark configure capture device capture second head perform second head obtain second face region detect second facial landmark within second face region estimate head posture angle accord second facial landmark calculate gaze position screen accord head posture angle rotation reference angle predetermine position configure screen display correspond visual effect accord gaze position accord wherein gaze position comprises first coordinate value first axial

### KeyWord Extraction
Finally we obtain the key words and phrases from the proccesed text to obtain the main themes of the patents. 

In [38]:
# initialize keyphrase extraction model, here TopicRank
print("Initializing extractor...")
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
print("Loading text...");
extractor.load_document(input=lem_text_a, language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
print("Candidate Selection...")
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
print("Weighting...")
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
print("Selecting 20 best candidates...")
keyphrases = extractor.get_n_best(n=30)
for tuple in keyphrases:
    print(tuple[0])
print(keyphrases)

Initializing extractor...
Loading text...
Candidate Selection...
Weighting...
Selecting 20 best candidates...
fusion method
order
aspect smart television tv
region interest
detect
local patch
field view head-mounted display base
different process
human group
vergence distance
configuration
customer information
step manage count specific facial
network
motion subject
input
face cluster
pre-processed multi-channel channel pre-processed
electronic apparatus
pixel pixel
area view
photo album
feature extractor
module
interior vehicle face network
location
face face correspond respective person
notification instruction electronic
location-based access control secure resource
regular pixel
[('fusion method', 0.050166996687558967), ('order', 0.0227374855218926), ('aspect smart television tv', 0.020706482678782133), ('region interest', 0.01973492968607286), ('detect', 0.018785845382593425), ('local patch', 0.018697103953675444), ('field view head-mounted display base', 0.0175542834262552), ('di

In [39]:
# initialize keyphrase extraction model, here TopicRank
print("Initializing extractor...")
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
print("Loading text...");
extractor.load_document(input=lem_text_c, language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
print("Candidate Selection...")
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
print("Weighting...")
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
print("Selecting 60 best candidates...")
keyphrases = extractor.get_n_best(n=60)
for tuple in keyphrases:
    print(tuple[0])
print(keyphrases)

Initializing extractor...
Loading text...
Candidate Selection...
Weighting...
Selecting 60 best candidates...
device
feature map correspond interior
face detection comprise
process
location
keypoint heatmap
upcoming medium program base profile
tree module
specific person
parameter motion parameter
compute system
configure
respective
face correspond respective person
invitee
model
position
comprises
specific facial neural aggregation network
data
count specific facial satisfies preset
audio feedback signal
operation
facial area subject detect
machine
possible system
information
local patch
attention parameter
angle rotation preset direction
face subtract area
topic multimodal file comprise
memory store instruction
preset source cover accord
feature extraction network ii generate
layer computer implement
second characteristic information
consumer
region classifier code
interior face network
convolutional layer
historical time historical date
second yaw angle
training
input interface
phot

In [40]:
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(lem_text_a)
for tuple in keywords:
    print(tuple[0])

print(kw_model.extract_keywords(lem_text_a, keyphrase_ngram_range=(1, 1), stop_words=None))

posture
faceembodiments
capture
classify
feature
[('posture', 0.3497), ('faceembodiments', 0.2996), ('capture', 0.2826), ('classify', 0.2775), ('feature', 0.2763)]


In [41]:
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(lem_text_c)
for tuple in keywords:
    print(tuple[0])

print(kw_model.extract_keywords(lem_text_c, keyphrase_ngram_range=(1, 1), stop_words=None))

posture
alignment
interpolation
rotated
tilt
[('posture', 0.4437), ('alignment', 0.4106), ('interpolation', 0.3969), ('rotated', 0.3907), ('tilt', 0.3821)]
