# Text Mining Process for Face Recognition Data

## Getting the desired text from the patents

We start from parsing the files containing the patents and look for the ones which can give us useful information about the topic we are interested into.
This is going to take some minutes.

In [80]:
import os
import re
path = 'C:\\Users\\Stefano\\Desktop\\Stefano\\Business\\Project\\Material\\MyPatents'
print("Getting files...")
# getting all files from the directory given by the path
files = os.listdir(path)
# moving to the desired directory
os.chdir(path)
text = ""
print("START!")
for filename in files:
    file = open(filename, encoding="utf-8")
    text+=file.read();
    #print("A file! "+filename)
    #    print("\n")
    #print(text)
print("END!")


Getting files...
START!
END!


After obtaining the whole text of the file of our interest, we separate the sections containing the abstract and the claims, which are the two sections we are interested on for our analysis

In [81]:
p1 = ""
p2 = ""
p1=re.findall(r'(?<=<abstract>\n)(?s:.*?)(?=\n</abstract>)',text)
p2=re.findall(r'(?<=<claims>\n)(?s:.+?)(?=\n</claims>)',text)
print(p1)
print(p2)
print("TEXT OBTAINED!")

['An electronic apparatus including an image capturing device, a storage device and a processor and an operation method thereof are provided. The image capturing device captures an image for a user, and the storage device records a plurality of modules. The processor is coupled to the image capturing device and the storage device and is configured to: configure the image capturing device to capture a head image of a user; perform a face recognition operation to obtain a face region; detect a plurality of facial landmarks within the face region; estimate a head posture angle of the user according to the facial landmarks; calculate a gaze position where the user gazes on the screen according to the head posture angle, a plurality of rotation reference angle, and a plurality of predetermined calibration positions; and configure the screen to display a corresponding visual effect according to the gaze position.', 'The present disclosure provides a computation method and product thereof. Th

### Text cleaining
First we lower the text for both sections, then we do the whitespace and punctuation removal.

In [82]:
# lower() is a Python function for strings
lower_atext = ""
lower_ctext = ""
for abstract_text in p1:
    lower_atext += abstract_text.lower() #we pick each word and add to a variable, which will contain all the text
lower_atext

for claim_text in p2:
    lower_ctext += claim_text.lower() #we pick each word and add to a variable, which will contain all the text
lower_ctext



In [83]:
#white space removal for both sections
def remove_whitespace(text):
    return  " ".join(text.split())

lowera_text = remove_whitespace(lower_atext)
lowera_text
lowerc_text = remove_whitespace(lower_ctext)
lowerc_text



In [84]:
#punctuation and digits removal: we replace any undesired character with a ''
for char in '?.,!/;:()1234567890':  
    lowera_text = lowera_text.replace(char,'')
print(lowera_text)
for char in '?.,!/;:()1234567890':  
    lowerc_text = lowerc_text.replace(char,'')
print(lowerc_text)

an electronic apparatus including an image capturing device a storage device and a processor and an operation method thereof are provided the image capturing device captures an image for a user and the storage device records a plurality of modules the processor is coupled to the image capturing device and the storage device and is configured to configure the image capturing device to capture a head image of a user perform a face recognition operation to obtain a face region detect a plurality of facial landmarks within the face region estimate a head posture angle of the user according to the facial landmarks calculate a gaze position where the user gazes on the screen according to the head posture angle a plurality of rotation reference angle and a plurality of predetermined calibration positions and configure the screen to display a corresponding visual effect according to the gaze positionthe present disclosure provides a computation method and product thereof the computation method

### Tokenization
In this step we tokenize the text of both sections.

In [85]:
import nltk
# the output is a list, where each element is a sentence of the original text
nltk.sent_tokenize(lowera_text)
nltk.sent_tokenize(lowerc_text)



In [86]:
# the output is a list, where each element is a token of the original text
tokenized_text_a = nltk.word_tokenize(lowera_text)
print(tokenized_text_a)

tokenized_text_c = nltk.word_tokenize(lowerc_text)
print(tokenized_text_c)

['an', 'electronic', 'apparatus', 'including', 'an', 'image', 'capturing', 'device', 'a', 'storage', 'device', 'and', 'a', 'processor', 'and', 'an', 'operation', 'method', 'thereof', 'are', 'provided', 'the', 'image', 'capturing', 'device', 'captures', 'an', 'image', 'for', 'a', 'user', 'and', 'the', 'storage', 'device', 'records', 'a', 'plurality', 'of', 'modules', 'the', 'processor', 'is', 'coupled', 'to', 'the', 'image', 'capturing', 'device', 'and', 'the', 'storage', 'device', 'and', 'is', 'configured', 'to', 'configure', 'the', 'image', 'capturing', 'device', 'to', 'capture', 'a', 'head', 'image', 'of', 'a', 'user', 'perform', 'a', 'face', 'recognition', 'operation', 'to', 'obtain', 'a', 'face', 'region', 'detect', 'a', 'plurality', 'of', 'facial', 'landmarks', 'within', 'the', 'face', 'region', 'estimate', 'a', 'head', 'posture', 'angle', 'of', 'the', 'user', 'according', 'to', 'the', 'facial', 'landmarks', 'calculate', 'a', 'gaze', 'position', 'where', 'the', 'user', 'gazes', 'o



### StopWords removal
We remove the stopwords from the text. The language we are using is english, so we remove the english stopwords.

In [87]:
from nltk.corpus import stopwords
stopwords_en = stopwords.words('english')

In [88]:
# we prepare a empty list, which will contain the words after the stopwords removal
tokenized_vector_a = []

# we iterate into the list of tokens obtained through the tokenization
for token in tokenized_text_a:
    # if a token is not a stopword, we insert it in the list
    if token not in stopwords_en:
        tokenized_vector_a.append(token)

# the output is a list of all the tokens of the original text excluding the stopwords
print(tokenized_vector_a)

# we prepare a empty list, which will contain the words after the stopwords removal
tokenized_vector_c = []

# we iterate into the list of tokens obtained through the tokenization
for token in tokenized_text_c:
    # if a token is not a stopword, we insert it in the list
    if token not in stopwords_en:
        tokenized_vector_c.append(token)

# the output is a list of all the tokens of the original text excluding the stopwords
print(tokenized_vector_c)

['electronic', 'apparatus', 'including', 'image', 'capturing', 'device', 'storage', 'device', 'processor', 'operation', 'method', 'thereof', 'provided', 'image', 'capturing', 'device', 'captures', 'image', 'user', 'storage', 'device', 'records', 'plurality', 'modules', 'processor', 'coupled', 'image', 'capturing', 'device', 'storage', 'device', 'configured', 'configure', 'image', 'capturing', 'device', 'capture', 'head', 'image', 'user', 'perform', 'face', 'recognition', 'operation', 'obtain', 'face', 'region', 'detect', 'plurality', 'facial', 'landmarks', 'within', 'face', 'region', 'estimate', 'head', 'posture', 'angle', 'user', 'according', 'facial', 'landmarks', 'calculate', 'gaze', 'position', 'user', 'gazes', 'screen', 'according', 'head', 'posture', 'angle', 'plurality', 'rotation', 'reference', 'angle', 'plurality', 'predetermined', 'calibration', 'positions', 'configure', 'screen', 'display', 'corresponding', 'visual', 'effect', 'according', 'gaze', 'positionthe', 'present', '



### POS Analysis
We now do the POS analysis: we use the pos tagging to assign each word to its pos tag, then we clean and simplify the pos text.

In [89]:
pos_tagging_a = nltk.pos_tag(tokenized_vector_a)
print(pos_tagging_a)
pos_tagging_c = nltk.pos_tag(tokenized_vector_c)
print(pos_tagging_c)

[('electronic', 'JJ'), ('apparatus', 'NN'), ('including', 'VBG'), ('image', 'NN'), ('capturing', 'NN'), ('device', 'NN'), ('storage', 'NN'), ('device', 'NN'), ('processor', 'NN'), ('operation', 'NN'), ('method', 'NN'), ('thereof', 'NN'), ('provided', 'VBD'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'NN'), ('captures', 'NNS'), ('image', 'NN'), ('user', 'JJ'), ('storage', 'NN'), ('device', 'NN'), ('records', 'NNS'), ('plurality', 'NN'), ('modules', 'VBZ'), ('processor', 'NN'), ('coupled', 'VBN'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'JJ'), ('storage', 'NN'), ('device', 'NN'), ('configured', 'VBD'), ('configure', 'JJ'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'JJ'), ('capture', 'NN'), ('head', 'NN'), ('image', 'NN'), ('user', 'NN'), ('perform', 'VB'), ('face', 'NN'), ('recognition', 'NN'), ('operation', 'NN'), ('obtain', 'VB'), ('face', 'JJ'), ('region', 'NN'), ('detect', 'JJ'), ('plurality', 'NN'), ('facial', 'JJ'), ('landmarks', 'NNS'), ('within', 'IN'), 



In [90]:
cleaned_POS_text_a = []

for tuple in pos_tagging_a:
    # POS tagged text is a list of tuples, where the first element tuple[0] is a token and the second one tuple[1] is
    # the Part of Speech. If the POS has length == 1, the token is punctuation, otherwise it is not, and we insert it
    # in the list cleaned_POS_text
    if len(tuple[1]) > 1:
        cleaned_POS_text_a.append(tuple)
        
print(cleaned_POS_text_a) 

cleaned_POS_text_c = []

for tuple in pos_tagging_c:
    # POS tagged text is a list of tuples, where the first element tuple[0] is a token and the second one tuple[1] is
    # the Part of Speech. If the POS has length == 1, the token is punctuation, otherwise it is not, and we insert it
    # in the list cleaned_POS_text
    if len(tuple[1]) > 1:
        cleaned_POS_text_c.append(tuple)
        
print(cleaned_POS_text_c) 

[('electronic', 'JJ'), ('apparatus', 'NN'), ('including', 'VBG'), ('image', 'NN'), ('capturing', 'NN'), ('device', 'NN'), ('storage', 'NN'), ('device', 'NN'), ('processor', 'NN'), ('operation', 'NN'), ('method', 'NN'), ('thereof', 'NN'), ('provided', 'VBD'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'NN'), ('captures', 'NNS'), ('image', 'NN'), ('user', 'JJ'), ('storage', 'NN'), ('device', 'NN'), ('records', 'NNS'), ('plurality', 'NN'), ('modules', 'VBZ'), ('processor', 'NN'), ('coupled', 'VBN'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'JJ'), ('storage', 'NN'), ('device', 'NN'), ('configured', 'VBD'), ('configure', 'JJ'), ('image', 'NN'), ('capturing', 'VBG'), ('device', 'JJ'), ('capture', 'NN'), ('head', 'NN'), ('image', 'NN'), ('user', 'NN'), ('perform', 'VB'), ('face', 'NN'), ('recognition', 'NN'), ('operation', 'NN'), ('obtain', 'VB'), ('face', 'JJ'), ('region', 'NN'), ('detect', 'JJ'), ('plurality', 'NN'), ('facial', 'JJ'), ('landmarks', 'NNS'), ('within', 'IN'), 



In [91]:
def simpler_pos_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return "a"
    elif nltk_tag.startswith('V'):
        return "v"
    elif nltk_tag.startswith('N'):
        return "n"
    elif nltk_tag.startswith('R'):
        return "r"
    else:         
        return None
    
simpler_POS_text_a = []

# for each tuple of the list, we create a new tuple: the first element is the token, the second is
# the simplified pos tag, obtained calling the function simpler_pos_tag()
# then we append the new created tuple to a new list, which will be the output
for tuple in cleaned_POS_text_a:
    if tuple[1] == 'NNP':   #this is because there is some text in japanese categorized as 'NNP';
                            #no other relevant words are categorized in such a way
        continue;
    POS_tuple = (tuple[0], simpler_pos_tag(tuple[1]))
    simpler_POS_text_a.append(POS_tuple)
    
print(simpler_POS_text_a)

simpler_POS_text_c = []

for tuple in cleaned_POS_text_c:
    if tuple[1] == 'NNP':   #this is because there is some text in japanese categorized as 'NNP';
                            #no other relevant words are categorized in such a way
        continue;
    POS_tuple = (tuple[0], simpler_pos_tag(tuple[1]))
    simpler_POS_text_c.append(POS_tuple)
    
print(simpler_POS_text_c)

[('electronic', 'a'), ('apparatus', 'n'), ('including', 'v'), ('image', 'n'), ('capturing', 'n'), ('device', 'n'), ('storage', 'n'), ('device', 'n'), ('processor', 'n'), ('operation', 'n'), ('method', 'n'), ('thereof', 'n'), ('provided', 'v'), ('image', 'n'), ('capturing', 'v'), ('device', 'n'), ('captures', 'n'), ('image', 'n'), ('user', 'a'), ('storage', 'n'), ('device', 'n'), ('records', 'n'), ('plurality', 'n'), ('modules', 'v'), ('processor', 'n'), ('coupled', 'v'), ('image', 'n'), ('capturing', 'v'), ('device', 'a'), ('storage', 'n'), ('device', 'n'), ('configured', 'v'), ('configure', 'a'), ('image', 'n'), ('capturing', 'v'), ('device', 'a'), ('capture', 'n'), ('head', 'n'), ('image', 'n'), ('user', 'n'), ('perform', 'v'), ('face', 'n'), ('recognition', 'n'), ('operation', 'n'), ('obtain', 'v'), ('face', 'a'), ('region', 'n'), ('detect', 'a'), ('plurality', 'n'), ('facial', 'a'), ('landmarks', 'n'), ('within', None), ('face', 'n'), ('region', 'n'), ('estimate', 'n'), ('head', 'n



### Lemmatization
In this step we lemmatize the pos text, so we obtain the final two vectors with all the lemmas we need for our analysis.

In [92]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [93]:
lemmatized_text_a = []

for tuple in simpler_POS_text_a:
    if (tuple[1] == None):
        lemmatized_text_a.append(lemmatizer.lemmatize(tuple[0]))
    else:
        lemmatized_text_a.append(lemmatizer.lemmatize(tuple[0], pos=tuple[1]))
    
print(lemmatized_text_a)

lemmatized_text_c = []

for tuple in simpler_POS_text_c:
    if (tuple[1] == None):
        lemmatized_text_c.append(lemmatizer.lemmatize(tuple[0]))
    else:
        lemmatized_text_c.append(lemmatizer.lemmatize(tuple[0], pos=tuple[1]))
    
print(lemmatized_text_c)

['electronic', 'apparatus', 'include', 'image', 'capturing', 'device', 'storage', 'device', 'processor', 'operation', 'method', 'thereof', 'provide', 'image', 'capture', 'device', 'capture', 'image', 'user', 'storage', 'device', 'record', 'plurality', 'modules', 'processor', 'couple', 'image', 'capture', 'device', 'storage', 'device', 'configure', 'configure', 'image', 'capture', 'device', 'capture', 'head', 'image', 'user', 'perform', 'face', 'recognition', 'operation', 'obtain', 'face', 'region', 'detect', 'plurality', 'facial', 'landmark', 'within', 'face', 'region', 'estimate', 'head', 'posture', 'angle', 'user', 'accord', 'facial', 'landmark', 'calculate', 'gaze', 'position', 'user', 'gaze', 'screen', 'accord', 'head', 'posture', 'angle', 'plurality', 'rotation', 'reference', 'angle', 'plurality', 'predetermine', 'calibration', 'position', 'configure', 'screen', 'display', 'correspond', 'visual', 'effect', 'accord', 'gaze', 'positionthe', 'present', 'disclosure', 'provide', 'compu

['electronic', 'device', 'configure', 'make', 'screen', 'display', 'plurality', 'image', 'frame', 'comprise', 'image', 'capture', 'device', 'storage', 'device', 'store', 'plurality', 'module', 'processor', 'couple', 'image', 'capture', 'device', 'storage', 'device', 'configure', 'execute', 'module', 'storage', 'device', 'configure', 'screen', 'display', 'plurality', 'marker', 'object', 'plurality', 'predetermine', 'calibration', 'position', 'configure', 'image', 'capturing', 'device', 'capture', 'plurality', 'first', 'head', 'image', 'user', 'look', 'predetermine', 'calibration', 'position', 'perform', 'plurality', 'first', 'face', 'recognition', 'operation', 'first', 'head', 'image', 'obtain', 'plurality', 'first', 'face', 'region', 'correspond', 'predetermined', 'calibration', 'position', 'detect', 'plurality', 'first', 'facial', 'landmark', 'correspond', 'first', 'face', 'region', 'calculate', 'plurality', 'rotation', 'reference', 'angle', 'user', 'look', 'predetermined', 'calibrati