# Procesamiento del lenguaje natural y minería de texto (PLN)



Esta práctica consistirá en que hemos seleccionado dos artículos de un determinado tema y mediante un algoritmo de scoring-sentences seleccionaremos las frases más importantes de cada artículo.


In [1]:
# packages
import sys
import os
import nltk
from collections import Counter

### Documentos de texto

Hemos elegido el tópico de la biografía de Einstein, por ello, hemos buscado dos artículos de diferentes fuentes que nos hacen un resumen sobre su vida.

Las dos fuentes son:
    - https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bio.html - Artículo 1
    - https://www.biography.com/people/albert-einstein-9285408 - Artículo 2
    

In [6]:
f = open("Albert_Einstein_1.txt",encoding="utf8")
document1 = f.read()
print(document1)

Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. Six weeks later the family moved to Munich, where he later on began his schooling at the Luitpold Gymnasium. Later, they moved to Italy and Albert continued his education at Aarau, Switzerland and in 1896 he entered the Swiss Federal Polytechnic School in Zurich to be trained as a teacher in physics and mathematics. In 1901, the year he gained his diploma, he acquired Swiss citizenship and, as he was unable to find a teaching post, he accepted a position as technical assistant in the Swiss Patent Office. In 1905 he obtained his doctor's degree.

During his stay at the Patent Office, and in his spare time, he produced much of his remarkable work and in 1908 he was appointed Privatdozent in Berne. In 1909 he became Professor Extraordinary at Zurich, in 1911 Professor of Theoretical Physics at Prague, returning to Zurich in the following year to fill a similar post. In 1914 he was appointed Director of the Kaiser

In [5]:
f = open("Albert_Einstein.txt",encoding="utf8")
document2 = f.read()
print(document2)

Albert Einstein was a German-born physicist who developed the general theory of relativity. He is considered one of the most influential physicists of the 20th century.
Who Was Albert Einstein?
Albert Einstein (March 14, 1879 to April 18, 1955) was a German mathematician and physicist who developed the special and general theories of relativity. In 1921, he won the Nobel Prize for physics for his explanation of the photoelectric effect. In the following decade, he immigrated to the U.S. after being targeted by the Nazis. His work also had a major impact on the development of atomic energy. In his later years, Einstein focused on unified field theory. With his passion for inquiry, Einstein is generally considered the most influential physicist of the 20th century.
Albert Einstein’s Inventions and Discoveries
As a physicist, Einstein had many discoveries, but he is perhaps best known for his theory of relativity and the equation E=MC2, which foreshadowed the development of atomic power a

Como nuestro objetivo es analizar palabra a palabra cada artículo, necesitamos limpiarlo, es decir, hacer un preprocesamiento de texto y para ello eliminamos los símbolos como '\n':

In [4]:
document2 = document2.replace('\n','')
document1 = document1.replace('\n','')

document1 = document1.replace("\'s","'s")
document2 = document2.replace("\'s","'s")
print(document1)
print(document2)

Albert Einstein was born at Ulm, in Württemberg, Germany, on March 14, 1879. Six weeks later the family moved to Munich, where he later on began his schooling at the Luitpold Gymnasium. Later, they moved to Italy and Albert continued his education at Aarau, Switzerland and in 1896 he entered the Swiss Federal Polytechnic School in Zurich to be trained as a teacher in physics and mathematics. In 1901, the year he gained his diploma, he acquired Swiss citizenship and, as he was unable to find a teaching post, he accepted a position as technical assistant in the Swiss Patent Office. In 1905 he obtained his doctor's degree.During his stay at the Patent Office, and in his spare time, he produced much of his remarkable work and in 1908 he was appointed Privatdozent in Berne. In 1909 he became Professor Extraordinary at Zurich, in 1911 Professor of Theoretical Physics at Prague, returning to Zurich in the following year to fill a similar post. In 1914 he was appointed Director of the Kaiser W

### Tf-idf and entities

Tf-idf, es una librería que nos permite a partir de la creacción de un diccionario obtener el peso de cada palabra dentro del texto mediante unas reglas establecidas. Nos interesa aquellas palabras que tengan mayor peso pues serán las más importantes del texto.

In [7]:
my_documents = [document1,document2]

# packages
from nltk.stem import WordNetLemmatizer
from gensim.corpora.dictionary import Dictionary
from nltk.corpus import stopwords
import re
from nltk.tokenize import regexp_tokenize

# lower words
doc_lower = []
for doc in my_documents:
    document = doc.lower()
    doc_lower.append(document)

# tokenizamos el documento y eliminamos aquellas palabras denominadas 'stopwords' (signos de puntuación, palabras muy comunes, etc...)
tokenized_words = []

for i in doc_lower:
    x = regexp_tokenize(i,r'(\d+|\w+)')
    tokenized_words.append(x)

filtered_words = []
list1 = []
list2 = []
for i in range(len(tokenized_words)):
    for word in tokenized_words[i]:
        if word not in stopwords.words('english'):
            if i == 0:
                list1.append(word)
            else:
                list2.append(word)

filtered_words.append(list1)
filtered_words.append(list2)


# dictionary creation
dictionary = Dictionary(filtered_words)
dictionary.token2id


# corpus
corpus = []

for i in filtered_words:
    x = dictionary.doc2bow(i)
    corpus.append(x)



In [8]:
# PALABRAS CON MÁS PESO DEL DOCUMENTO 1
from gensim.models.tfidfmodel import TfidfModel
# creation of tfidf model
tfidf = TfidfModel(corpus)


relevant_words_document1 = []
weights_words_document1 = []
# document 1
doc = corpus[0]

# weights
tfidf_weights = tfidf[doc]
# order
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# best words 
for term_id, weight in sorted_tfidf_weights[:20]:
    print(dictionary.get(term_id), weight)
    relevant_words_document1.append(dictionary.get(term_id))
    weights_words_document1.append(weight)

mechanics 0.2777777777777778
problems 0.2222222222222222
professor 0.2222222222222222
1920 0.16666666666666666
america 0.16666666666666666
berlin 0.16666666666666666
important 0.16666666666666666
movement 0.16666666666666666
statistical 0.16666666666666666
1914 0.1111111111111111
1916 0.1111111111111111
1950 0.1111111111111111
appointed 0.1111111111111111
contributed 0.1111111111111111
gained 0.1111111111111111
interpretation 0.1111111111111111
leading 0.1111111111111111
medal 0.1111111111111111
moved 0.1111111111111111
philosophy 0.1111111111111111


In [9]:
# document 2
doc = corpus[1]

relevant_words_document2 = []
weights_words_document2 = []
# weights
tfidf_weights = tfidf[doc]
# order
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# best words 
for term_id, weight in sorted_tfidf_weights[:20]:
    print(dictionary.get(term_id), weight)
    relevant_words_document2.append(dictionary.get(term_id))
    weights_words_document2.append(weight)

would 0.23199266800191015
brain 0.15466177866794012
e 0.15466177866794012
considered 0.12888481555661677
one 0.12888481555661677
physicist 0.12888481555661677
scientists 0.12888481555661677
thus 0.12888481555661677
universe 0.12888481555661677
bomb 0.1031078524452934
couple 0.1031078524452934
due 0.1031078524452934
energy 0.1031078524452934
ideas 0.1031078524452934
mc2 0.1031078524452934
nobel 0.1031078524452934
prize 0.1031078524452934
speech 0.1031078524452934
study 0.1031078524452934
u 0.1031078524452934


A parte del uso de esta librería para saber las palabras con más peso del documento, he decidido añadir un plus y es saber cuales son las entidades de más peso (año, nombres propios, lugares, etc...) de cada documento utilizando la librería spacy:

In [10]:
# importancia de las entidades
import spacy
nlp = spacy.load('en')

# document 1
doc = nlp(document1)
entities_1 = []
for ent in doc.ents:
    entities_1.append(ent.text.lower())

# document 2
doc = nlp(document1)
entities_2 = []
for ent in doc.ents:
    entities_2.append(ent.text.lower())

entities = [entities_1,entities_2]


# dictionary creation
dictionary = Dictionary(entities)
dictionary.token2id


# corpus
corpus = []

for i in entities:
    x = dictionary.doc2bow(i)
    corpus.append(x)


# DOCUMENT 1
doc = corpus[0]

relevant_entities_document1 = []
weights_entities_document1=[]
# weights
tfidf_weights = tfidf[doc]
# order
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# best entities
print('MOST RELEVANT ENTITIES IN DOCUMENT 1')
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
    relevant_entities_document1.append(dictionary.get(term_id))
    weights_entities_document1.append(weight)


print('====================================')


doc = corpus[1]

relevant_entities_document2 = []
weights_entities_document2 = []
# weights
tfidf_weights = tfidf[doc]
# order
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# best entities
print('MOST RELEVANT ENTITIES IN DOCUMENT 2')
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
    relevant_entities_document2.append(dictionary.get(term_id))
    weights_entities_document2.append(weight)

MOST RELEVANT ENTITIES IN DOCUMENT 1
einstein 0.5241424183609592
1905 0.20965696734438366
1914 0.20965696734438366
1916 0.20965696734438366
1920 0.20965696734438366
MOST RELEVANT ENTITIES IN DOCUMENT 2
einstein 0.5241424183609592
1905 0.20965696734438366
1914 0.20965696734438366
1916 0.20965696734438366
1920 0.20965696734438366


Una vez tengamos ya cuáles son las palabras con más peso y el peso de las entidades más relevantes, introduzco las palabras en una lista y su respectivo peso en otra lista en el mismo orden, puesto que luego es necesario para hacer el scoring-sentences:

In [13]:
# document 1 
key_words_document1 = []
for i in relevant_words_document1:
    key_words_document1.append(i)

for j in relevant_entities_document1:
    key_words_document1.append(j)
    
weights_document1 = []
for i in weights_words_document1:
    weights_document1.append(i)
    
for j in weights_entities_document1:
    weights_document1.append(j)
# document 2
key_words_document2 = []
for i in relevant_words_document2:
    key_words_document2.append(i)

for j in relevant_entities_document2:
    key_words_document2.append(j)
    
weights_document2 = []
for i in weights_words_document2:
    weights_document2.append(i)
    
for j in weights_entities_document2:
    weights_document2.append(j)
    
print('KEY WORDS FOR CHOOSING SENTENCES DOC 1')
print(key_words_document1)
print(weights_document1)


print('KEY WORDS FOR CHOOSING SENTENCES DOC 2')
print(key_words_document2)
print(weights_document2)


KEY WORDS FOR CHOOSING SENTENCES DOC 1
['mechanics', 'problems', 'professor', '1920', 'america', 'berlin', 'important', 'movement', 'statistical', '1914', '1916', '1950', 'appointed', 'contributed', 'gained', 'interpretation', 'leading', 'medal', 'moved', 'philosophy', 'einstein', '1905', '1914', '1916', '1920']
[0.2777777777777778, 0.2222222222222222, 0.2222222222222222, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.1111111111111111, 0.5241424183609592, 0.20965696734438366, 0.20965696734438366, 0.20965696734438366, 0.20965696734438366]
KEY WORDS FOR CHOOSING SENTENCES DOC 2
['would', 'brain', 'e', 'considered', 'one', 'physicist', 'scientists', 'thus', 'universe', 'bomb', 'couple', 'due', 'energy', 'ideas', 

### Pronoun transformation

Antes de comenzar a seleccionas las frases más relevantes del documento, hay un matiz que no debemos pasar por alto y es que una de las entidades con más peso es 'einstein' y en la mayoría de frases para referirse a él utiliza 'he', por lo tanto, para la selccion debemos hacer un 'pronoun transformation' que consistirá que donde pone 'he' lo sustituiremos por 'einstein'.

In [16]:
from nltk.tokenize import sent_tokenize

# tokenize sentences document 1
sentences_document1 = sent_tokenize(document1)


# tokenize sentences document 2
sentences_document2 = sent_tokenize(document2)

In [19]:
sentences_document1_new = []
for sentence in sentences_document1:
    sentence = sentence.lower()
    if "he" in sentence:
        sentence = re.sub(r'\bhe\b','einstein',sentence)
    sentences_document1_new.append(sentence)
    
sentences_document2_new =[]
for sentence in sentences_document2:
    sentence = sentence.lower()
    if "he" in sentence:
        sentence = re.sub(r'\bhe\b','einstein',sentence)
    sentences_document2_new.append(sentence)


### Our algorithm to choose sentences

Para elegir las frases más relevantes, hemos utilizado el mismo método que en el artículo 'http://www.doiserbia.nb.rs/img/doi/1820-0214/2015/1820-02141500060N.pdf'. 

En resumen, lo que haremos será hacer un score para cada frase basado en tres:
    - El primer score se basará en el score de las palabras que hemos seleccionado con sus respectivos pesos.
    - El segundo se basará en la longitud de la frase con respecto a la media de la longitud total de las frases
    - Y el último se considera que la primera y la última frase del documento siempre son más relevantes que el resto.
    
Se suman los scores y aquellas frases que tengan más score* son las seleccionadas como frases más importantes.

* He considerado que se seleccionan aquellas frases que tengan un score superior a la media de los scores.

In [20]:
# (1) word-score (just for relevant words) --> TF-IDF (we added entities as well)

# (2) Sentence Lenght Score
     # L(Si,di) = abs[ (number of words of the sentences) - (average of lenght of the sentences)] / (average of length of the sentences)
# (3) Sentence position
     # 0.8 if it is the first phrase
     # 0.2 if it is the last phrase
     # 0 in other case

# document 1
score_sentences_total_document1 = []
score_word_sentences_document1 = []

print('DOCUMENT 1')
print('========================')
print('WORD SCORE SENTENCES')

for i in sentences_document1_new:
    score = 0
    for j in key_words_document1:
        if j in i:
            pos = key_words_document1.index(j)
            score = score + weights_document1[pos]
        else:
            score = score
    score_word_sentences_document1.append(score)
print(score_word_sentences_document1)
print('========================')
print('SENTENCE LENGHT SCORE')
lenght_sentences = 0
for i in sentences_document1_new:
    lenght = len(i.split())
    lenght_sentences = lenght_sentences + lenght

average = float(lenght_sentences)/float(len(sentences_document1_new))
print('average = ',average)

score_len_sentences_document1 = []
for i in sentences_document1_new:
    score = abs(len(i.split()) - average)/ average
    score_len_sentences_document1.append(score)
print(score_len_sentences_document1)

print('============================')
print('SENTENCE POSITION SCORE')
score_sentence_position = []
for i in sentences_document1_new:
    score = 0
    if sentences_document1_new.index(i) == 0:
        score = score + 0.8
    elif sentences_document1_new.index(i) == (len(sentences_document1_new)-1):
        score = score + 0.2
    else:
        score = 0
    score_sentence_position.append(score)
print(score_sentence_position)
    
print('===========================')
print('TOTAL SCORE')
for i in range(len(score_len_sentences_document1)):
    score_total = score_word_sentences_document1[i] + score_sentence_position[i] + score_len_sentences_document1[i]
    score_sentences_total_document1.append(score_total)

print(score_sentences_total_document1)
    
    
print('==========================')
print('SENTENCES SELECTED')
# we are going to choose the sentences which they are above average of score
score_average = float(sum(score_sentences_total_document1)/len(score_sentences_total_document1))
print('score_average =', score_average)

sentences_selected = []
for i in score_sentences_total_document1:
    if i>score_average:
        pos = score_sentences_total_document1.index(i)
        sentence = sentences_document1[pos]
        sentences_selected.append(sentence)
print('we selected', len(sentences_selected),'sentences')
print(sentences_selected)


DOCUMENT 1
WORD SCORE SENTENCES
[0.5241424183609592, 0.6352535294720703, 0.6352535294720703, 0.6352535294720703, 0.7337993857053429, 0.6352535294720703, 0.7463646405831814, 1.2463646405831814, 1.3019201961387368, 0.5241424183609592, 0.801920196138737, 0.7463646405831814, 0.5241424183609592, 0.5241424183609592, 0.801920196138737, 1.3574757516942926, 0.5241424183609592, 1.0241424183609593, 1.3019201961387368, 1.1352535294720703, 1.079697973916515, 0.5241424183609592, 1.7337993857053429, 0, 0.3888888888888889, 0.801920196138737, 1.1352535294720703, 0.7463646405831814, 0.6908090850276258, 0.5241424183609592, 0.5241424183609592]
SENTENCE LENGHT SCORE
average =  24.903225806451612
[0.47797927461139894, 0.2370466321243523, 0.4054404145077721, 0.36528497409326427, 0.7189119170984456, 0.12435233160621766, 0.12435233160621766, 0.2370466321243523, 0.4455958549222798, 0.3976683937823834, 0.6463730569948187, 0.2370466321243523, 0.15673575129533676, 0.5181347150259067, 0.4857512953367876, 0.16450777

In [22]:
# document 1
score_sentences_total_document2 = []
score_word_sentences_document2 = []

print('DOCUMENT 2')
print('========================')
print('WORD SCORE SENTENCES')

for i in sentences_document2_new:
    score = 0
    for j in key_words_document2:
        if j in i:
            pos = key_words_document2.index(j)
            score = score + weights_document2[pos]
        else:
            score = score
    score_word_sentences_document2.append(score)
print(score_word_sentences_document2)
print('========================')
print('SENTENCE LENGHT SCORE')
lenght_sentences = 0
for i in sentences_document2_new:
    lenght = len(i.split())
    lenght_sentences = lenght_sentences + lenght

average = float(lenght_sentences)/float(len(sentences_document1_new))
print('average = ',average)

score_len_sentences_document2 = []
for i in sentences_document2_new:
    score = abs(len(i.split()) - average)/ average
    score_len_sentences_document2.append(score)
print(score_len_sentences_document2)

print('============================')
print('SENTENCE POSITION SCORE')
score_sentence_position = []
for i in sentences_document2_new:
    score = 0
    if sentences_document2_new.index(i) == 0:
        score = score + 0.8
    elif sentences_document2_new.index(i) == (len(sentences_document2_new)-1):
        score = score + 0.2
    else:
        score = 0
    score_sentence_position.append(score)
print(score_sentence_position)
    
print('===========================')
print('TOTAL SCORE')
for i in range(len(score_len_sentences_document2)):
    score_total = score_word_sentences_document2[i] + score_sentence_position[i] + score_len_sentences_document2[i]
    score_sentences_total_document2.append(score_total)

print(score_sentences_total_document2)
    
    
print('==========================')
print('SENTENCES SELECTED')
# we are going to choose the sentences which they are above average of score
score_average = float(sum(score_sentences_total_document2)/len(score_sentences_total_document2))
print('score_average =', score_average)

sentences_selected = []
for i in score_sentences_total_document2:
    if i>score_average:
        pos = score_sentences_total_document2.index(i)
        sentence = sentences_document2[pos]
        sentences_selected.append(sentence)
print('we selected', len(sentences_selected),'sentences')
print(sentences_selected)

DOCUMENT 2
WORD SCORE SENTENCES
[0.8076890125855161, 1.168566496144043, 0.6788041970288994, 0.8076890125855161, 0.8850199019194862, 0.7819120494741927, 0.25776963111323353, 0.7819120494741927, 1.0396816805874263, 1.1170125699213962, 0.888461164373283, 0.6788041970288994, 0.9107968650308095, 0.7819120494741927, 0.9107968650308095, 1.197784721709163, 0.36087748355852695, 0.9107968650308095, 0.7819120494741927, 0.7819120494741927, 0.25776963111323353, 0.8076890125855161, 0.6788041970288994, 0.7819120494741927, 0.8850199019194862, 1.013904717476103, 0.8850199019194862, 0.36087748355852695, 0.25776963111323353, 0.36087748355852695, 0.6186471146717605, 0.9881277543647795, 0.7819120494741927, 0.8850199019194862, 1.013904717476103, 0.6788041970288994, 0.7819120494741927, 0.6788041970288994, 0.7819120494741927, 0.8850199019194862, 0.7819120494741927, 0.6788041970288994, 0.15466177866794012, 0.28354659422455686, 0.15466177866794012, 1.0396816805874263, 1.4005591641459532, 0.9365738281421329, 1.1