# Récapitulation extractive de textes avec Python
Cette méthode utilise **Scikit-learn** et **spaCy** pour sélectionner les phrases les plus importantes dans un texte donné. Les phrases vont être gardées en entier.  
Une phrase est délimitée par un **.** ou **;**.  
#### Avantage  
Peu d'installations à faire puisqu'utilise spaCy; qui existe déjà comme widget dans Textable Prototypes. 
Comme c'est une méthode extractive, permet de visualiser les phrases retenues.
#### Désavantage
Il faut tout de même installer scikit-learn. 
L'algorithme ne crée pas de nouvelles phrases, c'est donc moins naturel. 

Installation des librairies:
- `$ pip install -U spacy`
- `$ pip install -U scikit-learn`  

Télécharger les langages voulus:
- Anglais `$ python -m spacy download en_core_news_sm`
- Portugais `$ python -m spacy download pt_core_news_sm`
- Français `$ python -m spacy download fr_core_news_sm`

In [2]:
# Imports
import spacy
# To run this in anoter language, un-comment the correct list of STOP_WORDS
from spacy.lang.pt.stop_words import STOP_WORDS
# from spacy.lang.en.stop_words import STOP_WORDS
# from spacy.lang.fr.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer

# Load the packages by name with Spacy built in load()
nlpPT = spacy.load("pt_core_news_sm")
nlpFR = spacy.load("fr_core_news_sm")
nlpEN = spacy.load("en_core_web_sm")

In [3]:
# Get content from txt examples
with open('frankEN.txt') as f: textEN = f.read()
docEN = nlpEN(textEN)
with open('frankFR.txt') as f: textFR = f.read()
docFR = nlpFR(textFR)
with open('frankPT.txt') as f: textPT = f.read()
docPT = nlpPT(textPT)

In [24]:
# Global var

number_sents = 3

## Compter les mots

La cellule suivante va segmenter le texte en phrase, puis créer une liste de mots distincts qui apparaissent dans tout le texte. On va y soustraire une liste prédéfinie par spaCy de STOP_WORDS qui sont inutiles. C'est pas exemple des déterminants ou mots très communs avec peu de sens. 

On compte ensuite combien de fois chaque mot apparaît dans chaque phrase. 

A la fin, on crée un dictionnaire qui a pour key/value chaque mot/nombre de fois qu'il apparaît. 

In [19]:
# We will continue with the portuguese example
# corpus is an array that contains each sentence of the document separately
corpus = [sent.text.lower() for sent in docPT.sents ]

# Convert text to a matrix of token counts while removing STOP_WORDS that provides very little informations
cv = CountVectorizer(stop_words=list(STOP_WORDS))   
X = cv.fit_transform(corpus) 
word_list = cv.get_feature_names();    

# Print stop words list
print("List of STOP_WORDS:")
print(cv.get_stop_words())
# Print a list of distincts words (without stop words) in alphabetical order
print("List of words from text:")
print(word_list)
# Matrix: how many times each word appear in each sentence
print("Matrix:")
print(X.toarray())

# Count unique words and how many times they appear
word_list = cv.get_feature_names();    
count_list = cv_fit.toarray().sum(axis=0)

# Create dictionnary of word frequency
word_frequency = dict(zip(word_list,count_list))
print(word_frequency)

List of STOP_WORDS:
frozenset({'portanto', 'a', 'esses', 'dez', 'tempo', 'quarto', 'daquela', 'tudo', 'sem', 'tem', 'novas', 'vossas', 'vários', 'ligado', 'o', 'adeus', 'nove', 'primeiro', 'toda', 'tipo', 'fazemos', 'era', 'conhecida', 'ali', 'dezanove', 'atrás', 'dezassete', 'mas', 'momento', 'quinta', 'quê', 'pois', 'muitos', 'tão', 'quer', 'parte', 'e', 'te', 'fui', 'como', 'vós', 'dessa', 'tanta', 'daquele', 'algumas', 'qualquer', 'isto', 'caminho', 'as', 'nenhuma', 'apoia', 'exemplo', 'grande', 'nossa', 'porquê', 'ter', 'todos', 'disso', 'depois', 'bem', 'dezasseis', 'então', 'estado', 'estava', 'todo', 'fim', 'longe', 'novo', 'favor', 'que', 'partir', 'é', 'fomos', 'vêm', 'comprida', 'podia', 'vens', 'às', 'ademais', 'tiveram', 'após', 'mil', 'primeira', 'obrigada', 'temos', 'aquilo', 'teve', 'demais', 'à', 'estiveram', 'por', 'até', 'poder', 'vocês', 'fora', 'seu', 'meu', 'nova', 'naquela', 'numa', 'onze', 'seria', 'geral', 'diz', 'fostes', 'poderá', 'outra', 'talvez', 'ele', 'e

## Sort the dictionnary
This next cell will first **sort the dictionnary** we juste created,  
then get the **relative frequency** of words. 

The variable that we will use to show more or less words is in val[]. It will be changed because it is annoying to have a negative one. 

In [23]:
# Get sorted dict of word frequency and print the top to test
val=sorted(word_frequency.values())
higher_word_frequencies = [word for word,freq in word_frequency.items() if freq in val[-3:]]
print("\nWords with higher frequencies: ", higher_word_frequencies)

# gets relative frequency of words to frequent words
higher_frequency = val[-1]
for word in word_frequency.keys():  
    word_frequency[word] = (word_frequency[word]/higher_frequency)


Words with higher frequencies:  ['amigo', 'anos', 'outro', 'tenha']


## Sentence score

Here we attribute a score per sentence according to how many frequent words it has. 
The sentences with the highest scores will be choose for the summary. 

In [15]:
# Initialise a sentence dictionnary
sentence_rank={}

# For each word in each sentence ... 
for sent in docPT.sents:
    for word in sent :    
        # if the word appears in word_frequency dict
        if word.text.lower() in word_frequency.keys(): 
            # If the sentence is already in sentence_rank dict, we add points
            if sent in sentence_rank.keys():
                sentence_rank[sent]+=word_frequency[word.text.lower()]
            # else we create a new key/value pair in dict    
            else:
                sentence_rank[sent]=word_frequency[word.text.lower()]
                
# Sort sentences
top_sentences=(sorted(sentence_rank.values())[::-1])
# This is where we can choose how many sentences we want to keep for the summary
top_sent=top_sentences[:3]

In [31]:
# We can now create a summary from those sentences:
summary=[]
for sent,strength in sentence_rank.items():  
    if strength in top_sent:
        summary.append(sent)
    else:
        continue
for i in summary:
    print(i,end=" ")

Não tenho ninguém próximo a mim, sereno e corajoso, que tenha uma mentalidade elevada e aberta, cujas aptidões sejam iguais às minhas, para aprovar ou corrigir meus planos. e eu realmente anseio por um amigo que tenha discernimento suficiente para não me ver como um sonhador e paciência para ajudar-me a organizar minhas idéias. Passei a juventude em solidão, vivi meus melhores anos em sua suave e feminina companhia, e isso moldou meu caráter de tal forma que sou incapaz de superar o desgosto intenso que me causa a brutalidade, tão comum nos navios. Há alguns anos ele amou uma jovem senhora russa de pequena fortuna e, como ele havia ganho uma considerável quantia em dinheiro, o pai da moça consentiu no casamento. Na prática, sou muito ativo, trabalhador, um operário pronto a executar tudo com perseverança, mas ao lado disso há um amor, uma crença no assombroso inserida em todos os meus projetos, que me coloca distante dos caminhos normais dos homens, impelindo-me para o mar bravo. 

#### This method will also work with other langages, as long as the correct nlp and correct STOP_WORDS list is used. 