# Script extracting example sentences from a corpus

**Author**: Diane Marquette <br>
**Date created**: 18/01/2021 <br>
**Date last modified**: 22/01/2021 <br>
**Python Version**: 3.8.5 <br>

**Note**: This script has been adapted to run on **Google Colab**. This avoids us to run locally the sentence extraction cell that took more than 6 hours to execute.

The goal of this script is to **extract example sentences from a French corpus**. 

We **focus on rare words that only appear 3 times** in the entire WordLex corpus. Example sentences are especially valuable for these words as they usually have a specific usage context. Plus, we don't overwhelmed by hundreds of potential example sentences.

We use the French corpora from January 2012 provided by [HC Corpora](http://corpora.epizy.com/corpora.html). It comes with a "Full Word List" that includes all the words that exist in the corpus and how many times they appeared. 

Because this list of words is raw and includes a lot of "garbage", we double-check if the word exists in the [Lexique 383](http://www.lexique.org/shiny/openlexicon/) dataset.


**Inputs**:
- <code>french_corpus_2012_01_23</code> folder (from HC Corpora) including <code>Fre_Blogs.txt</code>, <code>Fre_Newspapers.txt</code> and <code>Fre_Twitter.txt</code>
- <code>french_wordlist_full.txt</code> file (from HC Corpora) associated to the January 2012 corpus
- <code>Lexique-query-2021-01-18 8-3-27.xlsx</code> file (downloaded from Lexique 383's website on January 18th, 2021), we only kept 4 columns ("word", "lemme", "cgram" and "cgramortho")

**Output**:
- CSV file with 3 columns ('Word', 'Language' and 'Example') containing example sentences for words appearing only 3 times in the original corpus

## Import libraries

We use the [sentence-splitter](https://pypi.org/project/sentence-splitter/) package to split the documents into sentences. It relies on a heuristic algorithm trained for different languages, including French, to split plain text into a list of sentences.

In [None]:
!pip install sentence_splitter



In [2]:
import io, os, codecs
import pandas as pd
import nltk 
from sentence_splitter import split_text_into_sentences 

We mount our Drive to the Colab VM to later export there the generated CSV file.

In [None]:
from google.colab import drive

# Mount your Drive to the Colab VM.
drive.mount('/drive')

Mounted at /drive


We upload all five files used in this script to the Colab VM. 

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Fre_Blogs.txt to Fre_Blogs (2).txt
Saving Fre_Newspapers.txt to Fre_Newspapers (1).txt
Saving Fre_Twitter.txt to Fre_Twitter (1).txt
Saving french_wordlist_full.txt to french_wordlist_full (1).txt
Saving Lexique-query-2021-01-18 8-3-27.xlsx to Lexique-query-2021-01-18 8-3-27 (1).xlsx


## Import the list of words appearing in the HC Corpora corpus

In [None]:
file_WordLex = 'french_wordlist_full.txt'

WordLex = pd.read_csv(io.BytesIO(uploaded[file_WordLex]), 
                      sep='\t', names=['Word', 'Frequency'])
WordLex.head()

Unnamed: 0,Word,Frequency
0,µ,16
1,µa,1
2,µalchimie,2
3,µallé,1
4,µbio,1


In [None]:
WordLex.shape

(574343, 2)

We only keep the words appearing 3 times in the corpus.

In [None]:
rare_words = WordLex.loc[WordLex['Frequency']==3]
rare_words.head()

Unnamed: 0,Word,Frequency
9,µg/l,3
13,µsd,3
28,0.02,3
46,0.12,3
48,0.13,3


In [None]:
rare_words.shape

(33129, 2)

Among the 574'343 words included in the list, only 33'129 words appear 3 times in the corpus from January 2012.

## Import the list of words included in the Lexique 383 dataset  

This list is much "cleaner" than the one from HC Corpora. We use the <code>openpyxl</code> package to read the .xlsx file.

In [None]:
file_lexique383 = 'Lexique-query-2021-01-18 8-3-27.xlsx'

lexique383 = pd.read_excel(io.BytesIO(uploaded[file_lexique383]), sheet_name='Sheet1', engine='openpyxl')
lexique383.head()

Unnamed: 0,Word,lemme,cgram,cgramortho
0,a,a,NOM,"NOM,AUX,VER"
1,a,avoir,AUX,"NOM,AUX,VER"
2,a,avoir,VER,"NOM,AUX,VER"
3,a capella,a capella,ADV,ADV
4,a cappella,a cappella,ADV,ADV


In [None]:
lexique383.shape

(142694, 4)

It "only" includes 142'694 words.

## Find words appearing three times in the corpus AND included in Lexique 383

From the list of words appearing 3 times in the corpus, we only included those in the Lexique 383 dataset.

In [None]:
common_words = pd.merge(lexique383, rare_words, on='Word', how='inner')
common_words.head()

Unnamed: 0,Word,lemme,cgram,cgramortho,Frequency
0,abandonnerez,abandonner,VER,VER,3
1,abandonniez,abandonner,VER,VER,3
2,abattirent,abattre,VER,VER,3
3,abattis,abattis,NOM,"NOM,VER",3
4,abattis,abattre,VER,"NOM,VER",3


In [None]:
common_words.shape

(5810, 5)

In [None]:
print("Number of unique spellings: {}".format(len(common_words.Word.unique())))

Number of unique spellings: 5392


In [None]:
print("Number of unique lemma: {}".format(len(common_words.lemme.unique())))

Number of unique lemma: 4692


Only 5'810 words out of the 33'129 words occuring 3 times are also included in Lexique 383. 

## Import the corpus

We load the documents from each type of sources in a single dataframe.

In [None]:
corpus_files = ['Fre_Newspapers.txt', 'Fre_Blogs.txt', 'Fre_Twitter.txt']

column_names = ['Source', 'Publication date', 'Source Type', 'Topics', 'Text']
corpus = pd.DataFrame(columns = column_names)

# create dataframe with corpus docs
for file in corpus_files:
    content = pd.read_csv(io.BytesIO(uploaded[file]), sep='\t', names=column_names)
    corpus = pd.concat([corpus, content], ignore_index=True)

corpus.head()

Unnamed: 0,Source,Publication date,Source Type,Topics,Text
0,nicematin.com,2011/10/11,1,0,"17 heures à l'arrière du parc Phœnix, à Nice. ..."
1,metrofrance.com,2011/04/07,1,0,"Le chef de la diplomatie française, Alain Jupp..."
2,leparisien.fr,2011/11/27,1,0,"Depuis, la situation s’est améliorée… à Arago,..."
3,ledauphine.com,2011/04/11,1,0,Noël Duchêne et Patrice Vaniscotte ont fait le...
4,ladepeche.fr,2012/01/11,1,412,C'est l'entreprise familiale Souyris de Carmau...


## Extract example sentences from corpus

In [None]:
# Tokenize a document
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = word_tokenize(text)
    return tokens

In [None]:
%%time

data = {}
# counter to keep track of how many example sentences we found
i=0

for word in common_words.Word.unique():
    # retrieve documents (i.e., rows) from the corpus including this word
    example_docs = corpus[corpus['Text'].str.contains(word, na=False, case=False)]
    if example_docs.shape[0] == 0:
        print("No document including the word {} has been found.\n".format(word))
    else:
        for index, row in example_docs.iterrows():
            # check that the document really includes the word (e.g., 'beau') 
            # and not a string containing this word (e.g., 'beauté')
            if word.lower() in word_tokenize(row['Text'].lower()):
                try:
                    text = split_text_into_sentences(text=row['Text'],language='fr')
                    for sent in text:
                        if word in sent:
                            # add sentence including the word we're looking for as a key
                            i += 1
                            data[i] = [word, 'FRA', sent]
                except:
                    print("Something went wrong when splitting the text into sentences.")


# convert dictionary into a pandas dataframe
example_sentences = pd.DataFrame.from_dict(data, orient='index', columns=['Word', 'Language', 'Example'])

No document including the word abandonnerez has been found.

No document including the word abusèrent has been found.

No document including the word accélérais has been found.

No document including the word adresserez has been found.

No document including the word agençait has been found.

No document including the word alanguissait has been found.

No document including the word alchimies has been found.

No document including the word amphigouriques has been found.

No document including the word américaniser has been found.

No document including the word anabase has been found.

No document including the word androgynie has been found.

No document including the word angéliquement has been found.

No document including the word annihilent has been found.

No document including the word anoxie has been found.

No document including the word anti-castristes has been found.

No document including the word anti-patriotique has been found.

No document including the word anti-reflets

In [5]:
example_sentences.shape

(13574, 3)

We notice that quite a few sentences were not parsed successfully. To be honest, I don't know why. However, we still managed to extract 13'574 example sentences.

In [12]:
final_output = pd.merge(example_sentences, lexique383, on='Word', how='inner')
final_output.drop(columns=['cgram', 'cgramortho'], inplace=True)
final_output.head()

Unnamed: 0,Word,Language,Example,lemme
0,abandonniez,FRA,Si vous maintenez cette pratique sans faille d...,abandonner
1,abattirent,FRA,"Mais des hommes sont venus, qui firent avec ce...",abattre
2,abattirent,FRA,"A ce spectacle, les chevaliers indignés se ruè...",abattre
3,abattirent,FRA,« Lui et ses compagnons abattirent une infinit...,abattre
4,abattis,FRA,Mohamed VI... numérotez vos abattis... votre h...,abattis


/!\ We added the lemma to facilitate the example sentences import to Kam4D. However, the same costume can come from multiple lemma. The lemma in the CSV file were obtained by merging the <code>example_sentences</code> table with the <code>lexique383</code> one. If multiple lemma were available in lexique383, it automatically picked the first one.

In [None]:
# export dataframe as a CSV file
final_output.to_csv('/drive/My Drive/example_sentences.csv', index=False, header=True) 