Run the following two cells first

In [None]:
import pandas as pd
import nltk
import csv
import tarfile
import string
from collections import Counter
from nltk import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

# Read sentences (do this first)

Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once.

In [None]:
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        return pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
                quoting=csv.QUOTE_NONE)

In [None]:
all_sentences = read_sentences_file()

Now, you can fetch sentences of a specific language using the following cells. When you want to change you target language, you can start again from here.

Note that by default, we get rid of the `ISO`, `Date added`, `Date last modified`, and `Username` columns.  
If you need any of these columns, you can comment the lines you need by adding a `#` at the beginning of the corresponding lines.

In [None]:
def sentences_of_language(sentences, language):
    target_sentences = sentences[sentences['ISO'] == language]
    del target_sentences['Date added']
    del target_sentences['Date last modified']
    del target_sentences['ISO']
    del target_sentences['Username']
    target_sentences = target_sentences.set_index("sentenceID")
    return target_sentences

Choose your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.).

In [None]:
language = 'fra'
sentences = sentences_of_language(all_sentences, language)

The following cell displays the first five sentences of your set, just for a quick check.

In [None]:
sentences.head()

To only get the text of sentence with a specific id, use the following syntax `sentences.loc[<sentenceID>].Text`

In [None]:
sentences.loc[1115].Text

# Sentences containing a specific word

First, run the following cell

In [None]:
def get_sentences(word, sentences):
    frame = sentences[sentences['Text'].str.contains(word)]
    frame = frame.append(sentences[sentences['Text'].str.contains(word.capitalize())])
    frame = frame.append(sentences[sentences['Text'].str.contains(word.upper())])
    frame = frame.append(sentences[sentences['Text'].str.contains(word.lower())])
    frame.drop_duplicates()
    return frame

Choose the word you want to search, run the cell, and all sentences (from your sentences set) containing your word will be displayed.  
The occurences that will match are your word, and your word in lowercase, uppercase, or capitalized.  
For example, if you look for `beauty`, sentences starting by `Beauty` will also match.

In [None]:
word = "skis"
get_sentences(word, sentences)

If you want to check (some of) the sentences containing one word exactly (that is, the case is matching), you can use the following

In [None]:
word = "exemple"
sentences[sentences['Text'].str.contains(word)]

# Checking how many sentences for a list of words

Suppose that you want to check how many sentences contain a specific word. You could use `get_sentences` above and count the results. However, if you have several words in mind, and you only want to know how many sentences contain them, you can use the following.

/!\ Currently, only sentences matching **exactly** your word will be counted (no uppercase, no capitalization, etc.) /!\

First, run the cell below.

In [None]:
def how_many_sentences(word_list, sentences):
    for w in word_list:
        print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))
#         if len(sentences[sentences['Text'].str.contains(w)]) <= 10:
#             print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))

Then, replace `word_list` by the words you are interested in.  
Do not forget the brackets and the quotes. `word_list` format should be `word_list = ["word1", "word2", ..., "wordn"]`

In [None]:
word_list = ["manger", "skis", "mirage", "oasis"]
how_many_sentences(word_list, sentences)

Now, suppose that you only want to check the words from your list who appears in less than `n` sentences. 

First, run the cell below.

In [None]:
def how_many_sentences_under_threshold(word_list, threshold, sentences):
    for w in word_list:
        nb_occurences = len(sentences[sentences['Text'].str.contains(w)])
        if nb_occurences <= threshold:
            print(w + "\t\t" + str(nb_occurences))

Write your own list of words, as specified above and set `n` to the number of sentences you want to set as a threshold.  
For example, if `n` is set to 10, only words that appear in less than 10 sentences will return, along with the number of sentences in which they appear.

In [None]:
word_list = ["manger", "skis", "mirage", "oasis"]
n = 10
how_many_sentences_under_threshold(word_list, n, sentences)

# Word analysis

Some standard symbols to ignore are given by the following cell.  

In [None]:
string.punctuation

You should add punctuation specific to your target language to `additional_punctuation` below (respect the format).

In [None]:
additional_punctuation = ['``', "''", '``', "''", '...', '’', '``', "''", '«', '»',]

The following cell will dispay a list of what will be consider "useless" words. Those are common [stop words](https://en.wikipedia.org/wiki/Stop_words) PLUS all the punctuation symbols defined above.  
If you're note happy with this list, you can limit it to only punctuation by removing `nltk.corpus.stopwords.words()`, or extend it by adding another list to `useless_words`

This list of stop words use the `stopwords` corpus of the nltk package. Note that a limited number of languages are available. Currently availabe are  
`arabic`, `azerbaijani`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hungarian`, `indonesion`, `italian`, `kazakh`, `nepali`, `norwegian`, `portuguese`, `romanian`, `russian`, `slovene`, `spanish`, `swedish`, `tajik`, `turkish`

In [None]:
language = "french"
useless_words = nltk.corpus.stopwords.words(language) + list(string.punctuation) + additional_punctuation
useless_words

In [None]:
# List of words in fra_sentences['Text']
texts = [word for word in sentences['Text']]
all_words = [word for text in texts for word in nltk.word_tokenize(text)]
# "Raw" number of words
len(all_words)

In [None]:
# Using a RegexpTokenizer to improve tokenizing of French sentences.
# We want to split at apostrophes.
toknizer = RegexpTokenizer(r"''\w'|\w+|[^\w\s]''")
filtered_words = [word.lower() for text in texts for word in toknizer.tokenize(text) if not word.lower() in useless_words]
# Filter numbers written with digits
filtered_words = [word for word in filtered_words if not word.isdigit()]
# Number of filtered words
len(filtered_words)

In [None]:
# Number of unique words
len(set(filtered_words))

In [None]:
word_counter = Counter(filtered_words)
most_common_words = word_counter.most_common()
most_common_words

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)

plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")

In [None]:
df = pd.DataFrame.from_dict(word_counter, orient='index')
df = df.rename(columns={'index':'word', 0:'count'})
df = df.sort_values(by='count', ascending=False)
df

## Words that appear only once

In [None]:
unique_words = df[df['count'] == 1]
unique_words

Use `df.head(n)` for the `n` most used words.  
Use `df.tail(n)` for `n` of the less used words.  
You can use `df[m:n]` for the words between the m-th and n-th most used.

For example, you can use this to go through words that are used only once to quickly find typos or erroneous words. First check the words that are used only once by `df.tail(n)`, then use `sentences[sentences['Text'].str.contains(word)]` with the words you fetched. That way, you can quickly check the sentence containing that word.

In [None]:
# First ten elements
df.head(10)

In [None]:
# From 11th to 20th
df[10:20]

In [None]:
# Last ten elements
df.tail(10)

In [None]:
# From 15th to the last until 10th to the last
df[len(df)-15:len(df)-10]

In [None]:
test = df.head()
test

In [None]:
[t[0] for t in most_common_words[:10]]

The following display the 15 less used words along with the sentences that contain them. Notice however that it is a simplistic approach that may not exactly return what you want. If the word is `cat`, this will return `Cat`, `cats`, and so on.

In [None]:
n = 15
check_list = [t[0] for t in most_common_words[len(df)-n:len(df)]]
for word in check_list:
    print(word)
    display(get_sentences(word, sentences))