This notebook is focused on users' sentences and audio contributions. You can
- [Find sentences of specific users](#users_sentences)
- [Check audio contributions](#audio) and fetch sentences with or without audio.

Before experimenting with any of the possibility described above, it is necessary to set and execute the cells under the [Read sentences section](#read_sentences).

If you're new to Jupyter, please click on `Cell > Run All` from the top menu to see what the notebook does. You should see that cells that are running have an `In[*]` that will become `In[n]` when their execution is finished (`n` is a number). To run a specific cell, click in it and press `Shift + Enter` or click the `Run` button of the top menu. 

In any case, to be able to use the notebook correctly, please run the two following cells first.

In [None]:
import pandas as pd
import csv
import tarfile

In [None]:
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

<a id='read_sentences'></a>
# Read sentences

Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once.

In [None]:
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        return pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
                quoting=csv.QUOTE_NONE)

In [None]:
all_sentences = read_sentences_file()

Now, you can fetch sentences of a specific language using the following cells. If you want to change you target language, you can start again from here.

Note that by default, we get rid of the `ISO`, `Date added`, and `Date last modified` columns.  
If you need any of these columns, you can comment out the lines you need by adding a `#` at the beginning of the corresponding lines of the next cell.

So run the following cell

In [None]:
def sentences_of_language(sentences, language):
    target_sentences = sentences[sentences['ISO'] == language]
    del target_sentences['Date added']
    del target_sentences['Date last modified']
    del target_sentences['ISO']
    target_sentences = target_sentences.set_index("sentenceID")
    return target_sentences

Choose your target `language` as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.), and run the next one.

In [None]:
language = 'eng'  # <-- Modify this value
sentences = sentences_of_language(all_sentences, language)

Now, the variable `sentences` contains the sentences of the language you specified. Wanna check? The following cell displays the five random sentences of your set, just for a quick check.

In [None]:
sentences.sample(5)

<a id='users_sentences'></a>
# Get sentences of specific users

At its name indicates, you can use this section to fetch sentences belonging to some users.

Run the following cell (you don't have to modify it).

In [None]:
def get_sentences_of_user(sentences, users):
    target = sentences[sentences['Username'].isin(users)]
    print(len(target), "sentences fetched.")
    return target.sort_values(by='Username') # Modify this to change the sorting

You can specify the name of the user(s) whose sentences you want in the `usernames` list. By running the following cell, you will fetch the sentences in the language you set above.

In [None]:
usernames = ['AlanF_US', 'CK']  # <-- Modify these values
user_sentences = get_sentences_of_user(sentences, usernames)

The following cell displays a small sample of the sentences you fetched, just to check everything looks fine.

In [None]:
user_sentences.sample(10)

<a id='audio'></a>
# Audio

In this section, you can filter your sentences set to fetch only the ones having audio (or not). Note that you need to have prepared the `user_sentences` variables from the previous section.

Run the following cell (you don't have to modify it).

In [None]:
def read_audio_file():
    with tarfile.open('./sentences_with_audio.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        sentences = pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'Username', 'License', 'Attribution URL'], 
                quoting=csv.QUOTE_NONE)
    del sentences['License']
    del sentences['Attribution URL']
    return sentences

sentences_with_audio = read_audio_file()
audio_ids = sentences_with_audio['sentenceID'].values

Now, you should have all sentences having audio inside the `sentences_with_audi` variable. You can quickly check if `sentences_with_audio` looks OK by running the following cell.

In [None]:
sentences_with_audio.sample(10)

The following section allows you to check which sentences of your current `user_sentences` set have / do not have audio. That's where you'll need to make sure to have run the cells of the previous section.

Note that `user_sentences` was filtered by the sentence author, not the audio contributor.

Run the following cell (you don't have to modify it).

In [None]:
def subset_with_audio(sentences, audio_ids):
    target = user_sentences[user_sentences.index.isin(audio_ids)]
    print(len(target), "sentences with audio fetched.")
    return target

def subset_without_audio(sentences, audio_ids):
    target = user_sentences[~user_sentences.index.isin(audio_ids)]
    print(len(target), "sentences without audio fetched.")
    return target

user_sentences_with_audio = subset_with_audio(user_sentences, audio_ids)
user_sentences_without_audio = subset_without_audio(user_sentences, audio_ids)

Now, `user_sentences_with_audio` and `users_sentences_without_audio` contains the sentences with / without audio belonging to your current set. You can fetch them and play with them.  
That way, you can extract sentences you want to record in the way you want: in order, in random, containing specific words, etc. (You may need to have a look to other notebooks, or get your hands dirty to achieve what you want though). 

Notice also that taking sentences at random may require a bit more of management if you want to do it often. Being random, it may happen that you get the same sentences several times. Using files, copy-pasting directly, etc.

Just for an illustration, here is how to take the first 50 sentences of `user_sentences_without_audio`. 

As a sidenote, by default, a maximum of 60 rows will be displayed (the first 30 and the last 30). If you want to display more, you can use 
```
pandas.set_option('display.max_rows', n)
```
where n is the number of rows you want to display at maximum.

Slicing is often better than displaying everything, but in this particular case, we may need to display one or two hundred sentences so...

In [None]:
user_sentences_without_audio[:50]

50 sentences at random:

In [None]:
user_sentences_without_audio.sample(50)

## More filtering, limiting to one user

Supposed you created a set with the sentences belonging to two users, let's say AlanF_US and CK, and created the `user_sentences_without_audio` above. Now, if you want to filter the set to only one user, of course you can go back to the [Get sentences of specific users](#users_sentences) section and do everything again. However, there is a simpler way!

The sentences without audio of a specific user can be fetched in the following manner. Here, following our example, we filter to the sentences belonging to CK only.

In [None]:
username = 'CK'  # <-- Modify this value
one_user_no_audio = user_sentences_without_audio[user_sentences_without_audio['Username'] == username]
print(f'{one_user_no_audio.shape[0]} sentences without audio belonging to {username}.')

Of course, you can fetch the `one_user_no_audio` set the same way as previously, for example for 50 random sentence, run the following.

In [None]:
one_user_no_audio.sample(50)

For more advanced way of fetching sentences, check the other books and add code below!