Run the following two cells first

In [None]:
import pandas as pd
import csv
import tarfile

In [None]:
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

# Read sentences (do this first)

Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once.

In [None]:
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        return pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
                quoting=csv.QUOTE_NONE)

In [None]:
all_sentences = read_sentences_file()

Now, you can fetch sentences of a specific language using the following cells. When you want to change you target language, you can start again from here.

In [None]:
def sentences_of_language(sentences, language):
    target_sentences = sentences[sentences['ISO'] == language]
    del target_sentences['Date added']
    del target_sentences['Date last modified']
    del target_sentences['ISO']
    target_sentences = target_sentences.set_index("sentenceID")
    return target_sentences

Choose your target language as a 3-letter ISO code (`cmn`, `fra`, `jpn`, `eng`, etc.).

In [None]:
language = 'eng'
sentences = sentences_of_language(all_sentences, language)

The following cell displays the first five sentences of your set, just for a quick check.

In [None]:
sentences.head()

## Get sentences of specific users

First, run the following cell

In [None]:
def get_sentences_of_user(sentences, users):
    target = sentences[sentences['Username'].isin(users)]
    print(len(target), "sentences fetched.")
    return target.sort_values(by='Username') # Modify this to change the sorting

You can specify the name of the user(s) whose sentences you want in the `usernames` list. By running the following cell, you will fetch the sentences in the language you set above.

In [None]:
usernames = ['AlanF_US', 'CK']
user_sentences = get_sentences_of_user(sentences, usernames)

The following cell displays a small sample of the sentences you fetched, just to check everything looks fine.

In [None]:
user_sentences.sample(10)

# Audio

First, run the two following cells

In [None]:
def read_audio_file():
    with tarfile.open('./sentences_with_audio.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        sentences = pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'Username', 'License', 'Attribution URL'], 
                quoting=csv.QUOTE_NONE)
    del sentences['License']
    del sentences['Attribution URL']
    return sentences

In [None]:
sentences_with_audio = read_audio_file()
audio_ids = sentences_with_audio['sentenceID'].values

You can quickly check if the `sentences_with_audio` set looks OK by running the following

In [None]:
sentences_with_audio.sample(10)

The following section allows you to check which sentences of your current `user_sentences` set have / do not have audio.  

Note that `user_sentences` was filtered by the sentence author, not the audio contributor.

In [None]:
def subset_with_audio(sentences, audio_ids):
    target = user_sentences[user_sentences.index.isin(audio_ids)]
    print(len(target), "sentences with audio fetched.")
    return target

In [None]:
def subset_without_audio(sentences, audio_ids):
    target = user_sentences[~user_sentences.index.isin(audio_ids)]
    print(len(target), "sentences without audio fetched.")
    return target

In [None]:
user_sentences_with_audio = subset_with_audio(user_sentences, audio_ids)

In [None]:
user_sentences_without_audio = subset_without_audio(user_sentences, audio_ids)

The sentences without audio of a specific user can be fetched in the following manner

In [None]:
one_user_no_audio = user_sentences_without_audio[user_sentences_without_audio['Username'] == 'CK']
print(f'{one_user_no_audio.shape[0]} sentences without audio.')

And now you can fetch them and play with them.  
That way, you can extract sentences you want to record in the way you want: in order, in random, containing specific words, etc. (You may need to have a look to other notebooks, or get your hands dirty to achieve what you want though). 

Notice also that taking sentences at random may require a bit more of management if you want to do it often. Being random, it may happen that you get the same sentences several times. Using files, copy-pasting directly, etc.

Just for an illustration, here is how to take the first 50 sentences of `one_user_no_audio`. 

As a sidenote, by default, a maximum of 60 rows will be displayed (the first 30 and the last 30). If you want to display more, you can use 
```
pandas.set_option('display.max_rows', n)
```
where n is the number of rows you want to display at maximum.

Slicing is often better than displaying everything, but in this particular case, we may need to display one or two hundred sentences so...

In [None]:
one_user_no_audio[:50]

50 sentences at random:

In [None]:
one_user_no_audio.sample(50)

For more advanced way of fetching sentences, check the other books and add code here!