# **Process Ngram Files**

## Generate Vocabulary File
Make a list of the _n_ most common unigrams (1grams). This file can be used for filtering multi-token ngrams. Unigrams containing **untagged tokens**, **numerals**, **non-alphabetic** characters are dropped.

#### Set base directory

In [1]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 1 \
    --processes 14 \
    --file_range 0 24

#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/1gram_files/original" \
    --output_dir "{base_dir}/1gram_files/lowercase" \
    --processes 14

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}//1gram_files/lowercase" \
    --output_dir "{base_dir}/1gram_files/lemmas" \
    --processes 14

#### Sort and concatenate

In [None]:
!python sort_and_concatenate.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --temp_dir "{base_dir}/1gram_files/temp" \
    --output_file "{base_dir}/1gram_files/concatenated/1grams_sort.jsonl" \
    --processes 14

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort.jsonl"

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort.jsonl" \
    --output_file "{base_dir}/1gram_files/concatenated/1grams_sort-stop.jsonl" \
    --removal_method token

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort.jsonl" \
    --output_file "{base_dir}/1gram_files/concatenated/1grams_sort-short.jsonl" \
    --min_length 3 \
    --removal_method token

#### Consolidate

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort-short.jsonl" \
    --output_file "{base_dir}/1gram_files/concatenated/1grams_sort-short-consol.jsonl"

#### Add index

In [None]:
!python index_ngrams.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort-short-consol.jsonl" \
    --output_file "{base_dir}/1gram_files/concatenated/1grams_sort-short-consol-index.jsonl"

#### Create file of _n_ most common tokens

In [None]:
!python make_vocab_list.py \
    --input_file "{base_dir}/1gram_files/concatenated/1grams_sort-short-consol-index.jsonl" \
    --n_vocab 100000 \
    --output_file "{base_dir}/valid_vocab_lookup.txt" \
    --membership_file "{base_dir}/valid_vocab_membertest.txt"

## Process multigram files
Download multigrams (_n_ = 2–5). Drop those containing **untagged tokens**, **numerals**, **non-alphabetic characters**. 

Optionally, specify a **vocabulary file** for additional filtering. Vocabulary filtering dicards ngrams containing tokens absent from the vocabulary file. Part-of-speech (POS) tags are stripped and base tokens lowercased and lemmatized during matching; when a ngram passes the vocabulary filter, the original case and inflection of the tokens are preserved and the POS tag reattached.

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 5 \
    --processes 48 \
    --file_range 10001 15000 \
    --vocab_file "{base_dir}/valid_vocab_membertest.txt" \
    --overwrite

Loaded vocabulary file.

Downloading and filtering.

Files:  20%|█████▋                       | 1982/10001 [46:09<3:45:01,  1.68s/it]

#### Lowercase

In [3]:
!python lowercase.py \
    --input_dir "{base_dir}/5gram_files/original" \
    --output_dir "{base_dir}/5gram_files/lowercase" \
    --processes 14

Files: 100%|███████████████████████████| 19423/19423 [00:59<00:00, 328.19file/s]

Processing complete.


#### Lemmatize

In [4]:
!python lemmatize.py \
    --input_dir "{base_dir}/5gram_files/lowercase" \
    --output_dir "{base_dir}/5gram_files/lemmas" \
    --processes 14

Files: 100%|███████████████████████████| 19423/19423 [00:26<00:00, 739.45file/s]

Processing complete.


#### Sort and concatenate

In [None]:
!python sort_and_concatenate.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --temp_dir "{base_dir}/5gram_files/temp" \
    --output_file "{base_dir}/5gram_files/concatenated/5grams_sort.jsonl" \
    --processes 14

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/4-concatenated/1-5grams_sort.jsonl"

#### Consolidate ngrams

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/5gram_files/4-concatenated/1-5grams_sort.jsonl" \
    --output_file "{base_dir}/5gram_files/4-concatenated/2-5grams_consolidated.jsonl"

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_file "{base_dir}/5gram_files/4-concatenated/2-5grams_consolidated.jsonl" \
    --output_file "{base_dir}/5gram_files/4-concatenated/3-5grams_no_stop.jsonl" \
    --removal_method token

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_file "{base_dir}/5gram_files/4-concatenated/3-5grams_no_stop.jsonl" \
    --output_file "{base_dir}/5gram_files/4-concatenated/4-5grams_no_short.jsonl" \
    --min_length 3
    --removal_method token

#### Index ngrams

In [None]:
!python index_ngrams.py \
    --input_file "{base_dir}/5gram_files/4-concatenated/4-5grams_no_short.jsonl" \
    --output_file "{base_dir}/5gram_files/4-concatenated/5-5grams_indexed.jsonl"

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/5gram_files/4-concatenated/5-5grams_indexed.jsonl" \
    --start 60000 \
    --end 60500