# **Process Ngram Files**

## Generate Vocabulary File
Make a list of the _n_ most common unigrams (1grams). This file can be used for filtering multi-token ngrams. Unigrams containing **untagged tokens**, **numerals**, **non-alphabetic** characters are dropped.

#### Select the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
base_dir = "/Volumes/knowleslab/sharedresources/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 1 \
    --processes 12 \
    --file_range 0 24 \
    --output_dir {base_dir}

#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/1gram_files/orig" \
    --output_dir "{base_dir}/1gram_files/lower" \
    --processes 12

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}//1gram_files/lower" \
    --output_dir "{base_dir}/1gram_files/lemmas" \
    --processes 12

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --output_dir "{base_dir}/1gram_files/stop" \
    --processes 14 \
    --removal_method token \
    --min_tokens 1

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --output_dir "{base_dir}/1gram_files/short" \
    --processes 14 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 1

#### Sort and concatenate

In [None]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/1gram_files/short" \
    --temp_dir "{base_dir}/1gram_files/temp" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl" \
    --processes 14

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl"

#### Consolidate

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol.jsonl"

#### Add index

In [None]:
!python index_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl"

#### Create file of _n_ most common tokens

In [None]:
!python make_vocab_list.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --n_vocab 100000 \
    --output_file "{base_dir}/valid_vocab_lookup.txt" \
    --membership_file "{base_dir}/valid_vocab_membertest.txt"

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --start 0 \
    --end 50

## Process multigram files
Download multigrams (_n_ = 2–5). Drop those containing **untagged tokens**, **numerals**, **non-alphabetic characters**. 

Optionally, specify a **vocabulary file** for additional filtering. Vocabulary filtering dicards ngrams containing tokens absent from the vocabulary file. Part-of-speech (POS) tags are stripped and base tokens lowercased and lemmatized during matching; when a ngram passes the vocabulary filter, the original case and inflection of the tokens are preserved and the POS tag reattached.

#### Set the appropriate base directory

In [8]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
base_dir = "/Volumes/knowleslab/sharedresources/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 5 \
    --processes 48 \
    --file_range 14720 14720 \
    --vocab_file "{base_dir}/valid_vocab_membertest.txt" \
    --output_dir {base_dir} \
    --save_empty

#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/5gram_files/orig" \
    --output_dir "{base_dir}/5gram_files/lower" \
    --processes 14

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}/5gram_files/lower" \
    --output_dir "{base_dir}/5gram_files/lemmas" \
    --processes 14

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --output_dir "{base_dir}/5gram_files/stop" \
    --processes 14 \
    --removal_method token \
    --min_tokens 2

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --output_dir "{base_dir}/5gram_files/short" \
    --processes 14 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 2

#### Sort and concatenate

In [9]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/5gram_files/short" \
    --temp_dir "{base_dir}/5gram_files/temp" \
    --output_file "{base_dir}/5gram_files/concat/5grams_sort.jsonl" \
    --processes 30

Sorting individual files:

Files: 100%|██████████████████████████████| 1794/1794 [01:35<00:00, 18.80file/s]

Merge-sorting files:

Lines: 307089794line [13:15, 385854.06line/s]

Processing complete.


In [10]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_sort.jsonl"

Lines: 307089794line [08:48, 581209.87line/s]

The file is sorted.

Processing complete.


#### Consolidate ngrams

In [12]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_sort.jsonl" \
    --output_file "{base_dir}/5gram_files/concat/5grams_consol.jsonl"

Consolidating ngrams.

Lines: 307089794lines [38:21, 133425.09lines/s]

Processing complete.


#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --start 0 \
    --end 50

## Make Yearly Files
Reorganize ngrams into year-specific files specifying each ngram's frequency for the year.

#### Select the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
base_dir = "/Volumes/knowleslab/sharedresources/NLP_corpora/Google_Books/20200217/eng"

#### Create yearly files

In [None]:
!python make_yearly_files.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_consol.jsonl" \
    --output_dir "{base_dir}/5gram_files/year_files" \
    --chunk_size 100000 \
    --processes 14

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/5gram_files/year_files/2019.jsonl" \
    --start 51 \
    --end 100