# **Process Ngram Files**

## Generate Vocabulary File
Make a list of the _n_ most common unigrams (1grams). This file can be used for filtering multi-token ngrams. Unigrams containing **untagged tokens**, **numerals**, **non-alphabetic** characters are dropped.

#### Select the appropriate base directory

In [1]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [10]:
base_dir = "/Volumes/knowleslab/sharedresources/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [8]:
conda install nltk

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3/envs/hist_w2v_env

  added / updated specs:
    - nltk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    nltk-3.9.1                 |  py312hca03da5_0         2.7 MB
    ------------------------------------------------------------
                                           Total:         2.7 MB

The following NEW packages will be INSTALLED:

  click              pkgs/main/osx-arm64::click-8.1.7-py312hca03da5_0 
  joblib             pkgs/main/osx-arm64::joblib-1.4.2-py312hca03da5_0 
  nltk               pkgs/main/osx-arm64::nltk-3.9.1-py312hca03da5_0 
  regex              pkgs/main/osx-arm64::regex-2024.9.11-py312h80987f9_0 



Downloading and Extracting Packages:
                                                  

In [16]:
!python download_and_filter_ngrams.py \
    --ngram_size 1 \
    --processes 12 \
    --file_range 0 24 \
    --output_dir {base_dir}

Downloading and filtering.

Files: 100%|████████████████████████████████████| 24/24 [04:16<00:00, 10.70s/it]

Processing complete.


#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/1gram_files/orig" \
    --output_dir "{base_dir}/1gram_files/lower" \
    --processes 12

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}//1gram_files/lower" \
    --output_dir "{base_dir}/1gram_files/lemmas" \
    --processes 14

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --output_dir "{base_dir}/1gram_files/stop" \
    --processes 14 \
    --removal_method token \
    --min_tokens 1

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --output_dir "{base_dir}/1gram_files/short" \
    --processes 14 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 1

#### Sort and concatenate

In [None]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/1gram_files/short" \
    --temp_dir "{base_dir}/1gram_files/temp" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl" \
    --processes 14

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl"

#### Consolidate

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol.jsonl"

#### Add index

In [None]:
!python index_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl"

#### Create file of _n_ most common tokens

In [None]:
!python make_vocab_list.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --n_vocab 100000 \
    --output_file "{base_dir}/valid_vocab_lookup.txt" \
    --membership_file "{base_dir}/valid_vocab_membertest.txt"

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --start 0 \
    --end 50

## Process multigram files
Download multigrams (_n_ = 2–5). Drop those containing **untagged tokens**, **numerals**, **non-alphabetic characters**. 

Optionally, specify a **vocabulary file** for additional filtering. Vocabulary filtering dicards ngrams containing tokens absent from the vocabulary file. Part-of-speech (POS) tags are stripped and base tokens lowercased and lemmatized during matching; when a ngram passes the vocabulary filter, the original case and inflection of the tokens are preserved and the POS tag reattached.

#### Set base directory

In [5]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [6]:
!python download_and_filter_ngrams.py \
    --ngram_size 5 \
    --processes 48 \
    --file_range 0 19422 \
    --vocab_file "{base_dir}/valid_vocab_membertest.txt" \
    --save_empty

Loaded vocabulary file.

Downloading and filtering.

Files:   0%|                               | 2/19423 [00:34<79:14:19, 14.69s/it]


KeyboardInterrupt



#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/5gram_files/orig" \
    --output_dir "{base_dir}/5gram_files/lower" \
    --processes 14

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}/5gram_files/lower" \
    --output_dir "{base_dir}/5gram_files/lemmas" \
    --processes 14

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --output_dir "{base_dir}/5gram_files/stop" \
    --processes 14 \
    --removal_method token \
    --min_tokens 2

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --output_dir "{base_dir}/5gram_files/short" \
    --processes 14 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 2

#### Sort and concatenate

In [None]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --temp_dir "{base_dir}/5gram_files/temp" \
    --output_file "{base_dir}/5gram_files/concatenated/5grams_sort.jsonl" \
    --processes 14

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/concatenated/5grams_sort.jsonl"

#### Consolidate ngrams

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/5gram_files/concatenated/5grams_sort-stop.jsonl" \
    --output_file "{base_dir}/5gram_files/concatenated/5grams_sort-stop-consol.jsonl"

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/concat/1grams_short-sort-consol-index.jsonl" \
    --start 0 \
    --end 50