# **Process Ngram Files**

## Generate Vocabulary File
Make a list of the _n_ most common unigrams (1grams). This file can be used for filtering multi-token ngrams. Unigrams containing **untagged tokens**, **numerals**, **non-alphabetic** characters are dropped.

#### Select the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 1 \
    --processes 24 \
    --file_range 0 0 \
    --output_dir {base_dir} \
    --save_empty

#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/1gram_files/orig" \
    --output_dir "{base_dir}/1gram_files/lower" \
    --processes 24

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}//1gram_files/lower" \
    --output_dir "{base_dir}/1gram_files/lemmas" \
    --processes 24

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/1gram_files/lemmas" \
    --output_dir "{base_dir}/1gram_files/stop" \
    --processes 24 \
    --removal_method token \
    --min_tokens 1

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/1gram_files/stop" \
    --output_dir "{base_dir}/1gram_files/short" \
    --processes 24 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 1 \
    --overwrite

#### Sort and concatenate

In [None]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/1gram_files/short" \
    --temp_dir "{base_dir}/1gram_files/temp" \
    --output_file "{base_dir}/1gram_files/concat/1grams_prepped.jsonl" \
    --processes 24 \
    --overwrite

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_prepped.jsonl"

#### Consolidate

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_prepped.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_consol.jsonl" \
    --strip_tags \
    --overwrite

#### Add index

In [None]:
!python index_ngrams.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_consol.jsonl" \
    --output_file "{base_dir}/1gram_files/concat/1grams_indexed.jsonl" \
    --overwrite

#### Create file of _n_ most common tokens

In [None]:
!python make_vocab_list.py \
    --input_file "{base_dir}/1gram_files/concat/1grams_indexed.jsonl" \
    --n_vocab 80000 \
    --output_file "{base_dir}/valid_vocab_lookup.txt" \
    --membership_file "{base_dir}/valid_vocab_membertest.txt" \
    --plain_text

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/concat/1grams_consol.jsonl" \
    --start 50000 \
    --end 50001

## Process multigram files
Download multigrams (_n_ = 2–5). Drop those containing **untagged tokens**, **numerals**, **non-alphabetic characters**. 

Optionally, specify a **vocabulary file** for additional filtering. Vocabulary filtering dicards ngrams containing tokens absent from the vocabulary file. Part-of-speech (POS) tags are stripped and base tokens lowercased and lemmatized during matching; when a ngram passes the vocabulary filter, the original case and inflection of the tokens are preserved and the POS tag reattached.

#### Set the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Download

In [None]:
!python download_and_filter_ngrams.py \
    --ngram_size 5 \
    --processes 48 \
    --file_range 0 19422 \
    --vocab_file "{base_dir}/valid_vocab_membertest.txt" \
    --output_dir {base_dir} \
    --save_empty \
    --overwrite \
    --strip_tags \
    --test_file "/scratch/edk202/hist_w2v/test.jsonl"

#### Lowercase

In [None]:
!python lowercase.py \
    --input_dir "{base_dir}/5gram_files/orig" \
    --output_dir "{base_dir}/5gram_files/lower" \
    --processes 24

#### Lemmatize

In [None]:
!python lemmatize.py \
    --input_dir "{base_dir}/5gram_files/lower" \
    --output_dir "{base_dir}/5gram_files/lemmas" \
    --processes 24

#### Remove stopwords

In [None]:
!python remove_stopwords.py \
    --input_dir "{base_dir}/5gram_files/lemmas" \
    --output_dir "{base_dir}/5gram_files/stop" \
    --processes 24 \
    --removal_method token \
    --min_tokens 2

#### Remove short words

In [None]:
!python remove_short_words.py \
    --input_dir "{base_dir}/5gram_files/stop" \
    --output_dir "{base_dir}/5gram_files/short" \
    --processes 24 \
    --min_length 3 \
    --removal_method token \
    --min_tokens 2

#### Sort and concatenate

In [None]:
!python sort_into_single_file.py \
    --input_dir "{base_dir}/5gram_files/short" \
    --temp_dir "{base_dir}/5gram_files/temp" \
    --output_file "{base_dir}/5gram_files/concat/5grams_prepped.jsonl" \
    --processes 24

In [None]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_prepped.jsonl"

#### Consolidate ngrams

In [None]:
!python consolidate_ngrams.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_prepped.jsonl" \
    --output_file "{base_dir}/5gram_files/concat/5grams_consol.jsonl" \
    --strip_tags \
    --overwrite:

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/5gram_files/concat/5grams_consol.jsonl" \
    --start 5000 \
    --end 5050

## Make Yearly Files
Reorganize ngrams into year-specific files specifying each ngram's frequency for the year.

#### Select the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Create yearly files

In [None]:
!python make_yearly_files.py \
    --input_file "{base_dir}/5gram_files/concat/5grams_consol.jsonl" \
    --output_dir "{base_dir}/5gram_files/year_files/jsonl" \
    --chunk_size 100000 \
    --processes 24

#### View contents of ngram file

In [None]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/5gram_files/year_files/jsonl/2019.jsonl" \
    --start 10000 \
    --end 10500

In [None]:
!python jsonl_to_plain_text.py \
    --input_dir "{base_dir}/1gram_files/year_files/jsonl" \
    --output_dir "{base_dir}/1gram_files/year_files/text" \
    --processes 48

#### Convert JSONL files to plain text for Word2Vec efficiency

In [None]:
!python jsonl_to_plain_text.py \
    --input_dir "{base_dir}/5gram_files/year_files/jsonl" \
    --output_dir "{base_dir}/5gram_files/year_files/text" \
    --processes 48

# **Train Word2Vec**

#### Set the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Train a basic model

In [None]:
!python train_word2vec.py \
    --corpus_file "{base_dir}/5gram_files/year_files/text/2019.txt" \
    --model_file "{base_dir}/5gram_files/word2vec_model.model" \
    --vector_file "{base_dir}/5gram_files/word_vectors.txt" \
    --vector_size 300 \
    --window 5 \
    --sg 1 \
    --negative 10 \
    --min_count 5 \
    --sample 1e-5 \
    --workers 48 \
    --epochs 10 \
    --alpha 0.025 \
    --min_alpha 0.0001

In [None]:
from gensim.models import Word2Vec

# Load the model
model = Word2Vec.load(f'{base_dir}/5gram_files/word2vec_model.model')

similar_words = model.wv.most_similar('water_ADJ', topn=10)
print(similar_words)

# **Process Unigrams**

## Process Unigrams

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
!python 1_download.py --ngram_size 1 --ngram_type tagged --proj_dir {base_dir} --workers 48 \
--overwrite --compress

In [None]:
!python 2_convert.py --ngram_size 1 --ngram_type tagged --proj_dir {base_dir} --workers 48 \
--overwrite --compress

In [None]:
!python 3_lowercase.py --ngram_size 1 --proj_dir {base_dir} --workers 48 \
--overwrite --compress

In [None]:
!python 4_lemmatize.py --ngram_size 1 --proj_dir {base_dir} --workers 48 \
--overwrite --compress

In [None]:
!python 5_filter.py --ngram_size 1 --proj_dir {base_dir} --workers 48 \
--numerals --nonalpha --stopwords --min_token_length 3 --min_tokens 1 --overwrite --compress

In [None]:
!python 6_combine.py --ngram_size 1 --proj_dir {base_dir} --overwrite --workers 48 --overwrite

In [None]:
!python 7_index.py --ngram_size 1 --proj_dir {base_dir} --input_file {base_dir}/1gram_files/6corpus/1-00000-to-00017.jsonl \
--overwrite --vocab_file 80000 --workers 48

In [None]:
!python print_jsonl_lines.py --file_path "{base_dir}/1gram_files/6corpus/1-00000-to-00017.jsonl" --start 0 --end 5 --parse

In [None]:
!python verify_sort.py --input_file "{base_dir}/1gram_files/6corpus/1-00000-to-00017.jsonl"

## Process Multigrams

In [1]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
!python 1_download.py \
    --ngram_size 5 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --overwrite \
    --compress

In [None]:
!python 2_convert.py \
    --ngram_size 5 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 3_lowercase.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 4_lemmatize.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 5_filter.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --numerals \
    --nonalpha \
    --stopwords \
    --min_token_length 3 \
    --min_tokens 2 \
    --compress \
    --vocab_file {base_dir}/1gram_files/6corpus/1-00000-to-00017-vocab_list_match.txt

In [1]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

!python 6_sort2.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --workers 10 \
    --sort_key ngram \
    --compress \
    --sort_order ascending \
    --end_iteration 2

[31mStart Time:                2025-01-01 16:52:43.530206
[0m
[4mSort Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter
Sorted directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/temp
Temp directory:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/tmp
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-merged.jsonl.lz4
Files available:           6520
Files to use:              6520
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter/5-00069-of-19423.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter/5-19422-of-19423.jsonl.lz4
Ngram size:                5
Number of workers:         10
Compress output files:     True
Overwrite existing files:  False
Sort key:                  ngram
Sort order:                ascending
He

In [2]:
!python 7_consolidate.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --overwrite \
    --compress

[31mStart Time:                2024-12-31 17:52:08.091974
[0m
[4mConsolidation Info[0m
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-merged.jsonl.lz4
Consolidated directory:    /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl.lz4
Ngram size:                5
Compress output files:     True
Overwrite existing files:  True

Consolidating   |[32m██████████████████████████████████████████████████[0m| 100.0% 276470316   /276470316  [0m

Lines before consolidation:  276470316
Lines after consolidation:   75107076
[31m
End Time:                  2024-12-31 19:10:05.741234
[0m


In [3]:
import lz4.frame

input_path = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl.lz4"
output_path = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl"

def decompress_lz4_file(input_path, output_path):
    with lz4.frame.open(input_path, "rb") as compressed_file:
        with open(output_path, "wb") as decompressed_file:
            decompressed_file.write(compressed_file.read())

decompress_lz4_file(input_path, output_path)

In [6]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/5gram_files/6corpus/5gram-consolidated.jsonl" \
    --start 50000 \
    --end 50100

Line 50000: {"ngram":"abdallah tell","freq_tot":201,"doc_tot":199,"freq":{"1758":1,"1845":1,"1848":1,"1849":3,"1850":22,"1851":2,"1852":1,"1853":1,"1854":3,"1855":2,"1856":2,"1857":2,"1859":4,"1861":1,"1863":3,"1864":1,"1865":2,"1866":1,"1867":1,"1868":7,"1869":4,"1871":1,"1872":1,"1873":2,"1874":1,"1880":2,"1881":5,"1882":6,"1883":2,"1884":7,"1885":5,"1886":2,"1888":2,"1889":1,"1890":1,"1891":2,"1892":1,"1893":2,"1895":2,"1896":5,"1897":7,"1898":1,"1899":2,"1900":3,"1902":1,"1904":1,"1907":4,"1908":1,"1909":3,"1910":1,"1911":1,"1915":1,"1970":2,"2003":2,"2007":1,"2014":2,"1887":2,"1894":3,"1901":9,"1914":1,"1930":1,"1932":3,"1934":3,"1954":2,"1955":2,"1957":1,"1962":2,"1978":1,"1979":1,"1991":1,"1997":1,"1999":2,"2000":1,"2001":4,"2008":3,"2013":3,"2015":3,"2017":1,"2018":3},"doc":{"1758":1,"1845":1,"1848":1,"1849":3,"1850":22,"1851":2,"1852":1,"1853":1,"1854":3,"1855":2,"1856":2,"1857":2,"1859":4,"1861":1,"1863":3,"1864":1,"1865":2,"1866":1,"1867":1,"1868":7,"1869":4,"1871":1,"1872":

In [4]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/6corpus/5gram-consolidated.jsonl" \
    --field ngram \
    --sort_order ascending

Lines: 75107076line [08:47, 142445.80line/s]

The file is sorted.

Processing complete.


In [None]:
!python simulate_merge.py \
    --file_dir "{base_dir}/5gram_files/temp" \
    --workers 48

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

!python simulate_merge2.py \
    --ngram_size 5 \
    --file_dir "{base_dir}/5gram_files/temp" \
    --tmp_dir "{base_dir}/5gram_files/tmp" \
    --compress \
    --sort_key ngram \
    --sort_order ascending \
    --workers 48