# Table of Contents
- [What is the Google Books Ngram database?](#introduction)
- [What functions do we need to work with the database?](#functions)
  - [How do we deal with large files?](#function-split)
  - [How do we extract the data?](#function-extract)
  - [How do we export the data?](#function-export)
- [Let's split the large files](#split)
- [Let's extract the data](#extract)
- [References](#references)

# What is the Google Books Ngram database? <a name="introduction"></a>

The [Google Books Ngram dataset](https://storage.googleapis.com/books/ngrams/books/datasetsv3.html) is a massive collection of word and phrase frequencies extracted from millions of digitized books, spanning centuries of written history. This dataset powers the [Google Books Ngram Viewer](https://books.google.com/ngrams/info), an interactive tool that lets users visualize the frequency of words and phrases over time. The Google Books corpora provides a unique lens into linguistic, cultural, and historical trends by tracking how language usage evolves over time [[1](#michel2010)].

## Tell me more...

We will work with Version 3 of the dataset, released in February 2020. This version is based on over 8 million books scanned by the Google Books team and spans the years 1500 to 2019. It covers 8 languages and provides *n*-gram data (sequences of 1 to 5 words) along with metadata such as publication year, match count, and volume count.

# What's the objective of this notebook?

The Google Books Ngram dataset is shared via .gzip files. Our goal is to have a (not so streamlined) workflow to process the compressed data. By the end, we'll have .csv files with the ngrams that we're interested in, enabling us to explore beyond what the Ngram Viewer offers.

In [1]:
import pandas as pd
import numpy as np
import gzip
import re

# Mount notebook to Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/sp2025/dtsc4301/dtsc-capstone/data/google-books-ngram/
!ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/.shortcut-targets-by-id/1huccIkFQ4UJwb4XQ6Tg9yR5F4AOEtFkP/DTSC Capstone/data/google-books-ngram
 chi-sim   eng	 eng-us  'eng-us csv'   fre   ger   heb   ita   rus   spa


# What functions do we need to work with the database? <a name="functions"></a>

We define three functions below to help in processing and extracting the data from the Google Ngram files:
  - `split(filename, lines)`: Helps us split the data files into smaller subsets. Processing the large files from the Google Ngram dataset requires more resources than is proved by Google Colab.
  - `extract(filename, dataframe, yr=None, freq=None, regex=None)`: Helps us extract the ngrams that we're interested in given criteria defined by the year, frequency, and regex.
  - `to_csv(dataframe, filename)`:

## Splitting Files

In [2]:
def split(filename: str, lines: int = 1000000) -> None:
    """
    Splits a compressed Google Ngram dataset file into smaller .gz files.

    Needed because processing large files needs more resources than Google Colab
    provides.

    Parameters
    ----------
    filename : str
        The name of the gzip-format file to be opened.
    lines : int
        The number of lines per output file.
    """
    # Google Ngram files are shared as gzip-compressed files
    with gzip.open(filename, 'rt', encoding='utf-8') as f:
        # Match naming convention of Google Ngram files
        part_num = 0
        out_filename = f'{filename[:-3]}-pt-{part_num:05d}.gz'
        outfile = gzip.open(out_filename, 'wt', encoding='utf-8')

        for i, line in enumerate(f, start=1):
              outfile.write(line)

              # Output progress updates to not go insane while waiting
              if i % 1000000 == 0:
                print(f'Line {i} written to {out_filename}')

              # 3 million lines
              if i % lines == 0:
                outfile.close()
                print(f'File saved as {out_filename}')
                part_num += 1
                out_filename = f'{filename[:-3]}-pt-{part_num:05d}.gz'
                outfile = gzip.open(out_filename, 'wt', encoding='utf-8')

        outfile.close()
        print(f'File saved as {out_filename}')

## Extracting Data
### File Format <a name="file-format"></a>
Each of the files is compressed tab-separated data. In Version 3, each line has the following format:
```
ngram TAB year_1,match_count_1,volume_count_1 TAB year_2 ... TAB year_x,match_count_x,volume_count_x
```

As an example, here are the first and second lines from the `teferral_NOUN` file of the English 1-grams ([1-00023-of-00024.gz](http://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00023-of-00024.gz)):
```
www.socialstudies	2000,1,1  2001,1,1 ... 2018,2,2  2019,2,2
πόνοι	1868,1,1	1876,1,1	1943,1,1 ... 2018,19,10 2019,42,13
```

The first line tells us that the word "πόνοι":
- In 1868, occured 1 time overall, in 1 distinct book of their sample.
- In 2019, occured 42 times overall, in 13 distinct book of their sample.

### Extracting Data

The `extract()` function uses the following regex to omit 1-grams that begin with special characters or numbers:
```
^[^a-zA-Z\u00C0-\u024F\u0400-\u04FF\u0590-\u05FF\u0600-\u06FF\u4E00-\u9FFF].*
```
Where:
- `^` → Start of the string.
- `[^...]` → Matches anything except:
  - `a-zA-Z` → Basic Latin letters.
  - `\u00C0-\u024F` → Latin Extended-A & B (Covers accented letters like é, ñ, ü, ç, etc., used in French, German, and Spanish).
  - `\u0400-\u04FF` → Cyrillic (Covers Russian and other Slavic languages).
  - `\u0590-\u05FF` → Hebrew.
  - `\u0600-\u06FF` → Arabic.
  - `\u4E00-\u9FFF` → Chinese (CJK Unified Ideographs).
- `*` → Allows the rest of the string to be anything.

In [3]:
def extract(filename: str, dataframe: list, yr: int = None, freq: int = None, regex : str = None) -> list:
    """
    Extracts qualifying n-grams from a compressed Google Ngram dataset file
    and appends them to the provided dataframe list.

    Parameters
    ----------
    filename : str
        The name of the gzip-format file to be opened.
    dataframe : list
        The list to append qualifying n-gram records to. Qualifying records
        are determined by the `yr` and `freq` parameters.
    yr : int
        The lower bound for years of interest. If `yr` = 1950, then qualifying
        records must have a year of at least 1950.
    freq : int
        The lower bound for match count. If `freq` = 40, then qualifying
        records must have a match count of at least 40.
    regex : str
        The Regular Expression pattern to match. If provided, records
        matching the pattern will be ommitted from the dataframe list.

    Returns
    -------
    list
        The modified dataframe list with appended qualifying n-gram records.
    """

    # Google Ngram files are shared as gzip-compressed files
    with gzip.open(filename=filename, mode='rt', encoding='utf-8') as f:
        for line in f:
            # Each line in file has the format: ngram TAB year,match_count,volumn_count TAB...
            line = line.strip().split('\t')

            # Skip entries that begin with special characters or numbers
            if regex:
              if re.search(regex, line[0]):
                  continue

            # Iterate over yearly records (omitting n-gram at index 1)
            for year in line[1:]:
                # Split comma-separated year-related data (format: year,match_count,volumn_count)
                year = year.split(',')

                # Skip entries that don't meet thresholds to reduce file size
                if yr:
                  if int(year[0]) < yr:
                      continue
                if freq:
                  if int(year[1]) < freq:
                      continue

                dataframe.append([line[0]] + year)

## Convert to CSV

In [4]:
def to_csv(dataframe: list, filename: str) -> None:
    """
    Converts a dataframe list to a CSV file.

    Parameters
    ----------
    dataframe : list
        The list to be converted to a CSV file.
    filename : str
        The name of the CSV file to be created.

    Returns
    -------
    None
    """

    # Output as .csv file for further analysis
    df = pd.DataFrame(dataframe, columns=['ngram', 'year', 'match_count', 'volume_count'])
    df.to_csv(filename, index=False, header=False)
    print(f'File saved as {filename}')

# Extracting Data

## Generating File Paths <a name="file-paths"></a>

The corpus has the following number of files for each language:

| Language Code | Language             | Number of Files |
|---------------|----------------------|-----------------|
| **chi-sim**   | Chinese (Simplified) | 1               |
| **eng**       | English              | 24              |
| **fre**       | French               | 6               |
| **ger**       | German               | 8               |
| **heb**       | Hebrew               | 1               |
| **ita**       | Italian              | 2               |
| **rus**       | Russian              | 2               |
| **spa**       | Spanish              | 3               |

Where each file has the naming convention:
```
./{language}/1-{file_index}-of-{total_files}.gz
```

For example, the sixth English file has the path:
```
./eng/1-00006-of-00024.gz
```

Below automates the generation of file paths for each language:

In [5]:
# Map language codes to the number of available files
file_qty = {
    'chi-sim': 1,  # Chinese (simplified)
    'eng': 24,     # English
    'eng-us': 14,  # American English
    'fre': 6,      # French
    'ger': 8,      # German
    'heb': 1,      # Hebrew
    'ita': 2,      # Italian
    'rus': 2,      # Russian
    'spa': 3       # Spanish
}

file_paths = []

# Iterate over each language in `file_qty`
for i, language in enumerate(file_qty):
    # Temporary list to store file paths for current language
    files = []

    # Generate file paths in the format: './eng/1-00006-of-00024.gz'
    for j in range(file_qty[language]):
        files.append(f'./{language}/1-{j:05d}-of-{file_qty[language]:05d}.gz')

    # Append the list of file paths to `file_paths`
    file_paths.append(files)

def extend_file_path(files: int, file_path: str) -> list:
    extend_file_paths = []
    for i in range(files):
        extend_file_paths.append(f'{file_path[0:-3]}-pt-{i:05d}.gz')
    return extend_file_paths

files = extend_file_path(files=2, file_path=file_paths[2][4])
print(files)

['./eng-us/1-00004-of-00014-pt-00000.gz', './eng-us/1-00004-of-00014-pt-00001.gz']


In [8]:
split(file_paths[2][4])

Line 1000000 written to ./eng-us/1-00004-of-00014-pt-00000.gz
Line 2000000 written to ./eng-us/1-00004-of-00014-pt-00000.gz
File saved as ./eng-us/1-00004-of-00014-pt-00000.gz
Line 3000000 written to ./eng-us/1-00004-of-00014-pt-00001.gz
File saved as ./eng-us/1-00004-of-00014-pt-00001.gz


In [None]:
# Set parameters
header = ['ngram', 'year', 'match_count', 'volume_count'] # Header names of columns in dataset
df = [] # Empty list to append qualifying records to
year = 1900 # Lower bound for years of interest (e.g. 1950-2019)
match_count = 0 # Lower bound for match count of interest (e.g. the n-gram must appear at least 40 times)
regex = '^[^a-zA-Z\u00C0-\u024F\u0400-\u04FF\u0590-\u05FF\u0600-\u06FF\u4E00-\u9FFF].*'

files = files # Set files to iterate over
for filename in files: # Iterate over each file
  extract(filename=filename, dataframe=df, yr=year, freq=match_count, regex=regex) # Clean and enumerate qualifying records
  to_csv(dataframe=df, filename=filename[:-3]) # Export records to CSV file
  df = [] # Reset dataframe list for next file

# References <a name="references"></a>
1. Jean-Baptiste Michel\*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden\*. *Quantitative Analysis of Culture Using Millions of Digitized Books*. **Science** (Published online ahead of print: 12/16/2010)<a name="michel2010"></a>