ngramr.plus

Overview

ngramr.plus provides functions for extracting frequency data from Google Books Ngram datasets without requiring local downloads. The package enables researchers to analyze word usage patterns across centuries of published texts in multiple varieties of English.

Key Features

Direct access to Google Books Ngram data (no local downloads required)
Multiple English varieties: All English, British English, American English
Flexible time aggregation: by year or decade
Support for 1-5 grams: single words to 5-word phrases
Memory efficient: uses chunked reading for large datasets
Robust error handling: comprehensive network and data validation
Normalized frequencies: returns both raw counts and per-million-word rates

Why ngramr.plus?

While the excellent ngramr package excels at plotting trends, ngramr.plus returns the underlying frequency data, enabling:

Precise identification of peaks and troughs in word usage
Advanced statistical analysis of linguistic trends
Integration with other text analysis workflows
Custom visualization and modeling approaches
Variability-based neighbor clustering for periodization (see vnc package)

Installation

Install from CRAN:

install.packages("ngramr.plus")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("browndw/ngramr.plus")

Quick Start

library(ngramr.plus)

# Get yearly frequency data for "zinger" in US English
zinger_data <- google_ngram("zinger", variety = "us", by = "year")
head(zinger_data)
#>   Year  AF    Total Per_10.6
#> 1 1950   5  1234567     4.05
#> 2 1955  12  1456789     8.23
#> 3 1960  45  1678901    26.81

# Search for multiple word forms (lemmas) in British English by decade
walk_data <- google_ngram(c("walk", "walks", "walked"), 
                          variety = "gb", by = "decade")

# Search for phrases (up to 5 words)
ml_data <- google_ngram("machine learning", variety = "eng", by = "year")

Main Function

`google_ngram()`

Extracts frequency data from Google Books Ngram corpus with the following parameters:

word_forms: Character vector of words/phrases to search (max 5 tokens per ngram)
variety: English variety - "eng" (all), "gb" (British), or "us" (American)
by: Time aggregation - "year" or "decade"

Returns: Data frame with columns:

Year/Decade: Time period
AF: Absolute frequency (total occurrences)
Total: Total words in corpus for that period
Per_10.6: Normalized frequency per million words

Data Processing Notes

Performance Considerations

File sizes vary greatly: Simple queries (Q, X, Z files) process in seconds, while common letters/bigrams may take several minutes
Memory efficient: Uses chunked reading to handle multi-gigabyte Google datasets
Internet required: Functions access live Google Books repositories
Progress tracking: Shows progress bar for longer operations

Word Form Guidelines

Use lemmas of the same word (e.g., c("teenager", "teenagers"))
Case insensitive: Automatically handles capitalization differences
Hyphenated words: Count as multiple tokens (e.g., "x-ray" = 3 tokens)
Special characters: Properly escaped for pattern matching
Unicode support: Handles international characters

Error Handling

The package includes comprehensive error handling for:

Network issues: Timeout, connectivity problems
Missing data: Files not found, empty results
Input validation: Token limits, mixed ngram lengths
Server errors: HTTP errors, temporary unavailability

Example Analysis

library(ngramr.plus)
library(ggplot2)

# Analyze the rise of technology terminology starting with 'c'
tech_data <- google_ngram(c("computer", "computing", "cyber"), 
                          variety = "eng", by = "year")

# Plot the trend
ggplot(tech_data, aes(x = Year, y = Per_10.6)) +
  geom_line() +
  geom_point() +
  labs(title = "Rise of Computing Terminology",
       x = "Year", 
       y = "Frequency (per million words)") +
  theme_minimal()

Documentation

Function help: ?google_ngram
Package vignette: browseVignettes("ngramr.plus")
Full documentation: ReadTheDocs

Citation

If you use ngramr.plus in research, please cite:

citation("ngramr.plus")

Related Packages

ngramr: Efficient ngram visualization
vnc: Variability-based neighbor clustering
tidytext: Text mining with tidy principles

Contributing

Bug reports and feature requests are welcome on GitHub Issues.

License

MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
Meta		Meta
R		R
data		data
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngramr.plus

Overview

Key Features

Why ngramr.plus?

Installation

Quick Start

Main Function

`google_ngram()`

Data Processing Notes

Performance Considerations

Word Form Guidelines

Error Handling

Example Analysis

Documentation

Citation

Related Packages

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ngramr.plus

Overview

Key Features

Why ngramr.plus?

Installation

Quick Start

Main Function

google_ngram()

Data Processing Notes

Performance Considerations

Word Form Guidelines

Error Handling

Example Analysis

Documentation

Citation

Related Packages

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`google_ngram()`

Packages