Skip to content

browndw/ngramr.plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ngramr.plus

R-CMD-check Tests CRAN status CRAN downloads Lifecycle: stable License: MIT

Overview

ngramr.plus provides functions for extracting frequency data from Google Books Ngram datasets without requiring local downloads. The package enables researchers to analyze word usage patterns across centuries of published texts in multiple varieties of English.

Key Features

  • Direct access to Google Books Ngram data (no local downloads required)
  • Multiple English varieties: All English, British English, American English
  • Flexible time aggregation: by year or decade
  • Support for 1-5 grams: single words to 5-word phrases
  • Memory efficient: uses chunked reading for large datasets
  • Robust error handling: comprehensive network and data validation
  • Normalized frequencies: returns both raw counts and per-million-word rates

Why ngramr.plus?

While the excellent ngramr package excels at plotting trends, ngramr.plus returns the underlying frequency data, enabling:

  • Precise identification of peaks and troughs in word usage
  • Advanced statistical analysis of linguistic trends
  • Integration with other text analysis workflows
  • Custom visualization and modeling approaches
  • Variability-based neighbor clustering for periodization (see vnc package)

Installation

Install from CRAN:

install.packages("ngramr.plus")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("browndw/ngramr.plus")

Quick Start

library(ngramr.plus)

# Get yearly frequency data for "zinger" in US English
zinger_data <- google_ngram("zinger", variety = "us", by = "year")
head(zinger_data)
#>   Year  AF    Total Per_10.6
#> 1 1950   5  1234567     4.05
#> 2 1955  12  1456789     8.23
#> 3 1960  45  1678901    26.81

# Search for multiple word forms (lemmas) in British English by decade
walk_data <- google_ngram(c("walk", "walks", "walked"), 
                          variety = "gb", by = "decade")

# Search for phrases (up to 5 words)
ml_data <- google_ngram("machine learning", variety = "eng", by = "year")

Main Function

google_ngram()

Extracts frequency data from Google Books Ngram corpus with the following parameters:

  • word_forms: Character vector of words/phrases to search (max 5 tokens per ngram)
  • variety: English variety - "eng" (all), "gb" (British), or "us" (American)
  • by: Time aggregation - "year" or "decade"

Returns: Data frame with columns:

  • Year/Decade: Time period
  • AF: Absolute frequency (total occurrences)
  • Total: Total words in corpus for that period
  • Per_10.6: Normalized frequency per million words

Data Processing Notes

Performance Considerations

  • File sizes vary greatly: Simple queries (Q, X, Z files) process in seconds, while common letters/bigrams may take several minutes
  • Memory efficient: Uses chunked reading to handle multi-gigabyte Google datasets
  • Internet required: Functions access live Google Books repositories
  • Progress tracking: Shows progress bar for longer operations

Word Form Guidelines

  • Use lemmas of the same word (e.g., c("teenager", "teenagers"))
  • Case insensitive: Automatically handles capitalization differences
  • Hyphenated words: Count as multiple tokens (e.g., "x-ray" = 3 tokens)
  • Special characters: Properly escaped for pattern matching
  • Unicode support: Handles international characters

Error Handling

The package includes comprehensive error handling for:

  • Network issues: Timeout, connectivity problems
  • Missing data: Files not found, empty results
  • Input validation: Token limits, mixed ngram lengths
  • Server errors: HTTP errors, temporary unavailability

Example Analysis

library(ngramr.plus)
library(ggplot2)

# Analyze the rise of technology terminology starting with 'c'
tech_data <- google_ngram(c("computer", "computing", "cyber"), 
                          variety = "eng", by = "year")

# Plot the trend
ggplot(tech_data, aes(x = Year, y = Per_10.6)) +
  geom_line() +
  geom_point() +
  labs(title = "Rise of Computing Terminology",
       x = "Year", 
       y = "Frequency (per million words)") +
  theme_minimal()

Documentation

  • Function help: ?google_ngram
  • Package vignette: browseVignettes("ngramr.plus")
  • Full documentation: ReadTheDocs

Citation

If you use ngramr.plus in research, please cite:

citation("ngramr.plus")

Related Packages

  • ngramr: Efficient ngram visualization
  • vnc: Variability-based neighbor clustering
  • tidytext: Text mining with tidy principles

Contributing

Bug reports and feature requests are welcome on GitHub Issues.

License

MIT License. See LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors