Pull Bigrams from the _Google Million_ data set.

> The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See
- https://books.google.com/ngrams/info
- https://storage.googleapis.com/books/ngrams/books/datasetsv2.html

In [None]:
# drop records from before this year
MIN_YEAR = 2000

## Step 1: Fetch data

In [2]:
!mkdir -p ngrams

In [28]:
import requests as rq
from tqdm.notebook import tqdm

for i in tqdm(range(100)):
    file = f'googlebooks-eng-1M-2gram-20090715-{i}.csv.zip'
    url = f'http://storage.googleapis.com/books/ngrams/books/{file}'
    response = rq.get(url)
    with open(f'ngrams/{file}', 'wb') as f:
        f.write(response.content)

## Step 2: Filter/aggregate

Here, we do light filtering and aggregation.

For efficiency we drop all records from years prior to `MIN_YEAR`.

Because the data is unique at the yearly level, we group by and sum all occurrences at the bigram level
over the years.

In [60]:
from tqdm.notebook import tqdm

for i in tqdm(range(100)):
    file = f'googlebooks-eng-1M-2gram-20090715-{i}.csv.zip'
    try:
        df = pd.read_csv(f'ngrams/{file}', sep='\t')
    except Exception:
        continue
    df.columns =['ngram', 'year', 0'occurrences', 'pages', 'books']
    df = df[
        df.year.ge(MIN_YEAR)
        & ~df.ngram.str.contains(f'[^{string.ascii_letters + " "}]+')
    ].groupby('ngram').occurrences.sum()
    df.to_csv(f"ngrams/{file.replace('.csv.zip', '-clean.csv.gz')}")

  0%|          | 0/100 [00:00<?, ?it/s]