# Digging into the data Bert was trained on

From [the hugging face model card for bert-base-uncased](https://huggingface.co/bert-base-uncased) we can see info about the datasets BERT was trained on and after reading the paper [Addressing Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus](https://arxiv.org/pdf/2105.05241.pdf) it seems like a worthy cause to contribute to further analysis, for this model and others. 

## BookCorpus and BookCorpusOpen

Hugging face links to the [homepage for the original authors of BERT](https://yknzhu.wixsite.com/mbweb), But they aren't hosting the original dataset anymore.

[The dataset card for BookCorpus](https://huggingface.co/datasets/bookcorpus) is a reproduction but it doesn't include meta data regarding the books the content comes from.

### Optional Exercise: Extending the Analysis via the BookCorpusOpen Dataset

[The dataset card for BookCorpusOpen](https://huggingface.co/datasets/bookcorpusopen)

This is also an imperfect reproduction, though it has meta data so there is opportunity to extend the paper [Addressing Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus](https://arxiv.org/pdf/2105.05241.pdf) with further analysis. 

Perhaps we could email the authors with copyrighted materials, try to crowdfund some money for them since many companies are making a lot of money from bert-based models that were trained on their hard work without their permission? 

Or explore the dataset to see if you can find any other insights. 

In [1]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("bookcorpusopen")

Found cached dataset bookcorpusopen (/Users/d/.cache/huggingface/datasets/bookcorpusopen/plain_text/1.0.0/98559c92eb612e150a676c5b5131f9f8f07d4cab88e7f3761fda266ad22ff2a7)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
df = pd.DataFrame(dataset["train"])

In [3]:
import re

def get_email_addresses(text):
    return re.findall('[0-9a-zA-z]+@[0-9a-zA-z]+\.[0-9a-zA-z]+', text)

def get_copyright(text):
    return re.findall('Copyright\s*\d*', text)

def get_license_notes(text):
    return re.findall('License Notes\n\n(.*?)(?:\n)', text)

In [5]:
df["email_addresses"] = df["text"].apply(get_email_addresses)
df["copyright"] = df["text"].apply(get_copyright)
df["license_notes"] = df["text"].apply(get_license_notes)

In [7]:
# df filtered where email_addresses is not empty, license_notes is not empty, and copy right is not empty
df_notgood = df[(df["email_addresses"].str.len() > 0) & (df["license_notes"].str.len() > 0) & (df["copyright"].str.len() > 0)]

In [8]:
df_notgood.head()

Unnamed: 0,title,text,email_addresses,copyright,license_notes
5,satan-the-sworn-enemy-of-mankind.epub.txt,\n## SATAN:\n\n## The Sworn Enemy of Mankind...,[info@harunyahya.com],[Copyright ],[## All rights reserved. No part of this publi...
12,crown-of-thorns-the-race-to-clone-jesus-christ...,\n### Crown of Thorns -\n\n### The Race To Cl...,"[iancpirvine@hotmail.co, iancpirvine@hotmail.co]",[Copyright 2001],[This book is licensed for your personal enjoy...
27,shattered.epub.txt,\n\nSHATTERED\n\nby\n\nSandra Madera\n\nEdited...,[smadera23@yahoo.com],[Copyright ],[This ebook is licensed for your personal enjo...
33,a-course-in-miracles-wth-comments-by-ron-rasmu...,\n\n### A Course in Miracles with Comments\n\n...,"[ARequiredCourse@yahoogroups.com, Rasmussen@ao...",[Copyright 2015],[This ebook is licensed for your personal enjo...
60,red-warp.epub.txt,\nRed Warp\n\nDon DeBon\n\nStandard Edition\n\...,[debon@gmail.com],"[Copyright , Copyright ]",[This e-book is licensed for your personal enj...


## Wikipedia 

### Optional Exercise: 

I'm not sure if this is the original wikipedia dataset that BERT was trained on, but perhaps it'd be informative to extend the paper [Addressing Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus](https://arxiv.org/pdf/2105.05241.pdf) with further analysis on this wikipedia dataset? 

[The hugging face dataset card for wikipedia](https://huggingface.co/datasets/wikipedia)

Can you aggregate the data and create a data card that reports the percentage of the data that fits under various categories between the BookCorpus and Wikipedia datasets?