# Exploring Pretraining Datasets

## C4

C4 is a massive dataset used in pre-training T5. In our pre-training setting we only use it as a regularizer so as the model does not forget to understand natural language.

Download link: https://huggingface.co/datasets/allenai/c4/tree/main

In [1]:
import pandas as pd
import json
import random
import gzip
import glob
import numpy as np

from tqdm.notebook import tqdm
from sentence_splitter import split_text_into_sentences
from nltk.tokenize import word_tokenize

### Reading

In [2]:
with open('../storage/datasets/c4/c4_data_original.json') as f:
    c4_original = [json.loads(line)['text'] for line in f]

### Processing

* Removing newlines
* Sentence splitting
* Shuffling

In [3]:
# Remove newlines
c4 = [text.replace('\n', ' ') for text in c4_original]

# Sentence split
c4_sentences = []
for text in tqdm(c4):
    sents = split_text_into_sentences(text, language='en')
    c4_sentences.extend(sents)

random.shuffle(c4_sentences)

del c4

  0%|          | 0/356317 [00:00<?, ?it/s]

## WDC


WebTableCorpus is a collection of tables crawled from the web.

Download link: http://webdatacommons.org/webtables/2015/downloadInstructions.html

## Reading

In [2]:
table_paths = glob.glob('../storage/datasets/wdc/original/1438042981460.12/warc/*')
original_tables = []

def get_tables():
    for table_path in table_paths[:500]:
        with gzip.open(table_path, 'r') as f:
            for line in f:
                try:
                    yield json.loads(line)
                except UnicodeDecodeError:
                    continue
# print(len(original_tables))

## Filtering

In [3]:
def is_english(table):
    return any(domain in table['url'] for domain in ['.com', '.eu', '.uk', '.net', '.org'])

def has_header(table):
    return table['hasHeader']

def is_not_empty(table):
    return len(table['relation']) >= 2

def is_not_huge(table):
    return len(table['relation']) < 50 and len(table['relation'][0]) < 15

def has_title_or_page_title(table):
    return table['title'] != '' or table['pageTitle'] != ''

def title_is_not_huge(table):
    if table['title'] != '' and len(table['title'].split()) < 5:
        return True
    elif table['pageTitle'] != '' and len(table['pageTitle'].split()) < 5:
        return True
    else:
        return False
    

filtered_tables = [table for table in get_tables() 
                  if has_header(table) and 
                  is_not_empty(table) and 
                  is_english(table) and
                  is_not_huge(table) and
                  has_title_or_page_title(table) and
                  title_is_not_huge(table)]

print(len(filtered_tables))

122124


## Analysis of TextBefore and TextAfter

In [26]:
def calculate_token_overlap(text, row):
    text_tokens = word_tokenize(text)
    
    # We want the text to be at least 8 tokens so as it resembles real text
    if len(row) == 0 or len(text_tokens) < 8:
        return 0
    
    token_set = set(text_tokens)
    row_set = set(row)
    
    return len(token_set.intersection(row_set)) / min(len(text_tokens), len(row_set))


def calculate_row_overlaps(table, text_position="textBeforeTable"):
    return [calculate_token_overlap(table[text_position], row) 
            for row in table['relation'][1:]]  # Skip the header


def calculate_dataset_overlaps(tables):
    before_table_max = []
    after_table_max = []
    
    for table in tqdm(tables):
        before_table_max.append(max(calculate_row_overlaps(table, text_position="textBeforeTable")))
        after_table_max.append(max(calculate_row_overlaps(table, text_position="textAfterTable")))

    return before_table_max, after_table_max


with open('../storage/datasets/wdc/filtered/1438042981525.10.json', 'r') as inp:
    filtered_tables = json.load(inp)

before_overlaps, after_overlaps = calculate_dataset_overlaps(filtered_tables)

  0%|          | 0/37056 [00:00<?, ?it/s]

In [27]:
before_overlaps = np.array(before_overlaps)
after_overlaps = np.array(after_overlaps)

## How many tables have at least one row with more than 50% overlap
thresh = 0.4
print(f"More than {thresh} | TextBefore: {np.sum(before_overlaps > thresh) / len(before_overlaps)}")
print(f"More than {thresh} | TextAfter: {np.sum(after_overlaps > thresh) / len(after_overlaps)}")

## What is the average overlap of the text before and the text after?
print(f"Average overlap | TextBefore: {np.mean(before_overlaps)}")
print(f"Average overlap | TextAfter: {np.mean(after_overlaps)}")

More than 0.4 | TextBefore: 0.15104166666666666
More than 0.4 | TextAfter: 0.40282275474956825
Average overlap | TextBefore: 0.14498885856155933
Average overlap | TextAfter: 0.3959055534994654


In [30]:
before_overlaps = np.array(before_overlaps)
after_overlaps = np.array(after_overlaps)

thresh = 0.4
after_inds = np.where(after_overlaps > thresh)[0]
before_inds = np.where(before_overlaps > thresh)[0]


for i, ind in enumerate(before_inds):
    table = pd.DataFrame(filtered_tables[ind]['relation'][1:], columns=filtered_tables[ind]['relation'][0])
    print(filtered_tables[ind]['textBeforeTable'])
    display(table)
    
    print("-" * 90)
    
    if i > 20:
        break
    

Artist's attributes Catálogos Colecciónes Instituciones Representantes Exposiciones Colectivas Exposiciones Individuales Datos Biográficos Análisis Perfil Fila (2015): Simeon Ignatov , Already a member? Login now Buy a membership You will only see limited information for the artist until you purchase a membership. Full Access to this page is restricted to members of ArtFacts.Net. Artistas Exposiciones Instituciones Noticias Instituciones Exposiciones Artistas Contacto Productos y Precios ¿Quiénes somos? Identificación Saber más Acceso gratuito Italiano Français Español English Deutsch ArtFacts.net Please enable JavaScript to view this page correctly


Unnamed: 0,Artistas,Manoela Ignatova,Eleonora Hadzhinikolova,Velizar Dimchev,Plamen Assenov,Alexander Tsanev
0,Fila,,,,,
1,№,1.0,1.0,1.0,1.0,1.0


------------------------------------------------------------------------------------------
Saraperos de Saltillo HR: MXO: Heras 2 (10), Sandoval, Jo 1 (1) W: Valdez, Ro (W, 2-0, 6.16) ; L: Guerrero, D (L, 4-2, 5.30) 0 13 10 x 0 2 0 2 1 0 1 4 Mexico 0 11 3 0 0 1 0 1 0 0 1 0 Saltillo E H R 9 8 7 6 5 4 3 2 1 Final Mexican League (AAA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California


Unnamed: 0,Player,"Valdez, Ro (W, 2-0)","Reyes, D","Castaneda, F","Sandoval, Ju"
0,IP,5.0,1.0,2.0,1.0
1,H,6.0,3.0,1.0,1.0
2,R,2.0,1.0,0.0,0.0
3,ER,2.0,1.0,0.0,0.0
4,BB,3.0,0.0,0.0,0.0
5,SO,2.0,2.0,3.0,0.0
6,HR,0.0,0.0,0.0,0.0
7,ERA,6.16,4.91,6.0,4.95


------------------------------------------------------------------------------------------
Midland RockHounds HR: MRO: Carter 1 (), Everidge 1 () W: Deduno (W, 6-1, 2.92) ; L: Cramer (L, 2-3, 4.74) 0 12 8 x 1 0 0 0 0 0 0 7 Tulsa 4 9 3 0 1 0 0 0 0 2 0 0 Midland E H R 9 8 7 6 5 4 3 2 1 Final Texas League (AA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California League (HiA)


Unnamed: 0,Player,"Cramer (L, 2-3)",Heuser,Hunton
0,IP,5.0,2.0,1.0
1,H,9.0,2.0,1.0
2,R,7.0,0.0,1.0
3,ER,5.0,0.0,1.0
4,BB,2.0,1.0,0.0
5,SO,6.0,3.0,0.0
6,HR,0.0,0.0,0.0
7,ERA,4.74,5.24,2.25


------------------------------------------------------------------------------------------
Midland RockHounds HR: MRO: Carter 1 (), Everidge 1 () W: Deduno (W, 6-1, 2.92) ; L: Cramer (L, 2-3, 4.74) 0 12 8 x 1 0 0 0 0 0 0 7 Tulsa 4 9 3 0 1 0 0 0 0 2 0 0 Midland E H R 9 8 7 6 5 4 3 2 1 Final Texas League (AA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California League (HiA)


Unnamed: 0,Player,"Deduno (W, 6-1)",Cedeno
0,IP,7.0,2.0
1,H,7.0,2.0
2,R,2.0,1.0
3,ER,2.0,1.0
4,BB,2.0,1.0
5,SO,5.0,1.0
6,HR,1.0,1.0
7,ERA,2.92,4.03


------------------------------------------------------------------------------------------
Corpus Christi Hooks HR: COR: Heras, L 1 (1) W: Cruz, L (W, 2-0, 0.56) ; L: Geer (L, 8-5, 3.12) 0 2 0 0 0 0 0 0 0 0 0 0 San Antonio 0 11 5 1 1 0 0 0 0 3 0 0 Corpus Christi E H R 9 8 7 6 5 4 3 2 1 Final Texas League (AA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California League


Unnamed: 0,Player,"Geer (L, 8-5)","Campos, L",Zavada,McBryde
0,IP,6.0,1.0,1.0,1.0
1,H,5.0,1.0,3.0,2.0
2,R,3.0,0.0,1.0,1.0
3,ER,3.0,0.0,1.0,1.0
4,BB,1.0,0.0,0.0,0.0
5,SO,4.0,1.0,1.0,2.0
6,HR,1.0,0.0,0.0,0.0
7,ERA,3.12,0.65,2.41,1.99


------------------------------------------------------------------------------------------
Lehigh Valley IronPigs W: Abad, F (W, 1-0, 0.00) ; L: Stutes (L, 0-1, 3.18) 0 7 3 x 1 0 0 0 0 0 0 2 Syracuse 3 7 2 x 0 1 0 1 0 0 0 0 Lehigh Valley E H R 9 8 7 6 5 4 3 2 1 Final International League (AAA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California League (HiA) Carolina League (HiA) Florida State


Unnamed: 0,Player,Rosenberg,"Stutes (L, 0-1)"
0,IP,6.0,1.2
1,H,5.0,2.0
2,R,2.0,1.0
3,ER,1.0,1.0
4,BB,1.0,2.0
5,SO,4.0,1.0
6,HR,0.0,0.0
7,ERA,6.0,3.18


------------------------------------------------------------------------------------------
Lehigh Valley IronPigs W: Abad, F (W, 1-0, 0.00) ; L: Stutes (L, 0-1, 3.18) 0 7 3 x 1 0 0 0 0 0 0 2 Syracuse 3 7 2 x 0 1 0 1 0 0 0 0 Lehigh Valley E H R 9 8 7 6 5 4 3 2 1 Final International League (AAA) --Major League Baseball-- American League (MAJ) National League (MAJ) --Triple-A-- International League (AAA) Mexican League (AAA) Pacific Coast League (AAA) --Double-A-- Eastern League (AA) Southern League (AA) Texas League (AA) --High Class A-- California League (HiA) Carolina League (HiA) Florida State


Unnamed: 0,Player,Perry,"Romero, J (H, 1)","Davis, E (BS, 1)","Abad, F (W, 1-0)"
0,IP,5.0,1.0,1.0,1.0
1,H,4.0,1.0,2.0,0.0
2,R,1.0,0.0,1.0,0.0
3,ER,1.0,0.0,1.0,0.0
4,BB,0.0,0.0,0.0,0.0
5,SO,4.0,0.0,2.0,0.0
6,HR,0.0,0.0,0.0,0.0
7,ERA,6.3,0.0,2.25,0.0


------------------------------------------------------------------------------------------
Coming soon... Start xChange Mint 1 jbvintage Start xChange Near Mint/Mint 1 piyn   Condition Qty Username No users have to trade for this. Wants Haves For Sale For Trade Goudey Cincinnati Reds Adams, Sparky , Bottomley, Jim , Comorosky, Adam A. , Piet, Tony 1935 Goudey 4-in-1 Related Marketplace Products View All


Unnamed: 0,Username,piyn,jbvintage
0,Qty,1,1
1,Condition,Near Mint/Mint,Mint
2,,Start xChange,Start xChange


------------------------------------------------------------------------------------------
Coming soon... View All » Start xChange Mint 1 gray3 Start xChange Near Mint/Mint 1 bghoosier Start xChange Mint 1 tylern1 Start xChange Near Mint/Mint 1 gjhaedicke   Condition Qty Username view all » Start xChange Near Mint/Mint 9 lmcgown Start xChange Mint 1 dabears525 Start xChange Mint 1 shanedogg22 Start xChange Mint 1 franrose   Condition Qty Username Wants Haves For Sale For Trade Topps Allen and Ginter


Unnamed: 0,Username,gjhaedicke,tylern1,bghoosier,gray3,bblenk,Orioles12
0,Qty,1,1,1,1,1,1
1,Condition,Near Mint/Mint,Mint,Near Mint/Mint,Mint,Near Mint/Mint,Near Mint/Mint
2,,Start xChange,Start xChange,Start xChange,Start xChange,Start xChange,Start xChange


------------------------------------------------------------------------------------------
Run Time: 0.000309 Params: 1 WHERE permission_combination_id = ? FROM xf_permission_combination SELECT cache_value   Queries (9, time: 0.0080s, 2.4%) Memory: 6.8624 MB (Peak: 7.3214 MB) Page Time: 0.3353s


Unnamed: 0,Select Type,SIMPLE,SIMPLE.1,SIMPLE.2,SIMPLE.3,SIMPLE.4,SIMPLE.5
0,Table,user,user_profile,user_option,user_privacy,permission_combination,session_activity
1,Type,const,const,const,const,const,const
2,Possible Keys,PRIMARY,PRIMARY,PRIMARY,PRIMARY,PRIMARY,PRIMARY
3,Key,PRIMARY,PRIMARY,PRIMARY,PRIMARY,PRIMARY,
4,Key Len,4,4,4,4,4,
5,Ref,const,const,const,const,const,
6,Rows,1,1,1,1,1,1
7,Extra,,,,,,Impossible ON condition


------------------------------------------------------------------------------------------
Run Time: 0.000309 Params: 1 WHERE permission_combination_id = ? FROM xf_permission_combination SELECT cache_value   Queries (9, time: 0.0080s, 2.4%) Memory: 6.8624 MB (Peak: 7.3214 MB) Page Time: 0.3353s


Unnamed: 0,Select Type,SIMPLE,SIMPLE.1,SIMPLE.2,SIMPLE.3
0,Table,user_follow,user,user_option,user_profile
1,Type,ref,eq_ref,eq_ref,eq_ref
2,Possible Keys,"PRIMARY,follow_user_id",PRIMARY,PRIMARY,PRIMARY
3,Key,follow_user_id,PRIMARY,PRIMARY,PRIMARY
4,Key Len,4,4,4,4
5,Ref,const,dancefo2_xeno2.user_follow.user_id,dancefo2_xeno2.user.user_id,dancefo2_xeno2.user_follow.user_id
6,Rows,1,1,1,1
7,Extra,Using index; Using temporary; Using filesort,Using where,Using where,


------------------------------------------------------------------------------------------
(05/05/1984) Christopher BIRCHALL Trinidad and Tobago } } $j("#expColl").attr("src", "/imgml/icons/expand.gif"); $j("#playerSearchForm").attr("style","height: 26px;"); } else { $j("#expColl").attr("src", "/imgml/icons/collapse.gif"); $j("#playerSearchForm").attr("style","height: 136px;"); if($j("#expColl").attr("src").indexOf("/imgml/icons/expand.gif") >= 0) { $j("#advS").toggle(); { function toggleAdvanced() checkLoading() } } } } } break; rd[i].checked='true'; { if(rd[i].value== cp) { for(var i in rd) var rd = document.getElementsByName('cp'); if(!cp){cp='s';} $j('#pn').val(pn); { if(pn) var cp =qs.cp var pn =qs.pn.replace('+',' '); var qs =location.search.substring(1).toQueryParams(); { if(location.search.length>0) { function checkLoading() } } if(t.length >=3){return true;}else{return false;} { function checkChars(t) } alert('Please insert at least three characters in the search field'); { fu

Unnamed: 0,Tournaments,FIFA World Cup™ Final,FIFA World Cup™ Qualifier
0,Editions,2006,"2010, 2014"
1,MP,3,21
2,W,0,11
3,D,1,3
4,L,2,7
5,GF,0,1
6,GA,0,0
7,Y,0,4
8,2YC,0,0
9,R,0,0


------------------------------------------------------------------------------------------
Waltz, tango, fox trot and cha cha! Learn basic ballroom dance with your favorite partner. Pre-registration required. 105204 - Ballroom Dance Showing: 1 to 15 Total Results: 15 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th Age: All Ages 3 months 6 months 9 months Adult Senior Begin Month: All Months January February March April May June July August September October November December Gender: Co-ed Male


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1
0,Activity,105212-01,105212-02
1,Description,Tappercise,Tappercise
2,Dates,09/02/15 - 10/07/15,10/14/15 - 11/18/15
3,Time,5:30P - 6:15P,5:30P - 6:15P
4,Days,W,W
5,Location,Waters-Moss,Waters-Moss
6,Fees,$35,$35
7,Ages,16 years and Up,16 years and Up
8,Status,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details


------------------------------------------------------------------------------------------
Waltz, tango, fox trot and cha cha! Learn basic ballroom dance with your favorite partner. Pre-registration required. 105204 - Ballroom Dance Showing: 1 to 15 Total Results: 15 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th Age: All Ages 3 months 6 months 9 months Adult Senior Begin Month: All Months January February March April May June July August September October November December Gender: Co-ed Male


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1,Read Notice.2
0,Activity,105300-01,105300-02,105300-03
1,Description,Clogging,Clogging,Clogging
2,Dates,09/08/15 - 09/29/15,10/06/15 - 10/27/15,11/03/15 - 12/01/15
3,Time,6:00P - 7:00P,6:00P - 7:00P,6:00P - 7:00P
4,Days,Tu,Tu,Tu
5,Location,Waters-Moss,Waters-Moss,Waters-Moss
6,Fees,$30,$30,$30
7,Ages,10 years and Up,10 years and Up,10 years and Up
8,Status,Unavailable,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details,Item Details


------------------------------------------------------------------------------------------
Waltz, tango, fox trot and cha cha! Learn basic ballroom dance with your favorite partner. Pre-registration required. 105204 - Ballroom Dance Showing: 1 to 15 Total Results: 15 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th Age: All Ages 3 months 6 months 9 months Adult Senior Begin Month: All Months January February March April May June July August September October November December Gender: Co-ed Male


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1
0,Activity,107703-01,107703-02
1,Description,Toddler Tumble Tots,Toddler Tumble Tots
2,Dates,09/03/15 - 09/17/15,09/24/15 - 10/08/15
3,Time,6:30P - 7:00P,6:30P - 7:00P
4,Days,Th,Th
5,Location,Waters-Moss,Waters-Moss
6,Fees,$35,$35
7,Ages,2 years to under 4 years,2 years to under 4 years
8,Status,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details


------------------------------------------------------------------------------------------
Waltz, tango, fox trot and cha cha! Learn basic ballroom dance with your favorite partner. Pre-registration required. 105204 - Ballroom Dance Showing: 1 to 15 Total Results: 15 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th Age: All Ages 3 months 6 months 9 months Adult Senior Begin Month: All Months January February March April May June July August September October November December Gender: Co-ed Male


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1
0,Activity,107704-01,107704-02
1,Description,Toddler Jazz & Ballet,"Toddler Jazz, Ballet"
2,Dates,09/03/15 - 09/17/15,09/24/15 - 10/08/15
3,Time,6:00P - 6:30P,6:00P - 6:30P
4,Days,Th,Th
5,Location,Waters-Moss,Waters-Moss
6,Fees,$35,$35
7,Ages,2 years to under 4 years,2 years to under 4 years
8,Status,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details


------------------------------------------------------------------------------------------
Harvard 3 Boston College 2 Boston College at Harvard Preview Box Score | Recap | Video | }); }); $(this).toggleClass('active').next('ul').toggleClass('active'); $('.secondary-nav > h1').click(function() { $(function() { })(jQuery); }); }); $('.has-submenu.active').removeClass('active'); $(document).on('click', function(e) { }); return false; $(this).parent().toggleClass('active'); $('.has-submenu > a', '#global-nav').on('click', function() { }); return false; $('#global-nav').removeClass('active'); $('#jump-to-global-nav').removeClass('active'); $('#global-nav .close').click(function(e) { }); $('#global-nav').toggleClass('active'); $(this).toggleClass('active'); e.preventDefault(); $('#jump-to-global-nav').click(function(e) { $(function() { (function($) { 9/21/2014 at 2:00 pm @ Cambridge, Mass. (Malkin Athletic Cent)


Unnamed: 0,Final,Boston College (5-6),Harvard (6-2)
0,1,23,25
1,2,24,26
2,3,25,13
3,4,25,23
4,5,14,16
5,Score,2,3


------------------------------------------------------------------------------------------
35 5 4 3 4 3 4 4 4 4 Par 2866 480 361 122 343 155 321 300 382 402 Red tee 2974 500 361 142 349 155 383 300 382 402 Gold tee 3050 500 361 164 349 175 383 334 382 402 White tee 3368 542 387 198 377 217 426 380 405 436 Blue tee Out* 9 8 7 6 5 4 3 2 1 Hole Front Nine 5,830 131 74.7 71 Red tee 6,113 131 75.7 71 Gold tee 6,196 126 70.8 71 White tee 6,834 130 73.2 71 Blue tee Yardage Slope Rating Par Twitter – http://www.pga.com/ Regulation Length Holes – 18 Greens Grass Type – Blue Grass Architect Name – David A. Rainville, ASGCA/Harry Rainville Course Details/History


Unnamed: 0,Hole,Blue tee,White tee,Gold tee,Red tee,Par
0,10,469,435,435,416,4
1,11,377,334,334,319,4
2,12,458,403,403,382,4
3,13,158,147,147,133,3
4,14,526,470,470,419,5
5,15,397,355,355,320,4
6,16,422,389,382,382,4
7,17,129,114,114,114,3
8,18,530,499,499,479,5
9,In*,3466,3146,3139,2964,36


------------------------------------------------------------------------------------------
36 4 4 3 5 4 5 3 4 4 Par 2602 313 263 99 472 337 439 131 260 288 Gold tee 3099 366 352 128 543 390 497 183 320 320 White tee 3308 388 389 142 557 404 560 194 328 346 Blue tee 3506 426 428 152 576 418 569 212 336 389 Black tee Out* 9 8 7 6 5 4 3 2 1 Hole Front Nine 5,251 125 71.3 72 Gold tee 6,144 129 70.4 72 White tee 6,544 133 72.2 72 Blue tee 6,909 138 73.9 72 Black tee Yardage Slope Rating Par Twitter – http://www.pga.com/ Regulation Length Holes – 18 Greens Grass Type – Blue Grass Architect Name – Joel Goldstrand Course Details/History


Unnamed: 0,Hole,Black tee,Blue tee,White tee,Gold tee,Par
0,10,428,400,382,363,4
1,11,311,297,277,248,4
2,12,507,490,476,436,5
3,13,442,412,371,330,4
4,14,421,397,361,309,4
5,15,149,136,116,101,3
6,16,395,382,373,299,4
7,17,190,182,174,117,3
8,18,560,540,515,446,5
9,In*,3403,3236,3045,2649,36


------------------------------------------------------------------------------------------
Join us for a 26-mile loop ride around the City of Columbia. Plan to ride a mix of both soft surface trails, bike lanes and streets with low to medium traffic volume. Participants must have intermediate on-road riding skills. Staff will provide SAG (support and gear) for minor maintenance issues. Please make sure you are self-supported with an inner tube (correct size and valve type) or patch kit and water. Helmets required. Optional: Lunch (on your own) after the ride. Registration suggested. Drop-in participants are welcome. 118107 - Loop the City Ride Showing: 1 to 24 Total Results: 24 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1,Read Notice.2
0,Activity,118300-01,118300-02,118300-03
1,Description,City Cycling,City Cycling,City Cycling
2,Dates,08/29/15 - 08/29/15,09/19/15 - 09/19/15,10/17/15 - 10/17/15
3,Time,9:00A - 2:00P,9:00A - 2:00P,9:00A - 2:00P
4,Days,Sa,Sa,Sa
5,Location,Armory Sports Center,Armory Sports Center,Armory Sports Center
6,Fees,$0,$0,$0
7,Ages,14 years and Up,14 years and Up,14 years and Up
8,Status,Unavailable,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details,Item Details


------------------------------------------------------------------------------------------
Join us for a 26-mile loop ride around the City of Columbia. Plan to ride a mix of both soft surface trails, bike lanes and streets with low to medium traffic volume. Participants must have intermediate on-road riding skills. Staff will provide SAG (support and gear) for minor maintenance issues. Please make sure you are self-supported with an inner tube (correct size and valve type) or patch kit and water. Helmets required. Optional: Lunch (on your own) after the ride. Registration suggested. Drop-in participants are welcome. 118107 - Loop the City Ride Showing: 1 to 24 Total Results: 24 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1,Read Notice.2
0,Activity,118302-01,118302-02,118302-03
1,Description,Bicycle Maintenance at Home,Bicycle Maintenance at Home,Bicycle Maintenance
2,Dates,08/20/15 - 08/20/15,09/10/15 - 09/10/15,10/28/15 - 10/28/15
3,Time,6:00P - 7:30P,6:00P - 7:30P,6:00P - 7:30P
4,Days,Th,Th,W
5,Location,Armory Sports Center,Armory Sports Center,Armory Sports Center
6,Fees,$0,$0,$0
7,Ages,14 years and Up,14 years and Up,14 years and Up
8,Status,Unavailable,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details,Item Details


------------------------------------------------------------------------------------------
Join us for a 26-mile loop ride around the City of Columbia. Plan to ride a mix of both soft surface trails, bike lanes and streets with low to medium traffic volume. Participants must have intermediate on-road riding skills. Staff will provide SAG (support and gear) for minor maintenance issues. Please make sure you are self-supported with an inner tube (correct size and valve type) or patch kit and water. Helmets required. Optional: Lunch (on your own) after the ride. Registration suggested. Drop-in participants are welcome. 118107 - Loop the City Ride Showing: 1 to 24 Total Results: 24 Activity Search Results Enroll Now Enroll Now All Instructors All Instructors Bishop Steven Walker Ray Availability Option: All Classes Only Available Classes Available Nonresident Classes Time Blocks: Grade: Unspecified Pre-K (0-3) Pre-K (3-5) Kinder 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th


Unnamed: 0,Cart Icon,Read Notice,Read Notice.1,Read Notice.2
0,Activity,118306-01,118306-02,118306-03
1,Description,Fix-a-Flat Class,Fix-a-Flat Class,Fix-a-Flat Class
2,Dates,08/27/15 - 08/27/15,10/13/15 - 10/13/15,11/04/15 - 11/04/15
3,Time,6:00P - 7:30P,6:00P - 7:30P,6:00P - 7:30P
4,Days,Th,Tu,W
5,Location,Armory Sports Center,Armory Sports Center,Armory Sports Center
6,Fees,$0,$0,$0
7,Ages,3 years and Up,3 years and Up,3 years and Up
8,Status,Unavailable,Unavailable,Unavailable
9,Single Icon,Item Details,Item Details,Item Details


------------------------------------------------------------------------------------------


### Check specific cases

In [31]:
# How many tables contain the word Player in first column
player_tables = 0

for table in filtered_tables:
    if table['relation'][0][0] == 'Cart Icon':
        player_tables += 1
        
player_tables

1238

## Analysis of Title and PageTitle

In [74]:
has_title = 0
has_page_title = 0
has_both = 0
has_none = 0

for table in filtered_tables:
    if table['title'] != '':
        has_title += 1
    if table['pageTitle'] != '':
        has_page_title += 1
    if table['title'] != '' and table['pageTitle'] != '':
        has_both += 1
    if table['title'] == '' and table['pageTitle'] == '':
        has_none += 1
        
print(f"Has title: {has_title / len(filtered_tables)}")
print(f"Has page title: {has_page_title / len(filtered_tables)}")
print(f"Has both: {has_both / len(filtered_tables)}")
print(f"Has none: {has_none / len(filtered_tables)}")

Has title: 0.22697422292096558
Has page title: 0.9997707248370509
Has both: 0.22674494775801643
Has none: 0.0


In [75]:
# Explore the difference between title and pageTitle
counter = 0
for table in filtered_tables:
    if table['title'] != '':
        print(table)
        counter += 1
        print("-" * 120)
    if counter > 5:
        break
        
print("!" * 240)
        
counter = 0
for table in filtered_tables:
    if table['pageTitle'] != '':
        print(table)
        counter += 1
        print("-" * 120)
    if counter > 5:
        break

{'relation': [['SEASON', '2008', '2009', 'TOTAL'], ['GP', '16', '6', '23'], ['G', '0', '0', '0'], ['A', '0', '0', '0'], ['PTS', '0', '0', '0'], ['SHOTS', '1', '0', '1'], ['SHOT %', '.000', '.000', '.000'], ['SOG', '1', '0', '1'], ['SOG%', '1.000', '.000', '1.000'], ['GW', '0', '0', '0'], ['PK-ATT', '0-0', '0-0', '0-0']], 'pageTitle': 'Brown', 'title': 'Career Statistics', 'url': 'http://brownbears.com/sports/m-soccer/2010-11/bios/smith_ian00.html', 'hasHeader': True, 'headerPosition': 'FIRST_ROW', 'tableType': 'RELATION', 'tableNum': 1, 's3Link': 'common-crawl/crawl-data/CC-MAIN-2015-32/segments/1438042981460.12/warc/CC-MAIN-20150728002301-00160-ip-10-236-191-2.ec2.internal.warc.gz', 'recordEndOffset': 37217939, 'recordOffset': 37209737, 'tableOrientation': 'HORIZONTAL', 'TableContextTimeStampAfterTable': '{26037=Brown University Athletics | 235 Hope St. Box 1932 | Providence, RI 02912, 21585=Before Brown: Transferred from Western Kentucky University where he started as a freshman ... 