# Finding surprisingly canadian words

In this notebook, we search Wikipedia for words which we would not expect to have high canadianness. This is mainly to identify unexpected relations.

Wikipedia entries such as 'Ice hockey' and 'Maple leaf' having high canadianness would be matching our expectation. But can we discover things which have high canadianness, even if we would have not expected if at first?

In [1]:
import mwclient

import numpy as np
import pandas as pd

import btb.utils.tools as btbtools
import btb.utils.wikiquery as wq
import btb.utils.userresolver as usr

import wikidat.utils.ipresolver as ipr

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
expEdits = wq.getTotalContributions()
wikiEN = mwclient.Site('en.wikipedia.org')
bots = wq.getAllBots(wikiEN)

In [5]:
store = {}
threshold = 0.5   # 'Surprising' pages should have at least 50% canadianness 
numEntries = 350  # Generate 350 random pages

In [6]:
# And lets load existing results
import pickle
store = pickle.load(open('surprisinglyCanadian.pkl', 'r'))

In [None]:
counter = 0
while len(store)<numEntries:
    # Generate random pages in sets of 10
    randPages = wikiEN.random(namespace=0, limit=10)
    for randPage in randPages:
        ips, usrs, nrevs = wq.getContributionsForPage(wikiEN, randPage['title'])
        knwRevs, conf, nIP, nUsr, nBot, nUnkn = btbtools.prepareData(ips, usrs, bots)
        cmpEdits = btbtools.compareEdits(expEdits, knwRevs)
        
        ca_e, ca_o, ca_m = cmpEdits['CA']
        # Discard pages with low confidence on measure
        if ca_m > threshold and conf > 0.75:
            print ' - %6.4f (%4.2f): %s'%(ca_m, conf, randPage['title'])
            store[randPage['title']] = cmpEdits
        counter = counter + 1

This is a list of pages which we found to be 'surprisingly canadian'. A handful of pages (highlighted in bold) have been manually selected for further investigation, because they seem interesting.

 - **0.5740 (0.79): The Little Prince (2015 film)**
 - **0.7840 (0.84): Essen Stadtbahn**
 - **0.7618 (0.77): Pop music in Ukraine**
 - **0.6220 (0.88): 1979 Australian Sports Car Championship**
 - **0.9449 (0.78): Awards of the Ministry for Emergency Situations of Russia**
 - **0.9189 (0.81): 2009–10 Los Angeles Lakers season**
 - 0.5410 (0.81): List of Official Subscription Plays Chart number-one songs of the 2000s
 - 0.5320 (0.79): Celebration Rock
 - 0.9460 (1.00): Kohlhiesel's Daughters
 - 0.9460 (1.00): Antonio Rampoldi
 - 0.6422 (0.95): Bholu (mascot)
 - 0.9460 (0.87): Fluke (album)
 - 0.9362 (0.87): 2009 BDO Canadian Open of Curling
 - 0.8920 (0.81): 1926–27 Detroit Cougars season
 - 0.9460 (1.00): Codroy, Newfoundland and Labrador
 - 0.9417 (0.87): Île aux Coudres Airport
 - 0.8380 (1.00): Shelleyan
 - 0.9100 (1.00): The Girl from Woolworth's
 - 0.6220 (0.78): Cambria County Conservation And Recreation Authority
 - 0.5005 (0.76): Daniele Gaither
 - 0.9169 (0.81): 2006–07 ECHL season
 - 0.9460 (1.00): Silva (film)
 - 0.5140 (1.00): Cricket in Chile
 - 0.9413 (0.93): Bar River Water Aerodrome
 - 0.9229 (0.91): List of Canadian films of 1995
 - 0.6220 (0.78): Children Galore
 - 0.8110 (0.78): Jim Lawrence (baseball)
 - 0.9438 (0.88): Ukrainian Cultural Heritage Village
 - 0.9460 (1.00): Janówek, Gmina Wiskitki
 - 0.8684 (0.76): RAF Middleton St George
 - 0.8785 (0.82): Johann IV
 - 0.9280 (1.00): Homer Floyd
 - 0.8920 (1.00): Samuel B. Chipman
 - 0.7570 (0.78): Beader
 - 0.7192 (0.79): Jethro New
 - 0.7300 (0.77): YBE
 - 0.9460 (0.80): Acalypha wilderi
 - 0.9232 (0.80): Steve Gould (curler)
 - 0.8380 (1.00): Łęgi, Opole Voivodeship
 - 0.6760 (0.86): Luke Dawson
 - 0.9190 (1.00): Burnett Inlet
 - 0.9460 (0.89): March 16 (Eastern Orthodox liturgics)
 - 0.5444 (0.77): Green Angel
 - 0.9300 (0.78): Starfleet Orion
 - 0.9383 (0.80): CCGS Cape Cockburn
 - 0.8482 (0.84): Geronimo Stilton
 - 0.9021 (0.78): Vanessa King
 - 0.7300 (0.83): International Bridge Walk
 - 0.9460 (1.00): 2005 Challenge Bell – Doubles
 - 0.9411 (1.00): Lee Valmassy
 - 0.7030 (0.76): Jone o Grinfilt
 - 0.8110 (0.78): Gaston Leroux (ice hockey)
 - 0.9451 (0.88): CIT du Sud-Ouest
 - 0.9145 (0.90): Icons of Horror Collection - Sam Katzman
 - 0.9190 (0.83): The River of Stars (film)
 - 0.9460 (0.90): 1st Parliament of British Columbia
 - 0.9400 (1.00): Castle Building
 - 0.6220 (0.88): Alonzo Bertram See
 - 0.6760 (1.00): Bigfin snake eel
 - 0.8158 (0.82): Abdi Mohamoud Omar
 - 0.5140 (0.82): Too Hot to Handle (1977 film)
 - 0.8920 (1.00): Isthmian
 - 0.9190 (0.86): Hasan Yalnızoğlu
 - 0.9421 (0.94): 18th General Assembly of Nova Scotia
 - 0.8596 (0.79): Morgan Peck
 - 0.7840 (0.89): Donnell Cameron
 - 0.7840 (0.80): SPSD
 - 0.9111 (0.82): Hans Stüwe
 - 0.9266 (0.78): York Region EMS
 - 0.9460 (1.00): Noria (disambiguation)
 - 0.9280 (0.80): 1971 Calgary Stampeders season
 - 0.7300 (1.00): Biathlon at the 1992 Winter Olympics – Women's relay
 - 0.9421 (0.79): Homewood Airport
 - 0.6580 (0.76): 1963 Buffalo Bills season
 - 0.7840 (1.00): 2001 South African Figure Skating Championships
 - 0.9432 (0.80): 2010 UNAF U-23 Tournament
 - 0.9422 (0.94): Insignificance
 - 0.9244 (1.00): The Count of Cagliostro
 - 0.6220 (1.00): Shane Williams (Australian footballer)
 - 0.9460 (1.00): Irasto Knights
 - 0.5140 (0.90): John Collett Ryland
 - 0.5176 (0.79): Jack Bauer (cyclist)
 - 0.6220 (0.78): Lake West, Dallas
 - 0.9434 (0.77): 2008 Winnipeg Blue Bombers season
 - 0.8920 (1.00): Julius Caesar (1914 film)
 - 0.9424 (0.94): RadioSonic
 - 0.9379 (0.96): Rigoletto... in Bluegrass
 - 0.8380 (1.00): Keith O'Connell
 - 0.9370 (0.93): Steve Orcherton
 - 0.6220 (0.88): Bebop & Beyond
 - 0.8440 (0.81): Survive This
 - 0.5140 (0.82): Jean-Baptiste Jean
 - 0.9460 (0.83): Sinister Street (film)
 - 0.6940 (0.81): Graham Briggs
 - 0.9352 (1.00): Striplin Lone Ranger
 - 0.9451 (0.96): 22nd New Brunswick Legislature
 - 0.9393 (0.84): Vicki Conrad
 - 0.7840 (0.80): Greenfield (Charlotte Court House, Virginia)
 - 0.9442 (0.85): List of Toronto rapid transit stations
 - 0.9460 (1.00): Rotec Panther
 - 0.9309 (0.76): Xtra Vancouver
 - 0.7300 (0.83): First Follett Ministry
 - 0.9340 (0.79): Fergus Shanahan
 - 0.9460 (1.00): V. echinata
 - 0.9460 (0.80): Jim Eglinski
 - 0.9460 (1.00): Tachie
 - 0.9366 (0.90): Hermann Picha
 - 0.9100 (0.83): Toronto blackout
 - 0.7948 (0.76): Wildlife Preservation Canada
 - 0.9352 (0.82): Timiskaming (electoral district)
 - 0.7840 (0.80): Apterichtus australis
 - 0.8110 (1.00): Ron Davies (Western Australian politician)
 - 0.7840 (0.80): University of Tennessee at Nashville
 - 0.9100 (0.77): Harold Danforth
 - 0.9280 (0.80): Trenck (film)
 - 0.9432 (0.85): Paudash Lake
 - 0.8110 (0.78): Gaston Leroux (ice hockey)
 - 0.9447 (0.96): 2009 Quebec Scotties Tournament of Hearts
 - 0.5680 (0.80): Sandi Valentinčič
 - 0.9055 (1.00): The Last Bandit
 - 0.9460 (0.92): CIT Sorel-Varennes
 - 0.7840 (1.00): The Oath of Peter Hergatz
 - 0.9280 (0.80): The Great Impersonation (1935 film)
 - 0.9100 (1.00): Peter Winser
 - 0.5680 (0.89): Glemmtal
 - 0.7840 (0.77): John Cooper (tennis)
 - 0.9460 (0.77): Cudworth Airport
 - 0.9439 (0.78): Corey Chamblin
 - 0.8920 (1.00): Joe Lane Travis
 - 0.6760 (0.86): Carmichael Park, Tingalpa
 - 0.9306 (0.82): Journal Pioneer
 - 0.9460 (0.78): List of people from Sherbrooke
 - 0.9398 (0.76): Royal Vancouver Yacht Club
 - 0.7840 (1.00): Colobocrossa
 - 0.9443 (0.83): Leeds—Grenville
 - 0.9415 (0.84): Brant (provincial electoral district)
 - 0.5680 (0.80): Byam Martin Channel
 - 0.9460 (0.79): Goodsoil Airport
 - 0.9411 (0.86): Meadow Lake Tribal Council (Saskatchewan)
 - 0.6220 (0.88): FIFA Female Player of the Century
 - 0.7300 (0.77): Tim McIntyre
 - 0.6220 (0.88): Fighting Marshal
 - 0.9460 (0.79): Leaf Rapids Water Aerodrome
 - 0.6760 (1.00): Jury Box (game)
 - 0.9445 (0.93): Your Mom's House
 - 0.6760 (0.86): Periprosthetic
 - 0.9244 (0.78): Harry James Barber
 - 0.7720 (0.84): Wisconsin Coach Lines
 - 0.8110 (0.78): WCGB
 - 0.8766 (0.89): Augustus Island
 - 0.8650 (1.00): Flying Legend
 - 0.9409 (0.88): Victoria Poon
 - 0.9244 (0.88): Ben Antao
 - 0.6760 (0.78): Bert Gardiner
 - 0.5140 (1.00): KEAD
 - 0.9460 (1.00): Vortech Skylark
 - 0.9448 (0.85): District of Saskatchewan
 - 0.9460 (1.00): Reindeer Lake/Lindbergh Lodge Aerodrome
 - 0.9460 (1.00): Bluffers (disambiguation)
 - 0.9350 (0.75): Fort Henry, Ontario
 - 0.6220 (0.88): Calvin L. Brown
 - 0.9460 (1.00): Airdrome Morane Saulnier L
 - 0.8920 (0.80): Awada
 - 0.5680 (0.89): Kankakee Community Resource Center
 - 0.9190 (1.00): Peter Rolston
 - 0.8920 (0.86): Wilmot Howard Cole
 - 0.9437 (0.86): Crown Collection
 - 0.8380 (1.00): Peter Thompson (antiquarian)
 - 0.9325 (0.83): Duncan McArthur (Canadian politician)
 - 0.8920 (1.00): VSW
 - 0.9408 (0.94): LIM-49 Nike Zeus
 - 0.9280 (0.89): The Falcon in Danger
 - 0.8164 (0.85): Joe Fiorito
 - 0.8380 (1.00): Bathycongrus parapolyporus
 - 0.9423 (0.84): Louise (2003 film)
 - 0.5005 (0.79): Purgaz
 - 0.8380 (0.90): Les Ramsay
 - 0.7300 (0.83): Kristin Jacobi
 - 0.9460 (0.80): List of population centres in Ontario
 - 0.9460 (0.78): Charlie's Burgers
 - 0.9460 (0.80): Spotted Island Air Station
 - 0.5410 (0.77): Munchies (TV series)
 - 0.7840 (1.00): Sharab
 - 0.9460 (1.00): Fisher Hudson
 - 0.6994 (0.80): Faouzi Chaouchi
 - 0.8920 (1.00): Acromycter nezumi
 - 0.9325 (0.81): Save a Little Sunshine
 - 0.9115 (0.76): Fauna of Canada
 - 0.9460 (1.00): 2013 Challenge Chateau Cartier de Gatineau
 - 0.7300 (0.83): Arjunrao Bharbhare
 - 0.9344 (0.81): Yassine Boukhari
 - 0.8920 (1.00): Bill Lasseter
 - 0.9229 (0.77): Dancing Mad
 - 0.9445 (0.80): Société des designers graphiques du Québec
 - 0.8020 (0.79): Let's Go (Shawn Desman song)
 - 0.9190 (1.00): Pascale Fonteneau
 - 0.9460 (0.83): Saskatoon Silver Springs
 - 0.7300 (0.83): James Dinwiddie
 - 0.9345 (0.77): Departures (TV series)
 - 0.5140 (0.87): Sarah Mayer
 - 0.5503 (0.76): Collie Buddz
 - 0.8700 (0.77): Strange Behaviour
 - 0.9460 (1.00): William Clapham
 - 0.9393 (0.82): Sweet Honey in the Rock: Raise Your Voice
 - 0.8650 (0.83): The Flaw (1955 film)
 - 0.9414 (0.78): Treaty of Versailles (1757)
 - 0.8920 (0.80): John David MacRae
 - 0.9460 (1.00): Howland H-3 Pegasus
 - 0.7030 (0.85): The White Spirit
 - 0.5993 (0.79): Elmopalooza
 - 0.6760 (1.00): Cycling at the 2012 Summer Paralympics – Women's individual pursuit C5
 - 0.7300 (0.83): Thomas Luther Shepherd
 - 0.7715 (0.84): List of American films of 2014
 - 0.9055 (0.88): James Reid (Ontario politician)
 - 0.9460 (1.00): Lepreau Parish, New Brunswick
 - 0.9460 (1.00): Prairie Rose School Division No. 8
 - 0.9190 (1.00): John Montgomery (shipbuilder)
 - 0.7840 (0.92): 1 Razlog
 - 0.9460 (0.93): Anjou (AMT)
 - 0.6760 (0.80): Angie Moretto
 - 0.5140 (0.82): William Henry Gillespie
 - 0.5714 (0.77): Ellen Wong
 - 0.5680 (0.89): Whirley Hall
 - 0.8560 (1.00): Night Birds
 - 0.9424 (0.84): George Arnold (settler)
 - 0.9450 (0.98): Mariusz Linke
 - 0.9055 (0.88): Roy Travers
 - 0.9343 (0.79): Masonville Place
 - 0.7300 (1.00): Altenglan station
 - 0.9460 (0.77): Stirling Aerodrome
 - 0.8110 (0.84): Legendary Assassin
 - 0.9460 (0.88): Conrad, Yukon
 - 0.6220 (0.88): Rowing at the 2002 Asian Games – Men's lightweight double sculls
 - 0.9393 (0.90): Paramount Model 120 Sportster
 - 0.9418 (0.84): Alexander Gibson (industrialist)
 - 0.5140 (0.86): How Green was my Cactus
 - 0.9414 (0.78): Treaty of Versailles (1757)
 - 0.9067 (0.90): Nova Scotia Federation of Labour
 - 0.8866 (0.78): Nothing to Worry About
 - 0.6400 (0.87): Charles Adamu
 - 0.7840 (0.80): Khama
 - 0.9460 (0.86): Bead Game
 - 0.8200 (0.91): Indian Country (disambiguation)
 - 0.9405 (0.91): Overdale, Montreal
 - 0.8920 (1.00): Peculator verconis
 - 0.9460 (1.00): D. laeta
 - 0.9460 (0.82): Kamsack Airport
 - 0.9156 (0.78): Paul Mercier (Bloc Québécois MP)
 - 0.9405 (0.75): Peter Braid
 - 0.8920 (0.80): George Alexander McQuibban
 - 0.7480 (0.78): History of Stonyhurst College
 - 0.9136 (1.00): William Frederick Meyers
 - 0.9460 (0.79): Interprovincial Amateur Hockey Union
 - 0.9244 (0.78): Eberhard Frowein
 - 0.8650 (0.83): Robert Hankinson
 - 0.9460 (0.90): 16th General Assembly of Newfoundland
 - 0.8200 (0.80): 2011 Formula BMW Talent Cup season
 - 0.9460 (0.86): Pimachiowin Aki
 - 0.9377 (0.88): Fugitive in Trieste
 - 0.7840 (1.00): List of football clubs in Sweden – Ö
 - 0.9460 (1.00): Airdrome DeHavilland DH-2
 - 0.7840 (0.80): Old Gaol
 - 0.5140 (0.82): Catholic Order of Foresters
 - 0.9448 (0.99): Greater Sudbury municipal election, 2014
 - 0.9460 (1.00): Fort of Greta (Horta)
 - 0.7975 (0.83): Danny Vasquez
 - 0.9409 (0.80): Christine Magee
 - 0.9456 (0.90): Royal Canadian Air Force Women's Division
 - 0.8650 (0.91): Lucille Lisle
 - 0.6160 (0.78): Air-sea rescue
 - 0.5860 (0.77): Overturn
 - 0.8920 (1.00): Édouard-Charles St-Père
 - 0.8560 (1.00): Ascetoaxinus
 - 0.7075 (0.81): RIT Tigers men's ice hockey
 - 0.9190 (1.00): Carle Bernier-Genest
 - 0.9400 (0.83): Wood Buffalo municipal election, 2007
 - 0.9424 (0.80): Ferland Airport
 - 0.8071 (0.78): Donald Hayes
 - 0.9433 (0.88): Saskatchewan municipal elections, 2006
 - 0.8200 (0.83): Brunnsåkersskolan
 - 0.9400 (0.95): Agape International Missions
 - 0.8920 (1.00): Game on Board
 - 0.5410 (0.79): Stewart Lerman
 - 0.9460 (1.00): Moyes Delta Gliders
 - 0.7300 (0.83): Constituency PB-38 (Mastung)
 - 0.9100 (0.79): Victoire Terminus
 - 0.9440 (0.86): Saskatchewan general election, 1948
 - 0.8920 (0.77): Clear Passage Island
 - 0.9190 (0.90): Margaret Rideout
 - 0.9444 (0.94): 32nd New Brunswick Legislature
 - 0.9460 (1.00): Chester Melanson
 - 0.9257 (0.77): Pulmonary hypoplasia
 - 0.9460 (1.00): Wild Rose School Division No. 66
 - 0.6220 (0.78): The Kaleidoscope
 - 0.9443 (0.82): 2009–10 QMJHL season
 - 0.9352 (1.00): Larry Birkbeck
 - 0.5680 (1.00): Barbara Kuriger
 - 0.9383 (0.80): Walter Davey
 - 0.9100 (1.00): The Charmer (1931 film)
 - 0.6760 (0.86): Dyke Acland Bay
 - 0.9145 (0.76): Hongliutan
 - 0.6760 (1.00): Electoral history of Thomas F. Bayard
 - 0.8920 (0.80): Robert S. Copeland
 - 0.9215 (0.80): John Turnbull (actor)
 - 0.9398 (0.78): RCAF Station Grostenquin
 - 0.6040 (0.76): Frank Rozelaar-Green
 - 0.6490 (0.87): 2013 Milwaukee IndyFest
 - 0.6760 (1.00): Gene Derfler
 - 0.8650 (1.00): Harringay Green Lanes
 - 0.6760 (1.00): Order of Saint Mark
 - 0.8380 (1.00): The Haller Case
 - 0.8650 (0.83): Frank Tennant
 - 0.8380 (0.90): Normandie Heights, Pasadena, California
 - 0.9362 (0.76): Sirius Joyport
 - 0.9460 (0.80): Algoma (provincial electoral district)
 - 0.8380 (1.00): Doto alidrisi
 - 0.7300 (0.83): Kenneth Buckley
 - 0.6760 (0.78): Frederick Roberts (politician)
 - 0.7840 (0.80): John Driscoll (jockey)
 - 0.6040 (0.76): Puppy Love/Sleigh Ride (S Club Juniors song)
 - 0.9460 (1.00): Fyodor Mezentsev
 - 0.9247 (0.79): OutTV
 - 0.9352 (0.92): Heping Road
 - 0.9316 (0.79): Meanings of minor planet names: 78001–79000
 - 0.9423 (0.82): Jenn McGinn
 - 0.9100 (1.00): Charles Alphonse Fournier
 - 0.9400 (0.87): Communities in Bloom
 - 0.8380 (0.84): SFContario
 - 0.9386 (0.82): Stratford Municipal Airport
 - 0.8039 (0.81): Indo-Pak Confederation
 - 0.9240 (0.84): Hayford Hobbs
 - 0.8920 (1.00): Philip Fudge
 - 0.9460 (0.77): Empress/McNeill Spectra Energy Aerodrome
 - 0.5950 (0.86): HIPO model

In [97]:
# Save results
pickle.dump(store, open('surprisinglyCanadian.pkl', 'w'))

# A closer look
Here we have a closer look at the pages which we have manually selected

In [69]:
def makeSummary(wikiEN, title):
    ips, usrs, nrevs = wq.getContributionsForPage(wikiEN, title)
    knwRevs, conf, nIP, nUsr, nBot, nUnkn = btbtools.prepareData(ips, usrs, bots)
    
    ipContribs = btbtools.countContributions(np.unique(ips), ipr.getCountryCode)
    usrLang = lambda(user): usr.getUserCountry(user)
    userContribs = btbtools.countContributions(np.unique(usrs), usrLang)

    #return:
    # numberCanadianRevisions
    # numberRevisions
    # numberCanadianSources
    # numberSources
    knwCA = knwRevs['CA'] if 'CA' in knwRevs else 0
    sumC = sum([ knwRevs[c] for c in knwRevs ] )
    ipCA = ipContribs['CA'] if 'CA' in ipContribs else 0
    usrCA = userContribs['CA'] if 'CA' in userContribs else 0
    sourcesCA = ipCA + usrCA
    return knwCA, sumC, sourcesCA, len(np.unique(usrs)) + len(np.unique(ips))

In [90]:
# Selected words
titles = [
    "The Little Prince (2015 film)", 
    "Essen Stadtbahn", 
    "Pop music in Ukraine", 
    "1979 Australian Sports Car Championship", 
    "Awards of the Ministry for Emergency Situations of Russia", 
    u"2009–10 Los Angeles Lakers season" ]

summaries = { title: makeSummary(wikiEN, title) for title in titles }

pd.DataFrame([ (t,ncr,nr,ncs,ns) for t,(ncr,nr,ncs,ns) in summaries.iteritems() ], 
             columns=[ 'Title', 'N_CA revisions', 'N revisions', 'N_CA sources', 'N Sources' ])

Unnamed: 0,Title,N_CA revisions,N revisions,N_CA sources,N Sources
0,Essen Stadtbahn,4,16,1,6
1,Pop music in Ukraine,39,172,3,83
2,The Little Prince (2015 film),9,71,4,44
3,2009–10 Los Angeles Lakers season,273,410,9,155
4,Awards of the Ministry for Emergency Situation...,153,156,1,14
5,1979 Australian Sports Car Championship,1,7,1,5


After having a closer look, it looks like these pages have very small number of revisions and/or revisions are always done by a small number of users. Therefore, these pages do not reflect a general canadian interest on these pages, but rather one canadian person who is very interested in the topic -- outliers.