# Discovery and Representation of Open Making Related Terms

This notebook sketches the initial exercise on discovering the open making related keywords. The input text is harvested via a Web crawler that identifies and crawls semantically related wikipedia articles.   

In [1]:
from utils import tokenizer
import nltk
from nltk import FreqDist
from math import log
import json, csv

## 1. Loading a reference English language corpus

In [2]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## 2. Stop words

### 2.1 Standard stop words

In [3]:
with open("data/stopwords_standard.txt", "r") as f:
    STOP_WORDS_STANDARD = set(f.read().strip().split("\n"))
print(STOP_WORDS_STANDARD)

{"they're", "let's", 'as', "we've", 'itself', 'up', "don't", "he'll", 'about', 'she', 'until', "weren't", 'too', 'at', 'like', 'theirs', "wouldn't", 'yours', "what's", 'most', 'your', 'had', 'there', 'any', 'could', 'it', 'for', "we'll", 'why', 'if', "you'll", 'few', 'whom', "she'll", 'further', 'cannot', 'that', "you've", 'his', 'hers', "wasn't", "i'm", 'those', "she's", 'from', 'an', 'ought', 'during', 'under', 'below', 'me', "haven't", 'he', "they've", 'they', 'not', "it's", 'same', 'having', "you're", 'been', "when's", 'of', 'this', 'on', 'myself', "can't", 'doing', 'r', "we'd", 'before', 'just', 'was', "hadn't", 'in', 'herself', 'to', 'out', 'does', 'so', 'or', "doesn't", 'is', 'after', 'because', 'should', 'down', "she'd", 'then', 'while', 'com', "you'd", 'their', "aren't", 'yourselves', 'over', "they'd", 'them', 'between', 'each', 'our', 'again', "won't", 'you', 'know', "that's", 'off', 'both', "he'd", 'http', 'am', 'nor', 'ourselves', "there's", 'my', 'when', 'who', "couldn't",

### 2.2 Open-making related stop words

In [4]:
with open("data/stopwords_openmaker.txt", "r") as f:
    STOP_WORDS_OPENMAKER = set(f.read().strip().split("\n"))
print(STOP_WORDS_OPENMAKER)

{'may', 'well', 'one', 'often', 'almost', 'also', 'many'}


## 3. Removing stop words from the reference English corpus

In [5]:
# merging the two list together
STOP_WORDS = STOP_WORDS_STANDARD.union(STOP_WORDS_OPENMAKER)
print(STOP_WORDS)

{'as', 'up', 'she', 'until', "weren't", 'too', "wouldn't", 'yours', "what's", 'most', 'there', 'it', 'for', "we'll", 'whom', "she'll", 'further', 'cannot', 'that', 'his', 'hers', 'from', 'ought', 'under', 'he', 'they', "it's", 'having', "you're", 'been', "when's", 'of', "can't", 'doing', "we'd", 'was', "hadn't", 'often', 'to', 'out', 'does', 'so', "doesn't", 'after', 'because', 'then', 'com', "you'd", 'yourselves', "they'd", 'them', 'each', 'again', "won't", "that's", 'off', 'almost', "he'd", 'http', 'am', 'nor', 'ourselves', "there's", 'my', 'when', 'who', 'only', 'did', 'its', 'above', 'and', 'where', 'into', 'also', 'many', 'such', 'would', 'all', 'well', "i've", 'has', "i'd", 'themselves', 'being', "i'll", 'but', 'other', 'were', "hasn't", 'have', 'what', 'through', 'with', 'do', "how's", 'more', "where's", 'once', 'get', "shan't", 'how', 'by', 'own', 'than', 'him', "here's", 'one', "they're", "let's", "we've", 'itself', "don't", "he'll", 'about', 'at', 'like', 'theirs', 'your', 'h

In [6]:
# load english words from the Brown corpus removing stop words.
english_freq_dist = FreqDist([w.lower() for w in nltk.corpus.brown.words()
                              if w not in STOP_WORDS])

## 4. Removing the rare words.

Below we remove rare words and get total count. The code below keeps all words with a occurance frequency above 2. 

In [7]:
english_freq_dist = {k:v for k,v in english_freq_dist.items() if v > 2}

## 5. Loading the input Open Maker corpus

In [8]:
# load the harvested text from wikipedia.
with open("data/wikipedia.json", "r") as f: OM_Corpus_text = f.read()
OM_Corpus = json.loads(OM_Corpus_text)

In [9]:
# The total number of wiki articles used:
print(len(OM_Corpus))

152


In [10]:
# Column names of the the corpus.
OM_Corpus[0].keys()

dict_keys(['theme.id', 'title', 'url', 'depth', 'text'])

In [11]:
def display_pages(tid):
    meme = [page for page in OM_Corpus if page['theme.id'] == tid]
    for m in meme:
        print(m['depth'],m['title'], m['url'])

In [12]:
display_pages(0)

0 Do it yourself https://en.wikipedia.org/wiki/Do_it_yourself
1 Edupunk https://en.wikipedia.org/wiki/Edupunk
1 Prosumer https://en.wikipedia.org/wiki/Prosumer
1 How-to https://en.wikipedia.org/wiki/How-to
1 Kludge https://en.wikipedia.org/wiki/Kludge
1 Bricolage https://en.wikipedia.org/wiki/Bricolage
1 Junk box https://en.wikipedia.org/wiki/Junk_box
1 Number 8 wire https://en.wikipedia.org/wiki/Number_8_wire
1 Ready-to-assemble furniture https://en.wikipedia.org/wiki/Ready-to-assemble_furniture
1 Open design https://en.wikipedia.org/wiki/Open_Design
1 Hackerspace https://en.wikipedia.org/wiki/Hackerspace
1 Instructables https://en.wikipedia.org/wiki/Instructables
1 Handyman https://en.wikipedia.org/wiki/Handyman
1 Circuit bending https://en.wikipedia.org/wiki/Circuit_bending
1 Project GreenWorld International https://en.wikipedia.org/wiki/Project_GreenOman
1 3D printing https://en.wikipedia.org/wiki/3D_printing


In [13]:
display_pages(1)

0 Open design https://en.wikipedia.org/wiki/Open_design
1 Knowledge commons https://en.wikipedia.org/wiki/Knowledge_commons
1 Open Source Ecology https://en.wikipedia.org/wiki/Open_Source_Ecology
1 Computer-aided design https://en.wikipedia.org/wiki/Computer-aided_design
1 Open Source Initiative https://en.wikipedia.org/wiki/Open_Source_Initiative
1 Open Architecture Network https://en.wikipedia.org/wiki/Open_Architecture_Network
1 Open-source architecture https://en.wikipedia.org/wiki/Open-source_architecture
1 Commons-based peer production https://en.wikipedia.org/wiki/Commons-based_peer_production
1 Open standard https://en.wikipedia.org/wiki/Open_standard
1 OpenCores https://en.wikipedia.org/wiki/OpenCores
1 Co-creation https://en.wikipedia.org/wiki/Co-creation
1 OpenBTS https://en.wikipedia.org/wiki/OpenBTS
1 Open manufacturing https://en.wikipedia.org/wiki/Open_manufacturing
1 Open-source hardware https://en.wikipedia.org/wiki/Open-source_hardware
1 Open source appropriate techno

In [14]:
display_pages(2)

0 Sustainability https://en.wikipedia.org/wiki/Sustainability
1 Sustainability standards and certification https://en.wikipedia.org/wiki/Sustainability_standards_and_certification
1 Appropriate technology https://en.wikipedia.org/wiki/Appropriate_technology
1 Sustainable development https://en.wikipedia.org/wiki/Sustainable_development
1 Environmental issue https://en.wikipedia.org/wiki/Environmental_issue
1 World Cities Summit https://en.wikipedia.org/wiki/World_Cities_Summit
1 Ecopsychology https://en.wikipedia.org/wiki/Ecopsychology
1 Book:Sustainability https://en.wikipedia.org/wiki/Book:Sustainability
1 Sustainable design https://en.wikipedia.org/wiki/Sustainable_design
1 Circles of Sustainability https://en.wikipedia.org/wiki/Circles_of_Sustainability
1 Sustainability science https://en.wikipedia.org/wiki/Sustainability_science
1 Sustainable living https://en.wikipedia.org/wiki/Sustainable_living
1 Index of sustainability articles https://en.wikipedia.org/wiki/List_of_sustainabil

In [15]:
display_pages(3)

0 Maker culture https://en.wikipedia.org/wiki/Maker_culture
1 Modular design https://en.wikipedia.org/wiki/Modular_design
1 Open-source car https://en.wikipedia.org/wiki/Open-source_car
1 Electric vehicle conversion https://en.wikipedia.org/wiki/Electric_vehicle_conversion
1 Thingiverse https://en.wikipedia.org/wiki/Thingiverse
1 Fab lab https://en.wikipedia.org/wiki/Fab_Lab_(fabrication_laboratory)
1 SparkFun Electronics https://en.wikipedia.org/wiki/SparkFun
1 RepRap project https://en.wikipedia.org/wiki/RepRap
1 Distributed manufacturing https://en.wikipedia.org/wiki/Distributed_manufacturing
1 Craft production https://en.wikipedia.org/wiki/Craft_production
1 Autonomous building https://en.wikipedia.org/wiki/Autonomous_building
1 Open-source hardware https://en.wikipedia.org/wiki/Open_source_hardware
1 Kit car https://en.wikipedia.org/wiki/Kit_car


In [16]:
display_pages(4)

0 Innovation https://en.wikipedia.org/wiki/Innovation
1 Competitive intelligence https://en.wikipedia.org/wiki/Creative_competitive_intelligence
1 Multiple discovery https://en.wikipedia.org/wiki/Multiple_discovery
1 UNDP Innovation Facility https://en.wikipedia.org/wiki/UNDP_Innovation_Facility
1 Open Innovations (event) https://en.wikipedia.org/wiki/Open_Innovations_(Forum_and_Technology_Show)
1 Trans-cultural diffusion https://en.wikipedia.org/wiki/Diffusion_(anthropology)
1 Individual capital https://en.wikipedia.org/wiki/Individual_capital
1 Innovation system https://en.wikipedia.org/wiki/Innovation_system
1 Public domain https://en.wikipedia.org/wiki/Public_domain
1 Ingenuity https://en.wikipedia.org/wiki/Ingenuity
1 Sustainable Development Goals https://en.wikipedia.org/wiki/Sustainable_Development_Goals
1 Participatory design https://en.wikipedia.org/wiki/Participatory_design
1 Innovation management https://en.wikipedia.org/wiki/Innovation_management
1 Information revolution ht

In [17]:
display_pages(5)

0 Collaboration https://en.wikipedia.org/wiki/Collaboration
1 Wikinomics https://en.wikipedia.org/wiki/Wikinomics
1 Collaborative editing https://en.wikipedia.org/wiki/Collaborative_editing
1 Telepresence https://en.wikipedia.org/wiki/Telepresence
1 Knowledge management https://en.wikipedia.org/wiki/Knowledge_management
1 The Culture of Collaboration https://en.wikipedia.org/wiki/The_Culture_of_Collaboration
1 Collaborative governance https://en.wikipedia.org/wiki/Collaborative_governance
1 Community film https://en.wikipedia.org/wiki/Community_film
1 Collaborative innovation network https://en.wikipedia.org/wiki/Collaborative_innovation_network
1 Design thinking https://en.wikipedia.org/wiki/Design_thinking
1 Role-based collaboration https://en.wikipedia.org/wiki/Role-based_collaboration
1 Intranet portal https://en.wikipedia.org/wiki/Intranet_portal
1 Critical thinking https://en.wikipedia.org/wiki/Critical_thinking
1 Facilitation (business) https://en.wikipedia.org/wiki/Facilitation

## 6. Analyzing a specific corpus based on a theme

In [18]:
# Note that theme.id: 0 corresponds to the the Do IT YOURSELF
input_text = " ".join([page['text'] for page in OM_Corpus if page['theme.id'] == 0])

In [19]:
# Tokenizing the input text:
tokenized = tokenizer.tokenize_words(input_text)
number_of_words = len(tokenized)
print(number_of_words),OM_Corpus[0]['title']

59758


(None, 'Do it yourself')

### 6.1 Computing frequency distributions of each token, i.e word, term, pancuation, etc.

In [20]:
input_freq_dist = FreqDist(tokenized)

In [21]:
input_freq_dist.most_common(20)

[('\n', 5728),
 ('the', 2809),
 ('and', 1980),
 ('of', 1849),
 ('to', 1509),
 ('in', 1199),
 ('a', 1191),
 ('"', 593),
 ('is', 587),
 ('for', 561),
 ('as', 465),
 ('that', 454),
 ('by', 391),
 ('on', 363),
 ('or', 343),
 ('are', 323),
 ('thinking', 323),
 ('with', 310),
 ('design', 295),
 ('be', 277)]

### 6.2 Removing punctuation and stopwords from the input corpus

In [22]:
for stopword in STOP_WORDS:
    if stopword in input_freq_dist:
        del input_freq_dist[stopword]
        
for punctuation in tokenizer.CHARACTERS_TO_SPLIT:
    if punctuation in input_freq_dist:
        del input_freq_dist[punctuation]

# Re-control most common words after cleaning:
input_freq_dist.most_common(80)

[('thinking', 323),
 ('design', 295),
 ('collaborative', 218),
 ('collaboration', 202),
 ('crowdsourcing', 172),
 ('work', 155),
 ('knowledge', 147),
 ('community', 138),
 ('process', 132),
 ('information', 120),
 ('critical', 110),
 ('learning', 109),
 ('management', 105),
 ('research', 104),
 ('used', 89),
 ('project', 87),
 ('business', 86),
 ('people', 86),
 ('use', 86),
 ('social', 85),
 ('new', 85),
 ('film', 84),
 ('users', 80),
 ('first', 75),
 ('different', 74),
 ('students', 74),
 ('telepresence', 74),
 ('public', 70),
 ('problems', 70),
 ('ideas', 68),
 ('software', 66),
 ('group', 65),
 ('development', 65),
 ('time', 64),
 ('problem', 61),
 ('example', 61),
 ('systems', 61),
 ('conference', 59),
 ('projects', 57),
 ('call', 57),
 ('methods', 56),
 ('skills', 55),
 ('system', 55),
 ('using', 53),
 ('participants', 52),
 ('technology', 51),
 ('world', 51),
 ('together', 51),
 ('online', 51),
 ('help', 50),
 ('tools', 50),
 ('music', 49),
 ('free', 49),
 ('leadership', 48),
 (

### 6.3 Removing rare words from input distribution

In [23]:
input_freq_dist = {k:v for k,v in input_freq_dist.items() if v > 1}

## 7. Comparing input vs English corpus volumes

### 7.1 Total words (after cleaning) 

In [24]:
n_input = sum(input_freq_dist.values())
n_english = sum(english_freq_dist.values())
n_input, n_english

(28583, 679519)

### 7.2 Unique words (after cleaning)

In [25]:
n_unique_word_input = len(input_freq_dist.items())
n_unique_word_brown = len(english_freq_dist.items())
n_unique_word_input, n_unique_word_brown

(3833, 20591)

### 7.3 Cleaned set of input words/terms

List of words in the corpus in case, for a visual inspection. Such inspections will be used both to improve tokenization as well as filtering.

In [26]:
input_freq_dist

{'collaboration': 202,
 'uses': 21,
 'see': 22,
 'disambiguation': 2,
 'definition': 18,
 'music': 49,
 'two': 41,
 'artists': 11,
 'featuring': 2,
 'purposeful': 5,
 'relationship': 20,
 'parties': 14,
 'strategically': 2,
 'choose': 5,
 'cooperate': 4,
 'order': 38,
 'achieve': 13,
 'shared': 33,
 'overlapping': 3,
 'objectives': 7,
 'collaborative': 218,
 'leadership': 48,
 'developing': 19,
 'effective': 19,
 'partnerships': 11,
 'communities': 26,
 'schools': 22,
 'rubin': 3,
 'explains': 7,
 'b': 4,
 'voluntary': 5,
 'nature': 15,
 'success': 14,
 'depends': 5,
 "leader's": 2,
 'ability': 26,
 'build': 11,
 'maintain': 5,
 'relationships': 15,
 'similar': 20,
 'closely': 9,
 'cooperation': 9,
 'requires': 16,
 'although': 17,
 'form': 43,
 'social': 85,
 'within': 35,
 'decentralized': 2,
 'group': 65,
 'teams': 14,
 'work': 155,
 'collaboratively': 4,
 'obtain': 8,
 'greater': 13,
 'resources': 30,
 'recognition': 4,
 'reward': 5,
 'facing': 4,
 'competition': 7,
 'structured': 

### 7.4 Set of terms/words that occure in both corpus.

In [27]:
common_words = [w for w in input_freq_dist.keys() & english_freq_dist.keys()]
print(len(common_words))

3001


In [28]:
for w in common_words: print(w)

credit
19th
agents
gold
heads
analyze
yellow
chosen
belief
evaluating
digital
multitude
chief
though
substantial
john
community
coalition
medicine
discovering
systemic
ross
mark
40
staff
gaining
worked
used
flood
room
pizza
creativity
rigorous
general
front
consists
index
campaign
revenue
actions
report
low
ted
relationships
legislation
without
skills
g
new
higher
two-thirds
organize
canada
economic
saves
armed
de
sold
logical
accept
transactions
stravinsky
investors
transparent
near
share
largest
assisting
joseph
zones
entire
n
route
possibilities
hours
martin
regions
side
user
learn
avoid
courses
items
velocity
convert
initial
instrumental
alongside
emergence
potentially
differently
children's
list
rules
viewed
mob
need
earn
observation
conventional
brief
isolated
london
growth
reconstruct
consensus
innovation
embedded
continuing
drama
co-operative
attempt
american
order
ford
died
arena
event
shown
diversity
sign
something
nature
choices
rational
august
survival
educated
exploration


efficiently
systematic
distinguish
outcome
aside
translations
pool
changing
extend
c
computer
spring
taylor
goes
recent
granted
investigate
eye
led
plans
concerns
nevertheless
lets
drop
pragmatic
least
included
application
balloon
implies
howard
contemporary
associates
diverse
points
government
start
clicked
face
towards
social
sports
proposed
property
awareness
alexander
reading
kingdom
compiled
conducted
puts
reference
established
news
stimulates
showers
comprehensive
thomas
taking
linear
forms
premise
formerly
paris
2
needed
conclusions
turned
employers
global
toronto
defines
cardinal
fields
stars
build
quarters
approach
instruments
mind
professional
rewarded
defense
confused
begin
fund
track
placed
supply
point
gene
identify
step
blurred
believe
necessarily
mean
pianist
types
via
increases
calling
researchers
key
gabriel
shortly
add
learned
file
criteria
service
analysis
explanation
adapted
uncertainty
critic
pipeline
novel
sides
journalism
reviewed
arrangements
distinguishing
invo

### 7.5 Set of terms/words that occure in the sample but not in the reference corpus.

TO BE EXAMINED: This specific set needs to be incorporated. In fact, it may capture specifity of the content to a great extend. We need to assign a mapping score for each words in this set.

In [29]:
input_specifics = dict()
for w in input_freq_dist.keys() - english_freq_dist.keys():
    input_specifics[w] = input_freq_dist[w]
    print(w)

prepaid
atandt
tagging
iws
acm
website
dispositions
ii
tackling
losee
designerly
self-interest
josef
stanford's
open-api
mompou
noncommercial
breakthroughs
chopin
interact
boye
kerry
theorist
inducement
ornithology
stakeholders
gps
inc
pivotal
it-mediated
adaptability
flat-rate
schmidt
rittel
macrowork
sharepoint
app
contexts
critique
learners
popularized
grieg
human-centered
creators
options
donation
formats
cognitive
soft-systems
diy
globally
hierarchies
istockphoto
developers
online
origination
brogeland
crowdsourcing
funded
workspace
facebook
brainstorming
disambiguation
synthesizing
amateurs
tchaikovsky
ims
entrepreneurs
nouno
transactional
rubin
skunk
olmsted
wikipedia
kurt
qatar
menter
hci
saxton
funding
usenet
centres
explores
waldegrave
joaqun
audio
internalization
knowledge-based
participatory
co-founder
ontario
facilitation
recognised
occupiers
leifer
barcelona
dissemination
videos
routing
instructors
code-named
ckos
tacit
isbn
unresolved
ambidextrous
lifecycle
carlo
keyword

In [30]:
print(len(input_specifics))

832


## 8. Stemming (in case needed) 

In [31]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for k,v in input_freq_dist.items():
    stemmed = stemmer.stem(k)
    if stemmed != k: print(k, "->", stemmed)

collaboration -> collabor
uses -> use
disambiguation -> disambigu
definition -> definit
artists -> artist
featuring -> featur
purposeful -> purpos
parties -> parti
strategically -> strateg
choose -> choos
cooperate -> cooper
achieve -> achiev
shared -> share
overlapping -> overlap
objectives -> object
collaborative -> collabor
developing -> develop
effective -> effect
partnerships -> partnership
communities -> commun
schools -> school
explains -> explain
voluntary -> voluntari
nature -> natur
depends -> depend
leader's -> leader'
ability -> abil
relationships -> relationship
closely -> close
cooperation -> cooper
requires -> requir
decentralized -> decentr
teams -> team
collaboratively -> collabor
resources -> resourc
recognition -> recognit
facing -> face
competition -> competit
structured -> structur
methods -> method
encourage -> encourag
communication -> commun
specifically -> specif
increase -> increas
engage -> engag
solving -> solv
forms -> form
charts -> chart
useful -> use
sit

profiles -> profil
interactions -> interact
asynchronous -> asynchron
serves -> serv
commonly -> commonli
easily -> easili
interaction -> interact
telepresence -> telepres
effectiveness -> effect
factors -> factor
nearly -> nearli
dramatically -> dramat
cheaply -> cheapli
niche -> nich
produced -> produc
formerly -> formerli
coined -> coin
creative -> creativ
energy -> energi
numbers -> number
coordinated -> coordin
usually -> usual
meaningful -> meaning
hierarchical -> hierarch
financial -> financi
compensation -> compens
compares -> compar
centralized -> central
decision -> decis
tagging -> tag
jobs -> job
anyone -> anyon
interested -> interest
means -> mean
announcements -> announc
website -> websit
clickworkers -> clickwork
another -> anoth
helps -> help
massively -> massiv
distributed -> distribut
mitchell -> mitchel
presentation -> present
emerging -> emerg
collaborationism -> collaboration
acquired -> acquir
meaning -> mean
referring -> refer
countries -> countri
occupiers -> oc

dissemination -> dissemin
benchmarking -> benchmark
incentives -> incent
accelerate -> acceler
impressive -> impress
concrete -> concret
initially -> initi
supported -> support
world's -> world'
objective -> object
ckos -> cko
maximise -> maximis
theoretical -> theoret
aspects -> aspect
academics -> academ
thomas -> thoma
magazine -> magazin
subsequently -> subsequ
capital -> capit
maturity -> matur
publications -> public
contribution -> contribut
overall -> overal
disciplines -> disciplin
vary -> vari
debates -> debat
ecological -> ecolog
identity -> ident
details -> detail
perspective -> perspect
perspectives -> perspect
relevance -> relev
suggested -> suggest
translate -> translat
findings -> find
frameworks -> framework
distinguishing -> distinguish
categorizing -> categor
distinguishes -> distinguish
internalised -> internalis
aware -> awar
opposite -> opposit
holds -> hold
communicated -> commun
internalization -> intern
considers -> consid
cycle -> cycl
suggests -> suggest
store

displayed -> display
contemporary -> contemporari
attendees -> attende
introduces -> introduc
integrating -> integr
hexagonal -> hexagon
gathering -> gather
googleable -> googleabl
nongoogleable -> nongoogl
setting -> set
bias -> bia
observe -> observ
views -> view
determine -> determin
confidence -> confid
publishes -> publish
offered -> offer
volunteer -> volunt
motivate -> motiv
utilizing -> util
multitude -> multitud
offering -> offer
applying -> appli
reduced -> reduc
uncertainty -> uncertainti
presents -> present
placed -> place
promising -> promis
holistic -> holist
happens -> happen
pressing -> press
operational -> oper
so-called -> so-cal
soft-systems -> soft-system
christopher -> christoph
alexander -> alexand
knowledge-based -> knowledge-bas
incorporated -> incorpor
assessment -> assess
arguably -> arguabl
establishes -> establish
partly -> partli
anything -> anyth
logical -> logic
dilemmas -> dilemma
showing -> show
developments -> develop
inadequate -> inadequ
inquiry -> i

## 9. Computing representation power of common words.

In [32]:
# combine
makerness = {}
# common_words = [w[0] for w in common_words]
for w in common_words:
    # Consider only words whose charcater length is larger than 1
    if len(w) > 1:
        # Log likelihood scores are computed:
        score = log((input_freq_dist[w] / n_input) / (english_freq_dist[w] / n_english))
        makerness[w] = score

In [33]:
common_words

['credit',
 '19th',
 'agents',
 'gold',
 'heads',
 'analyze',
 'yellow',
 'chosen',
 'belief',
 'evaluating',
 'digital',
 'multitude',
 'chief',
 'though',
 'substantial',
 'john',
 'community',
 'coalition',
 'medicine',
 'discovering',
 'systemic',
 'ross',
 'mark',
 '40',
 'staff',
 'gaining',
 'worked',
 'used',
 'flood',
 'room',
 'pizza',
 'creativity',
 'rigorous',
 'general',
 'front',
 'consists',
 'index',
 'campaign',
 'revenue',
 'actions',
 'report',
 'low',
 'ted',
 'relationships',
 'legislation',
 'without',
 'skills',
 'g',
 'new',
 'higher',
 'two-thirds',
 'organize',
 'canada',
 'economic',
 'saves',
 'armed',
 'de',
 'sold',
 'logical',
 'accept',
 'transactions',
 'stravinsky',
 'investors',
 'transparent',
 'near',
 'share',
 'largest',
 'assisting',
 'joseph',
 'zones',
 'entire',
 'n',
 'route',
 'possibilities',
 'hours',
 'martin',
 'regions',
 'side',
 'user',
 'learn',
 'avoid',
 'courses',
 'items',
 'velocity',
 'convert',
 'initial',
 'instrumental',
 '

In [34]:
# Sorting by scores:
for k,v in sorted(makerness.items(), key=lambda x:x[1], reverse=True): print(k,v)

collaboration 5.991934107047692
users 5.758840224880314
user 5.308639222930758
portal 5.288836595634578
participants 5.1739066289606015
web 4.987731502850657
global 4.960332528662542
innovation 4.9362349770834815
challenges 4.873321151672912
virtual 4.854972013004716
researchers 4.7780109718685875
motivations 4.6501776003587025
expertise 4.634910128227914
turk 4.554867420554378
connect 4.4678560435647485
organizational 4.449506904896552
template 4.449506904896552
implementation 4.384968383758981
perspectives 4.372545863760424
citation 4.331723869240168
educators 4.267185348102597
environments 4.267185348102597
solving 4.224625733683801
digital 4.210026934262649
capacities 4.198192476615645
media 4.187142640429061
composers 4.187142640429061
vs 4.180173971112967
prototype 4.149402312446214
les 4.149402312446214
coined 4.149402312446214
definitions 4.149402312446214
indigenous 4.149402312446214
design 4.1193499673798115
feedback 4.084863791308642
links 4.05587625443539
allows 4.052775476

authors 2.5179854932933377
engineering 2.5179854932933377
leadership 2.5179854932933377
distributed 2.5117935230454167
allowing 2.5071745771891223
article 2.5044134157477944
processes 2.4928176215864113
individuals 2.489031530930321
stravinsky 2.475425878874542
investors 2.475425878874542
promotes 2.475425878874542
defining 2.475425878874542
introduces 2.475425878874542
featuring 2.475425878874542
inventions 2.475425878874542
modes 2.475425878874542
prizes 2.475425878874542
discusses 2.475425878874542
entails 2.475425878874542
assess 2.475425878874542
disruptive 2.475425878874542
paradigm 2.475425878874542
violinist 2.475425878874542
discern 2.475425878874542
surgery 2.475425878874542
commons 2.475425878874542
professionals 2.475425878874542
antonio 2.475425878874542
huddle 2.475425878874542
circle 2.475425878874542
innate 2.475425878874542
differs 2.475425878874542
intertwined 2.475425878874542
prize 2.475425878874542
wiley 2.475425878874542
tutor 2.475425878874542
founded 2.475425878

experiences 1.7632305033439022
approach 1.7575860857242254
intellectual 1.751507039647843
agencies 1.7494888754916058
suggests 1.7471873785033265
written 1.7425383696650827
amateur 1.7414567037943416
corporations 1.7414567037943416
writing 1.7386036348119351
teaching 1.7364691622833024
impact 1.7364691622833024
skill 1.7334885341451647
providing 1.7300929451440263
parties 1.7300929451440263
different 1.729634964829175
codes 1.7216540764981618
pierre 1.7216540764981618
significantly 1.7216540764981618
notions 1.7216540764981618
alternatives 1.7216540764981618
lessons 1.7216540764981618
lists 1.7216540764981618
compensation 1.7216540764981618
movements 1.7163207305227992
medicine 1.7132858268276452
meeting 1.7105867698584802
review 1.7081707261608747
transparent 1.7022359906410602
embrace 1.7022359906410602
cooperation 1.7022359906410602
owns 1.7022359906410602
rely 1.7022359906410602
ego 1.7022359906410602
analogy 1.7022359906410602
incorporated 1.7022359906410602
monitoring 1.702235990

judging 1.1536700388922225
validity 1.1536700388922225
alter 1.1536700388922225
referred 1.1536700388922223
system 1.1452209844056949
advanced 1.1441912949376787
lead 1.1419739991290314
tests 1.137140736941012
collection 1.1356515333895443
wave 1.1316911321734473
dedicated 1.1316911321734473
encouraging 1.1316911321734473
channels 1.1316911321734473
agriculture 1.1316911321734473
lighting 1.1316911321734473
publications 1.1316911321734473
address 1.1273527305748492
individual 1.126096711988123
accordingly 1.1208802160692315
bird 1.1208802160692315
involve 1.1208802160692315
internal 1.1208802160692315
extend 1.1208802160692315
mainly 1.1208802160692315
language 1.1162825068206021
significant 1.11212103597935
build 1.11212103597935
allowed 1.11212103597935
numerous 1.1101849269524835
source 1.1101849269524835
prior 1.1101849269524835
thousands 1.1101849269524835
similar 1.1080595276401701
lack 1.1071500232573297
components 1.1071500232573297
identity 1.1071500232573297
compared 1.103117

particularly 0.48755153072019647
rule 0.48755153072019647
led 0.48299571418433584
driven 0.48299571418433584
hundreds 0.48299571418433584
literature 0.4754485085489529
century 0.47491159578437686
observed 0.47394587866441795
positive 0.47394587866441795
london 0.47169615893040245
labor 0.46721184648307384
beginning 0.46660190440865934
become 0.46330450129415407
radical 0.46052285833227724
replace 0.46052285833227724
voting 0.46052285833227724
representing 0.46052285833227724
era 0.46052285833227724
alex 0.46052285833227724
text 0.46052285833227724
virtue 0.46052285833227724
medium 0.46052285833227713
charges 0.46052285833227713
mounted 0.46052285833227713
edward 0.46052285833227713
target 0.46052285833227713
prime 0.46052285833227713
given 0.4552036968546773
issue 0.4472776315822566
ideal 0.4439935563810667
available 0.4399035711295415
better 0.438543951613502
salt 0.4385439516135019
flight 0.4385439516135019
chamber 0.4385439516135019
superior 0.4385439516135019
recommended 0.43854395

lose -0.19872277055198673
minor -0.19872277055198673
black -0.19872277055198673
million -0.20363678535441576
account -0.2073065142433783
plan -0.20852677064860758
king -0.2101514663756095
fall -0.2124216149101487
tree -0.21581720391128684
housing -0.21581720391128684
europe -0.21581720391128684
responsibility -0.21581720391128684
drop -0.21581720391128684
tension -0.21581720391128684
headed -0.21581720391128684
class -0.21823558477556843
around -0.21848981129276338
armed -0.23262432222766807
chemical -0.23262432222766807
balance -0.2326243222276682
purposes -0.2326243222276682
dance -0.2326243222276682
run -0.24210306618221197
base -0.24367415841425305
civil -0.24367415841425305
thought -0.24625647101772563
words -0.24733046961736369
european -0.24915362417887862
showing -0.24915362417887862
boston -0.24915362417887862
feelings -0.24915362417887862
minister -0.24915362417887862
recently -0.2573169348180395
annual -0.2654141450506589
values -0.2654141450506589
save -0.2654141450506591
w

In [35]:
with open('makerness_diy.csv', 'w') as csvfile:
    thewriter = csv.writer(csvfile, delimiter=',')
    for k,v in sorted(makerness.items(), key=lambda x:x[1], reverse=True):
        thewriter.writerow([k,v])