# Discovery and Representation of Open Making Related Terms

This notebook sketches the initial exercise on discovering the open making related keywords. The input text is harvested via a Web crawler that identifies and crawls semantically related wikipedia articles.   

In [1]:
from utils import tokenizer
import nltk
from nltk import FreqDist
from math import log
import json, csv

## 1. Loading a reference English language corpus

In [2]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## 2. Stop words

### 2.1 Standard stop words

In [3]:
with open("data/stopwords_standard.txt", "r") as f:
    STOP_WORDS_STANDARD = set(f.read().strip().split("\n"))
print(STOP_WORDS_STANDARD)

{'at', 'not', 'once', 'be', 'ours ', 'further', 'down', 'if', "don't", 'have', 'into', 'how', 'why', 'its', 'himself', "he'd", 'until', 'we', 'whom', 'http', "he's", 'off', 'our', "hasn't", "that's", "isn't", 'up', 'been', 'itself', 'what', 'your', 'has', 'more', "she'd", "didn't", 'with', 'because', "we'd", 'to', 'all', 'each', "couldn't", "here's", 'while', 'get', "he'll", 'so', 'again', 'him', 'and', 'for', 'her', "they'll", 'above', 'was', 'about', 'between', 'or', 'where', 'www', 'as', 'their', "you'll", 'is', 'me', 'when', 'yours', "she's", 'same', 'no', "they're", 'having', 'any', 'them', 'themselves', 'by', 'were', 'there', 'an', 'who', 'yourselves', "haven't", 'would', 'cannot', "shan't", "what's", "won't", 'this', "weren't", 'the', "we're", 'most', "i've", 'should', 'just', "they'd", "wasn't", "i'd", 'nor', "we've", "i'll", "you're", "how's", "doesn't", "when's", 'in', "i'm", "aren't", 'herself', "we'll", 'but', 'am', 'ought', 'such', 'my', 'out', 'know', "there's", 'are', "y

### 2.2 Open-making related stop words

In [4]:
with open("data/stopwords_openmaker.txt", "r") as f:
    STOP_WORDS_OPENMAKER = set(f.read().strip().split("\n"))
print(STOP_WORDS_OPENMAKER)

{'well', 'also', 'may', 'almost', 'one', 'often', 'many'}


## 3. Removing stop words from the reference English corpus

In [5]:
# merging the two list together
STOP_WORDS = STOP_WORDS_STANDARD.union(STOP_WORDS_OPENMAKER)
print(STOP_WORDS)

{'once', 'ours ', 'down', 'if', 'how', 'its', 'himself', 'we', 'whom', 'http', 'off', 'may', "hasn't", "that's", 'up', 'what', 'itself', 'has', "didn't", 'with', "we'd", 'to', 'all', "couldn't", 'get', "he'll", 'again', 'for', "they'll", 'above', 'between', 'or', 'where', 'well', 'also', 'www', 'yours', 'me', 'when', 'same', 'no', 'having', 'them', 'were', 'who', 'would', 'cannot', "what's", 'the', "we're", 'should', 'just', "they'd", "we've", "i'd", 'one', "how's", "aren't", "we'll", 'but', 'my', "there's", 'are', "you've", "mustn't", 'hers', 'than', 'only', 'under', 'then', 'can', 'could', 'like', "she'll", 'it', 'those', 'other', 'after', 'do', 'of', "can't", 'com', "why's", 'r', 'through', 'very', 'too', 'did', 'a', 'does', 'that', 'often', 'own', "it's", 'at', 'not', 'be', 'further', "don't", 'have', 'into', 'why', "he'd", 'until', "he's", 'our', "isn't", 'been', 'your', 'more', "she'd", 'because', 'each', "here's", 'while', 'so', 'him', 'and', 'her', 'was', 'about', 'as', 'their'

In [6]:
# load english words from the Brown corpus removing stop words.
english_freq_dist = FreqDist([w.lower() for w in nltk.corpus.brown.words()
                              if w not in STOP_WORDS])

## 4. Removing the rare words.

Below we remove rare words and get total count. The code below keeps all words with a occurance frequency above 2. 

In [7]:
english_freq_dist = {k:v for k,v in english_freq_dist.items() if v > 2}

## 5. Loading the input Open Maker corpus

In [8]:
# load the harvested text from wikipedia.
with open("data/wikipedia.json", "r") as f: OM_Corpus_text = f.read()
OM_Corpus = json.loads(OM_Corpus_text)

In [9]:
# The total number of wiki articles used:
print(len(OM_Corpus))

152


In [10]:
# Column names of the the corpus.
OM_Corpus[0].keys()

dict_keys(['theme.id', 'title', 'url', 'depth', 'text'])

In [11]:
def display_articles(tid):
    articles = [article for article in OM_Corpus if article['theme.id'] == tid]
    for article in articles:
        print(article['depth'],article['title'], article['url'])

In [12]:
display_articles(0)

0 Do it yourself https://en.wikipedia.org/wiki/Do_it_yourself
1 Edupunk https://en.wikipedia.org/wiki/Edupunk
1 Prosumer https://en.wikipedia.org/wiki/Prosumer
1 How-to https://en.wikipedia.org/wiki/How-to
1 Kludge https://en.wikipedia.org/wiki/Kludge
1 Bricolage https://en.wikipedia.org/wiki/Bricolage
1 Junk box https://en.wikipedia.org/wiki/Junk_box
1 Number 8 wire https://en.wikipedia.org/wiki/Number_8_wire
1 Ready-to-assemble furniture https://en.wikipedia.org/wiki/Ready-to-assemble_furniture
1 Open design https://en.wikipedia.org/wiki/Open_Design
1 Hackerspace https://en.wikipedia.org/wiki/Hackerspace
1 Instructables https://en.wikipedia.org/wiki/Instructables
1 Handyman https://en.wikipedia.org/wiki/Handyman
1 Circuit bending https://en.wikipedia.org/wiki/Circuit_bending
1 Project GreenWorld International https://en.wikipedia.org/wiki/Project_GreenOman
1 3D printing https://en.wikipedia.org/wiki/3D_printing


In [13]:
display_articles(1)

0 Open design https://en.wikipedia.org/wiki/Open_design
1 Knowledge commons https://en.wikipedia.org/wiki/Knowledge_commons
1 Open Source Ecology https://en.wikipedia.org/wiki/Open_Source_Ecology
1 Computer-aided design https://en.wikipedia.org/wiki/Computer-aided_design
1 Open Source Initiative https://en.wikipedia.org/wiki/Open_Source_Initiative
1 Open Architecture Network https://en.wikipedia.org/wiki/Open_Architecture_Network
1 Open-source architecture https://en.wikipedia.org/wiki/Open-source_architecture
1 Commons-based peer production https://en.wikipedia.org/wiki/Commons-based_peer_production
1 Open standard https://en.wikipedia.org/wiki/Open_standard
1 OpenCores https://en.wikipedia.org/wiki/OpenCores
1 Co-creation https://en.wikipedia.org/wiki/Co-creation
1 OpenBTS https://en.wikipedia.org/wiki/OpenBTS
1 Open manufacturing https://en.wikipedia.org/wiki/Open_manufacturing
1 Open-source hardware https://en.wikipedia.org/wiki/Open-source_hardware
1 Open source appropriate techno

In [14]:
display_articles(2)

0 Sustainability https://en.wikipedia.org/wiki/Sustainability
1 Sustainability standards and certification https://en.wikipedia.org/wiki/Sustainability_standards_and_certification
1 Appropriate technology https://en.wikipedia.org/wiki/Appropriate_technology
1 Sustainable development https://en.wikipedia.org/wiki/Sustainable_development
1 Environmental issue https://en.wikipedia.org/wiki/Environmental_issue
1 World Cities Summit https://en.wikipedia.org/wiki/World_Cities_Summit
1 Ecopsychology https://en.wikipedia.org/wiki/Ecopsychology
1 Book:Sustainability https://en.wikipedia.org/wiki/Book:Sustainability
1 Sustainable design https://en.wikipedia.org/wiki/Sustainable_design
1 Circles of Sustainability https://en.wikipedia.org/wiki/Circles_of_Sustainability
1 Sustainability science https://en.wikipedia.org/wiki/Sustainability_science
1 Sustainable living https://en.wikipedia.org/wiki/Sustainable_living
1 Index of sustainability articles https://en.wikipedia.org/wiki/List_of_sustainabil

In [15]:
display_articles(3)

0 Maker culture https://en.wikipedia.org/wiki/Maker_culture
1 Modular design https://en.wikipedia.org/wiki/Modular_design
1 Open-source car https://en.wikipedia.org/wiki/Open-source_car
1 Electric vehicle conversion https://en.wikipedia.org/wiki/Electric_vehicle_conversion
1 Thingiverse https://en.wikipedia.org/wiki/Thingiverse
1 Fab lab https://en.wikipedia.org/wiki/Fab_Lab_(fabrication_laboratory)
1 SparkFun Electronics https://en.wikipedia.org/wiki/SparkFun
1 RepRap project https://en.wikipedia.org/wiki/RepRap
1 Distributed manufacturing https://en.wikipedia.org/wiki/Distributed_manufacturing
1 Craft production https://en.wikipedia.org/wiki/Craft_production
1 Autonomous building https://en.wikipedia.org/wiki/Autonomous_building
1 Open-source hardware https://en.wikipedia.org/wiki/Open_source_hardware
1 Kit car https://en.wikipedia.org/wiki/Kit_car


In [16]:
display_articles(4)

0 Innovation https://en.wikipedia.org/wiki/Innovation
1 Competitive intelligence https://en.wikipedia.org/wiki/Creative_competitive_intelligence
1 Multiple discovery https://en.wikipedia.org/wiki/Multiple_discovery
1 UNDP Innovation Facility https://en.wikipedia.org/wiki/UNDP_Innovation_Facility
1 Open Innovations (event) https://en.wikipedia.org/wiki/Open_Innovations_(Forum_and_Technology_Show)
1 Trans-cultural diffusion https://en.wikipedia.org/wiki/Diffusion_(anthropology)
1 Individual capital https://en.wikipedia.org/wiki/Individual_capital
1 Innovation system https://en.wikipedia.org/wiki/Innovation_system
1 Public domain https://en.wikipedia.org/wiki/Public_domain
1 Ingenuity https://en.wikipedia.org/wiki/Ingenuity
1 Sustainable Development Goals https://en.wikipedia.org/wiki/Sustainable_Development_Goals
1 Participatory design https://en.wikipedia.org/wiki/Participatory_design
1 Innovation management https://en.wikipedia.org/wiki/Innovation_management
1 Information revolution ht

In [17]:
display_articles(5)

0 Collaboration https://en.wikipedia.org/wiki/Collaboration
1 Wikinomics https://en.wikipedia.org/wiki/Wikinomics
1 Collaborative editing https://en.wikipedia.org/wiki/Collaborative_editing
1 Telepresence https://en.wikipedia.org/wiki/Telepresence
1 Knowledge management https://en.wikipedia.org/wiki/Knowledge_management
1 The Culture of Collaboration https://en.wikipedia.org/wiki/The_Culture_of_Collaboration
1 Collaborative governance https://en.wikipedia.org/wiki/Collaborative_governance
1 Community film https://en.wikipedia.org/wiki/Community_film
1 Collaborative innovation network https://en.wikipedia.org/wiki/Collaborative_innovation_network
1 Design thinking https://en.wikipedia.org/wiki/Design_thinking
1 Role-based collaboration https://en.wikipedia.org/wiki/Role-based_collaboration
1 Intranet portal https://en.wikipedia.org/wiki/Intranet_portal
1 Critical thinking https://en.wikipedia.org/wiki/Critical_thinking
1 Facilitation (business) https://en.wikipedia.org/wiki/Facilitation

## 6. Analyzing a specific corpus based on a theme

In [18]:
def get_title(Corpus, theme_id):
    title = ''
    for article in Corpus:
        if article['theme.id'] == theme_id:
            title = article['title']
            break
    return title

### 6.0 Selecting the specific theme (a sub-corpus).

In [19]:
## For a different sub-corpus use a corresponding theme ID.
current_theme_id = 4

In [20]:
current_title = get_title(OM_Corpus, current_theme_id)

In [21]:
output_fname = "_".join([word.capitalize() for word in current_title.split(" ")])
print(current_title, "::", output_fname)

Innovation :: Innovation


In [22]:
# Note that theme.id: 0 corresponds to the the Do IT YOURSELF
input_text = " ".join([page['text'] for page in OM_Corpus if page['theme.id'] == current_theme_id])

In [23]:
print(input_text)

Innovation 
 For other uses see 
 Innovation disambiguation 
 Innovation 
 can be defined simply as a "new idea device or method" 
 However innovation is often also viewed as the application of better solutions that meet new requirements unarticulated needs or existing market needs 
 This is accomplished through more-effective 
 products 
 processes 
 services 
 technologies 
 or business models that are readily available to 
 markets 
 governments 
 and 
 society 
 The term "innovation" can be defined as something original and more effective and as a consequence new that "breaks into" the market or society 
 It is related to but not the same as 
 invention 
 Innovation is often manifested via the 
 engineering 
 process The opposite of innovation is 
 exnovation 
 While a novel device is often described as an innovation in economics management science and other fields of practice and analysis innovation is generally considered to be the result of a process that brings together various

In [24]:
# Tokenizing the input text:
tokenized = tokenizer.tokenize_words(input_text)
number_of_words = len(tokenized)
print(number_of_words,current_title)

108495 Innovation


### 6.1 Computing frequency distributions of each token, i.e word, term, pancuation, etc.

In [25]:
input_freq_dist = FreqDist(tokenized)

In [26]:
input_freq_dist.most_common(20)

[('\n', 15885),
 ('the', 4659),
 ('of', 3260),
 ('and', 3142),
 ('in', 1946),
 ('to', 1814),
 ('a', 1714),
 ('"', 1108),
 ('is', 965),
 ('that', 752),
 ('for', 746),
 ('by', 721),
 ('as', 719),
 ('innovation', 697),
 ('or', 526),
 ('are', 523),
 ('on', 498),
 ('research', 454),
 ('be', 447),
 ('creativity', 441)]

### 6.2 Removing punctuation and stopwords from the input corpus

In [27]:
for stopword in STOP_WORDS:
    if stopword in input_freq_dist:
        del input_freq_dist[stopword]
        
for punctuation in tokenizer.CHARACTERS_TO_SPLIT:
    if punctuation in input_freq_dist:
        del input_freq_dist[punctuation]

# Re-control most common words after cleaning:
input_freq_dist.most_common(80)

[('innovation', 697),
 ('research', 454),
 ('creativity', 441),
 ('technology', 393),
 ('development', 337),
 ('new', 269),
 ('first', 252),
 ('russian', 235),
 ('creative', 228),
 ('knowledge', 206),
 ('process', 194),
 ('intelligence', 194),
 ('empire', 175),
 ('century', 167),
 ('information', 165),
 ('public', 164),
 ('theory', 154),
 ('-0', 153),
 ('business', 150),
 ('system', 138),
 ('product', 132),
 ('soviet', 132),
 ('work', 129),
 ('market', 127),
 ('ideas', 127),
 ('value', 126),
 ('domain', 124),
 ('social', 123),
 ('union', 122),
 ('e', 118),
 ('use', 118),
 ('management', 117),
 ('used', 114),
 ('technologies', 110),
 ('diffusion', 110),
 ('model', 110),
 ('technological', 110),
 ('innovations', 106),
 ('p', 106),
 ('people', 104),
 ('invention', 103),
 ('open', 103),
 ('j', 101),
 ('leadership', 97),
 ('economic', 96),
 ('early', 95),
 ('example', 94),
 ('idea', 90),
 ('data', 90),
 ('b', 89),
 ('bc', 88),
 ('history', 86),
 ('developed', 85),
 ('time', 85),
 ('competit

### 6.3 Removing rare words from input distribution

In [28]:
input_freq_dist = {k:v for k,v in input_freq_dist.items() if v > 1}

## 7. Comparing input vs English corpus volumes

### 7.1 Total words (after cleaning) 

In [29]:
n_input = sum(input_freq_dist.values())
n_english = sum(english_freq_dist.values())
n_input, n_english

(52891, 679519)

### 7.2 Unique words (after cleaning)

In [30]:
n_unique_word_input = len(input_freq_dist.items())
n_unique_word_brown = len(english_freq_dist.items())
n_unique_word_input, n_unique_word_brown

(5951, 20591)

### 7.3 Cleaned set of input words/terms

List of words in the corpus in case, for a visual inspection. Such inspections will be used both to improve tokenization as well as filtering.

In [31]:
input_freq_dist

{'innovation': 697,
 'uses': 32,
 'see': 59,
 'disambiguation': 6,
 'defined': 30,
 'simply': 3,
 'new': 269,
 'idea': 90,
 'device': 12,
 'method': 44,
 'however': 69,
 'viewed': 15,
 'application': 27,
 'better': 26,
 'solutions': 28,
 'meet': 14,
 'requirements': 13,
 'needs': 47,
 'existing': 49,
 'market': 127,
 'accomplished': 4,
 'products': 81,
 'processes': 78,
 'services': 41,
 'technologies': 110,
 'business': 150,
 'models': 23,
 'readily': 5,
 'available': 32,
 'markets': 33,
 'governments': 16,
 'society': 50,
 'term': 74,
 'something': 16,
 'original': 37,
 'effective': 24,
 'consequence': 5,
 'breaks': 2,
 'related': 61,
 'invention': 103,
 'manifested': 3,
 'via': 8,
 'engineering': 17,
 'process': 194,
 'opposite': 3,
 'novel': 32,
 'described': 22,
 'economics': 38,
 'management': 117,
 'science': 72,
 'fields': 15,
 'practice': 25,
 'analysis': 53,
 'generally': 26,
 'considered': 28,
 'result': 46,
 'brings': 3,
 'together': 29,
 'various': 38,
 'ideas': 127,
 'way

### 7.4 Set of terms/words that occure in both corpus.

In [32]:
common_words = [w for w in input_freq_dist.keys() & english_freq_dist.keys()]
print(len(common_words))

4212


In [33]:
for w in common_words: print(w)

latter
approximately
baltic
require
cities
participation
decades
flying
segments
d
face-to-face
resource
readily
achievements
approach
officially
worth
gun
equality
accomplishments
oslo
null
network
records
rare
whole
meeting
distinguishing
carnegie
publishing
segment
symbol
desire
ray
believe
fresh
newspaper
audience
empirically
editor
days
institution
pressures
demonstration
component
dual
automobiles
printed
traditional
york
liquid
formation
cast
successful
source
become
guns
limited
immortality
due
full
literary
chariot
copy
render
wagner
beam
complexity
unpublished
typewriter
completion
significantly
retrieved
supersonic
distinct
names
island
socialist
serving
enact
professor
satisfaction
program
half
hiring
foam
alloys
pulp
deaths
involving
journals
process
invested
rome
length
shorter
kind
college
discontinuous
enforcement
advance
perennial
processes
steam
faculty
scholarly
closely
leonard
potential
trend
england
studying
argued
taught
chemical
base
version
themes
deliver
safety

### 7.5 Set of terms/words that occure in the sample but not in the reference corpus.

TO BE EXAMINED: This specific set needs to be incorporated. In fact, it may capture specifity of the content to a great extend. We need to assign a mapping score for each words in this set.

In [34]:
input_specifics = dict()
for w in input_freq_dist.keys() - english_freq_dist.keys():
    input_specifics[w] = input_freq_dist[w]
    print(w)

keqiang
mona
eds
reside
tu-95
makarov
navigational
open-source
internallocalized
default
hobson
behaviours
deindividuation
x-axis
entrepreneurship
nasa
wai
duchy
2011-08-25
ultra
bisphenol
yablochkov
der
jean-marc
lanka
coalitions
mongolia
titanium
predictors
phased
electro
collaborative
mallory
practitioner
generative
cole
3-dimensional
ktp
wireless
sputniks
youtube
vandervert's
cherenkov
elliot
twitter
sobyanin
maser
platforming
carol
floppy
domesticated
cip
mp
kozbelt
waldemar
biometrics
transplantation
recursive
redirects
-series
unipolar
camming
componential
viral
isbn
risk-taking
fleming
yearneeded
taha
acquisitions
lateral
licensor
porcelain
oleg
criticised
brainstorming
preventers
scorers
concise
real-world
wittwer
topol
christensen
1830s
mednick
bower
dedicating
modalities
nickerson
conductors
open-license
docks
petrol
rushton
kaspersky
counterfactual
oncolytic
new-market
immersive
ci
futures
cleanup
eyetap
printers
fixed-wing
hierarchies
lu
estonia
formalities
nurturing
metho

In [35]:
print(len(input_specifics))

1739


## 8. Stemming (in case needed) 

In [36]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for k,v in input_freq_dist.items():
    stemmed = stemmer.stem(k)
    if stemmed != k: print(k, "->", stemmed)

innovation -> innov
uses -> use
disambiguation -> disambigu
defined -> defin
simply -> simpli
device -> devic
however -> howev
viewed -> view
application -> applic
solutions -> solut
requirements -> requir
needs -> need
existing -> exist
accomplished -> accomplish
products -> product
processes -> process
services -> servic
technologies -> technolog
business -> busi
models -> model
readily -> readili
available -> avail
markets -> market
governments -> govern
society -> societi
something -> someth
original -> origin
effective -> effect
consequence -> consequ
breaks -> break
related -> relat
invention -> invent
manifested -> manifest
engineering -> engin
opposite -> opposit
described -> describ
economics -> econom
management -> manag
science -> scienc
fields -> field
practice -> practic
analysis -> analysi
generally -> gener
considered -> consid
brings -> bring
together -> togeth
various -> variou
ideas -> idea
industrial -> industri
innovations -> innov
created -> creat
empirically -> em

## 9. Computing representation power of common words.

In [37]:
# combine
makerness = {}
# common_words = [w[0] for w in common_words]
for w in common_words:
    # Consider only words whose charcater length is larger than 1
    if len(w) > 1:
        # Log likelihood scores are computed:
        score = log((input_freq_dist[w] / n_input) / (english_freq_dist[w] / n_english))
        makerness[w] = (score, input_freq_dist[w])

In [38]:
# Sorting by scores:
for k,v in sorted(makerness.items(), key=lambda x:x[1][0], reverse=True): print(v[0],k,v[1])

7.154027264577092 innovation 697
6.444972300982508 creativity 441
5.830296735864057 innovations 106
5.548884276425872 inventions 80
5.41535288380135 global 70
5.206393967479096 organizational 71
5.176208991140698 domain 124
5.174190826984462 disruptive 55
4.951047275670252 user 44
4.7657614990475805 technology 393
4.62689552343708 empire 175
4.590033930132921 eric 23
4.5680550234141455 researchers 30
4.5546320030820056 dynasty 37
4.5546320030820056 citation 37
4.545582167562087 researcher 22
4.470074615053942 template 34
4.3989786933702115 expertise 19
4.3857334666201915 millennium 25
4.372310446288051 digital 37
4.363260610768132 technological 110
4.357650497877366 invented 79
4.344911472099937 2nd 18
4.31674059513324 users 35
4.3110099204242545 forum 58
4.287753058259987 disruption 17
4.2579000951103065 inventors 22
4.192148717547526 invention 103
4.162589915305982 agenda 25
4.140117059453923 paradox 44
4.137272107321691 hypothetical 39
4.090677333715695 creative 228
4.07557853831635

In [39]:
OUTPUT_FOLDER = "./output/"
csvfile_name = OUTPUT_FOLDER + "makerness_" + output_fname + ".csv"
with open(csvfile_name, 'w') as csvfile:
    thewriter = csv.writer(csvfile, delimiter=',')
    for k,v in sorted(makerness.items(), key=lambda x:x[1][0], reverse=True):
        thewriter.writerow([k,v[0],v[1]])

In [40]:
print(csvfile_name)

./output/makerness_Innovation.csv
