# Discovery and Representation of Open Making Related Terms

This notebook sketches the initial exercise on discovering the open making related keywords. The input text is harvested via a Web crawler that identifies and crawls semantically related wikipedia articles.   

In [1]:
from utils import tokenizer
import nltk
from nltk import FreqDist
from math import log
import json, csv

## 1. Loading a reference English language corpus

In [2]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## 2. Stop words

### 2.1 Standard stop words

In [3]:
with open("data/stopwords_standard.txt", "r") as f:
    STOP_WORDS_STANDARD = set(f.read().strip().split("\n"))
print(STOP_WORDS_STANDARD)

{'had', 'doing', 'between', 'get', 'if', "i've", 'not', 'than', "they've", "wasn't", 'we', "there's", 'on', 'as', 'ourselves', 'each', 'there', 'a', "he'll", "you'd", 'off', 'so', 'can', 'his', 'hers', 'more', 'ours ', 'its', "aren't", 'under', "won't", 'some', 'from', 'like', 'that', "you'll", 'my', 'into', 'or', "it's", "isn't", 'who', 'he', 'they', "we'd", 'know', 'once', "i'm", "can't", "shan't", 'an', "she'd", 'then', "didn't", "here's", 'until', 'com', 'same', 'such', 'here', 'our', "haven't", "she'll", 'these', "they'd", 'with', "what's", 'http', "we're", 'she', 'theirs', 'nor', 'am', 'because', 'yourselves', 'having', 'i', 'do', 'what', 'where', 'itself', 'most', 'before', "couldn't", 'herself', 'down', 'during', 'by', 'up', 'was', 'this', 'of', "who's", 'about', "i'll", "where's", 'when', 'themselves', "wouldn't", 'himself', "hasn't", "they'll", 'could', "that's", 'few', 'again', 'are', "they're", 'too', 'own', "you've", 'ought', 'be', 'above', 'did', 'no', 'those', 'whom', 'w

### 2.2 Open-making related stop words

In [4]:
with open("data/stopwords_openmaker.txt", "r") as f:
    STOP_WORDS_OPENMAKER = set(f.read().strip().split("\n"))
print(STOP_WORDS_OPENMAKER)

{'many', 'one', 'also', 'may', 'almost', 'often', 'well'}


## 3. Removing stop words from the reference English corpus

In [5]:
# merging the two list together
STOP_WORDS = STOP_WORDS_STANDARD.union(STOP_WORDS_OPENMAKER)
print(STOP_WORDS)

{'had', "i've", "there's", 'on', 'ourselves', "he'll", "you'd", 'can', 'hers', 'ours ', "aren't", 'under', "won't", 'from', 'like', 'that', "you'll", 'my', 'or', "it's", 'they', 'once', "can't", "shan't", 'an', "here's", 'often', 'such', 'our', "she'll", 'these', 'theirs', 'nor', 'am', 'because', 'where', 'before', 'herself', 'was', 'by', 'this', 'about', 'himself', 'could', "that's", 'few', 'again', 'are', "they're", 'own', 'be', 'no', "let's", "we've", "she's", 'how', 'you', 'which', "shouldn't", 'the', 'is', 'him', 'yourself', 'your', 'both', 'after', 'below', 'cannot', 'to', 'r', 'other', 'well', "mustn't", "hadn't", 'further', 'but', 'all', 'have', 'why', "he'd", 'being', "weren't", "how's", 'at', "why's", 'just', 'and', 'should', 'doing', 'between', 'get', 'if', 'not', 'than', "they've", "wasn't", 'we', 'as', 'each', 'there', 'a', 'off', 'so', 'his', 'may', 'almost', 'more', 'its', 'some', 'into', "isn't", 'who', 'he', "we'd", 'know', "i'm", "she'd", 'then', "didn't", 'until', 'c

In [6]:
# load english words from the Brown corpus removing stop words.
english_freq_dist = FreqDist([w.lower() for w in nltk.corpus.brown.words()
                              if w not in STOP_WORDS])

## 4. Removing the rare words.

Below we remove rare words and get total count. The code below keeps all words with a occurance frequency above 2. 

In [7]:
english_freq_dist = {k:v for k,v in english_freq_dist.items() if v > 2}

## 5. Loading the input Open Maker corpus

In [8]:
# load the harvested text from wikipedia.
with open("data/wikipedia.json", "r") as f: OM_Corpus_text = f.read()
OM_Corpus = json.loads(OM_Corpus_text)

In [9]:
# The total number of wiki articles used:
print(len(OM_Corpus))

152


In [10]:
# Column names of the the corpus.
OM_Corpus[0].keys()

dict_keys(['theme.id', 'title', 'url', 'depth', 'text'])

In [11]:
def display_articles(tid):
    articles = [article for article in OM_Corpus if article['theme.id'] == tid]
    for article in articles:
        print(article['depth'],article['title'], article['url'])

In [12]:
display_articles(0)

0 Do it yourself https://en.wikipedia.org/wiki/Do_it_yourself
1 Edupunk https://en.wikipedia.org/wiki/Edupunk
1 Prosumer https://en.wikipedia.org/wiki/Prosumer
1 How-to https://en.wikipedia.org/wiki/How-to
1 Kludge https://en.wikipedia.org/wiki/Kludge
1 Bricolage https://en.wikipedia.org/wiki/Bricolage
1 Junk box https://en.wikipedia.org/wiki/Junk_box
1 Number 8 wire https://en.wikipedia.org/wiki/Number_8_wire
1 Ready-to-assemble furniture https://en.wikipedia.org/wiki/Ready-to-assemble_furniture
1 Open design https://en.wikipedia.org/wiki/Open_Design
1 Hackerspace https://en.wikipedia.org/wiki/Hackerspace
1 Instructables https://en.wikipedia.org/wiki/Instructables
1 Handyman https://en.wikipedia.org/wiki/Handyman
1 Circuit bending https://en.wikipedia.org/wiki/Circuit_bending
1 Project GreenWorld International https://en.wikipedia.org/wiki/Project_GreenOman
1 3D printing https://en.wikipedia.org/wiki/3D_printing


In [13]:
display_articles(1)

0 Open design https://en.wikipedia.org/wiki/Open_design
1 Knowledge commons https://en.wikipedia.org/wiki/Knowledge_commons
1 Open Source Ecology https://en.wikipedia.org/wiki/Open_Source_Ecology
1 Computer-aided design https://en.wikipedia.org/wiki/Computer-aided_design
1 Open Source Initiative https://en.wikipedia.org/wiki/Open_Source_Initiative
1 Open Architecture Network https://en.wikipedia.org/wiki/Open_Architecture_Network
1 Open-source architecture https://en.wikipedia.org/wiki/Open-source_architecture
1 Commons-based peer production https://en.wikipedia.org/wiki/Commons-based_peer_production
1 Open standard https://en.wikipedia.org/wiki/Open_standard
1 OpenCores https://en.wikipedia.org/wiki/OpenCores
1 Co-creation https://en.wikipedia.org/wiki/Co-creation
1 OpenBTS https://en.wikipedia.org/wiki/OpenBTS
1 Open manufacturing https://en.wikipedia.org/wiki/Open_manufacturing
1 Open-source hardware https://en.wikipedia.org/wiki/Open-source_hardware
1 Open source appropriate techno

In [14]:
display_articles(2)

0 Sustainability https://en.wikipedia.org/wiki/Sustainability
1 Sustainability standards and certification https://en.wikipedia.org/wiki/Sustainability_standards_and_certification
1 Appropriate technology https://en.wikipedia.org/wiki/Appropriate_technology
1 Sustainable development https://en.wikipedia.org/wiki/Sustainable_development
1 Environmental issue https://en.wikipedia.org/wiki/Environmental_issue
1 World Cities Summit https://en.wikipedia.org/wiki/World_Cities_Summit
1 Ecopsychology https://en.wikipedia.org/wiki/Ecopsychology
1 Book:Sustainability https://en.wikipedia.org/wiki/Book:Sustainability
1 Sustainable design https://en.wikipedia.org/wiki/Sustainable_design
1 Circles of Sustainability https://en.wikipedia.org/wiki/Circles_of_Sustainability
1 Sustainability science https://en.wikipedia.org/wiki/Sustainability_science
1 Sustainable living https://en.wikipedia.org/wiki/Sustainable_living
1 Index of sustainability articles https://en.wikipedia.org/wiki/List_of_sustainabil

In [15]:
display_articles(3)

0 Maker culture https://en.wikipedia.org/wiki/Maker_culture
1 Modular design https://en.wikipedia.org/wiki/Modular_design
1 Open-source car https://en.wikipedia.org/wiki/Open-source_car
1 Electric vehicle conversion https://en.wikipedia.org/wiki/Electric_vehicle_conversion
1 Thingiverse https://en.wikipedia.org/wiki/Thingiverse
1 Fab lab https://en.wikipedia.org/wiki/Fab_Lab_(fabrication_laboratory)
1 SparkFun Electronics https://en.wikipedia.org/wiki/SparkFun
1 RepRap project https://en.wikipedia.org/wiki/RepRap
1 Distributed manufacturing https://en.wikipedia.org/wiki/Distributed_manufacturing
1 Craft production https://en.wikipedia.org/wiki/Craft_production
1 Autonomous building https://en.wikipedia.org/wiki/Autonomous_building
1 Open-source hardware https://en.wikipedia.org/wiki/Open_source_hardware
1 Kit car https://en.wikipedia.org/wiki/Kit_car


In [16]:
display_articles(4)

0 Innovation https://en.wikipedia.org/wiki/Innovation
1 Competitive intelligence https://en.wikipedia.org/wiki/Creative_competitive_intelligence
1 Multiple discovery https://en.wikipedia.org/wiki/Multiple_discovery
1 UNDP Innovation Facility https://en.wikipedia.org/wiki/UNDP_Innovation_Facility
1 Open Innovations (event) https://en.wikipedia.org/wiki/Open_Innovations_(Forum_and_Technology_Show)
1 Trans-cultural diffusion https://en.wikipedia.org/wiki/Diffusion_(anthropology)
1 Individual capital https://en.wikipedia.org/wiki/Individual_capital
1 Innovation system https://en.wikipedia.org/wiki/Innovation_system
1 Public domain https://en.wikipedia.org/wiki/Public_domain
1 Ingenuity https://en.wikipedia.org/wiki/Ingenuity
1 Sustainable Development Goals https://en.wikipedia.org/wiki/Sustainable_Development_Goals
1 Participatory design https://en.wikipedia.org/wiki/Participatory_design
1 Innovation management https://en.wikipedia.org/wiki/Innovation_management
1 Information revolution ht

In [17]:
display_articles(5)

0 Collaboration https://en.wikipedia.org/wiki/Collaboration
1 Wikinomics https://en.wikipedia.org/wiki/Wikinomics
1 Collaborative editing https://en.wikipedia.org/wiki/Collaborative_editing
1 Telepresence https://en.wikipedia.org/wiki/Telepresence
1 Knowledge management https://en.wikipedia.org/wiki/Knowledge_management
1 The Culture of Collaboration https://en.wikipedia.org/wiki/The_Culture_of_Collaboration
1 Collaborative governance https://en.wikipedia.org/wiki/Collaborative_governance
1 Community film https://en.wikipedia.org/wiki/Community_film
1 Collaborative innovation network https://en.wikipedia.org/wiki/Collaborative_innovation_network
1 Design thinking https://en.wikipedia.org/wiki/Design_thinking
1 Role-based collaboration https://en.wikipedia.org/wiki/Role-based_collaboration
1 Intranet portal https://en.wikipedia.org/wiki/Intranet_portal
1 Critical thinking https://en.wikipedia.org/wiki/Critical_thinking
1 Facilitation (business) https://en.wikipedia.org/wiki/Facilitation

## 6. Analyzing a specific corpus based on a theme

In [18]:
def get_title(Corpus, theme_id):
    title = ''
    for article in Corpus:
        if article['theme.id'] == theme_id:
            title = article['title']
            break
    return title

### 6.0 Selecting the specific theme (a sub-corpus).

In [19]:
## For a different sub-corpus use a corresponding theme ID.
current_theme_id = 1

In [20]:
current_title = get_title(OM_Corpus, current_theme_id)

In [21]:
output_fname = "_".join([word.capitalize() for word in current_title.split(" ")])
print(current_title, "::", output_fname)

Open design :: Open_Design


In [22]:
# Note that theme.id: 0 corresponds to the the Do IT YOURSELF
input_text = " ".join([page['text'] for page in OM_Corpus if page['title'] == current_title])

In [23]:
print(input_text)

Open design 
 RepRap 
 general-purpose 3D printer that not only could be used to make structures and functional components for open-design projects but is an open-source project itself 
 Uzebox is an open-design video game console 
 Bug Labs 
 open source hardware 
 Zoybar 
 open source guitar kit With 3-D printed body 
 Open design 
 is the development of physical products machines and systems through use of publicly shared design information Open design involves the making of both 
 free and open-source software 
 FOSS as well as 
 open-source hardware 
 The process is generally facilitated by the Internet and often performed without monetary compensation The goals and philosophy are identical to that of the 
 open-source movement 
 but are implemented for the development of physical products rather than software 
 Open design is a form of 
 co-creation 
 where the final product is designed by the users rather than an external stakeholder such as a private company 
 Sources of the op

In [24]:
# Tokenizing the input text:
tokenized = tokenizer.tokenize_words(input_text)
number_of_words = len(tokenized)
print(number_of_words ,current_title)

2150 Open design


### 6.1 Computing frequency distributions of each token, i.e word, term, pancuation, etc.

In [25]:
input_freq_dist = FreqDist(tokenized)

In [26]:
input_freq_dist.most_common(20)

[('\n', 270),
 ('the', 90),
 ('open', 83),
 ('design', 63),
 ('and', 60),
 ('of', 60),
 ('to', 44),
 ('open-source', 30),
 ('a', 30),
 ('in', 28),
 ('source', 26),
 ('software', 26),
 ('for', 24),
 ('is', 24),
 ('open-design', 20),
 ('an', 18),
 ('hardware', 18),
 ('are', 18),
 ('as', 16),
 ('movement', 16)]

### 6.2 Removing punctuation and stopwords from the input corpus

In [27]:
for stopword in STOP_WORDS:
    if stopword in input_freq_dist:
        del input_freq_dist[stopword]
        
for punctuation in tokenizer.CHARACTERS_TO_SPLIT:
    if punctuation in input_freq_dist:
        del input_freq_dist[punctuation]

# Re-control most common words after cleaning:
input_freq_dist.most_common(80)

[('open', 83),
 ('design', 63),
 ('open-source', 30),
 ('source', 26),
 ('software', 26),
 ('open-design', 20),
 ('hardware', 18),
 ('movement', 16),
 ('development', 12),
 ('organizations', 12),
 ('3d', 10),
 ('projects', 10),
 ('free', 10),
 ('physical', 8),
 ('designs', 8),
 ('cad', 8),
 ('project', 6),
 ('machine', 6),
 ('compared', 6),
 ('manufacturing', 6),
 ('dr', 6),
 ('engineering', 6),
 ('technology', 6),
 ('currently', 6),
 ('developing', 6),
 ('cost', 6),
 ('effort', 6),
 ('modular', 6),
 ('used', 4),
 ('products', 4),
 ('use', 4),
 ('shared', 4),
 ('information', 4),
 ('making', 4),
 ('without', 4),
 ('rather', 4),
 ('co-creation', 4),
 ('company', 4),
 ('sources', 4),
 ('current', 4),
 ('directions', 4),
 ('sharing', 4),
 ('knowledge', 4),
 ('principles', 4),
 ('related', 4),
 ('established', 4),
 ('definition', 4),
 ('potential', 4),
 ('together', 4),
 ('several', 4),
 ('university', 4),
 ('ecology', 4),
 ('two', 4),
 ('hand', 4),
 ('people', 4),
 ('time', 4),
 ('common'

### 6.3 Removing rare words from input distribution

In [28]:
input_freq_dist = {k:v for k,v in input_freq_dist.items() if v > 1}

## 7. Comparing input vs English corpus volumes

### 7.1 Total words (after cleaning) 

In [29]:
n_input = sum(input_freq_dist.values())
n_english = sum(english_freq_dist.values())
n_input, n_english

(1240, 679519)

### 7.2 Unique words (after cleaning)

In [30]:
n_unique_word_input = len(input_freq_dist.items())
n_unique_word_brown = len(english_freq_dist.items())
n_unique_word_input, n_unique_word_brown

(377, 20591)

### 7.3 Cleaned set of input words/terms

List of words in the corpus in case, for a visual inspection. Such inspections will be used both to improve tokenization as well as filtering.

In [31]:
input_freq_dist

{'18th': 2,
 '19th': 2,
 '3-d': 2,
 '3d': 10,
 'access': 2,
 'advanced': 2,
 'aggressive': 2,
 'aguaclara': 2,
 'alike': 2,
 'alliance': 4,
 'alternative': 2,
 'although': 2,
 'applications': 2,
 'apply': 2,
 'appropedia': 2,
 'appropriate': 4,
 'architecture': 4,
 'area': 2,
 'artefact': 2,
 'article': 2,
 'attribution': 2,
 'augustin': 2,
 'available': 2,
 'back': 2,
 'barriers': 2,
 'basis': 2,
 'benefit': 2,
 'benefits': 2,
 'beyond': 2,
 'body': 2,
 'bruce': 2,
 'bug': 2,
 'cad': 8,
 'cases': 2,
 'cells': 2,
 'centralized': 2,
 'century': 2,
 'certain': 2,
 'cheaper': 2,
 'child': 4,
 'circuits': 2,
 'closely': 2,
 'co-creation': 4,
 'code': 2,
 'coined': 2,
 'collaborate': 4,
 'collaborative': 4,
 'come': 2,
 'commercial': 2,
 'common': 4,
 'commons': 4,
 'commons-based': 2,
 'community': 2,
 'company': 4,
 'compared': 6,
 'compensation': 2,
 'complexity': 2,
 'components': 2,
 'computer': 2,
 'computer-controlled': 2,
 'concept': 2,
 'consisting': 2,
 'console': 2,
 'constructio

### 7.4 Set of terms/words that occure in both corpus.

In [32]:
common_words = [w for w in input_freq_dist.keys() & english_freq_dist.keys()]
print(len(common_words))

293


In [33]:
for w in common_words: print(w)

concept
centralized
foundation
places
hardware
provide
emerged
unrelated
creation
e
trend
help
designers
systems
everyone
individuals
testing
skills
photographs
group
game
mechanical
motors
local
industrial
compensation
late
needing
generally
similar
nothing
policies
bruce
external
appropriate
eric
company
might
reference
effort
end
sharing
projects
farming
licensed
models
required
trends
network
modular
expression
g
development
title
coined
per
tim
century
design
mechanism
innovation
value
cost
respects
certain
digital
benefits
period
miscellaneous
great
bug
rather
hand
use
put
compared
media
corporation
complexity
closely
widely
significantly
private
benefit
movements
groups
electronics
principles
available
framework
physical
resource
turbine
time
understood
country
early
19th
printer
reduced
university
members
technology
involve
give
beyond
thesis
established
engineering
community
creative
free
designs
tools
traced
two
overcome
duplication
computer
article
basis
performed
process
re

### 7.5 Set of terms/words that occure in the sample but not in the reference corpus.

TO BE EXAMINED: This specific set needs to be incorporated. In fact, it may capture specifity of the content to a great extend. We need to assign a mapping score for each words in this set.

In [34]:
input_specifics = dict()
for w in input_freq_dist.keys() - english_freq_dist.keys():
    input_specifics[w] = input_freq_dist[w]
    print(w)

littlebits
wikispeed
organisation
zoybar
initiatives
lamberts
commons-based
elektor
uzebox
co-creation
laptop
formalized
kadushin
open-design
sensorica
3-d
printable
funding
opencores
delft
kit
high-tech
aguaclara
dr
open-source
ohanda
copyleft
kiani
thingiverse
openbts
reprap
implemented
internet
nayfeh
sepehr
collaborative
odf
zoetrope
nascent
osp
stakeholder
grid
perens
unites
cad
customized
3d
computer-controlled
ronen
instructables
manifesto
openstructures
labs
augustin
openbook
attribution
website
vallance
o'reilly
appropedia
mit
reinoud
netbook
funded
technologies
facilitated
visualisation
phd
fab
software
collaborate
geometrical
artefact
usefully
sustainable
samir
general-purpose
ecology
standardization
unported
ecological
video
console
patenting


In [35]:
print(len(input_specifics))

84


## 8. Stemming (in case needed) 

In [36]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for k,v in input_freq_dist.items():
    stemmed = stemmer.stem(k)
    if stemmed != k: print(k, "->", stemmed)

general-purpose -> general-purpos
used -> use
structures -> structur
functional -> function
components -> compon
projects -> project
open-source -> open-sourc
console -> consol
labs -> lab
source -> sourc
hardware -> hardwar
printed -> print
body -> bodi
development -> develop
physical -> physic
products -> product
machines -> machin
systems -> system
publicly -> publicli
shared -> share
information -> inform
involves -> involv
making -> make
software -> softwar
generally -> gener
facilitated -> facilit
performed -> perform
monetary -> monetari
compensation -> compens
goals -> goal
philosophy -> philosophi
identical -> ident
implemented -> implement
co-creation -> co-creat
designed -> design
users -> user
external -> extern
stakeholder -> stakehold
private -> privat
company -> compani
sources -> sourc
directions -> direct
machine -> machin
compared -> compar
organizations -> organ
sharing -> share
manufacturing -> manufactur
traced -> trace
century -> centuri
aggressive -> aggress
pate

## 9. Computing representation power of common words.

In [37]:
# combine
makerness = {}
# common_words = [w[0] for w in common_words]
for w in common_words:
    # Consider only words whose charcater length is larger than 1
    if len(w) > 1:
        # Log likelihood scores are computed:
        score = log((input_freq_dist[w] / n_input) / (english_freq_dist[w] / n_english))
        makerness[w] = (score, input_freq_dist[w])

In [38]:
# Sorting by scores:
for k,v in sorted(makerness.items(), key=lambda x:x[1][0], reverse=True): print(v[0],k,v[1])

6.798750300417696 hardware 18
6.711738923428066 modular 6
6.306273815319902 commons 4
5.900808707211738 eric 2
5.900808707211738 coined 2
5.900808707211738 printer 2
5.900808707211738 foss 2
5.900808707211738 fledgling 2
5.900808707211738 non-profit 2
5.900808707211738 lab 2
5.900808707211738 portal 2
5.713210093316939 design 63
5.613126634759957 bruce 2
5.613126634759957 bug 2
5.613126634759957 repository 2
5.613126634759957 circuits 2
5.613126634759957 modifying 2
5.613126634759957 visually 2
5.389983083445747 needing 2
5.389983083445747 miscellaneous 4
5.389983083445747 cornell 2
5.389983083445747 virtual 2
5.389983083445747 voiced 2
5.294672903641422 architecture 4
5.207661526651792 licensed 2
5.207661526651792 digital 2
5.207661526651792 turbine 2
5.207661526651792 focusing 2
5.207661526651792 users 2
5.053510846824533 unrelated 2
5.053510846824533 innovation 2
5.053510846824533 designs 8
5.02107557107138 source 26
4.963063040336323 open 83
4.9199794542000115 19th 2
4.919979454200

In [39]:
OUTPUT_FOLDER = "./output/"
csvfile_name = OUTPUT_FOLDER + "makerness_" + output_fname + ".csv"
with open(csvfile_name, 'w') as csvfile:
    thewriter = csv.writer(csvfile, delimiter=',')
    for k,v in sorted(makerness.items(), key=lambda x:x[1][0], reverse=True):
        thewriter.writerow([k,v[0],v[1]])