# Bocoup Keywords

## Finding key terms on every page of Bocoup.com using tf-idf

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

### 1. Download bocoup.com: 

### 2. Load Python libs

In [53]:
import os                           # Walking the file system and getting paths
import re                           # Regex
import sys                          # Getting the system-specific unicode range
import math                         # logarithms
import nltk                         # Tokenizing string
import unicodedata                  # Unicode character database, for building our exclude list

from bs4 import BeautifulSoup       # Parsing HTML
from collections import Counter     # dict subclass with handy features\
from __future__ import division     # make python2 divide good

### 3. Define Characters to Exclude (Punctuation)

We want to exclude most punctuation from our corpus for a couple of reasons:

- We don't care about most of them (e.g. ellipses, apostrophes, etc.)
- BeautifulSoup converts most HTML entities (usually used for punctuation, e.g. ellipses, "smart quotes") to unicode characters. When we later try to convert this unicode string to an ascii string (which is nicer for nltk to parse), we get encoding errors

In [54]:
punc = u''

# Loop through all characters in unicode, pulling out those that are punctuation
for i in xrange(sys.maxunicode):
    char = unichr(i)
    if unicodedata.category(char).startswith('P'): # Add exceptions here
        punc += char

exclude = set(punc)

### 4. Set up containers for our word lists

In [55]:
# Count the number of documents that contain a token
token_document_count = Counter()

# Dict of site pages with counts of instances of each token on that page, keyed by document path
bocoup_com = {}

### 5. Recursively walk the site, tokenizing text content

In [76]:
# Total number of documents in the data set
num_documents = 0

# We're going to walk the local bocoup.com directory
walk_dir = 'bocoup.com'

# Walk
for root, subdirs, files in os.walk(walk_dir):

    # Every file represent a page of the site
    for filename in files:
        
        # Should break this loop if the file isn't *.html
        
        # Add this file to the document count
        num_documents += 1

        # Get the path for this document
        file_path = os.path.join(root, filename)

        # Parse the html
        soup = BeautifulSoup(open(file_path), 'lxml')

        # Get rid of code examples, other tags whose text nodes would pollute our results
        for tag in soup.find_all(['code', 'head', 'iframe', 'pre', 'script', 'style']):
            tag.decompose()

        # Get contents of all text nodes, stripping extraneous whitespace
        clean_soup = soup.get_text(" ", strip=True).lower()
        
        # Remove URLs that appear in the text
        clean_soup = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', clean_soup, flags=re.M)

        # Remove excluded characters from string
        clean_soup = ''.join([ch for ch in clean_soup if ch not in exclude])
        
        # Break string into discrete tokens
        tokens = nltk.word_tokenize(clean_soup)

        # Create a Counter with the resulting tokens
        document_tokens = Counter(tokens)
            
        # Count one more document appearance for each unique token in this file
        for token in document_tokens:
            token_document_count[token] += 1
        
        # Save the token counts for this document in our dict of all documents
        # Remove [index].html from this key, so it more closely resembles the URL
        bocoup_com[file_path.replace("/index.html", "").replace(".html", "")] = document_tokens

### 6. Calculate Inverse Document Frequency for each term in the corpus

The fewer documents a particular term appears in, the more "special" the term

In [77]:
idf = {}

for token in token_document_count:
    idf[token] = math.log(num_documents / token_document_count[token])

### 7. Calculate tf-idf for each term in each document

The number of times a term appears in a document multiplied by the inverse document frequency

In [78]:
tfidf = {}

for path in bocoup_com:
    
    # Use a Counter here for the sweet, sweet `most_common()` method we use below
    tfidf[path] = Counter()

    # Get the number of times the most common term on this page is used
    # We need this to normalize our frequency, below
    max_num = bocoup_com[path][max(bocoup_com[path], key=bocoup_com[path].get)]

    # Calculate the tf-idf for each word in this document
    for word in bocoup_com[path]:
        
        # Normalize to prevent bias towards longer docs
        tf = 0.5 + 0.5 * bocoup_com[path][word] / max_num
        
        # The tf-idf for this word in this document is the frequency this word appears 
        # in the document multiplied by its IDF (rarity in the corpus)
        tfidf[path][word] = tf * idf[word]

### 8. Check it out

In [79]:
for path in tfidf:
    print(path)
    for tup in tfidf[path].most_common(10):
        print('--{}').format(tup[0].encode('utf-8'))

bocoup.com/education/classes/end-to-end-testing-and-continuous-integration.1
--bh
--photo
--innovating
--mastery
--firsthand
--creatively
--educators
--intensive
--indepth
--chicago
bocoup.com/about/bocouper/jasmin-jata
--ceramic
--rose
--kelleher
--cephalopod
--reveling
--biology
--scientific
--outlook
--unknown
--overlords
bocoup.com/weblog/training-jquery-san-francisco
--guy
--proceeds
--enhancing
--struggled
--dabbled
--traversing
--val
--primer
--299
--beginner
bocoup.com/weblog/git-workflow-walkthrough-reviewing-pull-requests
--yannicks
--biggiethat
--commenting
--submitter
--coder
--spot
--pretend
--merely
--accident
--occur
bocoup.com/weblog/tag/johnny-five
--milliamps
--paired
--measurement
--programmed
--electronic
--lipo
--easilydiscoverable
--motor
--pasting
--precious
bocoup.com/weblog/organizing-your-backbone-js-application-with-modules
--applicationjs
--memoizing
--applicationreadyjs
--buddy
--namespaces
--elegant
--helperjs
--naivejs
--mutable
--encapsulated
bocoup.com/

## Next Steps

- Nearest-neighbor similiarity between pages