# Custom Sources - HTML

In [5]:
import nltk
from urllib import urlopen

Websites are written in HTML, so when you pull information directly from a site, you will get all the code back along with the text.

In [6]:
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

In [7]:
html = urlopen(url).read()

In [8]:
html

'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<meta charset="UTF-8" />\n<title>Python (programming language) - Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>window.RLQ = window.RLQ || []; window.RLQ.push( function () {\nmw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":690626735,"wgRevisionId":690626735,"wgArticleId":23862,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2015","Wikipedia articles needing page number citations from January 2012","Articles with DMOZ links","Good articles","Class-based programming languages","Cross-platform free software","Dutch inventions"

We will use a Python library called BeautifulSoup in order to strip away the HTML code.

In [9]:
from bs4 import BeautifulSoup

In [10]:
web_str = BeautifulSoup(html).get_text()

In [11]:
web_tokens = nltk.word_tokenize(web_str)

In [12]:
web_tokens[0:25]

[u'Python',
 u'(',
 u'programming',
 u'language',
 u')',
 u'-',
 u'Wikipedia',
 u',',
 u'the',
 u'free',
 u'encyclopedia',
 u'document.documentElement.className',
 u'=',
 u'document.documentElement.className.replace',
 u'(',
 u'/',
 u'(',
 u'^|\\s',
 u')',
 u'client-nojs',
 u'(',
 u'\\s|',
 u'$',
 u')',
 u'/']

With a little bit of manual work we can find the main body of text.

In [19]:
start = web_str.find("Python is a widely used general-purpose, high-level programming language.")

The end of the first section of the Wikipedia entry ends with "CPython is managed by the non-profit Python Software Foundation." 

In [20]:
end = web_str.find("CPython is managed by the non-profit Python Software Foundation.")

In [21]:
last_sent = len("CPython is managed by the non-profit Python Software Foundation.")

In [22]:
intro = web_str[start:end+last_sent]

In [23]:
intro_tokens = nltk.word_tokenize(intro)

In [24]:
print intro_tokens

[u'Python', u'is', u'a', u'widely', u'used', u'general-purpose', u',', u'high-level', u'programming', u'language', u'.', u'[', u'20', u']', u'[', u'21', u']', u'Its', u'design', u'philosophy', u'emphasizes', u'code', u'readability', u',', u'and', u'its', u'syntax', u'allows', u'programmers', u'to', u'express', u'concepts', u'in', u'fewer', u'lines', u'of', u'code', u'than', u'would', u'be', u'possible', u'in', u'languages', u'such', u'as', u'C++', u'or', u'Java', u'.', u'[', u'22', u']', u'[', u'23', u']', u'The', u'language', u'provides', u'constructs', u'intended', u'to', u'enable', u'clear', u'programs', u'on', u'both', u'a', u'small', u'and', u'large', u'scale', u'.', u'[', u'24', u']', u'Python', u'supports', u'multiple', u'programming', u'paradigms', u',', u'including', u'object-oriented', u',', u'imperative', u'and', u'functional', u'programming', u'or', u'procedural', u'styles', u'.', u'It', u'features', u'a', u'dynamic', u'type', u'system', u'and', u'automatic', u'memory', u'm