# Project: Analyzing Wikipedia Pages

The objective is to figure out the patterns in the website's writing and content presentation style.

Our main goals will be to:

- Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
- Remove common page headers and footers from the Wikipedia pages.
- Figure out what tags are the most common in Wikipedia pages.
- Figure out patterns in the text.

In [1]:
import os
os.listdir("wiki")

['Furubira_District,_Hokkaido.html',
 'Valentin_Yanin.html',
 'Kings_XI_Punjab_in_2014.html',
 'William_Harvey_Lillard.html',
 'Radial_Road_3.html',
 'George_Weldrick.html',
 'Zgornji_Otok.html',
 'Blue_Heelers_(season_8).html',
 'Taggen_Nunatak.html',
 '1951_National_League_tie-breaker_series.html',
 'List_of_number-one_singles_of_1993_(Finland).html',
 'Vrila.html',
 'William_Henry_Porter.html',
 'Clive_Brown_(footballer).html',
 '2010_Karshi_Challenger_%E2%80%93_Singles.html',
 'Blick_nach_Rechts.html',
 'Central_District_(Rezvanshahr_County).html',
 'Gal%C3%A1pagos,_Guadalajara.html',
 'Campus_of_Texas_A%26M_University.html',
 'Alexios_Aspietes.html',
 'Mei_Lanfang.html',
 'Thalkirchen-Obersendling-Forstenried-F%C3%BCrstenried-Solln.html',
 'Coalville_Town_railway_station.html',
 'Gennady_Lesun.html',
 'Bartrum_Glacier.html',
 'Victor_S._Mamatey.html',
 'Gottfried_Keller.html',
 'Table_Point_Formation.html',
 'Nobuhiko_Ushiba.html',
 'Master_of_Space_and_Time.html',
 'Early_medieva

The main content is nested inside the `<div>` tag with the `id=content`.

## Exploring the Data

In [2]:
import concurrent.futures as cf
import time as t

pool = cf.ThreadPoolExecutor(max_workers=4)
def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = t.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]
content = pool.map(read_data, filenames)
content = list(content)
print(t.time() - start)
articles = [f.replace(".html", "").replace("wiki/", "") for f in filenames]

0.40323901176452637


We've observed that threading is not showing any improvement in performance in terms of time. The overhead in the creation of new threads will always affect performance.

## Cleaning Markups

In [3]:
from bs4 import BeautifulSoup
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return str(soup.find_all("div", id="content")[0])
start = t.time()
pool = cf.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)
print(t.time() - start)

55.0821008682251


In [4]:
parsed[0]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNotice --></div>\n<div class="mw-indicators">\n</div>\n<h1 class="firstHeading" id="firstHeading" lang="en">Furubira District, Hokkaido</h1>\n<div class="mw-body-content" id="bodyContent">\n<div id="siteSub">From Wikipedia, the free encyclopedia</div>\n<div id="contentSub"></div>\n<div class="mw-jump" id="jump-to-nav">\n\t\t\t\t\tJump to:\t\t\t\t\t<a href="#mw-head">navigation</a>, \t\t\t\t\t<a href="#p-search">search</a>\n</div>\n<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><table class="plainlinks metadata ambox ambox-content ambox-Unreferenced" role="presentation">\n<tr>\n<td class="mbox-image">\n<div style="width:52px"><a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upl

## Most Used Tags

In [6]:
def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    tags = dict()
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = t.time()
pool = cf.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)
tags = list(tags)
tag_counts = dict()
for tag in tags:
    for k, v in tag.items():
        if k not in tag_counts:
            tag_counts[k] = 0
        tag_counts[k] += v
print(t.time() - start)
tag_counts

26.37105083465576


{'a': 161065,
 'abbr': 3665,
 'annotation': 2,
 'area': 39,
 'audio': 2,
 'b': 14455,
 'bdi': 4,
 'big': 75,
 'blockquote': 58,
 'br': 4986,
 'caption': 200,
 'center': 64,
 'cite': 3563,
 'code': 108,
 'dd': 1376,
 'del': 2,
 'div': 28581,
 'dl': 457,
 'dt': 334,
 'font': 40,
 'h1': 999,
 'h2': 4045,
 'h3': 777,
 'h4': 117,
 'h5': 4,
 'h6': 1,
 'hr': 51,
 'i': 18246,
 'img': 6701,
 'li': 85779,
 'map': 2,
 'math': 2,
 'mo': 2,
 'mrow': 2,
 'mstyle': 2,
 'noscript': 999,
 'ol': 858,
 'p': 7998,
 'pre': 1,
 'q': 76,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'ruby': 16,
 's': 10,
 'samp': 2,
 'semantics': 2,
 'small': 3272,
 'source': 2,
 'span': 67350,
 'strong': 599,
 'sub': 151,
 'sup': 11157,
 'table': 4010,
 'td': 57673,
 'th': 14472,
 'tr': 27300,
 'u': 51,
 'ul': 10972,
 'wbr': 85}

There's a lot of `<td>`, `<a>`, `<li>` and `<span>` tags that indicate that articles contain hyperlinks, un/ordered lists and tables. Links are the most numerous, which can mean that Wikipedia has auto-assigned hyperlinks to most searched or essential keywords that is linked to their own web pages.

## Most Used Words

In [7]:
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = dict()
    txt = soup.get_text()
    txt = re.sub("\W+", " ", txt.lower())
    wrd = txt.split(" ")
    wrd = [w for w in wrd if len(w) >= 5]
    return Counter(wrd).most_common(10)

start = t.time()
pool = cf.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)
word_counts = dict()
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
print(t.time() - start)
word_counts

26.579265832901


{'wordpress': 1,
 'video': 7,
 'final': 3,
 'legislative': 6,
 'prannathji': 1,
 'sparganothis': 1,
 'television': 21,
 'mount': 5,
 'wangerooge': 1,
 'partisan': 1,
 'brighouse': 1,
 'october': 28,
 'hardcore': 1,
 'noise': 1,
 'surgical': 1,
 'heroes': 1,
 'staff': 1,
 'elected': 5,
 '43667': 1,
 'editor': 2,
 'riding': 1,
 'spiders': 2,
 'taylour': 1,
 'llwyn': 1,
 'potheri': 1,
 'schauspielhaus': 1,
 'guimarães': 1,
 'strut': 1,
 'prison': 1,
 'feiglin': 1,
 'walking': 1,
 'statistics': 8,
 'dodgers': 1,
 'service': 16,
 'theatre': 6,
 'julien': 1,
 'jennifer': 1,
 'hurricanes': 1,
 'hiroshima': 1,
 'wright': 1,
 'scene': 1,
 'inquiry': 1,
 'incubus': 1,
 'slovakia': 1,
 'piracy': 1,
 'katie': 1,
 'maubeuge': 1,
 'tachinidaetachinidae': 1,
 'carley': 1,
 'clarkes': 1,
 'feedjit': 1,
 'turkey': 3,
 'virgin': 1,
 'classic': 2,
 'simon': 2,
 'women': 22,
 'books': 4,
 'landhi': 1,
 'often': 2,
 'faculty': 1,
 'recommendations': 1,
 'solid': 1,
 'muslim': 1,
 'george': 10,
 'eternament

## Top Ten Words

Only selecting the Top 10 words from each Article significantly improves the performance.