# Analyzing Wikipedia Pages

In this project, we'll analyze 54 megabytes worth of data scraped from Wikipedia articles to find patterns in the writing and content style. Our main goals will be to: extract the text from the articles and remove all HTML and Javascript markup, remove page headers and footers, find the most common tags, and then find patterns in the text.

The articles were scraped by downloading the contents on random Wikipedia pages using the code found in `scrape_random.py` file, and then saved using the last component of their URLs into the `wiki` folder. Note: due to the large amount of data, I've added only a few of these files to this repository to serve as an example.

## Introduction to the Wikipedia Data

The first thing will do is list all of the files in our `wiki` folder, and then open up some of the files to see what our data looks like.

In [1]:
import os

os.listdir('wiki')

In [2]:
len(os.listdir('wiki'))

In [3]:
with open('wiki/One_Night_of_Sin.html') as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>One Night of Sin - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"One_Night_of_Sin","wgTitle":"One Night of Sin","wgCurRevisionId":766528038,"wgRevisionId":766528038,"wgArticleId":16423543,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","EngvarB from January 2014","Use dmy dates from January 2014","Articles with hAudio microformats","Certification Table Entry usages for Austria","Certification Table Entry usages for Canada","Certification Table Entry usages for Germany","Certification Table Entry usages for Switzer

The content we want is nested within the `div` tag with the `content` id. We'll need to read in the data and remove the extra markup that we don't need.

## Read In the Data

Since reading in our data is IO bound, we'll use threads to read in all of the data more quickly.

In [4]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()

filenames = ['wiki/{}'.format(f) for f in os.listdir('wiki')]
content = pool.map(read_data, filenames)
content = list(content)

end = time.time()
print('Total time: ', end - start)

# Create a list that contains the article names without the prefix or suffix
articles = [f.replace('.html', '').replace('wiki/', '') for f in filenames]

Total time:  0.2015669345855713


Threading doesn't seem to make much of a difference on the performance. This is likely due to the overhead of creating so many new threads.

## Remove Extraneous Markup

Now that we have the data, the next thing we'll do is remove all of the markup that's outside of `div#content`.

The best way for us to do this is by using a web-scraper package called BeautifulSoup. This process is more CPU intensive, so we'll use a process pool to see if hit helps improve the speed.

We'll write a function that reads in an HTML document, parses it, and extracts returns the contents we need in string format. Then we'll time the function when using a process pool to see how its performance is affected.

In [5]:
from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return str(soup.find_all('div', id='content')[0])

start = time.time()

pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)

end = time.time()
print('Total time: ', end - start)

Total time:  15.193037033081055


In [6]:
parsed[0]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNotice --></div>\n<div class="mw-indicators">\n</div>\n<h1 class="firstHeading" id="firstHeading" lang="en">Ronald McCaffer</h1>\n<div class="mw-body-content" id="bodyContent">\n<div id="siteSub">From Wikipedia, the free encyclopedia</div>\n<div id="contentSub"></div>\n<div class="mw-jump" id="jump-to-nav">\n\t\t\t\t\tJump to:\t\t\t\t\t<a href="#mw-head">navigation</a>, \t\t\t\t\t<a href="#p-search">search</a>\n</div>\n<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><table class="plainlinks metadata ambox ambox-content ambox-multiple_issues compact-ambox" role="presentation">\n<tr>\n<td class="mbox-image">\n<div style="width:52px"><img alt="" data-file-height="40" data-file-width="40" height="40" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_

This function was much slower and CPU-intensive. Using all processors available seems to speed things up.

## Finding Common Tags

Next, we'll write a function to count how many times each tag occurs. This will help us understand how the Wikipedia pages are structured.

In [7]:
from bs4 import BeautifulSoup

def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()

pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)
tags = list(tags)

tag_counts = {}
for tag in tags:
    for key, value in tag.items():
        if key not in tag_counts:
            tag_counts[key] = 0
        tag_counts[key] += value

end = time.time()
print('Total time: ', end - start)

tag_counts

Total time:  9.183030128479004


{'div': 28581,
 'a': 161065,
 'h1': 999,
 'table': 4010,
 'tr': 27300,
 'td': 57673,
 'img': 6701,
 'span': 67350,
 'b': 14455,
 'small': 3272,
 'i': 18246,
 'br': 4986,
 'p': 7998,
 'h2': 4045,
 'ul': 10972,
 'li': 85779,
 'h3': 777,
 'abbr': 3665,
 'noscript': 999,
 'ol': 858,
 'sup': 11157,
 'th': 14472,
 'cite': 3563,
 'strong': 599,
 'caption': 200,
 'big': 75,
 'dl': 457,
 'dt': 334,
 'dd': 1376,
 'sub': 151,
 'code': 108,
 'blockquote': 58,
 'h4': 117,
 'wbr': 85,
 'q': 76,
 'center': 64,
 'bdi': 4,
 'hr': 51,
 'pre': 1,
 'u': 51,
 'audio': 2,
 'source': 2,
 's': 10,
 'h5': 4,
 'math': 2,
 'semantics': 2,
 'mrow': 2,
 'mstyle': 2,
 'mo': 2,
 'annotation': 2,
 'map': 2,
 'area': 39,
 'ruby': 16,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'h6': 1,
 'samp': 2,
 'font': 40,
 'del': 2}

## Finding Common Words

Now we can proceed by finding all of the common words in the articles. We'll have to experiment a bit to figure out the optimal count to work with and whether to use threads or processes.

In [8]:
from bs4 import BeautifulSoup
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    filter_list = ['a', 'i', 'to', 'for', 'the', 'of', 'and', 'in', 'from', 'an', 'this', 'is', 'was', 'with', 'on', '\'', 's', 'by', 'as', 'it', 'at']
    text = soup.get_text()
    text = re.sub('\W+', ' ', text.lower())
    words = text.split(' ')
    words = [w for w in words if w not in filter_list]
    return Counter(words).most_common(50) # Select only the top 50 words from each article

start = time.time()

pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)

word_counts = {}
for word_count in words:
    for word, count in word_count:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1

end = time.time()

print('Total time: ', end - start)

# Sort the dictionary values in descending order
sorted_word_counts = {}
sorted_keys = sorted(word_counts, key=word_counts.get, reverse=True)

for w in sorted_keys:
    sorted_word_counts[w] = word_counts[w]
    
sorted_word_counts

Total time:  8.975099086761475


{'edit': 738,
 'wikipedia': 707,
 '1': 671,
 '': 669,
 '2': 508,
 '3': 368,
 'free': 352,
 'retrieved': 334,
 'encyclopedia': 329,
 'categories': 321,
 'jump': 314,
 'search': 313,
 'navigation': 300,
 'e': 298,
 'articles': 289,
 '4': 286,
 'stub': 265,
 '5': 253,
 'he': 226,
 'that': 214,
 't': 210,
 'article': 203,
 'references': 191,
 'v': 187,
 'his': 184,
 '6': 182,
 'n': 170,
 '10': 161,
 'help': 150,
 'also': 148,
 'be': 146,
 'or': 145,
 'new': 142,
 'w': 139,
 'may': 130,
 'links': 125,
 '7': 124,
 '0': 117,
 'are': 116,
 'external': 115,
 'has': 114,
 'which': 114,
 'united': 113,
 '8': 110,
 'b': 109,
 '2013': 108,
 'can': 107,
 'coordinates': 106,
 '9': 105,
 'born': 103,
 'you': 103,
 'c': 97,
 'were': 96,
 '2011': 95,
 'm': 95,
 '2014': 94,
 '12': 93,
 'see': 93,
 'state': 92,
 '2010': 91,
 '2015': 91,
 'one': 89,
 'national': 89,
 'family': 89,
 'name': 89,
 'p': 88,
 '11': 88,
 'states': 88,
 'county': 87,
 'first': 85,
 'university': 83,
 'february': 82,
 'other': 82,

Select only a small amount of words from each article speeds up the algorithm. We could also choose to not include any words less that 5 characters in length as well to help us clean up some of the unnecessary values we see. 

## Conclusion & Next Steps

In this project we've done some basic analysis on scraped Wikipedia data and worked to optimize the code performance.

Some next steps we could take to continue this analysis further could be to:

* Look at what tags have the most content.
* Find the articles that are most commonly linked.
* Find the most common phrases.
* Calculate the distribution of letters per word.
* Use readability metrics to find the average reading level of Wikipedia articles.
* Find what images are most commonly shown.

We could continue to download as much data to work with as possible, and optimize our code to efficiently and effectively work with increasing amounts of data. The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Optimizing Code Performance on Large Datasets** course.