# Analyzing Wikipedia Pages

In this short data engineering focused project, we'll be working with data scraped from Wikipedia.

## Introduction
Wikipedia is maintained by volunteer content contributors and editors who continuously improve content. Wikipedia can be edited by anyone, and because it is crowdsourced, it's been able to rapidly assemble a huge library of articles.

In this guided project, we'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the requests package, which can be found here: http://docs.python-requests.org/en/master/. The scraping code is also in this folder that the project is in, in the scrape_random.py file.

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure https://en.wikipedia.org/wiki/Yarkant_County. If we were saving the article with the previous URL, we'd save it to the file Yarkant_County.html. All the data files are stored in the wiki folder. Note that the files are raw HTML.

Also note that the page is a fairly standard HTML page, and has embedded Javascript code. However, we'll be able to ignore the embedded Javascript during our analysis, so it should not be a problem.

The main goals:
* Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
* Remove common page headers and footers from the Wikipedia pages.
* Figure out what tags are the most common in Wikipedia pages.
* Figure out patterns in the text.

Let's get started by exploring a single data file.

In [1]:
# get some tools
import os

# list all files in wiki folder - it's a lot!
os.listdir("wiki")

['Furubira_District,_Hokkaido.html',
 'Valentin_Yanin.html',
 'Kings_XI_Punjab_in_2014.html',
 'William_Harvey_Lillard.html',
 'Radial_Road_3.html',
 'George_Weldrick.html',
 'Zgornji_Otok.html',
 'Blue_Heelers_(season_8).html',
 'Taggen_Nunatak.html',
 '1951_National_League_tie-breaker_series.html',
 'List_of_number-one_singles_of_1993_(Finland).html',
 'Vrila.html',
 'William_Henry_Porter.html',
 'Clive_Brown_(footballer).html',
 '2010_Karshi_Challenger_%E2%80%93_Singles.html',
 'Blick_nach_Rechts.html',
 'Central_District_(Rezvanshahr_County).html',
 'Gal%C3%A1pagos,_Guadalajara.html',
 'Campus_of_Texas_A%26M_University.html',
 'Alexios_Aspietes.html',
 'Mei_Lanfang.html',
 'Thalkirchen-Obersendling-Forstenried-F%C3%BCrstenried-Solln.html',
 'Coalville_Town_railway_station.html',
 'Gennady_Lesun.html',
 'Bartrum_Glacier.html',
 'Victor_S._Mamatey.html',
 'Gottfried_Keller.html',
 'Table_Point_Formation.html',
 'Nobuhiko_Ushiba.html',
 'Master_of_Space_and_Time.html',
 'Early_medieva

In [5]:
# that was how many files...
len(os.listdir("wiki"))

# read one of them 
with open("wiki/Tommy_Hunter_(baseball).html") as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Tommy Hunter (baseball) - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Tommy_Hunter_(baseball)","wgTitle":"Tommy Hunter (baseball)","wgCurRevisionId":768192910,"wgRevisionId":768192910,"wgArticleId":18744949,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use mdy dates from February 2016","1986 births","Living people","Sportspeople from Indianapolis","Baseball players from Indiana","Major League Baseball pitchers","Texas Rangers players","Baltimore Orioles players","Chicago Cubs players","Cleveland Indians players","Alabama Crimson Tide baseball 

It looks like we can see this as a typical html page, even though it may look a bit cryptic to the unfamiliar reader. It appears that the main content is nested inside the div tag with the id content.

## Reading in the Data
Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.

As this task is I/O bound, we can use threads to help us read in the data more quickly.

Note: It may make sense to benchmark the performance with multiple threads versus single threaded performance.

In [7]:
# get some neccessary modules for threading and timing
import concurrent.futures
import time

# create a pool of 4 workers (arbitrary to user machine)
pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

# make function to read file to avoid redundancy
def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

# time the start, read in files, and make into a content list
start = time.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]
content = pool.map(read_data, filenames)
content = list(content)

# time the end, see the difference (seconds) and convert to article name
end = time.time()
print(end - start)
articles = [f.replace(".html", "").replace("wiki/", "") for f in filenames]

0.39919543266296387


Note that the time printed above will vary obviously due to variance. 

However, after doing some profiling, it doesn't appear that threading makes a huge difference to performance. It may be because although files are opened, most of the task is offset by the overhead of creating new threads.

## Remove Extraneous Markup
Now that we've read in the data files, we can remove the extraneous markup that's outside the div#content tag that most of the content seems to be inside (example):


<script>...</script>
        ...
    
    
        ...
        <div id="content">
            ...
        </div>
        ...
        

We can use the BeautifulSoup package for this. BeautifulSoup enables us to extract all of the content inside a specific tag. For more information on beautiful soup, look for the Reddit API project in the Data Analysis Github repo, else here is some documentation: https://www.crummy.com/software/BeautifulSoup/.

Using the BeautifulSoup package, we'll parse each wiki article, then extract the div with id content and everything inside it.

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

In [10]:
# get BS from bs4 edition
from bs4 import BeautifulSoup

# create function that reads in html, parses, extracts content div to string
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return str(soup.find_all("div", id="content")[0])

# time ths start, create pool with 3 workers (arbitrary), parse content to list
start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)

# time the end and see the difference (this will be quite long - over half a min)
end = time.time()
print(end - start)

41.40981125831604


In [11]:
# let's look at some content
parsed[0]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNotice --></div>\n<div class="mw-indicators">\n</div>\n<h1 class="firstHeading" id="firstHeading" lang="en">Furubira District, Hokkaido</h1>\n<div class="mw-body-content" id="bodyContent">\n<div id="siteSub">From Wikipedia, the free encyclopedia</div>\n<div id="contentSub"></div>\n<div class="mw-jump" id="jump-to-nav">\n\t\t\t\t\tJump to:\t\t\t\t\t<a href="#mw-head">navigation</a>, \t\t\t\t\t<a href="#p-search">search</a>\n</div>\n<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><table class="plainlinks metadata ambox ambox-content ambox-Unreferenced" role="presentation">\n<tr>\n<td class="mbox-image">\n<div style="width:52px"><a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upl

It turns out that this operation is quite slow and CPU-intensive. It looks like using as many processes are there are available processors speeds things up.

## Finding Common Tags
Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of a tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of div tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.

We can count tags using the BeautifulSoup.find_all() method with no argument, then iterating through all of the tags.

In [13]:
# create function to count tags - parse and use find_all to fill dict
def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

# time the start, make pool with 3 workers (arbitrary), map the function to list
start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)
tags = list(tags)

# make tag_counts dict and double loop to count tags
tag_counts = {}
for tag in tags:
    for k,v in tag.items():
        if k not in tag_counts:
            tag_counts[k] = 0
        tag_counts[k] += v
        
# time the end and see the difference (intensive - about half a min)
end = time.time()
print(end - start)

# display tag counts
tag_counts

20.382586240768433


{'a': 161065,
 'abbr': 3665,
 'annotation': 2,
 'area': 39,
 'audio': 2,
 'b': 14455,
 'bdi': 4,
 'big': 75,
 'blockquote': 58,
 'br': 4986,
 'caption': 200,
 'center': 64,
 'cite': 3563,
 'code': 108,
 'dd': 1376,
 'del': 2,
 'div': 28581,
 'dl': 457,
 'dt': 334,
 'font': 40,
 'h1': 999,
 'h2': 4045,
 'h3': 777,
 'h4': 117,
 'h5': 4,
 'h6': 1,
 'hr': 51,
 'i': 18246,
 'img': 6701,
 'li': 85779,
 'map': 2,
 'math': 2,
 'mo': 2,
 'mrow': 2,
 'mstyle': 2,
 'noscript': 999,
 'ol': 858,
 'p': 7998,
 'pre': 1,
 'q': 76,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'ruby': 16,
 's': 10,
 'samp': 2,
 'semantics': 2,
 'small': 3272,
 'source': 2,
 'span': 67350,
 'strong': 599,
 'sub': 151,
 'sup': 11157,
 'table': 4010,
 'td': 57673,
 'th': 14472,
 'tr': 27300,
 'u': 51,
 'ul': 10972,
 'wbr': 85}

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates how interconnected articles on Wikipedia are. It is to be expected, which is usually a good thing in analysis! Confirmation works just as well as discovery.

## Finding Common Words
After finding the common tags, we should be able to find the common words in the article body. We can apply any definition of "word" that we want, but it might be helpful to apply a criteria of some sort. This process will be very similar to finding the common tags.

**Note: One thing to be aware of here is that depending on the words you choose and depending on your user machine, you may run out of memory, or performance may be slow. Try tweaking things to improve performance if necessary.**

In [15]:
# get some counting module help
from collections import Counter
import re

# create function to count words - parse, get_text, clean and take words >5 len
def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(10)

# time the start, make pool of 3 workers (arbitrary), map count_words into list
start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)

# make word_counts dict and double loop to count words
word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
        
# time the end and show the difference - same as above - about half a min
end = time.time()
print(end - start)

# display word counts
word_counts

21.52505874633789


{'genus': 20,
 'schleswig': 1,
 'daughter': 5,
 'frescoes': 1,
 'appearance': 2,
 'nadeshiko': 1,
 'flanders': 1,
 'medical': 6,
 'tournaments': 1,
 'clare': 1,
 'ioana': 1,
 'chicken': 1,
 'heath': 1,
 'march': 31,
 'daveigh': 1,
 'festival': 11,
 'bijar': 1,
 'trees': 1,
 'elizabeth': 2,
 'pradesh': 3,
 'hardcore': 1,
 'fuselage': 1,
 'england': 14,
 'independent': 3,
 'faber': 2,
 'gustavsson': 1,
 'ontario': 2,
 'growth': 3,
 'pascal': 1,
 'winds': 1,
 'walking': 1,
 'ministry': 3,
 'honey': 1,
 'gokhale': 1,
 'champion': 1,
 'diptera': 1,
 'distortion': 1,
 'items': 1,
 'fischtown': 1,
 'kattukukke': 1,
 'editions': 1,
 'suter': 1,
 'panthers': 1,
 'recke': 1,
 'ingelheim': 1,
 'triathlon': 1,
 'suburbs': 1,
 'georgia': 5,
 'pless': 1,
 'genes': 1,
 'arena': 2,
 'malta': 2,
 'catalunya': 1,
 'cárdenas': 1,
 'artistic': 2,
 'these': 4,
 'liberal': 6,
 'minardi': 1,
 'atlantic': 3,
 'logistic': 1,
 'lavender': 1,
 'vitalic': 1,
 'kolbuszowa': 1,
 'donje': 1,
 'constitution': 1,
 'me

In [23]:
# example of just running the function - leads to a reference object display
words = pool.map(count_words, parsed)
words

<generator object result_iterator at 0x7f7ca0192d80>

From the code above, we see that only selecting the top 10 words from each article speeds up performance quite a bit.

## Further Analysis / Next Steps
Even though this was some pretty basic analysis on wiki articles, there are a few more things to explore that we can take note of. For complete analysis projects, take a look at especially the Reddit API project mentioned before in this writeup. The information is essentially the same, as this project focused a bit more on data engineering.

## Analysis Questions
With that being said, some topics to explore might be:
* What tags have the most content inside of them?
* What articles are most commonly linked to from our articles?
* What phrases are the most common?
* What's the distribution of letters per word? How many 3 letter words are there? 4 letter?
* What's the average reading level of a Wikipedia article? Use readability metrics.
* What images are most commonly shown in articles?

Most of these questions can pretty much be answered with some time using the above functions and looking into the htmls, and this is more of a guide to process your own questions for things that you yourself may find interesting.

For example, it is pretty easy to find the most content tags; just filter by len. And also pretty easy to find the distribution of letters; literally just replace the 5 in the word function with 3, 4, etc. However, if you were interested in using this distribution to then see what topics of articles these distributions catered to the most - this could be useful. All analyses build on each other, it never really stops. Again, it just pertains to the individual's interest most of the time.

## Conclusion
Most problems in data engineering are also fundamentally scale problems. The more data you have, the harder it is to process, and the more tradeoffs you have to make. It would be useful to download more articles, so that scale becomes more of a problem. Can you work with 1 gigabyte of articles, and have a reasonably fast processing methods? How about 10 gigabytes? This project did not involve that for two reasons:
* Downloading 10 GB of data for an example project is a bit overkill.
* With more data, there is more tradeoff, but ultimately more redundancy as well. While you may have to optimize here and there, the code generally stays the same format.

All in all, this project was a good basis on data engineering combined with some simple analysis tools such as threading and using beautiful soup. Threading can be very useful when working with big companies on big projects, but to be honest, most of the time you won't even need it. 

It all comes down to experience and knowing when to use something is just as important as knowing when not to use something. Threading and data engineering, when used right, can exponentially make data analysis more powerful!