# Analyzing Wikipedia Pages

In this project, we'll analyze 54 megabytes worth of data scraped from Wikipedia articles to find patterns in the writing and content style. Our main goals will be to: extract the text from the articles and remove all HTML and Javascript markup, remove page headers and footers, find the most common tags, and then find patterns in the text.

The articles were scraped by downloading the contents on random Wikipedia pages using the code found in `scrape_random.py` file, and then saved using the last component of their URLs into our `wiki` folder. Note: due to the large amount of data, I've only a few of these files to this repository to serve as an example.

## Introduction to the Wikipedia Data

The first thing will do is list all of the files in our `wiki` folder, and then open up some of the files to see what our data looks like.

In [1]:
import os

os.listdir('wiki')

['Ronald_McCaffer.html',
 'Communities_of_Tulu_Nadu.html',
 'Mountune_Racing.html',
 'Tim_Spencer_(singer).html',
 'Nathaniel_Merriman.html',
 'One_Night_of_Sin.html',
 'Middle_Park,_Victoria.html',
 'Zgornji_Otok.html',
 'Josef_Mik.html',
 'Gaston_Lane.html',
 '2008_Fed_Cup_World_Group_II.html',
 'Phenacobius_catostomus.html',
 'Dowell_Philip_O%27Reilly.html',
 'Hebden_Bridge_Picture_House.html',
 'Plze%C5%88_Zoo.html',
 'Lower_Blackburn_Grade_Bridge.html',
 'DWTE-TV.html',
 'HD_90156.html',
 'Ordinary,_Virginia.html',
 'Cyclohexane_conformation.html',
 'Bifidocarpus.html',
 'Terry_Cox.html',
 'Furubira_District,_Hokkaido.html',
 'Kentucky_Theater.html',
 'Smeaton,_East_Lothian.html',
 'Alexander_Rizzoni.html',
 'Charged_Records.html',
 'Kate_Harwood.html',
 'Goodnight%E2%80%93Loving_Trail.html',
 'Aniavan.html',
 'Athletics_at_the_1994_Commonwealth_Games_%E2%80%93_Men%27s_pole_vault.html',
 'Doumanaba.html',
 'East_Down_(Northern_Ireland_Parliament_constituency).html',
 'Coenaculum_s

In [2]:
len(os.listdir('wiki'))

999

In [3]:
with open('wiki/One_Night_of_Sin.html') as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>One Night of Sin - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"One_Night_of_Sin","wgTitle":"One Night of Sin","wgCurRevisionId":766528038,"wgRevisionId":766528038,"wgArticleId":16423543,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","EngvarB from January 2014","Use dmy dates from January 2014","Articles with hAudio microformats","Certification Table Entry usages for Austria","Certification Table Entry usages for Canada","Certification Table Entry usages for Germany","Certification Table Entry usages for Switzer

The content we want is nested within the `div` tag with the `content` id. We'll need to read in the data and remove the extra markup that we don't need.

## Read in the Data

In [4]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = ['wiki/{}'.format(f) for f in os.listdir('wiki')]
content = pool.map(read_data, filenames)
content = list(content)

end = time.time()
print(end - start)
articles = [f.replace('.html', '').replace('wiki/', '') for f in filenames]

0.17207717895507812
