# Analyzing Wikipedia Pages

In this project, we'll analyze 54 megabytes worth of data scraped from Wikipedia articles to find patterns in the writing and content style. We'll do this by implementing our own simplified version of something like the `grep` command-line utility. Our goals will be to make a case-insensitive search for all occurrences of a given string in all files and provide the specific locations of the files where that string is found.

## Introduction to the Wikipedia Data

The articles were scraped by downloading the contents on random Wikipedia pages using the code found in `scrape_random.py` file, and then saved using the last component of their URLs into the `wiki` folder. Note: due to the large amount of data, I've added only a few of these files to this repository to serve as an example.

The first thing will do is list all of the files in our `wiki` folder, and then open up some of the files to see what our data looks like.

In [12]:
import os

# Listing out the file names
file_names = os.listdir('wiki')
file_names

['Ronald_McCaffer.html',
 'IwakiIshikawa_Station.html',
 'Communities_of_Tulu_Nadu.html',
 'Meydane_Jahad_Metro_Station.html',
 '2014E2809315_Kansas_State_Wildcats_men27s_basketball_team.html',
 'Mountune_Racing.html',
 'Tim_Spencer_(singer).html',
 'Nathaniel_Merriman.html',
 'Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html',
 'Campus_of_Texas_A26M_University.html',
 'One_Night_of_Sin.html',
 'Zgornji_Otok.html',
 'Josef_Mik.html',
 'Lin_Chiayu.html',
 'Gaston_Lane.html',
 '2008_Fed_Cup_World_Group_II.html',
 'Phenacobius_catostomus.html',
 'Hebden_Bridge_Picture_House.html',
 'ThorntonleBeans.html',
 'Lower_Blackburn_Grade_Bridge.html',
 'HD_90156.html',
 'Cyclohexane_conformation.html',
 'Bifidocarpus.html',
 'Terry_Cox.html',
 'Kentucky_Theater.html',
 'Alexander_Rizzoni.html',
 'Charged_Records.html',
 'Kate_Harwood.html',
 'Aniavan.html',
 'Doumanaba.html',
 'East_Down_(Northern_Ireland_Parliament_constituency).html',
 'Coenaculum_secundum.html',
 'Switzerland_at_the_1992_Winter_O

In [13]:
len(file_names)

999

In [14]:
# Read in and print the first file
with open(os.path.join('wiki', file_names[0])) as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Ronald McCaffer - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Ronald_McCaffer","wgTitle":"Ronald McCaffer","wgCurRevisionId":726527002,"wgRevisionId":726527002,"wgArticleId":17402798,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["BLP articles lacking sources from March 2011","All BLP articles lacking sources","Articles with topics of unclear notability from March 2015","All articles with topics of unclear notability","All stub articles","Academics of Loughborough University","Scottish civil engineers","Fellows of the Royal Academy of Engineerin

## Adding a MapReduce Framework

We'll start by adding a MapReduce function to help us process the large amounts of data.

In [15]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

## Counting the Number of Lines in All Files

In [21]:
def map_line_count(file_names):
    total = 0
    for name in file_names:
        with open(os.path.join('wiki', name)) as f:
            total += len(f.readlines())
    return total

def reduce_line_count(count1, count2):
    return count1 + count2

map_reduce(file_names, 8, map_line_count, reduce_line_count)

499797

# Grep

Select only a small amount of words from each article speeds up the algorithm. We could also choose to not include any words less that 5 characters in length as well to help us clean up some of the unnecessary values we see. 

## Conclusion & Next Steps

In this project we've done some basic analysis on scraped Wikipedia data and worked to optimize the code performance.

Some next steps we could take to continue this analysis further could be to:

* Look at what tags have the most content.
* Find the articles that are most commonly linked.
* Find the most common phrases.
* Calculate the distribution of letters per word.
* Use readability metrics to find the average reading level of Wikipedia articles.
* Find what images are most commonly shown.

We could continue to download as much data to work with as possible, and optimize our code to efficiently and effectively work with increasing amounts of data. The idea for this project comes from the [dataquest.io](https://app.dataquest.io/) **Parallel Processing** course.