# Guided Project:
### Analyzing Wikipedia Pages

## Introducing Wikipedia Data

In this guided project, you'll be working with data scraped from [Wikipedia](https://www.wikipedia.org/), a popular online encyclopedia. Wikipedia is maintained by volunteer content contributors and editors who continuously improve content. Anyone can edit Wikipedia, and you can read more about how to make an edit [here](https://en.wikipedia.org/wiki/Help:Editing). Because Wikipedia is crowdsourced, it's been able to rapidly assemble a huge library of articles.<br>

In this guided project, we'll be analyzing `54` megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the [requests](http://docs.python-requests.org/en/master/) package. The scraping code is in this folder, in the `scrape_random.py` file. If you need a refresher on web scraping and HTML, you may want to check out our [course](https://www.dataquest.io/course/apis-and-scraping) before trying this guided project.<br>

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure `https://en.wikipedia.org/wiki/Yarkant_County`. If we were saving the article with the previous URL, we'd save it to the file `Yarkant_County.html`. All the data files are stored in the `wiki` folder. Note that the files are raw HTML -- here are the first few lines of `Yarkant_County.html`:

![yarkant-country](https://s3.amazonaws.com/dq-content/yarkant_country.png)

Note that the page is a fairly standard HTML page, and has embedded [Javascript](https://developer.mozilla.org/en-US/docs/Web/JavaScript) code. We'll be able to ignore the embedded Javascript during our analysis, so don't worry too much about it right now.<br>

Our main goals will be to:
* Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
* Remove common page headers and footers from the Wikipedia pages.
* Figure out what tags are the most common in Wikipedia pages.
* Figure out patterns in the text.

* List all of the files in the `wiki` folder.
* Count up and display the number of files in the `wiki` folder.
* Display a single file from the `wiki` folder, and look at the raw HTML. Do you see any patterns?

In [1]:
# all of the files in the wiki folder
%ls wiki

[0m[01;32m100_Greatest_Romanians.html[0m*
[01;32m104th_Logistic_Support_Brigade_(United_Kingdom).html[0m*
[01;32m16th_Virginia_Infantry.html[0m*
[01;32m1896_Indiana_Hoosiers_football_team.html[0m*
[01;32m1898_Colgate_football_team.html[0m*
[01;32m1910_in_literature.html[0m*
[01;32m1915_Montana_football_team.html[0m*
[01;32m1951_National_League_tie-breaker_series.html[0m*
[01;32m1953%E2%80%9354_FA_Cup_qualifying_rounds.html[0m*
[01;32m1958_Wightman_Cup.html[0m*
[01;32m1988_State_of_Origin_series.html[0m*
[01;32m1st_Strategic_Aerospace_Division.html[0m*
[01;32m2001_Australian_Individual_Speedway_Championship.html[0m*
[01;32m2001_NCAA_Division_I_Field_Hockey_Championship.html[0m*
[01;32m2004_Tuvalu_A-Division.html[0m*
[01;32m2005%E2%80%9306_in_Welsh_football.html[0m*
[01;32m2007%E2%80%9308_Huddersfield_Town_A.F.C._season.html[0m*
[01;32m2008_Fed_Cup_World_Group_II.html[0m*
[01;32m2009_English_cricket_season.html[0m*
[01;32m2009_Wo

In [2]:
# count the number of files in wiki folder
%ls wiki | wc -l

999


In [32]:
# display a single file from wiki folder and look at the raw html
%cat wiki/100_Greatest_Romanians.html | head -50

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>100 Greatest Romanians - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"100_Greatest_Romanians","wgTitle":"100 Greatest Romanians","wgCurRevisionId":739997309,"wgRevisionId":739997309,"wgArticleId":5885981,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from November 2012","Articles containing Romanian-language text","Greatest Nationals","Lists of Romanian people","Romanian Television","Romanian television series"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",

## HTML Structure

```
<!DOCTYPE html>
<html>
<head>
    <title> ... </title>
    <script> ... </script>
    <link ... />
    <meta ... />
</head>
<body>
    <div>
    ...
    </div>
    <div>
    ...
    </div>
...
</body>
</html>
```

## Reading In The Data

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.<br>

As this task is **I/O bound**, we can use **threads** to help us read in the data more quickly. You may want to experiment with different thread counts to figure out what makes sense. It may also make sense to benchmark the performance with multiple threads versus single threaded performance.

* Read all of the files into list called `content`.
  * Each value in the list should the the content of a file.
* Create a list called `articles` that contains the article names.
  * These should be the filename without the `wiki/` prefix, and the `.html` suffix.
* Experiment with different performance options, and write up your findings.

In [1]:
import os
import concurrent.futures
import requests
from bs4 import BeautifulSoup

In [2]:
def get_content(filename):
    
    with open('wiki/'+filename) as f:
        page = f.read()
    return page

In [3]:
pool = concurrent.futures.ThreadPoolExecutor(max_workers = 2)
files = os.listdir('wiki/')
content = list(pool.map(get_content, files))

In [4]:
files[0]

'Ronald_McCaffer.html'

In [5]:
content[0][:50]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en'

In [6]:
def get_title(filename):
    return filename[:-5]

In [7]:
articles = list(pool.map(get_title, files))

In [8]:
articles[:5]

['Ronald_McCaffer',
 'Communities_of_Tulu_Nadu',
 'Mountune_Racing',
 'Tim_Spencer_(singer)',
 'Nathaniel_Merriman']

## Remove Extraneous Markup

Now that we've read in the data files, we can remove the extraneous markup that's outside the `div#content` tag that most of the content seems to be inside:

```python
<script>...</script>
        ...
    
    
        ...
        <div id="content">
            ...
        </div>
        ...
```

We can use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) package for this. `BeautifulSoup` enables us to extract all of the content inside a specific tag.<br>

Using the `BeautifulSoup` package, we'll parse each wiki article, then extract the div with id `content` and everything inside it.<br>

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

* Write a function that:
  * Reads in an HTML document.
  * Parses it using `BeautifulSoup`.
  * Extracts the `content div` from the page.
  * Calls the str type on the content to convert from a BeautifulSoup object to a string.
  * Returns the content.
* Execute the function across the data using a process pool.
* Assign the result to `parsed`.
* Try different numbers of workers, and see how that affects performance.

In [9]:
def get_content(filename):
    
    with open('wiki/'+filename) as f:
            
        content = f.read()
        parser = BeautifulSoup(content, 'html.parser')
        
        #print(len(parser.select("#content")))
        content_div = str(parser.select("#content")[0])
        
    return content_div


In [12]:
# function testing
get_content(files[0])[:100]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNot'

In [10]:
parsed = list(pool.map(get_content, files))

In [11]:
parsed[0][:100]

'<div class="mw-body" id="content" role="main">\n<a id="top"></a>\n<div id="siteNotice"><!-- CentralNot'

In [19]:
# try different numbers of workers

import time
import statistics

for numworkers in [2, 3, 5]:
    
    times = []
    
    # execute 5 tests for each numworkers cases
    # and print out the median
    for test in range(5):
        start = time.time()
        pool = concurrent.futures.ThreadPoolExecutor(max_workers=numworkers)
        parsed = list(pool.map(get_content, files))
        end = time.time()
        times.append(end - start)
    
    times_median = statistics.median(times)
    print('num of workers: {}'.format(numworkers))
    print('time duration: {}'.format(times_median))
    

num of workers: 2
time duration: 38.45559644699097
num of workers: 3
time duration: 39.808757066726685
num of workers: 5
time duration: 39.92155313491821


## Finding Common Tags

Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of `a` tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of `div` tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.<br>

We can count tags using the [BeautifulSoup.find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method with no argument, then iterating through all of the tags.

* Figure out if you should run this step with threads or with processes, and the optimal count.
* Count up the number of each tag in each document.
* Combine all the dictionaries to get the overall tag counts.
* Display the tag counts, and look over them. Does this look like what you'd expect? If you look at the actual underlying Wikipedia pages, do they look how you'd expect?

In [12]:
from multiprocessing import Pool
from concurrent.futures import ThreadPoolExecutor

In [13]:
def opt_count(num_processes, func, args,
                 ttype, test_times = 5):
    
    times_tot = []
    
    for nump in num_processes:
        
        times = []
        
        for test in range(test_times):
            start = time.time()
            
            if ttype == 'process':
                p = Pool(nump)
            elif ttype == 'thread':
                p = ThreadPoolExecutor(max_workers=nump)
            else:
                print('Invalid test type')
                return 0
            
            result = list(p.map(func, args))
            end = time.time()
            times.append(end-start)
            
        times_tot.append(statistics.median(times))
    
    return dict(zip(num_processes, times_tot))


In [14]:
def count_tags(content):
    
    parser = BeautifulSoup(content, 'html.parser')
    tags = parser.find_all()
    tagsdict = {}
    
    for tag in tags:
        tag_name = tag.name
        
        if tag_name in tagsdict:
            tagsdict[tag_name] += 1
        else:
            tagsdict[tag_name] = 1
    
    return tagsdict

In [36]:
# approach 1. threads
# optimizing test using first 10 parsed contents
opt_count([2, 3, 5], count_tags, parsed[:10], 'thread')

{2: 0.198805570602417, 3: 0.19154906272888184, 5: 0.1617581844329834}

In [37]:
# approach 2. processes
# optimizing test using first 10 parsed contents
opt_count([2, 3, 5], count_tags, parsed[:10], 'process')

{2: 0.13243317604064941, 3: 0.17528796195983887, 5: 0.18668794631958008}

* Execute `count_tags` function over contents in `parsed` via `processes = 2` approach.

In [None]:
p = Pool(2)
count_tags_dictlist = list(p.map(count_tags, parsed))

In [16]:
first_count_tag_dict = count_tags_dictlist[0]

for count_tag_dict in count_tags_dictlist[1:]:
    
    for k, v in count_tag_dict.items():
        
        if k in first_count_tag_dict:
            first_count_tag_dict[k] += v
        else:
            first_count_tag_dict[k] = v
        
import operator
sorted_tag_dicts = sorted(first_count_tag_dict.items(), key=operator.itemgetter(1),
                        reverse=True)
sorted_tag_dicts[:10]

[('a', 161065),
 ('li', 85779),
 ('span', 67350),
 ('td', 57673),
 ('div', 28581),
 ('tr', 27300),
 ('i', 18246),
 ('th', 14472),
 ('b', 14455),
 ('sup', 11157)]

### Note.
* **`a` tag for linking**
  * THE MOST OCCURRED TAG
* **`li` tag for listing contents, `span` tag for selecting area for contents, `td` for drawing table and `div` for content diviner**
  * ranked the following

## Finding Common Words
After finding the common tags, we should be able to find the common words in the article body. We can apply any definition of "word" that we want, but it might be helpful to apply similar criteria to what we saw in the last mission.<br>

One thing to be aware of here is that depending on the words you choose, you may run out of memory, or performance may be slow. Try tweaking things to improve performance.

* Figure out if you should run this with threads or with processes, and the optimal count.
* Parse each document, do any cleanup you want, and extract the words.
* Count up the number of words in each document.
* Combine the dictionaries to get the overall word counts across all of the documents.
* See if you can tweak the criteria of what a word is to increase performance.
  * If you only consider words with more than 7 letters, how does that change performance?
  * Does only returning the top words from the function change performance?

In [208]:
from collections import Counter

In [209]:
def word_frequencies(content):
    
    # first regex for removing all tags
    # second regex for removing newline/tab chars
    # third regex for removing special chars
    # fourth regex for converting all kinds of white spaces into single space
    re_exp_to_clean = [r"<[^>]*>", r"[\n\s\t]", r"[*·.,\:\-\"\'\[\]\(\)]+", r"[\s]+"]
    
    cleaned_article = content.lower()
    for exp in re_exp_to_clean:
        cleaned_article = re.sub(exp, " ", cleaned_article)
    
    # final clean up
    cleaned_article = re.sub("\W+", " ", cleaned_article)
    if cleaned_article[0] == ' ':
        cleaned_article = cleaned_article[1:]
    if cleaned_article[-1] == ' ':
        cleaned_article = cleaned_article[:-1]
    
    words = cleaned_article.split(" ")
    count = Counter(words)
    
    return dict(count)

In [211]:
# test
word_frequencies(parsed[0])

{'1': 5,
 '1943': 2,
 '1965': 1,
 '1977': 1,
 '1987': 1,
 '1991': 1,
 '1992': 1,
 '1993': 1,
 '1997': 2,
 '1998': 1,
 '2': 2,
 '2002': 2,
 '2006': 1,
 '2009': 2,
 '2011': 2,
 '2015': 2,
 '3': 2,
 '726527002': 1,
 'a': 9,
 'about': 1,
 'academic': 2,
 'academics': 4,
 'academy': 3,
 'adding': 1,
 'additional': 1,
 'all': 3,
 'amp': 1,
 'and': 16,
 'appointments': 1,
 'are': 1,
 'article': 4,
 'articles': 5,
 'associated': 1,
 'at': 5,
 'awarded': 1,
 'be': 3,
 'been': 1,
 'beyond': 1,
 'biography': 5,
 'births': 1,
 'blp': 2,
 'board': 2,
 'books': 3,
 'born': 2,
 'british': 2,
 'bsc': 1,
 'building': 1,
 'business': 2,
 'by': 3,
 'caledonian': 1,
 'can': 1,
 'cannot': 1,
 'career': 2,
 'categories': 2,
 'chancellor': 1,
 'chartered': 1,
 'citations': 1,
 'citing': 1,
 'civil': 4,
 'construction': 7,
 'contentious': 1,
 'contents': 1,
 'contractual': 1,
 'court': 1,
 'coverage': 1,
 'cranfield': 2,
 'dean': 1,
 'deleted': 1,
 'deputy': 1,
 'director': 1,
 'discuss': 1,
 'dsc': 1,
 'duti

In [212]:
# threads or processes?
opt_count([2, 3, 5], word_frequencies, parsed[:10], 'thread')

{2: 0.023484230041503906, 3: 0.021152019500732422, 5: 0.022765159606933594}

In [213]:
# occurring error!
opt_count([2, 3, 5], word_frequencies, parsed[:10], 'process')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-213-2cb48168b98e>", line 1, in <module>
    opt_count([2, 3, 5], word_frequencies, parsed[:10], 'process')
  File "<ipython-input-13-57c7b66e93b4>", line 14, in opt_count
    p = Pool(nump)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 156, in __init__
    self._setup_queues()
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 249, in _setup_queues
    self._inqueue = self._ctx.SimpleQueue()
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.f

OSError: [Errno 24] Too many open files

In [214]:
# performance optimized at (thread, max_workers = 3)

p = ThreadPoolExecutor(max_workers=3)
word_freqs_list = list(p.map(word_frequencies, parsed))

In [219]:
total_word_freqs = {}

for wf in word_freqs_list:
    for k,v in wf.items():
        if k not in total_word_freqs:
            total_word_freqs[k] = 0
        total_word_freqs[k] += v

total_word_freqs

{'ronald': 21,
 'mccaffer': 6,
 'from': 5877,
 'wikipedia': 2924,
 'the': 34728,
 'free': 1227,
 'encyclopedia': 1064,
 'jump': 1027,
 'to': 11270,
 'navigation': 1013,
 'search': 1085,
 'this': 2410,
 'article': 1116,
 'has': 1130,
 'multiple': 105,
 'issues': 133,
 'please': 269,
 'help': 924,
 'improve': 203,
 'it': 2674,
 'or': 1490,
 'discuss': 38,
 'these': 382,
 'on': 4952,
 'talk': 100,
 'page': 301,
 'learn': 266,
 'how': 407,
 'and': 15270,
 'when': 846,
 'remove': 264,
 'template': 287,
 'messages': 43,
 'biography': 216,
 'of': 20366,
 'a': 12584,
 'living': 346,
 'person': 84,
 'needs': 116,
 'additional': 249,
 'citations': 301,
 'for': 4440,
 'verification': 78,
 'by': 4641,
 'adding': 194,
 'reliable': 165,
 'sources': 514,
 'contentious': 20,
 'material': 185,
 'about': 775,
 'persons': 35,
 'that': 2498,
 'is': 4935,
 'unsourced': 253,
 'poorly': 28,
 'sourced': 27,
 'must': 129,
 'be': 1444,
 'removed': 154,
 'immediately': 55,
 'especially': 80,
 'if': 266,
 'potent

In [220]:
sorted_total_word_freqs = sorted(total_word_freqs.items(), 
                                 key=operator.itemgetter(1),
                                 reverse=True)

sorted_total_word_freqs[:10]

[('the', 34728),
 ('of', 20366),
 ('and', 15270),
 ('in', 14774),
 ('a', 12584),
 ('to', 11270),
 ('1', 6301),
 ('from', 5877),
 ('s', 5368),
 ('2', 5015)]

In [231]:
# packaging all the above processes as one function
# and make some experiments.

def get_total_word_freqs():
    
    p = ThreadPoolExecutor(max_workers=3)
    word_freqs_list = list(p.map(word_frequencies, parsed))
    
    total_word_freqs = {}

    for wf in word_freqs_list:
        for k,v in wf.items():
            if k not in total_word_freqs:
                total_word_freqs[k] = 0
            total_word_freqs[k] += v
            
    sorted_total_word_freqs = sorted(total_word_freqs.items(), 
                                 key=operator.itemgetter(1),
                                 reverse=True)
    
    return sorted_total_word_freqs
    

In [232]:
def performance_test(func, times=10):
    
    time_taken = []
    for i in range(times):
        start = time.time()
        func()
        time_taken.append(time.time() - start)
        
    return statistics.median(time_taken)

In [233]:
# default running time
performance_test(get_total_word_freqs)

3.4575916528701782

In [234]:
# Experiments start by changing the 'word_frequencies' function
# EX 1. Excluding words less than 7 characters

def word_frequencies(content):
    
    re_exp_to_clean = [r"<[^>]*>", r"[\n\s\t]", r"[*·.,\:\-\"\'\[\]\(\)]+", r"[\s]+"]
    
    cleaned_article = content.lower()
    for exp in re_exp_to_clean:
        cleaned_article = re.sub(exp, " ", cleaned_article)
        
    # final clean up
    cleaned_article = re.sub("\W+", " ", cleaned_article)
    if cleaned_article[0] == ' ':
        cleaned_article = cleaned_article[1:]
    if cleaned_article[-1] == ' ':
        cleaned_article = cleaned_article[:-1]
    
    words = cleaned_article.split(" ")
    
    words_except_lt_7chars = []
    for word in words:
        if len(word) >= 7:
            words_except_lt_7chars.append(word)
    
    count = Counter(words_except_lt_7chars)
    
    return dict(count)

In [235]:
performance_test(get_total_word_freqs)

3.4399609565734863

In [236]:
# Experiments start by changing the 'word_frequencies' function
# EX 2. returning only the most commonly occurred words

def word_frequencies(content):
    
    re_exp_to_clean = [r"<[^>]*>", r"[\n\s\t]", r"[*·.,\:\-\"\'\[\]\(\)]+", r"[\s]+"]
    
    cleaned_article = content.lower()
    for exp in re_exp_to_clean:
        cleaned_article = re.sub(exp, " ", cleaned_article)
        
    # final clean up
    cleaned_article = re.sub("\W+", " ", cleaned_article)
    if cleaned_article[0] == ' ':
        cleaned_article = cleaned_article[1:]
    if cleaned_article[-1] == ' ':
        cleaned_article = cleaned_article[:-1]
    
    words = cleaned_article.split(" ")
    
    words_except_lt_7chars = []
    for word in words:
        if len(word) >= 7:
            words_except_lt_7chars.append(word)
    
    # return the top 30 most occurred words
    count = Counter(words_except_lt_7chars).most_common(30)
    
    return dict(count)

In [237]:
performance_test(get_total_word_freqs)

3.474249005317688

### Summary for experiments


<br>

default|exclude<br>`>7` chars|return<br>top words
:---:|:---:|:---:
3.458|3.440|3.474