# Guided Project: Analyzing Wikipedia Pages

In this guided project, you'll be working with data scraped from Wikipedia, a popular online encyclopedia. We'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the requests package.

Our main goals will be to:
- Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
- Remove common page headers and footers from the Wikipedia pages.
- Figure out what tags are the most common in Wikipedia pages.
- Figure out patterns in the text.

In [1]:
import os
import glob
import time
import concurrent.futures
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
import re

In [2]:
# Getting list of wiki files and looking at one closer
wiki_pages_folder = "C:\\Users\\clayt\\Documents\\Jupyter Notebooks\\Dataquest\\WikiPages\\wiki"
wiki_files = glob.glob(wiki_pages_folder + "\\*.html")
print("Number of .html files in wiki folder: {}\n".format(len(wiki_files)))

print("What files look like:")
# Read First 1000 characters from first file
with open(wiki_files[0], encoding='utf-8') as f:
    r = f.read()
    print(r[:1000])

Number of .html files in wiki folder: 996

What files look like:
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>École des Mines de Douai - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"École_des_Mines_de_Douai","wgTitle":"École des Mines de Douai","wgCurRevisionId":766474818,"wgRevisionId":766474818,"wgArticleId":1225267,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","Articles lacking in-text citations from March 2016","All articles lacking in-text citations","Grandes écoles","Educational institutions established in 1878","Université Lille Nord de France","Douai"],"wg

## Reading in Files With Threads

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.

As this task is I/O bound, we can use threads to help us read in the data more quickly. We will experiment with different thread counts to figure out what makes sense. We will benchmark the performance with multiple threads versus single threaded performance.

In [3]:
def read_file(file):
    with open(file, encoding='utf-8') as f:
        lines = f.read()
    return lines

thread_number = [4, 3, 2, 1]

for num in thread_number:
    content = None
    start = time.time()
    pool = concurrent.futures.ThreadPoolExecutor(max_workers=num)
    content = pool.map(read_file,wiki_files)
    content = list(content)
    end = time.time() - start
    print("Time of {} thread(s): {:0.2f} secs".format(num,end))

Time of 4 thread(s): 0.25 secs
Time of 3 thread(s): 0.26 secs
Time of 2 thread(s): 0.27 secs
Time of 1 thread(s): 0.28 secs


Does not seems to be a large improvement with increasing number of threads used.

In [4]:
# create list of just html file names minus location and .html file extention
articles = [x.replace("C:\\Users\\clayt\\Documents\\Jupyter Notebooks\\Dataquest\\WikiPages\\wiki\\","") \
            .replace(".html","") for x in wiki_files]

## Parsing and Extracting Data From HTML Files

Using the BeautifulSoup package, we'll parse each wiki article, then extract the div with id <code>content</code> and everything inside it.

Since this operation is more CPU intensive than before, let's try using a <b>process pool</b> to see if the speed improves.

### Additional Concerns for Running <code>concurrent.futures.ProcessPoolExecutor</code> on Windows 10 in Jupyter Notebooks

Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using <code>ProcessPoolExecutor</code> or any other parallel code which spawns new processes.

Basically, all your code which creates new processes must be under <code>if \_\_name__ == '\__main__':</code> , for this reason the code can note be executed in the jupyter notebook on Windows 10.

So to get around this I ran the code for <code>ProcessPoolExecutor</code> as a text file. So that the code can be run under <code>if \_\_name__ == '\__main__':</code> outside of the interpretor.

In [5]:
# Need to save content outside of jupyter notebook so that ProcessPoolExecutor can access it
import json
with open('wiki_content.txt', 'w', encoding='utf-8') as f:
    json.dump(content, f)

# Text that will be saved as file to be run seperate from interpretor
fn="wiki_1.py"
with open(fn, "w", encoding='utf=8') as f: f.write(
"""import concurrent.futures
from bs4 import BeautifulSoup
import time
import json

def extract_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    div = soup.find("div", id="content")
    return str(div)
    
def main():
    
    #Read in data file
    with open('wiki_content.txt', 'r', encoding='utf-8') as f:
        content = json.load(f)
    
    process_pool = [6,5,4,3,2,1]
    
    for num in process_pool:
        content_extracted = None
        start = time.time()
        pool = concurrent.futures.ProcessPoolExecutor(max_workers=num)
        content_extracted = pool.map(extract_content,content)
        content_extracted = list(content_extracted)
        end = time.time() - start
        print("Time of {} processes: {:0.2f} secs".format(num,end))
    
    # Write out content_extracted file
    with open('wiki_content_extracted.txt', 'w', encoding='utf-8') as f:
        json.dump(content_extracted, f)
        
if __name__=='__main__':
    main()
"""
)

# Run file outside of jupyter notebook    
!python $fn

# Read in wiki_content_extracted.txt file
with open('wiki_content_extracted.txt', 'r', encoding='utf-8') as f:
    wiki_content_extracted = json.load(f)

Time of 6 processes: 12.70 secs
Time of 5 processes: 11.49 secs
Time of 4 processes: 11.23 secs
Time of 3 processes: 13.83 secs
Time of 2 processes: 20.07 secs
Time of 1 processes: 37.80 secs


Looks like increasing the number of process reduced the time to execute up until 4 process. At that point the additional processes did not add significantly to a reduction in time. Increasing the process from 1 to 4 process make the run time 3.4 times faster.

## Looking at Structure of Wiki Articles Through Tag Counts

Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of a tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of div tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.

We can count tags using the BeautifulSoup.find_all() method with no argument, then iterating through all of the tags.

We will be using the <code>concurrent.futures.ProcessPoolExecutor</code>, since this process is computationally intensive and therfore a CPU bound program.

We will look at the types of tags in each wiki articles to understand each article. So stats that need to be computed are:
- number of each tag in each document
- overall number of tags in all documents

In [6]:
# Text that will be saved as file to be run seperate from interpretor
fn="wiki_2.py"
with open(fn, "w", encoding='utf=8') as f: f.write(
"""import concurrent.futures
from bs4 import BeautifulSoup
import time
import json
from collections import Counter

def tag_count(content):
    found = BeautifulSoup(content, 'html.parser').find_all()
    c = Counter([tag.name for tag in found])
    return c

def main():
    
    # Read in data file
    with open('wiki_content_extracted.txt', 'r', encoding='utf-8') as f:
        content = json.load(f)
    
    process_pool = [6,5,4,3,2,1]
    
    for num in process_pool:
        c = None
        start = time.time()
        pool = concurrent.futures.ProcessPoolExecutor(max_workers=num)
        content_count = pool.map(tag_count,content)
        content_count = list(content_count)
        end = time.time() - start
        print("Time of {} processes: {:0.2f} secs".format(num,end))
        
    # Write out content_count file
    with open('wiki_content_count.txt', 'w', encoding='utf-8') as f:
        json.dump(content_count, f)
        
if __name__=='__main__':
    main()
"""
)

# Run file outside of jupyter notebook    
!python $fn

# Read in wiki_content_count.txt file
with open('wiki_content_count.txt', 'r', encoding='utf-8') as f:
    wiki_content_count = json.load(f)

Time of 6 processes: 8.17 secs
Time of 5 processes: 7.87 secs
Time of 4 processes: 8.16 secs
Time of 3 processes: 8.65 secs
Time of 2 processes: 12.50 secs
Time of 1 processes: 22.15 secs


Looking at the time it took to complete counting tags, once the number of processes got past 4, there was no real improvement to the computation time. This makes sense since the number of processes on my computer is 4. Past 4 process, new process just wait for computer resources to open up so there is a loss in performance past this point.

Now we will go on to look at the tags per articles and compute the total counts of tags across all wiki articles looked at.

In [7]:
# Count Per Article
articles_count = [['name','Counter']]
for i in range(len(articles)):
    articles_count.append([articles[i],Counter(wiki_content_count[i])])

# Overall Count
c_all_wiki = Counter()
for articles in articles_count[1:]:
    c_all_wiki += articles[1]

Now that I have analyzed the data I will rearrange it into a pandas dataframe to analyze.

In [8]:
def merge(dict1, dict2):
    res = {**dict1,**dict2}
    return res

# Start Keys
all_wiki_dict = Counter()
for key in list(c_all_wiki.keys()):
    all_wiki_dict[key] = 0
all_wiki_dict = dict(all_wiki_dict)
# Add dummy name to dict
all_wiki_dict['name'] = 'dummyname'

# Initialize dataframe
df_wiki_count = None
df_wiki_count = pd.DataFrame(all_wiki_dict,index=['name']).reset_index()

# Add Wiki Articles and Counted Tags
for data in articles_count[1:]:
    temp_dict = {}
    temp_dict['name'] = data[0]
    temp_dict = merge(temp_dict,data[1])
    df_wiki_count = df_wiki_count.append(temp_dict, ignore_index=True)

# Dataframe clean up
df_wiki_count = df_wiki_count.drop(columns=['index'], axis=1).fillna(0)
df_wiki_count = df_wiki_count.drop([0]) # Remove dummy holder data

df_wiki_count

Unnamed: 0,div,a,h1,table,caption,tr,th,td,span,br,...,audio,source,del,h5,h6,map,area,pre,samp,name
1,55,580,1,7.0,2.0,84.0,58.0,172.0,116.0,33.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,%C3%89cole_des_Mines_de_Douai
2,37,487,1,4.0,0.0,22.0,14.0,22.0,37.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,%C3%89taule
3,18,104,1,4.0,0.0,9.0,6.0,9.0,30.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,%C5%8Cnog%C5%8D_Station
4,26,167,1,1.0,0.0,3.0,1.0,2.0,42.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100_Greatest_Romanians
5,16,53,1,1.0,0.0,8.0,7.0,7.0,34.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,104th_Logistic_Support_Brigade_(United_Kingdom)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,18,74,1,2.0,0.0,11.0,2.0,18.0,38.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zeiraphera
993,31,142,1,3.0,0.0,19.0,12.0,18.0,65.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zgornji_Otok
994,16,49,1,0.0,0.0,0.0,0.0,0.0,44.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zhongxing_Guesthouse
995,71,831,1,24.0,0.0,140.0,71.0,129.0,72.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ZNF644


### Overall Averages and Ranges for all tags seen in wiki articles analyzed

In [9]:
df1 = df_wiki_count.describe().transpose().sort_values(by=['mean'], ascending=False)
s1 = df_wiki_count.agg(['sum']).transpose().drop('name')['sum']
df1['sum_all_tags'] = s1
df1.sort_values(by=['sum_all_tags'], ascending=False)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,sum_all_tags
a,996.0,160.918675,205.182665,13.0,51.0,92.0,177.75,2271.0,160275
li,996.0,85.477912,134.938581,3.0,17.0,35.0,97.0,1214.0,85136
span,996.0,67.518072,99.753569,0.0,20.0,41.0,73.0,1119.0,67248
td,996.0,57.779116,212.345337,0.0,6.0,17.0,35.0,4516.0,57548
div,996.0,28.599398,25.778437,12.0,16.0,21.0,32.0,434.0,28485
tr,996.0,27.293173,46.482724,0.0,6.0,15.0,30.0,617.0,27184
i,996.0,18.276104,39.677102,0.0,2.0,6.0,16.0,644.0,18203
b,996.0,14.488956,32.266025,0.0,3.0,6.0,12.0,644.0,14431
th,996.0,14.463855,22.145605,0.0,3.0,8.0,18.0,311.0,14406
sup,996.0,11.194779,27.53914,0.0,1.0,3.0,10.0,347.0,11150


### Looking at some max tag articles

In [10]:
col = ['a','li','span','div','td','img','h2','h3','code']

for c in col:
    print('Wiki Article with most <{}> tags:\nhttps://en.wikipedia.org/wiki/{}\n'.format(c,
    df_wiki_count['name'].loc[int(df_wiki_count[c].idxmax(axis='columns'))]))

Wiki Article with most <a> tags:
https://en.wikipedia.org/wiki/List_of_Spaghetti_Western_films

Wiki Article with most <li> tags:
https://en.wikipedia.org/wiki/Gandigwad

Wiki Article with most <span> tags:
https://en.wikipedia.org/wiki/2016%E2%80%9317_New_Zealand_Football_Championship

Wiki Article with most <div> tags:
https://en.wikipedia.org/wiki/Double_Wing_Attack

Wiki Article with most <td> tags:
https://en.wikipedia.org/wiki/Double_Wing_Attack

Wiki Article with most <img> tags:
https://en.wikipedia.org/wiki/2009_World_Junior_Ice_Hockey_Championships_rosters

Wiki Article with most <h2> tags:
https://en.wikipedia.org/wiki/List_of_birds_of_Malta

Wiki Article with most <h3> tags:
https://en.wikipedia.org/wiki/Bias

Wiki Article with most <code> tags:
https://en.wikipedia.org/wiki/Oldham_Metropolitan_Borough_Council_election,_2000



## Looking at Content of Wiki Articles Through Word Counts

After finding the common tags, we should be able to find the common words in the article body. We can apply any definition of "word" that we want. So, a word is any letter or underscore character surrounded by characters not in that group. Also, we will only be looking at words that are longer than four characters so that it removes most noninteresting common words.

Since I will be anlyzing the words within the documents and doing calculations on them, this will be a CPU bound program and therefore we will use the <code>ProcessPoolExecutor</code>.

In [11]:
# Text that will be saved as file to be run seperate from interpretor
fn="wiki_3.py"
with open(fn, "w", encoding='utf=8') as f: f.write(
"""import concurrent.futures
from bs4 import BeautifulSoup
import time
import json
from collections import Counter
import re

def word_count_ext(html):
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.get_text().lower()
    c = Counter(re.findall(r'[a-z]{5,}', text))
    return c

def main():
    
    # Read in data file
    with open('wiki_content_extracted.txt', 'r', encoding='utf-8') as f:
        content = json.load(f)
    
    process_pool = [1,2,3,4,5,6]
    
    for num in process_pool:
        c = None
        start = time.time()
        pool = concurrent.futures.ProcessPoolExecutor(max_workers=num)
        word_count = pool.map(word_count_ext,content)
        word_count = list(word_count)
        end = time.time() - start
        print("Time of {} processes: {:0.2f} secs".format(num,end))
        
    # Write out word_count file
    with open('wiki_word_count.txt', 'w', encoding='utf-8') as f:
        json.dump(word_count, f)
        
if __name__=='__main__':
    main()
"""
)

# Run file outside of jupyter notebook    
!python $fn

# Read in wiki_content_count.txt file
with open('wiki_word_count.txt', 'r', encoding='utf-8') as f:
    wiki_word_count = json.load(f)

Time of 1 processes: 22.88 secs
Time of 2 processes: 12.34 secs
Time of 3 processes: 8.79 secs
Time of 4 processes: 6.67 secs
Time of 5 processes: 6.73 secs
Time of 6 processes: 6.95 secs


In [12]:
# Overall Count of Words in All Articles
c_all_wiki_words = Counter()
for words in wiki_word_count:
    c_all_wiki_words += words

print('50 Most Common Words Over 4 Characters in Length Found:')
print(c_all_wiki_words.most_common(50))

50 Most Common Words Over 4 Characters in Length Found:
[('retrieved', 3208), ('wikipedia', 2613), ('categories', 1925), ('articles', 1807), ('references', 1404), ('united', 1383), ('which', 1238), ('title', 1203), ('article', 1109), ('search', 1082), ('encyclopedia', 1061), ('index', 1046), ('first', 1024), ('https', 1022), ('university', 1015), ('navigation', 1010), ('oldid', 996), ('links', 968), ('external', 956), ('october', 918), ('after', 909), ('their', 899), ('other', 896), ('national', 882), ('august', 881), ('september', 876), ('november', 843), ('production', 830), ('world', 828), ('states', 824), ('march', 795), ('school', 778), ('about', 774), ('county', 764), ('january', 757), ('district', 754), ('december', 750), ('february', 747), ('history', 731), ('state', 716), ('american', 702), ('april', 693), ('south', 680), ('italian', 647), ('italy', 623), ('north', 614), ('years', 589), ('there', 575), ('family', 569), ('coordinates', 564)]
