# Analyzing Wikipedia Pages

## Introduction

Volunteer content contributors and editors maintain Wikipedia by continuously improving content. Because of this, Wikipedia has a huge library. In this project, we will implement a simplified version of the grep command-line utlity to search for data in 54 megabytes worth of articles scraped from Wikipedia. 

Our main goals are:

* Search for all occurrences of a string in all of the files.
* Provide a case-insensitive option to the search.
* Refine the result by providing the specific locations of the files.

## Introducing Wikipedia Data

Articles were saved using the last component of their URLs and all data files are in the wiki forder. To start, we'll list all of the files in this folder.

In [1]:
# List files in the wiki folder and count number of files
import os

folder = 'wiki'
files = os.listdir(folder)
number_files = len(files)
print('Number of files in the wiki folder: ', number_files)

Number of files in the wiki folder:  999


In [2]:
# Read the first file and print first ten lines
with open(os.path.join(folder, files[0])) as first_file:
    first_file = list(first_file)
first_file[:10]

['<!DOCTYPE html>\n',
 '<html class="client-nojs" lang="en" dir="ltr">\n',
 '<head>\n',
 '<meta charset="UTF-8"/>\n',
 '<title>Bay of Concepción - Wikipedia</title>\n',
 '<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n',
 '<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""

## Adding the MapReduce Framework

First, we will add the MapReduce Framework, given that it is the base of this project.

In [3]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunks_results = pool.map(mapper, chunks)
    pool.close()
    pool.join()
    return functools.reduce(reducer, chunks_results)

### Total number of lines in all files in the wiki folder using MapReduce

In [4]:
def open_files(files):
    len_files = 0
    for name_file in files:
        with open(os.path.join('wiki', name_file)) as open_file:
            len_files += len(list(open_file))
    return len_files
 
def count_lines(len_files1, len_files2):
    return len_files1 + len_files2

map_reduce(files, 10, open_files, count_lines)

499797

## Grep Exact Match

Our first MapReduce grep algorithm will locate all lines in all files from the wiki folder that contains a given string.

In [5]:
def find_string(files):
    findings = {}
    for file_name in files:
        with open(os.path.join('wiki', file_name)) as opened_file:
            file = list(opened_file)
            for row_index in range(len(file)):
                if str_given in file[row_index]:
                    if file_name not in findings:
                        findings[file_name] = [row_index]
                    else:
                        findings[file_name].append(row_index)
    return findings

def reduce_dict(dict1, dict2):
    dict1.update(dict2)
    return dict1
        
def grep_mapreduce(folder, num_process):
    files_in_folder = os.listdir(folder)
    return map_reduce(files_in_folder, num_process, find_string, reduce_dict)

### Finding occurences of the string 'data'

In [6]:
str_given = 'data'
data_occurrences = grep_mapreduce('wiki', 10)

## Grep Case Insensitive

We can make our previous solution case insensitive by converting both the string given and the file lowercase inside our function.

In [7]:
def find_string_insensitive(files):
    findings = {}
    for file_name in files:
        with open(os.path.join('wiki', file_name)) as opened_file:
            file = [row.lower() for row in opened_file.readlines()]
            for row_index in range(len(file)):
                if str_given.lower() in file[row_index]:
                    if file_name not in findings:
                        findings[file_name] = [row_index]
                    else:
                        findings[file_name].append(row_index)
    return findings

def grep_mapreduce_insensitive(folder, num_process):
    files_in_folder = os.listdir(folder)
    return map_reduce(files_in_folder, num_process, find_string_insensitive, reduce_dict)

### Find all occurences of the string 'data'

In [8]:
str_given = 'data'
data_occurrences_insensitive = grep_mapreduce_insensitive('wiki', 10)

## Checking the Implementation

We can check if we find more matches with the last algorithm in each file since we have saved dictionaries for each function.

In [9]:
for file in data_occurrences:
    if data_occurrences[file] != data_occurrences_insensitive[file]:
        print(f'{file}: {len(data_occurrences_insensitive[file]) - len(data_occurrences[file])} new occurrences.')

Table_Point_Formation.html: 1 new occurrences.
Ingrid_GuimarC3A3es.html: 1 new occurrences.
Jules_Verne_ATV.html: 2 new occurrences.
Pictogram.html: 1 new occurrences.
Claire_Danes.html: 2 new occurrences.
PTPRS.html: 1 new occurrences.
A_Beautiful_Valley.html: 1 new occurrences.
Mudramothiram.html: 1 new occurrences.
Gordon_Bau.html: 2 new occurrences.
Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html: 1 new occurrences.
Code_page_1023.html: 3 new occurrences.
Cryptographic_primitive.html: 1 new occurrences.
Alex_Kurtzman.html: 1 new occurrences.
Filip_Pyrochta.html: 1 new occurrences.
Morgana_King.html: 1 new occurrences.
Don_Parsons_(ice_hockey).html: 1 new occurrences.
Bias.html: 1 new occurrences.
Tomohiko_ItC58D_(director).html: 2 new occurrences.
Imperial_Venus_(film).html: 1 new occurrences.
Camp_Nelson_Confederate_Cemetery.html: 1 new occurrences.
Benny_Lee.html: 1 new occurrences.
Kul_Gul.html: 1 new occurrences.
Medicago_murex.html: 1 new occurrences.
Oldfield_Baby_Great_Lakes.

## Finding Match Positions on Lines

Until now, we were finding the line numbers where there is occurrence. The next implementation will extend the last algorithm to give information about the location of the matches in the line. 

We will alter the dictionary by returning a tuple in the list with the number of the line and the index of the first character of the match in the line in the list instead of just the number of the line. 

### Subproblem with first_file

We will test how to find a word in a line before applying the code in our function. We'll do it with the sixth line (index 5) in the file stored in the first_file variable at the beginning of this project and the word 'client' that starts at index 13.

In [10]:
first_file[5]

'<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n'

In [11]:
def find_in_row(row, word):
    index = row.find(word, 0)
    occurrences = []
    while index != -1:
        occurrences.append(index)
        index = row.find(word, index + 1)
        
    return occurrences

print(find_in_row(first_file[5], 'client'))
print(first_file[5][96:96 + len('client')])
print(first_file[5][119:119 + len('client')])

[96, 119]
client
client


In [12]:
def find_string_insensit_pos(files):
    findings = {}
    for file_name in files:
        with open(os.path.join('wiki', file_name)) as opened_file:
            file = [row.lower() for row in opened_file.readlines()]
            
            for row_index in range(len(file)):
                if str_given.lower() in file[row_index]:
                    list_char_index = find_in_row(file[row_index], str_given)
                    
                    if file_name not in findings:
                        findings[file_name] = [(row_index, i) for i in list_char_index]
                    else:
                        for i in list_char_index:
                            findings[file_name].append((row_index, i))
    return findings



def grep_mapreduce_insensit_pos(folder, num_process):
    files_in_folder = os.listdir(folder)
    return map_reduce(files_in_folder, num_process, find_string_insensit_pos, reduce_dict)

str_given = 'science'
science_match = grep_mapreduce_insensit_pos('wiki', 10)

## Displaying the Results

We can show our results in a CSV file for better visualization.

In [13]:
import csv
# Number of characters to show before and after the match
context_delta = 20
with open('results.csv', 'w') as new_file:
    results = csv.writer(new_file)
    rows = [['File', 'Line', 'Index', 'Context']]
    for key in science_match:
        with open(os.path.join('wiki', key)) as file_name:
            file_name = list(file_name)
            for line_index, index in science_match[key]:
                start_context = max(index - context_delta, 0)
                end_context = index + len(str_given) + context_delta
                context = file_name[line_index][start_context:end_context]
                rows.append([key, line_index, index, context])
                
    results.writerows(rows)

In [14]:
import pandas as pd
visualization = pd.read_csv('results.csv')
visualization.head(15)

Unnamed: 0,File,Line,Index,Context
0,Valentin_Yanin.html,6,840,"the USSR Academy of Sciences"",""Full Members of"
1,Valentin_Yanin.html,6,890,"Russian Academy of Sciences"",""Demidov Prize la"
2,Valentin_Yanin.html,66,90,"i/Soviet_Academy_of_Sciences"" class=""mw-redirec"
3,Valentin_Yanin.html,66,145,"=""Soviet Academy of Sciences"">Soviet Academy of"
4,Valentin_Yanin.html,66,173,""">Soviet Academy of Sciences</a>; he became a f"
5,Valentin_Yanin.html,144,1440,"the_USSR_Academy_of_Sciences"" title=""Category:F"
6,Valentin_Yanin.html,144,1502,"the USSR Academy of Sciences"">Full Members of t"
7,Valentin_Yanin.html,144,1548,the USSR Academy of Sciences</a></li><li><a hre
8,Valentin_Yanin.html,144,1632,"_Russian_Academy_of_Sciences"" title=""Category:F"
9,Valentin_Yanin.html,144,1697,"Russian Academy of Sciences"">Full Members of t"
