## Analysis of Wikipedia Pages

Wikipedia is a free online resource of information. It is a volunteer content contributors and editors website with contributors all over the world. Anyone can edit [Wikiedia](https://en.wikipedia.org/wiki/Main_Page). This project implements the grep command-line utility and searches for data in articles. The `grep`command allows searching for textual data in all files from a given directory. For this project, all the data files are stored in a folder called `wiki`. 

In [1]:
import os
import math
import functools
from multiprocessing import Pool

## List and Iterate over All Files

List and iterate over all files in the `wiki` folder 

In [2]:
file_len = 0
file_names = os.listdir('wiki')
for fname in file_names:
    #print(fname)
    file_len += 1

In [3]:
print('number of files:', file_len)

number of files: 999


## Read and Print Contents of First File

In [4]:
# read the contents of the first file and print
folder_name = 'wiki'
with open(os.path.join(folder_name, file_names[0])) as fl:
    print(fl.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bay of Concepción - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNa

## Adding MapReduce Function

The MapReduce function will be used will be used throughout this project

In [5]:
def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

## Total Number of Lines in All Files

We use the MapReduce function to count the total number of lines in all files

In [6]:
# MapReduce function to count the total number of lines in all files
def map_lines_count(file_names):
    file_lines = 0
    for fl in file_names:
        with open(os.path.join(folder_name, fl)) as flname:
            flines = [line for line in flname.readlines()]
            file_lines += len(flines)
    return file_lines


def reduce_lines_count(filelines1, filelines2):
    merged = filelines1 + filelines2
    return merged

total_lines_count = map_reduce(file_names, 10, map_lines_count, reduce_lines_count)

In [7]:
total_lines_count

499797

## Create a Grep Function 

The grep algorithm does the following:
- Locates all lines in all files that contains a given string
- Returns a dictionary with file names as keys and the values as a list of all line numbers that contain the given string
- Files that do not contain the given string are not included in the result dictionary

For the given string **'data'**, find all occurrences in the files in the `wiki` folder

We define a `reduce_map_grep` function which takes the folder name ~ `wiki` and a chuck of file names to process for the occurrences of the given string.

The dictionary update function is then used to merge the result dictionaries.

In [8]:
# grep function mapper function to find all lines in all files with a target string
# function creates a dictionary with file names as keys and values are lists indexes
# that contain the target string

def map_grep_find_match(file_names):
    string_match = {}
    for fl in file_names:
        with open(fl) as flname:
            flines = [line for line in flname.readlines()]
            for ln_count, line in enumerate(flines):
                if string_target in line and fl in string_match:
                    string_match[fl].append(ln_count)
                elif string_target in line and not fl in string_match:
                    string_match[fl] = []
    return string_match

def reduce_grep_find_match(filelines1, filelines2):
    filelines1.update(filelines2)
    merged = filelines1.update(filelines2)
    return filelines1
        
def reduce_map_grep(folder_name, processes):
    file_names = [os.path.join(folder_name, fn) for fn in os.listdir(folder_name)]
    return map_reduce(file_names, processes, map_grep_find_match, reduce_grep_find_match)

In [9]:
string_target = 'data'
string_occurrence_sensitive = reduce_map_grep('wiki', 10)

## Case Insensitive Matches

Case matters in python and therefore the given string of 'data' will for example miss the word 'Data'. To be able to cater for all cases, the `grep`function can be improved by making it case insensitive.

In [10]:
# case insensitive function find all occurrences of a given string
def map_grep_find_case_insensitive_match(file_names):
    string_match = {}
    for fl in file_names:
        with open(fl) as flname:      
            flines = [line for line in flname.readlines()]
            for ln_count, line in enumerate(flines):
                if string_target.lower() in line.lower() and fl in string_match:
                    string_match[fl].append(ln_count)
                elif string_target in line and not fl in string_match:
                    string_match[fl] = []    
    return string_match


def reduce_map_grep_insensitive_match(folder_name, processes):
    file_names = [os.path.join(folder_name, fn) for fn in os.listdir(folder_name)]
    return map_reduce(file_names, processes, map_grep_find_case_insensitive_match, reduce_grep_find_match)
    
string_target = 'data'
string_occurrence_insensitive = reduce_map_grep_insensitive_match('wiki', 10)

## Checking the Implementation for More Matches

Having stored the results of the search for the given string **'data'** in **`string_occurrence_sensitive`** and **`string_occurrence_insensitive`** variables, we must verify that the new implementation works by checking if more matches are found than the previous implementation. To do this, we can iterate over the file names and compare and print the length difference between the results.

In [11]:
for ky in string_occurrence_insensitive:
    if not ky in string_occurrence_sensitive:
        print(len(string_occurrence_insensitive[ky]), 'new matches ' 'for',(os.path.join(folder_name, ky)))
    elif len(string_occurrence_insensitive[ky]) > len(string_occurrence_sensitive[ky]):
        print(len(string_occurrence_insensitive[ky]), 'new matches ' 'for',(os.path.join(folder_name, ky)))

13 new matches for wiki/wiki/Table_Point_Formation.html
12 new matches for wiki/wiki/Ingrid_GuimarC3A3es.html
22 new matches for wiki/wiki/Jules_Verne_ATV.html
26 new matches for wiki/wiki/Pictogram.html
14 new matches for wiki/wiki/Claire_Danes.html
17 new matches for wiki/wiki/PTPRS.html
4 new matches for wiki/wiki/A_Beautiful_Valley.html
10 new matches for wiki/wiki/Gordon_Bau.html
12 new matches for wiki/wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
3 new matches for wiki/wiki/Code_page_1023.html
7 new matches for wiki/wiki/Cryptographic_primitive.html
7 new matches for wiki/wiki/Alex_Kurtzman.html
6 new matches for wiki/wiki/Filip_Pyrochta.html
10 new matches for wiki/wiki/Morgana_King.html
15 new matches for wiki/wiki/Bias.html
8 new matches for wiki/wiki/Tomohiko_ItC58D_(director).html
8 new matches for wiki/wiki/Imperial_Venus_(film).html
13 new matches for wiki/wiki/Camp_Nelson_Confederate_Cemetery.html
12 new matches for wiki/wiki/Kul_Gul.html
10 new matches for wiki/wi

## Finding Index Positions on Lines

Let is begin by creating a function that takes a given string and finds the location of the string in a line of text.

In [12]:
# function takes a line of text and a given string and returns the index location if the shring
def find_line_indices(line, target):
    results = []
    si = 0
    for f in range(len(line)):
        j = line.find(target, si)
        if j != -1:
            si = j + 1
            results.append(j)
    return results

s = "Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret.".lower()
print(find_line_indices(s, "data"))

[0, 100, 133]


Finding all location indexes of a given string extends the function for finding a given string using case insensitivity in all lines. For each occurrence on the line, add its index to the line index.

In [13]:
# find all the occurrences of the target string in all the files
# for each occurrence, return the pair such that the first number is
# the line index and the second number is the index of the target string
# on that line
def map_grep_find_line_indices(file_names):
    indices = {}
    for fl in file_names:
        with open(fl) as flname:
            flines = [line.lower() for line in flname.readlines()]
            for ln_index, line in enumerate(flines):
                match_indices = find_line_indices(line, string_target.lower())
                if not fl in indices:
                    indices[fl] = []
                else:
                    indices[fl] += [(ln_index, mi) for mi in match_indices]
    return indices

def reduce_map_grep_line_indices(folder_name, processes):
    file_names = [os.path.join(folder_name, fn) for fn in os.listdir(folder_name)]
    return map_reduce(file_names, processes, map_grep_find_line_indices, reduce_grep_find_match)
 
string_target = 'data'
line_indices = reduce_map_grep_line_indices('wiki', 10)

## Displaying the Results

To make the results more readable, we write it in a CSV file. We will also show the text around each occurrence.


In [14]:
#write results to csv file
import csv

# How many character to show before and after the match
context_delta = 30

with open('results.csv', 'w', encoding='utf-8') as fl:
    writer = csv.writer(fl)
    rows = [["File", "Line", "Index", "Context"]]
    for fln in line_indices:
        with open(fln) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in line_indices[fln]:
            start = max(index - context_delta, 0)
            end   = index + len(string_target) + context_delta
            rows.append([fln, line, index, lines[line][start:end]])
    writer.writerows(rows)

In [15]:
import pandas
df = pandas.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Bay_of_ConcepciC3B3n.html,6,422,"egories"":[""Coordinates on Wikidata"",""All stub ..."
1,wiki/Bay_of_ConcepciC3B3n.html,45,628,"78-sj18-04-quiriquina.jpg 2x"" data-file-width=..."
2,wiki/Bay_of_ConcepciC3B3n.html,45,650,"jpg 2x"" data-file-width=""960"" data-file-height..."
3,wiki/Bay_of_ConcepciC3B3n.html,58,447,"aps, aerial photos, and other data for this lo..."
4,wiki/Bay_of_ConcepciC3B3n.html,58,692,"aps, aerial photos, and other data for this lo..."
5,wiki/Bay_of_ConcepciC3B3n.html,60,18,"<table class=""metadata plainlinks stub"" role=""..."
6,wiki/Bay_of_ConcepciC3B3n.html,62,568,"o_Region%2C_Chile.svg.png 2x"" data-file-width=..."
7,wiki/Bay_of_ConcepciC3B3n.html,62,590,"png 2x"" data-file-width=""600"" data-file-height..."
8,wiki/Bay_of_ConcepciC3B3n.html,105,40,"atlinks"" class=""catlinks"" data-mw=""interface"">..."
9,wiki/Bay_of_ConcepciC3B3n.html,105,748,"tegory:Coordinates_on_Wikidata"" title=""Categor..."
