# Analyzing Wikipedia Pages

## List all files in the wiki folder

We can create a list with the names of all files in the wiki folder using the `os.listdir()` function.

In [1]:
import os

file_names = os.listdir("wiki")
len(file_names)

999

## MapReduce Function

We start by adding the MapReduce function so that we can use throughout the project.

In [2]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

## Count Total number of Lines of all files

In [3]:
def map_line_count(file_chunk):
    total = 0
    for file in file_chunk:
        with open(os.path.join("wiki", file)) as f:
            total += len(f.readlines())
    return total

def reduce_line_count(count_chunk_1, count_chunk_2):
    return count_chunk_1 + count_chunk_2

map_reduce(file_names, 8, map_line_count, reduce_line_count)

499797

## Grep string function

The mapper function receives a chunk of filenames and calculates all occurrences of the target string on them. If a file contains no occurrences, we chose to not include an entry for that file in the result dictionary.

The reducer function uses the `dict.update()` method to merge the result dictionaries.

Note that the target variable will be defined outside and will be the string we are looking for.

In [4]:
target = "data"

def map_grep_string(file_chunk):
    result = {}
    for file in file_chunk:
        with open(os.path.join("wiki", file)) as f:
            file_lines = f.readlines()
        for index, sentence in enumerate(file_lines):
            if target in sentence:
                if file not in result:
                    result[file] = []
                result[file].append(index)
    return result

def reduce_grep_string(result_1, result_2):
    result_1.update(result_2)
    return result_1

## Finding the occurences of "data"

In [5]:
target = "data"
data_occurrences = map_reduce(file_names, 8, map_grep_string, reduce_grep_string)

## Allow for Case Insensitive Matches

We can allow case insensitive matches by converting both the target and the file contents to lowercase before we match.

In [6]:
target = "data"

def map_grep_icase(file_chunk):
    result = {}
    for file in file_chunk:
        with open(os.path.join("wiki", file)) as f:
            file_lines = f.readlines()
        for index, sentence in enumerate(file_lines):
            if target.lower() in sentence.lower():
                if file not in result:
                    result[file] = []
                result[file].append(index)
    return result


new_data_occurrences = map_reduce(file_names, 8, map_grep_icase, reduce_grep_string)

## Checking that we find more matches

We already stored the results into variables `data_occurrences` and `new_data_occurrences`. To check that we find more matches with the second version of the algorithm, we can loop over the file names and print the length difference between the results.

In [7]:
for fn in new_data_occurrences:
    if fn not in data_occurrences:
        print("Found {} new matches on file {}".format(len(new_data_occurrences[fn]), fn))
    elif len(new_data_occurrences[fn]) > len(data_occurrences[fn]):
        print("Found {} new matches on file {}".format(len(new_data_occurrences[fn]) - len(data_occurrences[fn]), fn))

Found 1 new matches on file Table_Point_Formation.html
Found 1 new matches on file Ingrid_GuimarC3A3es.html
Found 2 new matches on file Jules_Verne_ATV.html
Found 1 new matches on file Pictogram.html
Found 2 new matches on file Claire_Danes.html
Found 1 new matches on file PTPRS.html
Found 1 new matches on file A_Beautiful_Valley.html
Found 1 new matches on file Mudramothiram.html
Found 2 new matches on file Gordon_Bau.html
Found 1 new matches on file Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Found 3 new matches on file Code_page_1023.html
Found 1 new matches on file Cryptographic_primitive.html
Found 1 new matches on file Alex_Kurtzman.html
Found 1 new matches on file Filip_Pyrochta.html
Found 1 new matches on file Morgana_King.html
Found 1 new matches on file Don_Parsons_(ice_hockey).html
Found 1 new matches on file Bias.html
Found 2 new matches on file Tomohiko_ItC58D_(director).html
Found 1 new matches on file Imperial_Venus_(film).html
Found 1 new matches on file Camp_Nelson_

## Finding match indexes on lines

We need to solve a subproblem before we implement this one: Given a string and a target, find all occurrences of the target within that string.

In [8]:
def find_match_indexes(line, target):
    result = []
    pos = line.find(target)
    while pos != -1:
        result.append(pos)
        pos = line.find(target, pos+1)
    return result

# Test implementation
s = "Data science is related to data mining, machine learning and big data.".lower()
print(find_match_indexes(s, "data"))

[0, 27, 65]


## Finding All Match Positions on Lines

We can use any of the above functions to find all match locations.

After finding all indexes in one line, we need to create `tuple` pairs by adding the line index.

In [9]:
target = "science"

def map_grep_pos(file_chunk):
    result = {}
    for file in file_chunk:
        with open(os.path.join("wiki", file)) as f:
            file_lines = f.readlines()
        for index, sentence in enumerate(file_lines):
            match_list = find_match_indexes(sentence.lower(), target.lower())
            if file not in result:
                result[file] = []
            result[file] += [(index, match_index) for match_index in match_list]
    return result

all_occurrences = map_reduce(file_names, 8, map_grep_pos, reduce_grep_string)

In [10]:
all_occurrences

{'Bay_of_ConcepciC3B3n.html': [],
 'Bye_My_Boy.html': [],
 'Valentin_Yanin.html': [(6, 840),
  (6, 890),
  (66, 90),
  (66, 145),
  (66, 173),
  (144, 1440),
  (144, 1502),
  (144, 1548),
  (144, 1632),
  (144, 1697),
  (144, 1746)],
 'Kings_XI_Punjab_in_2014.html': [],
 'William_Harvey_Lillard.html': [(80, 166)],
 'Radial_Road_3.html': [],
 'George_Weldrick.html': [],
 'Zgornji_Otok.html': [],
 'Blue_Heelers_(season_8).html': [],
 'Taggen_Nunatak.html': [],
 'Henri_BraqueniC3A9.html': [],
 'Vrila.html': [],
 'William_Henry_Porter.html': [],
 'Clive_Brown_(footballer).html': [],
 'Blick_nach_Rechts.html': [],
 'Central_District_(Rezvanshahr_County).html': [],
 'Alexios_Aspietes.html': [],
 'Mei_Lanfang.html': [],
 'Wangeroogeclass_tug.html': [],
 'Dowell_Philip_O27Reilly.html': [],
 'Coalville_Town_railway_station.html': [],
 'Gennady_Lesun.html': [],
 'Bartrum_Glacier.html': [],
 'Victor_S._Mamatey.html': [(48, 682), (48, 728), (48, 767)],
 'Gottfried_Keller.html': [],
 'Table_Point_F

## Output result to CSV

Let's display the results. We will create a `CSV` file listing all occurrences. We will also show the text around each occurrence.

In [11]:
import csv

# How many character to show before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    fields = [["File", "Line", "Index", "Context"]]
    rows = []
    for file, val in all_occurrences.items():
        with open(os.path.join("wiki", file)) as f:
            lines = [line.strip() for line in f.readlines()]
            for fline, index in val:
                start = max(index - context_delta, 0)
                end   = index + len(target) + context_delta
                rows.append([file, fline, index, lines[fline][start:end]])
    writer.writerows(fields)
    writer.writerows(rows)
    

In [12]:
import pandas as pd
df = pd.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."
