## Introduction:
In this project, data scraped from Wikipedia will be used. Volunteer content contributors and editors maintain Wikipedia by continuously improving content. Anyone can edit Wikipedia (you can read more about how to make an edit here). Because Wikipedia is crowdsourced, it has rapidly assembled a huge library of articles.

In this guided project, a simplified version of the grep command-line utility to search for data in 54 megabytes worth of articles will be used. The grep utility essentially allows searching for textual data in all files from a given directory.

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure https://en.wikipedia.org/wiki/Yarkant_County. If the article with the previous URL was saved, it would be saved to the file Yarkant_County.html. All the data files are in the wiki folder. Note that the files are raw HTML. Below a list of all the files are provided.

In [1]:
## Viewing files within directory
import os

file_names = os.listdir("wiki")

print("number of files in wiki folder: ", len(file_names), "\n")

for index, file_name in enumerate(file_names):
    print(index, file_name)

number of files in wiki folder:  999 

0 Bay_of_ConcepciC3B3n.html
1 Bye_My_Boy.html
2 Valentin_Yanin.html
3 Kings_XI_Punjab_in_2014.html
4 William_Harvey_Lillard.html
5 Radial_Road_3.html
6 George_Weldrick.html
7 Zgornji_Otok.html
8 Blue_Heelers_(season_8).html
9 Taggen_Nunatak.html
10 Henri_BraqueniC3A9.html
11 Vrila.html
12 William_Henry_Porter.html
13 Clive_Brown_(footballer).html
14 Blick_nach_Rechts.html
15 Central_District_(Rezvanshahr_County).html
16 Alexios_Aspietes.html
17 Mei_Lanfang.html
18 Wangeroogeclass_tug.html
19 Dowell_Philip_O27Reilly.html
20 Coalville_Town_railway_station.html
21 Gennady_Lesun.html
22 Bartrum_Glacier.html
23 Victor_S._Mamatey.html
24 Gottfried_Keller.html
25 Table_Point_Formation.html
26 Nobuhiko_Ushiba.html
27 Master_of_Space_and_Time.html
28 Early_medieval_states_in_Kazakhstan.html
29 Eressa_aperiens.html
30 Myrtle_(sternwheeler).html
31 Abanycha_bicolor.html
32 JeecyVea.html
33 Aubrey_Fair.html
34 Ingrid_GuimarC3A3es.html
35 Urban_chicken.html
36

884 Holly_Golightly_(comics).html
885 Johann_Christoph_Hoffbauer.html
886 Matthew_Liptak.html
887 Programa_Mejorando_tu_Autoestima.html
888 Philaeus.html
889 SalemAuburn_Streets_Historic_District.html
890 Kate_Harwood.html
891 Baltic_Peak.html
892 Laredo_Colombia_Solidarity_Port_of_Entry.html
893 104th_Logistic_Support_Brigade_(United_Kingdom).html
894 Pichoy_Airfield.html
895 Wreck_Island_Natural_Area_Preserve.html
896 Desmiphora_bijuba.html
897 Webb_Dock_railway_line.html
898 Helena_Nyblom.html
899 Hall_of_Mental_Cultivation.html
900 Alpine_skiing_at_the_1994_Winter_Olympics_E28093_Men27s_combined.html
901 John_Whewell.html
902 Gulliver_Mickey.html
903 Henry_Horace_Williams.html
904 Hebden_Bridge_Picture_House.html
905 Oued_TlC3A9lat.html
906 Oldham_Metropolitan_Borough_Council_election_2000.html
907 Circus_Avenue.html
908 S27portable_Scoreboards.html
909 Kettu_Kalyanam.html
910 Burnin27_Sneakers.html
911 Union_United_Church.html
912 Svendborg_SkibsvC3A6rft.html
913 Campus_of_Texas_A

### Reading the first file
To read and print the contents of the first file.In order to do this one needs to join the name of the file with the wiki folder. This can be done using the os.path.join() function.

In [2]:
with open(os.path.join("wiki", file_names[0]), errors = "ignore") as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bay of Concepción - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNa

### Adding the MapReduce function to this project
Adding the MapReduce function so that it can be used throughout the project.

In [3]:
import math
import functools
import multiprocessing
from multiprocessing import Pool

number_of_cpu = multiprocessing.cpu_count()
print(number_of_cpu)

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

8


Making use of the map reduce function to count number of lines in all files

In [4]:
## mapper function
def map_line_count(file_names):
    total = 0
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            total += len(f.readlines())
    return total

## reducer function
def reduce_line_count(count1, count2):
    return count1 + count2

print(map_reduce(file_names, 8, map_line_count, reduce_line_count))

499797


### Creating a grep string function
To define a mapreduce_grep_string() function that takes two arguments as input:

- A path to a folder. In the case of this project we will only use it on the wiki folder but having this argument makes the function easier to reuse.

- The string that we want to find.

The mapper function receives a chunk of filenames and calculates all occurrences of the target string on them. If a file contains no occurrences, it will not include an entry for that file in the result dictionary.

The reducer function uses the dict.update() method to merge the resulting dictionaries.

Note that the target variable will be defined outside and will be the string that is searched for.

In [5]:
def map_grep_string(file_names):
    ''' Function which counts occurences of string and return their indexes'''
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results

## reducer function
def reduce_grep(lines1, lines2):
    lines1.update(lines2)
    return lines1

## mapper function
def mapreduce_grep(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep_string, reduce_grep)

In [6]:
## Finding all occurences of "data"
target = "data"
data_occurrences = mapreduce_grep("wiki", 8)

### Allow for case insensitive matches
To allow case insensitive matches by converting both the target and the file contents to lowercase before the matching.

In [7]:
def map_grep_insensitive(file_names):
    ''' Function which counts occurences of string (case insensitive) and return their indexes'''
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            if target.lower() in line:
                if fn not in results:
                    results[fn] = []
                results[fn].append(line_index)
    return results

def mapreduce_grep_insensitive(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep_insensitive, reduce_grep)

target = "data"
new_data_occurrences = mapreduce_grep_insensitive("wiki", 8)

### Checking that we find more matches
The results has been stored in the variables data_occurrences and new_data_occurrences. Now to check if one can find more matches with the second version of the algorithm, for this we can loop over the file names and print the length difference between the results.

In [8]:
for file_name in new_data_occurrences:
    if file_name not in data_occurrences:
        print("Found {} new matches on file {}".format(len(new_data_occurrences[file_name]), file_name))
    elif len(new_data_occurrences[file_name]) > len(data_occurrences[file_name]):
        print("Found {} new matches on file {}".format(len(new_data_occurrences[file_name]) - len(data_occurrences[file_name]), file_name))

Found 1 new matches on file wiki/Table_Point_Formation.html
Found 1 new matches on file wiki/Ingrid_GuimarC3A3es.html
Found 2 new matches on file wiki/Jules_Verne_ATV.html
Found 1 new matches on file wiki/Pictogram.html
Found 2 new matches on file wiki/Claire_Danes.html
Found 1 new matches on file wiki/PTPRS.html
Found 1 new matches on file wiki/A_Beautiful_Valley.html
Found 1 new matches on file wiki/Mudramothiram.html
Found 2 new matches on file wiki/Gordon_Bau.html
Found 1 new matches on file wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Found 3 new matches on file wiki/Code_page_1023.html
Found 1 new matches on file wiki/Cryptographic_primitive.html
Found 1 new matches on file wiki/Alex_Kurtzman.html
Found 1 new matches on file wiki/Filip_Pyrochta.html
Found 1 new matches on file wiki/Morgana_King.html
Found 1 new matches on file wiki/Don_Parsons_(ice_hockey).html
Found 1 new matches on file wiki/Bias.html
Found 2 new matches on file wiki/Tomohiko_ItC58D_(director).html
Found

### Function to find indexes where string occurs
Given a string, find occurrences of the target within that string. First example will be for a general string provided before including it in the mapreduce function.

In [9]:
def find_index_matches(string, target):
    results = []
    i = string.find(target, 0)
    while i != -1: # find function returns -1 when it does not find anything, while loop will ensure find function continues
        results.append(i) # append indexes until no more occurences are found
        i = string.find(target, i + 1)
    return results

# Test implementation
s = '''Data science is related to data mining, machine learning and big data.
ML is great in deriving results based on historical data.'''.lower()
print(find_index_matches(s, "data"))

[0, 27, 65, 123]


### Finding all match locations
Updating the grep function to now also include all occurences where target string occurs multiple times within the same line. This is so that all instances of the target string are returned instead of just the indexes of the lines where the target string occurs within each file.

In [10]:
def map_grep_match_indexes(file_names):
    results = {}
    for fn in file_names:
        with open(fn) as f:
            lines = [line.lower() for line in f.readlines()]
        for line_index, line in enumerate(lines):
            match_indexes = find_index_matches(line, target.lower())
            if fn not in results:
                results[fn] = []
            results[fn] += [(line_index, match_index) for match_index in match_indexes]
    return results

def mapreduce_grep_match_indexes(path, num_processes):
    file_names = [os.path.join(path, fn) for fn in os.listdir(path)]
    return map_reduce(file_names, num_processes,  map_grep_match_indexes, reduce_grep)

target = "science"
occurrences = mapreduce_grep_match_indexes("wiki", 8)

### Displaying the results
Creating a CSV file listing all occurrences.This will also show the text around each occurrence.

In [13]:
occurrences
import csv

# How many character to show  in the CSV file before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    rows = [["File", "Line", "Index", "Context"]]
    for fn in occurrences: # making use of previous map reduce occurrences
        with open(fn) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in occurrences[fn]:
            start = max(index - context_delta, 0)
            end   = index + len(target) + context_delta
            rows.append([fn, line, index, lines[line][start:end]])
    writer.writerows(rows)    

In [14]:
## Viewing in pandas
import pandas
df = pandas.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."
