# Analyzing Wikipedia Pages

In this project, we'll work with data scraped from [Wikipedia](https://www.wikipedia.org/). We'll implement a simplified version of the [grep command-line utility](https://en.wikipedia.org/wiki/Grep) to **search for data in 54 megabytes worth of articles using a Map Reduce parallel processing algorithm**. The `grep` utility essentially allows searching for textual data in all files from a given directory.

Volunteer content contributors and editors maintain Wikipedia by continuously improving content. Anyone can edit Wikipedia (you can read more about how to make an edit [here](https://en.wikipedia.org/wiki/Help:Editing)). Because Wikipedia is crowdsourced, it has rapidly assembled a huge library of articles.

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure `https://en.wikipedia.org/wiki/Yarkant_County`. If we were saving the article with the previous URL, we'd save it to the file `Yarkant_County.html`. All the data files are in the `wiki` folder. Note that the files are raw HTML — here are the first few lines of `Yarkant_County.html`:

![image info](https://dq-content.s3.amazonaws.com/569/1.1-m569.svg)

We're going to treat those files like plain-text and we won't rely on any of the specific structure of those files.

Our main goals are:
* Search for all occurrences of a string in all of the files.
* Provide a case-insensitive option to the search.
* Refine the result by providing the specific locations of the files.

## 1. List file names in the `wiki` folder

We'll get started by exploring a single data file. Before we do that we'll need to list and iterate over all files in the `wiki` folder.

In [1]:
# Import libraries we'll use
import os
import pandas as pd
import numpy as np
import math
import functools
from multiprocessing import Pool
import csv

In [2]:
# Generate a list of all file names in the `wiki` folder
file_names = os.listdir('wiki')

# Calculate and print the number of files in the `wiki` folder
print('There are', len(file_names), 'files in the `wiki` folder.')

There are 999 files in the `wiki` folder.


In [3]:
# Read the first file in the `wiki` folder and print its contents
with open(os.path.join('wiki', file_names[0])) as file:
          print(file.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bay of Concepción - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNa

## 2. Adding the map_reduce() function

In [4]:
def make_chunks(data, num_chunks):
    """Separate an data as evenly as possible into a specified number of chunks.
    
    Args:
        data (iterable): A list, set, dict, tuple, pandas dataframe, etc.
        num_chunks (int): The numer of chunks to separate the data into.
        
    Returns:
        list: A list of data chunks, the number of which is specifed by `num_chunks`.
        
    Example:
        >>> make_chunks([1,2,3,4,5,6,7,8], 3)
            [[1,2,3], [4,5,6], [7,8]]
    """
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    """Apply map and reduce function to data in parallel using a specified number of processes.
    
    Args:
        data (iterable): A list, set, dict, tuple, pandas dataframe, etc.
        num_processes (int): The number of processes to run in parallel. Should not exceed
                             the number of CPU cores.
        mapper (function): A mapper function to apply to process/split.
        reducer (function): A reducer function to reduce mapper function results to one value.
        
        Returns:
            variable: Depends on type of reducer function.
            
        Example:
            >>> map_reduce([1,2,3,4,5,6,7,8], 3, max, max)
                8
        """
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

## 3. Counting total number of lines all files

Let's explore the data a little bit more and count the total number of lines in all files stored in the `wiki` folder.

In [5]:
def map_line_count(chunk_file_names):
    """Count the number of lines in all the files in the `wiki` folder.
    
    Arg:
        chunk_file_names (list): A list of file names as strings.
        
    Returns:
        int: The total number of lines in the specified file names within the `wiki` folder.
    """
    total = 0
    for file_name in chunk_file_names:
        with open(os.path.join('wiki', file_name)) as file:
            total += len(file.readlines())
    return total

def reduce_line_count(count1, count2):
    """Add two total counts.
    
    Arg:
        count1 (int): An integer
        count2 (int): An integer
        
    Returns:
        integer: The sum of two integers.
    """
    return count1 + count2

# Calculate the total number of lines in all files in the `wiki` folder
total_lines = map_reduce(file_names, 4, map_line_count, reduce_line_count)
print('There are a total of {:,d} lines in all files in the `wiki` folder'.format(total_lines))

There are a total of 499,797 lines in all files in the `wiki` folder


## 4. Searching for all occurences of a string in all of the files

This will be the first **MapReduce grep algorithm**. The goal is to locate all lines in all files from the wiki folder that contains a given string.

We'll search for the occurence of the string "data" in all files.

In [6]:
target = 'data'

def map_string_line_index(chunk_file_names):
    """Find all lines that contain the target string and return those line indices for each file
        containing the target string.
    
    Arg:
        chunk_file_names (list): A list of file names as strings.
        
    Returns:
        dictionary: With the the format: {file name : line indices containing the target string}
    """
    line_indices = {}
    for file_name in chunk_file_names:
        with open(os.path.join('wiki', file_name)) as file:
            lines = [line.lower() for line in file.readlines()]
            for i, line in enumerate(lines):
                if target in line:
                    if file_name not in line_indices:
                        line_indices[file_name] = []
                    line_indices[file_name].append(i)
    return line_indices

def reduce_string_line_index(lines1, lines2):
    """Concatenate contents of two dictionaries.
    Args:
        lines1 (dictionary): Dictionary containing file names and index numbers.
        lines2 (dictionary): Another dictionary containing file names and index numbers.
        
    Returns:
        dictionary: A dictionary resulting from joining two dictionaries
    """
    lines1.update(lines2)
    return lines1

# Find all occruences of the string 'data' in the files stored in the `wiki` folder
occurences_of_string = map_reduce(file_names, 4, map_string_line_index, reduce_string_line_index)

## 4.1 Improving the algorithm to provide locations within each line

The current implementation will just return the index of lines where the target string is located.

The **new implementation should return pairs of indices where the first value is the line index and the second index if the index of the first character of the match on that line**. 
For the new implementation, we'll only need to modify the mapper function so that it returns a list of pairs with all occurrences of the target on all files from the given chunk.

In [31]:
def map_string_line_firstchar_index(file_names):
    """Find all lines that contain the target string and the index position within those lines
      where the target string begins. Return those those line/position index pairs as 
      a list of tuples for each file containing the target string.
    
    Arg:
        chunk_file_names (list): A list of file names as strings.
        
    Returns:
        dictionary: Containing the file name as the key and a list of tuples containing
                    the line index and position index of the target string.
    """
    results = {}
    for file_name in file_names:
        with open(file_name) as file:
            lines = [line.lower() for line in file.readlines()]
            for line_index, line in enumerate(lines):
                if target in line:
                    char_index = []
                    i = line.find(target)
                    while i > -1:
                        char_index.append(i)
                        i = line.find(target, i+1)
                    if file_name not in results:
                        results[file_name] = []
                    results[file_name] += [(line_index, match_index) for match_index in char_index]
    return results

# Find all occruences of the string 'data' in the files stored in the `wiki` folder
occurences_of_string = map_reduce([os.path.join('wiki', file_name) for file_name in file_names], 
                                  4, 
                                  map_string_line_firstchar_index, 
                                  reduce_string_line_index)
occurences_of_string

{'wiki/Valentin_Yanin.html': [],
 'wiki/William_Harvey_Lillard.html': [],
 'wiki/Victor_S._Mamatey.html': [],
 'wiki/Table_Point_Formation.html': [],
 'wiki/Master_of_Space_and_Time.html': [],
 'wiki/Urban_chicken.html': [],
 'wiki/AlMidan.html': [],
 'wiki/Jules_Verne_ATV.html': [],
 'wiki/Pictogram.html': [],
 'wiki/Claire_Danes.html': [],
 'wiki/Diplacus_aurantiacus.html': [],
 'wiki/Supermoon.html': [],
 'wiki/Louis_Vivet.html': [],
 'wiki/Kelvin_R._Throop.html': [],
 'wiki/Kristiene_Gong.html': [],
 'wiki/Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html': [],
 'wiki/Alex_Kurtzman.html': [],
 'wiki/Church_of_the_SubGenius.html': [],
 'wiki/Regina_Pacis_Catholic_Secondary_School.html': [],
 'wiki/List_of_women27s_football_clubs_in_Japan.html': [],
 'wiki/Bias.html': [],
 'wiki/Olive_Dennis.html': [],
 'wiki/Igor_and_Grichka_Bogdanoff.html': [],
 'wiki/Thomas_Croci.html': [],
 'wiki/Mariana_Mazzucato.html': [],
 'wiki/Oldfield_Baby_Great_Lakes.html': [],
 'wiki/Derek_Acorah.html': [],


## 5. Writing results into a CSV file

Our grep algorithms can now find all matches. However, with the dictionary it produces, it's not very easy to see those matches.

Let's write the results into a CSV file using 4 columns:

1. File: shows the name of the file of the match.
2. Line: shows the index of the line of the match.
3. Index: shows the index on the line of the match.
4. Context: shows the text around the match so that users can see the context.

In [23]:
def write_csv(map_reduce_dictionary, csv_name, context_delta=30):

    # Write the results of our grep search, `occurences_in_string` to a csv file
    with open(csv_name, 'w') as file:
        writer = csv.writer(file)
        rows = [['File', 'Line', 'Index', 'Context']]
        for file_name in map_reduce_dictionary:
            with open(file_name) as file:
                lines = [line.strip() for line in file.readlines()]
            for line, index in map_reduce_dictionary[file_name]:
                start = max(index - context_delta, 0) # The starting index for the context column
                end = index + len(target) + context_delta # The endiing index for the context column
                rows.append([os.path.join('wiki', file_name), line, index, lines[line][start:end]])
        writer.writerows(rows)
    
write_csv(occurences_of_string, 'results.csv')

In [24]:
# Check contents of csv file by loading it into a dataframe
result_df = pd.read_csv('results.csv')
result_df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/wiki/Bay_of_ConcepciC3B3n.html,6,422,"egories"":[""Coordinates on Wikidata"",""All stub ..."
1,wiki/wiki/Bay_of_ConcepciC3B3n.html,45,628,"78-sj18-04-quiriquina.jpg 2x"" data-file-width=..."
2,wiki/wiki/Bay_of_ConcepciC3B3n.html,45,650,"jpg 2x"" data-file-width=""960"" data-file-height..."
3,wiki/wiki/Bay_of_ConcepciC3B3n.html,58,447,"aps, aerial photos, and other data for this lo..."
4,wiki/wiki/Bay_of_ConcepciC3B3n.html,58,692,"aps, aerial photos, and other data for this lo..."
5,wiki/wiki/Bay_of_ConcepciC3B3n.html,60,18,"<table class=""metadata plainlinks stub"" role=""..."
6,wiki/wiki/Bay_of_ConcepciC3B3n.html,62,568,"o_Region%2C_Chile.svg.png 2x"" data-file-width=..."
7,wiki/wiki/Bay_of_ConcepciC3B3n.html,62,590,"png 2x"" data-file-width=""600"" data-file-height..."
8,wiki/wiki/Bay_of_ConcepciC3B3n.html,105,40,"atlinks"" class=""catlinks"" data-mw=""interface"">..."
9,wiki/wiki/Bay_of_ConcepciC3B3n.html,105,748,"tegory:Coordinates_on_Wikidata"" title=""Categor..."


## 5.1 Putting it all together

We'll put the map reduce and csv writer functions together and test it on the "science" string.

In [28]:
def grep_to_csv(file_names, directory, csv_name, num_processes=4, context_delta=30):    
    occurences = map_reduce([os.path.join(directory, file_name) for file_name in file_names], 
                            num_processes, 
                            map_string_line_firstchar_index,
                            reduce_string_line_index)
    
    write_csv(occurences, csv_name, context_delta)

In [30]:
target = 'science'
grep_to_csv(file_names, 'wiki', 'science.csv')
pd.read_csv('science.csv').head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."


## Conclusion

We worked with data scraped from [Wikipedia](https://www.wikipedia.org/) to implement a simplified version of the [grep command-line utility](https://en.wikipedia.org/wiki/Grep) using [MapReduce](https://en.wikipedia.org/wiki/MapReduce) parallel processing. We are now able to do a simple search for strings within files within a designated directory ("wiki" in our example) and generate a csv report with the results.

### Next steps

There are many improvements we can add to our algorithm. The grep command offers [many other options](https://www.gnu.org/software/grep/manual/grep.html).

Some ideas:

* Consider files located in subdirectories.
* Use the `re` module to make it possible to search for regular expressions.
* Make it possible to specify the search options rather than having a search function for each set of options.