## MapReduce

In this assignment, we will explore how to use some of the methods we have learned about, but with Big Data.

MapReduce is a programming model for performing parallel processing on Big Data. It is powerful, yet relatively simple.

There are two basic steps:
1. _Mapper_ - Turn each item in zero or more key-value pairs.
2. _Reducer_ - Produce output values by grouping together values from each corresponding key.

In [18]:
# Some necessary code
from collections import defaultdict, Counter
import re, datetime
from functools import partial

def tokenize(message):
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return (all_words)                           # remove duplicates

In [20]:
documents = ["data data data science","big data","data problems"]
result = word_count_old(documents)
print(result)

Counter({'data': 5, 'science': 1, 'big': 1, 'problems': 1})


This old way of counting words works fine for 100s of documents, but not millions. We need a way to distribute the documents to several different computers, each of them counting, and then returning the counts to a central hub. MapReduce

In [21]:
# The mapper functions maps the task
def wc_mapper(document):
    """for each word in document, emit (word,1)"""
    for word in tokenize(document):
        yield (word,1)
        
# The reducer function collects the results
def wc_reducer(word, counts):
    """sum up the counts for a word"""
    yield (word, sum(counts))

This is fine when you have a parallel processing environment and later we will learn about cloud computing and how to use MapReduce with the cloud. But for now, let's simulate this with our one computer.

In [22]:
def word_count(documents):
    """count the words in the input documents using MapReduce"""
    
    # place to store grouped values
    collector = defaultdict(list)
    
    for document in documents:
        for word, count in wc_mapper(document):
            collector[word].append(count)
            
    # add a statement to print the collector here
    print(collector)        
    return [output
            # replace items() with iteritems() if you get an error
           for word, counts in collector.items()
           for output in wc_reducer(word,counts)]

Create a list of documents where there is some overlap in the words in each document (don't use more than a total of about 5-6 words). Use your word_count function on this list.

e.g., ["data science", "big data", "science fiction", "data mining"]

Add a print statment to the function so you can see the values in collector after the mapper function has run.

What if a document has more than one occurence of a word? e.g., "data data science" Can you alter the tokenize function to fix this problem?

In [23]:
# make a list of documents here
documents = ["data science", "big data", "science fiction", "data mining"]
word_count(documents)

defaultdict(<class 'list'>, {'data': [1, 1, 1], 'science': [1, 1], 'big': [1], 'fiction': [1], 'mining': [1]})


[('data', 3), ('science', 2), ('big', 1), ('fiction', 1), ('mining', 1)]

Let's create a more general MapReduce function now. Hint: This function will look nearly identical to the word_count function above with some substitutions.

Rather than (word, count), it should be (key, value) to be more general.

In [24]:
def map_reduce(inputs, mapper, reducer):
    """runs MapReduce on input using functions mapper and reducer"""
    collector = defaultdict(list)
    
    # write a for loop over the inputs that calls mapper
    collector = defaultdict(list)
    
    for input in inputs:
        for key, value in mapper(input):
            collector[key].append(value)
            
    # add a statement to print the collector here
    #print(collector)        
    
    # write a return statement that calls the reducer
    return [output
            # replace items() with iteritems() if you get an error
           for input, values in collector.items()
           for output in reducer(input,values)]
            

If all went well, you should be able to call the word count with the code below.

In [25]:
word_counts = map_reduce(documents, wc_mapper, wc_reducer)
print(word_counts)

[('data', 3), ('science', 2), ('big', 1), ('fiction', 1), ('mining', 1)]


Let's also create more a more general reducer function, where we can change the aggregation that used; e.g., sum, max, or min, etc.

In [26]:
def reduce_values_using(aggregation_fn, key, values):
    """reduces a key-values pair by applying aggregation_fn"""
    yield (key, aggregation_fn(values))
    
def values_reducer(aggregation_fn):
    """turns a functions (values->output) into a reducer that
    maps (key, values)->(key, output)"""
    return partial(reduce_values_using, aggregation_fn)

In [27]:
sum_reducer = values_reducer(sum)
max_reducer = values_reducer(max)
min_reducer = values_reducer(min)
count_distinct_reducer = values_reducer(lambda values: len(set(values)))

Try doing the word count with your documents using each of the reducers above. Do you get the results you expected? What do you think is going on?

In [28]:
word_counts = map_reduce(documents, wc_mapper, sum_reducer)
print(word_counts)

[('data', 3), ('science', 2), ('big', 1), ('fiction', 1), ('mining', 1)]


In [29]:
word_counts = map_reduce(documents, wc_mapper, max_reducer)
print(word_counts)

[('data', 1), ('science', 1), ('big', 1), ('fiction', 1), ('mining', 1)]


In [30]:
word_counts = map_reduce(documents, wc_mapper, min_reducer)
print(word_counts)

[('data', 1), ('science', 1), ('big', 1), ('fiction', 1), ('mining', 1)]


In [31]:
word_counts = map_reduce(documents, wc_mapper, count_distinct_reducer)
print(word_counts)

[('data', 1), ('science', 1), ('big', 1), ('fiction', 1), ('mining', 1)]


Let's explore more by looking at social network status updates.

In [32]:
status_updates = [
    {"id": 1, 
     "username" : "joelgrus", 
     "text" : "Is anyone interested in a data science book?",
     "created_at" : datetime.datetime(2013, 12, 21, 11, 47, 0),
     "liked_by" : ["data_guy", "data_gal", "bill"] },
    # add your own
    {"id": 2, 
     "username" : "dnisarg13", 
     "text" : "I love nirva & data science",
     "created_at" : datetime.datetime(2013, 4, 21, 11, 47, 0),
     "liked_by" : ["krupa", "data_gal", "jahanvi"] },
    {"id": 3, 
     "username" : "dnisarg13", 
     "text" : "I love nirva & data science",
     "created_at" : datetime.datetime(2013, 4, 21, 11, 47, 0),
     "liked_by" : ["krupa", "data_gal", "jahanvi"] },
    {"id": 4, 
     "username" : "nirvavyas", 
     "text" : "i love nisarg",
     "created_at" : datetime.datetime(2013, 6, 21, 11, 47, 0),
     "liked_by" : ["data_guy", "nisarg", "nirva"] }
]

Let's create a mapper that counts the number of times "data science" is mentioned per day of the week.

In [33]:
def data_science_day_mapper(status_update):
    """yields (day_of_week, 1) if status_update contains "data science" """
    if "data science" in status_update["text"].lower():
        day_of_week = status_update["created_at"].weekday()
        yield (day_of_week, 1)
        
data_science_days = map_reduce(status_updates, 
                               data_science_day_mapper, 
                               sum_reducer)
print(data_science_days)

[(5, 1), (6, 2)]


Let's imagine another task. Let's say we want to profile each user by the most common word they put their status update. There are really three possible approaches. Which is right?
1. key is username, values are words and counts
2. key is word, values are usernames and counts
3. key is username and word, values are counts

Let's define a mapper and reducer for this task.

In [34]:
def words_per_user_mapper(status_update):
    user=status_update["username"]
    for word in tokenize(status_update["text"]):
        yield (user,(word,1))
            
def most_popular_word_reducer(user, words_and_counts):
    """given a sequence of (word, count) pairs, 
    return the word with the highest total count"""
    word_counts = Counter()
    for word, count in words_and_counts:
        word_counts[word] += count
    
    # find most common word and retun that (key,value) pair
    word,count = word_counts.most_common(1)[0]
    
    yield (user,(word,count))
    
user_words = map_reduce(status_updates,
                        words_per_user_mapper, 
                        most_popular_word_reducer)
print(user_words)

[('joelgrus', ('is', 1)), ('dnisarg13', ('i', 2)), ('nirvavyas', ('i', 1))]


Now you create a mapper for finding out the number of distinct status-likers for each user.

In [35]:
def liker_mapper(status_update):
    """return (user,liker) pairs"""
    user = status_update["username"]
    
    for liker in status_update["liked_by"]:
        yield (user,liker)
distinct_likers_per_user = map_reduce(status_updates,
                                     liker_mapper,
                                     count_distinct_reducer)
print(distinct_likers_per_user)

[('joelgrus', 3), ('dnisarg13', 3), ('nirvavyas', 3)]


Let's end this lesson by defining a mapper and reducer for matrix multiplication. Let's assume an $m\times n$ matrix $A$, and an $n\times k$ matrix $B$.

$C_{ij} = \sum_{l=1}^n A_{il}B_{lj}$

Assume the matrices
A = [[3, 2, 0],
    [0, 0, 0]]
B = [[4, -1, 0],
    [10, 0, 0],
    [0, 0, 0]]
are stored in a common list organized as so:

entries = [("A",0,0,3), ("A",0,1,2),
            ("B",0,0,4), ("B",0,1,-1), ("B",1,0,10)]

Our mapper will return the key-value pair ((row,col) of $C$, (col of $A$, value of $A$)) for elements of $A$ and ((row,col) of $C$, (row of $B$, value of $B$)) for elements of $B$.

In [36]:
def matrix_multiply_mapper(n, element):
    """n is the common dimension (columns of A, rows of B)
    element is a tuple (matrix_name, i, j, value)"""
    matrix, i, j, value = element

    # if matrix is A then output the key-value pairs ((i,column), (j,value)) over all columns of C
    if matrix == 'A':
        for column in range(n):
            yield ((i,column),(j,value))
    else:
        for row in range(n):
            yield ((row,j),(i,value))
            
    
    # else if matrix is B then output the key-value pairs ((row, j), (i, value)) over all rows of C
     
def matrix_multiply_reducer(n, key, indexed_values):
    results_by_index = defaultdict(list)
    
    # this reducer works the same as the word count reducer,
    # collecting all the pairs of A and B for each element of C
    for index,value in indexed_values:
        results_by_index[index].append(value)
        
    # sum up all the products of the positions with two (non-zero) results
    sum_product = sum(results[0]*results[1]
                      for results in results_by_index.values()
                      if len(results) == 2)
    # finally if the terms are != 0 then yield (key, value), where value is the result of the sum-product
    if sum_product != 0:
        yield (key,sum_product)

Once you have your mapper and reducer finished. Try it out.

In [37]:
entries = [("A", 0, 0, 3), ("A", 0, 1,  2),
           ("B", 0, 0, 4), ("B", 0, 1, -1), ("B", 1, 0, 10)]
mapper = partial(matrix_multiply_mapper, 3) # what does partial do here?
reducer = partial(matrix_multiply_reducer, 3)
map_reduce(entries,mapper,reducer) # [((0, 0), 32), ((0, 1), -3)]

[((0, 0), 32), ((0, 1), -3)]

## Mappers of Mappers

Let's try this with a larger text file. Use what you learned last week to open the file 'genesis.txt' and display the 10 most common words.

In [38]:
!cat data/genesis.txt | python egrep.py "[A-Z,a-z]" | python most_common_words.py 10

3593	and
2453	the
1346	of
639	he
639	his
604	to
596	unto
586	in
480	that
468	i


Now create a MapReduce implementation to output the most common words. Let's now reinvent the wheel. Can we mapreduce within a mapper? Certainly!

In [113]:
def file_mapper(filename):
    """Create a mapper that reads in the lines of <filename>
    counts the words in each line and then returns the (key,value)
    pairs for those word counts"""
    

Now run your mapper with the sum_reducer to see if it works.

In [166]:
filenames = ["data/genesis.txt",
            "data/Luke.txt",
            "data/Kings.txt"]
word_counts = map_reduce(filenames, file_mapper, sum_reducer)

TypeError: 'NoneType' object is not iterable

In [304]:
f1r = []
f2r = []
f3r = []
def sub_mapper1(f1):    
        with open(f1,'r') as f1:
            for line in f1:
                for word in line.strip().split():
                    f1r.append((word,1))
            return f1r

def sub_mapper2(f2):                
        with open(f2,'r') as f2:
            for line in f2:
                for word in line.strip().split():
                    f2r.append((word,1))
            return f2r

def sub_mapper3(f3):                
        with open(f3,'r') as f3:
            for line in f3:
                for word in line.strip().split():
                    f3r.append((word,1))
            return f3r

In [305]:
f1,f2,f3 = filenames

In [306]:
sub_mapper1(f1)
sub_mapper2(f2)
sub_mapper3(f3)

[('THE', 1),
 ('FIRST', 1),
 ('BOOK', 1),
 ('OF', 1),
 ('THE', 1),
 ('KINGS', 1),
 ('COMMONLY', 1),
 ('CALLED,', 1),
 ('THE', 1),
 ('THIRD', 1),
 ('BOOK', 1),
 ('OF', 1),
 ('THE', 1),
 ('KINGS', 1),
 ('CHAPTER', 1),
 ('1', 1),
 ('1', 1),
 ('Now', 1),
 ('king', 1),
 ('David', 1),
 ('was', 1),
 ('old', 1),
 ('[and]', 1),
 ('stricken', 1),
 ('in', 1),
 ('years;', 1),
 ('and', 1),
 ('they', 1),
 ('covered', 1),
 ('him', 1),
 ('with', 1),
 ('clothes,', 1),
 ('but', 1),
 ('he', 1),
 ('gat', 1),
 ('no', 1),
 ('heat.', 1),
 ('2', 1),
 ('Wherefore', 1),
 ('his', 1),
 ('servants', 1),
 ('said', 1),
 ('unto', 1),
 ('him,', 1),
 ('Let', 1),
 ('there', 1),
 ('be', 1),
 ('sought', 1),
 ('for', 1),
 ('my', 1),
 ('lord', 1),
 ('the', 1),
 ('king', 1),
 ('a', 1),
 ('young', 1),
 ('virgin:', 1),
 ('and', 1),
 ('let', 1),
 ('her', 1),
 ('stand', 1),
 ('before', 1),
 ('the', 1),
 ('king,', 1),
 ('and', 1),
 ('let', 1),
 ('her', 1),
 ('cherish', 1),
 ('him,', 1),
 ('and', 1),
 ('let', 1),
 ('her', 1),
 ('l

In [331]:
def most_popular_word_reducer(file,n):
    """given a sequence of (word, count) pairs, 
    return the word with the highest total count"""
    c = []
    word_counts = Counter()
    for word, count in file:
        word_counts[word] += count
    
    # find most common word and retun that (key,value) pair
    word,count = word_counts.most_common(n)[n-1]
    
    c.append((word,count))
    return c

In [343]:
pop = []
def top(file,n):
    for i in range(1,n):
        mp = most_popular_word_reducer(file,i)
        pop.append(mp)
    return pop

In [264]:
top(f3r,10)

NameError: name 'top' is not defined

In [265]:
import re
namefiles = ["names/yob1880.txt",
            "names/yob1881.txt"]

In [276]:
list1 = []
def mapper(namefiles):
    for namefile in namefiles:
        with open(namefile,'r') as f1:
            for line in f1:
                for word in line.strip().split(','):
                    list1.append((word))
    return list1
   

In [277]:
mapper(namefiles)

['Mary',
 'F',
 '7065',
 'Anna',
 'F',
 '2604',
 'Emma',
 'F',
 '2003',
 'Elizabeth',
 'F',
 '1939',
 'Minnie',
 'F',
 '1746',
 'Margaret',
 'F',
 '1578',
 'Ida',
 'F',
 '1472',
 'Alice',
 'F',
 '1414',
 'Bertha',
 'F',
 '1320',
 'Sarah',
 'F',
 '1288',
 'Annie',
 'F',
 '1258',
 'Clara',
 'F',
 '1226',
 'Ella',
 'F',
 '1156',
 'Florence',
 'F',
 '1063',
 'Cora',
 'F',
 '1045',
 'Martha',
 'F',
 '1040',
 'Laura',
 'F',
 '1012',
 'Nellie',
 'F',
 '995',
 'Grace',
 'F',
 '982',
 'Carrie',
 'F',
 '949',
 'Maude',
 'F',
 '858',
 'Mabel',
 'F',
 '808',
 'Bessie',
 'F',
 '796',
 'Jennie',
 'F',
 '793',
 'Gertrude',
 'F',
 '787',
 'Julia',
 'F',
 '783',
 'Hattie',
 'F',
 '769',
 'Edith',
 'F',
 '768',
 'Mattie',
 'F',
 '704',
 'Rose',
 'F',
 '700',
 'Catherine',
 'F',
 '688',
 'Lillian',
 'F',
 '672',
 'Ada',
 'F',
 '652',
 'Lillie',
 'F',
 '647',
 'Helen',
 'F',
 '636',
 'Jessie',
 'F',
 '635',
 'Louise',
 'F',
 '635',
 'Ethel',
 'F',
 '633',
 'Lula',
 'F',
 '621',
 'Myrtle',
 'F',
 '615',
 '

In [283]:
def remove_values_from_list(the_list, val):
        while val in the_list:
            the_list.remove(val)

In [296]:
remove_values_from_list(list1,"F" and "M")
list1

['Mary',
 '7065',
 'Anna',
 '2604',
 'Emma',
 '2003',
 'Elizabeth',
 '1939',
 'Minnie',
 '1746',
 'Margaret',
 '1578',
 'Ida',
 '1472',
 'Alice',
 '1414',
 'Bertha',
 '1320',
 'Sarah',
 '1288',
 'Annie',
 '1258',
 'Clara',
 '1226',
 'Ella',
 '1156',
 'Florence',
 '1063',
 'Cora',
 '1045',
 'Martha',
 '1040',
 'Laura',
 '1012',
 'Nellie',
 '995',
 'Grace',
 '982',
 'Carrie',
 '949',
 'Maude',
 '858',
 'Mabel',
 '808',
 'Bessie',
 '796',
 'Jennie',
 '793',
 'Gertrude',
 '787',
 'Julia',
 '783',
 'Hattie',
 '769',
 'Edith',
 '768',
 'Mattie',
 '704',
 'Rose',
 '700',
 'Catherine',
 '688',
 'Lillian',
 '672',
 'Ada',
 '652',
 'Lillie',
 '647',
 'Helen',
 '636',
 'Jessie',
 '635',
 'Louise',
 '635',
 'Ethel',
 '633',
 'Lula',
 '621',
 'Myrtle',
 '615',
 'Eva',
 '614',
 'Frances',
 '605',
 'Lena',
 '603',
 'Lucy',
 '590',
 'Edna',
 '588',
 'Maggie',
 '582',
 'Pearl',
 '569',
 'Daisy',
 '564',
 'Fannie',
 '560',
 'Josephine',
 '544',
 'Dora',
 '524',
 'Rosa',
 '507',
 'Katherine',
 '502',
 'A

In [299]:
list_names = list1[0::2]
list_names

['Mary',
 'Anna',
 'Emma',
 'Elizabeth',
 'Minnie',
 'Margaret',
 'Ida',
 'Alice',
 'Bertha',
 'Sarah',
 'Annie',
 'Clara',
 'Ella',
 'Florence',
 'Cora',
 'Martha',
 'Laura',
 'Nellie',
 'Grace',
 'Carrie',
 'Maude',
 'Mabel',
 'Bessie',
 'Jennie',
 'Gertrude',
 'Julia',
 'Hattie',
 'Edith',
 'Mattie',
 'Rose',
 'Catherine',
 'Lillian',
 'Ada',
 'Lillie',
 'Helen',
 'Jessie',
 'Louise',
 'Ethel',
 'Lula',
 'Myrtle',
 'Eva',
 'Frances',
 'Lena',
 'Lucy',
 'Edna',
 'Maggie',
 'Pearl',
 'Daisy',
 'Fannie',
 'Josephine',
 'Dora',
 'Rosa',
 'Katherine',
 'Agnes',
 'Marie',
 'Nora',
 'May',
 'Mamie',
 'Blanche',
 'Stella',
 'Ellen',
 'Nancy',
 'Effie',
 'Sallie',
 'Nettie',
 'Della',
 'Lizzie',
 'Flora',
 'Susie',
 'Maud',
 'Mae',
 'Etta',
 'Harriet',
 'Sadie',
 'Caroline',
 'Katie',
 'Lydia',
 'Elsie',
 'Kate',
 'Susan',
 'Mollie',
 'Alma',
 'Addie',
 'Georgia',
 'Eliza',
 'Lulu',
 'Nannie',
 'Lottie',
 'Amanda',
 'Belle',
 'Charlotte',
 'Rebecca',
 'Ruth',
 'Viola',
 'Olive',
 'Amelia',
 

In [300]:
list_count = list1[1::2]
list_count

['7065',
 '2604',
 '2003',
 '1939',
 '1746',
 '1578',
 '1472',
 '1414',
 '1320',
 '1288',
 '1258',
 '1226',
 '1156',
 '1063',
 '1045',
 '1040',
 '1012',
 '995',
 '982',
 '949',
 '858',
 '808',
 '796',
 '793',
 '787',
 '783',
 '769',
 '768',
 '704',
 '700',
 '688',
 '672',
 '652',
 '647',
 '636',
 '635',
 '635',
 '633',
 '621',
 '615',
 '614',
 '605',
 '603',
 '590',
 '588',
 '582',
 '569',
 '564',
 '560',
 '544',
 '524',
 '507',
 '502',
 '473',
 '471',
 '471',
 '462',
 '436',
 '427',
 '414',
 '411',
 '410',
 '406',
 '404',
 '403',
 '391',
 '388',
 '365',
 '361',
 '345',
 '344',
 '323',
 '319',
 '317',
 '306',
 '303',
 '302',
 '301',
 '299',
 '286',
 '283',
 '277',
 '274',
 '259',
 '252',
 '249',
 '248',
 '245',
 '241',
 '238',
 '237',
 '236',
 '234',
 '229',
 '224',
 '221',
 '221',
 '215',
 '213',
 '210',
 '210',
 '204',
 '204',
 '198',
 '192',
 '191',
 '183',
 '167',
 '166',
 '165',
 '162',
 '153',
 '151',
 '149',
 '144',
 '141',
 '138',
 '138',
 '137',
 '136',
 '132',
 '131',
 '131',

x = len(list1) - 3
for i in range(1,x,3):
    del list1[i]
list1

In [304]:
import pandas as pd
babyname_list = pd.DataFrame(
    {'Names': list_names,
     'Names total Count': list_count
    })

In [306]:
#babyname_list

In [330]:
babyname_list['Ex_names'] = babyname_list['Names'].str.extract('(^M*)', expand=False).str.strip()
babyname_list.sort_values(by='Names total Count', ascending=0)
result = babyname_list.loc[babyname_list['Ex_names'] == "M"]
result.head(100)

Unnamed: 0,Names,Names total Count,Ex_names
0,Mary,7065,M
4,Minnie,1746,M
5,Margaret,1578,M
15,Martha,1040,M
20,Maude,858,M
21,Mabel,808,M
28,Mattie,704,M
39,Myrtle,615,M
45,Maggie,582,M
54,Marie,471,M


For your assignment, I want you to create a new reducer that returns the top $n$ words in a list of documents.

In [368]:
babyname_list['Ex_names2'] = babyname_list['Names'].str.extract('(ar)', expand=False).str.strip()
result = babyname_list.loc[babyname_list['Ex_names2'] == "ar"]
result.head(10)

Unnamed: 0,Names,Names total Count,Ex_names,Ex_names2
0,Mary,7065,M,ar
5,Margaret,1578,M,ar
9,Sarah,1288,,ar
11,Clara,1226,,ar
15,Martha,1040,M,ar
19,Carrie,949,,ar
46,Pearl,569,,ar
54,Marie,471,M,ar
72,Harriet,319,,ar
74,Caroline,306,,ar


In [462]:
namefiles = ["names/yob1880.txt",
            "names/yob1881.txt"]

In [469]:
namefiles = ["names/yob1880.txt","names/yob1881.txt","names/yob1882.txt","names/yob1883.txt","names/yob1884.txt","names/yob1885.txt","names/yob1886.txt","names/yob1887.txt","names/yob1888.txt","names/yob1889.txt","names/yob1890.txt","names/yob1891.txt","names/yob1892.txt","names/yob1893.txt","names/yob1894.txt","names/yob1895.txt","names/yob1896.txt","names/yob1897.txt","names/yob1898.txt","names/yob1899.txt","names/yob1900.txt","names/yob1901.txt","names/yob1902.txt","names/yob1903.txt","names/yob1904.txt","names/yob1905.txt","names/yob1906.txt","names/yob1907.txt","names/yob1908.txt","names/yob1909.txt","names/yob1910.txt","names/yob1911.txt","names/yob1912.txt","names/yob1913.txt","names/yob1914.txt","names/yob1915.txt","names/yob1916.txt","names/yob1917.txt","names/yob1918.txt","names/yob1919.txt","names/yob1920.txt","names/yob1921.txt","names/yob1922.txt","names/yob1923.txt","names/yob1924.txt","names/yob1925.txt","names/yob1926.txt","names/yob1927.txt","names/yob1928.txt","names/yob1929.txt","names/yob1930.txt","names/yob1931.txt","names/yob1932.txt","names/yob1933.txt","names/yob1934.txt","names/yob1935.txt","names/yob1936.txt","names/yob1937.txt","names/yob1938.txt","names/yob1939.txt","names/yob1940.txt","names/yob1941.txt","names/yob1942.txt","names/yob1943.txt","names/yob1944.txt","names/yob1945.txt","names/yob1946.txt","names/yob1947.txt","names/yob1948.txt","names/yob1949.txt","names/yob1950.txt","names/yob1951.txt","names/yob1952.txt","names/yob1953.txt","names/yob1954.txt","names/yob1955.txt","names/yob1956.txt","names/yob1957.txt","names/yob1958.txt","names/yob1959.txt","names/yob1960.txt","names/yob1961.txt","names/yob1962.txt","names/yob1963.txt","names/yob1964.txt","names/yob1965.txt","names/yob1966.txt","names/yob1967.txt","names/yob1968.txt","names/yob1969.txt","names/yob1970.txt","names/yob1971.txt","names/yob1972.txt","names/yob1973.txt","names/yob1974.txt","names/yob1975.txt","names/yob1976.txt","names/yob1977.txt","names/yob1978.txt","names/yob1979.txt","names/yob1980.txt","names/yob1981.txt","names/yob1982.txt","names/yob1983.txt","names/yob1984.txt","names/yob1985.txt","names/yob1986.txt","names/yob1987.txt","names/yob1988.txt","names/yob1989.txt","names/yob1990.txt","names/yob1991.txt","names/yob1992.txt","names/yob1993.txt","names/yob1994.txt","names/yob1995.txt","names/yob1996.txt","names/yob1997.txt","names/yob1998.txt","names/yob1999.txt","names/yob2000.txt","names/yob2001.txt","names/yob2002.txt","names/yob2003.txt","names/yob2004.txt","names/yob2005.txt","names/yob2006.txt","names/yob2007.txt","names/yob2008.txt","names/yob2009.txt","names/yob2010.txt","names/yob2011.txt","names/yob2012.txt","names/yob2013.txt","names/yob2014.txt","names/yob2015.txt","names/yob2016.txt",]

In [470]:
def mapper(namefiles):
    for namefile in namefiles:
        with open(namefile,'r') as file:
            for line in file:
                for word in line.strip().split(','):
                    list1.append((word))
    return list1

In [471]:
def remove_values_from_list(the_list, val):
        while val in the_list:
            the_list.remove(val)

In [472]:
def reducer(list1,startingwith,topnwords):
    remove_values_from_list(list1,"F" and "M")
    list_names = list1[0::2]
    list_count = list1[1::2]
    babyname_list = pd.DataFrame(
    {'Names': list_names,
     'Names total Count': list_count
    })
    babyname_list['Starting with?'] = babyname_list['Names'].str.extract('(^%s*)'%startingwith, expand=False).str.strip()
    babyname_list.sort_values(by='Names total Count', ascending=0)
    result = babyname_list.loc[babyname_list['Starting with?'] == "M"]
    return result.head(topnwords)

In [473]:
def MapReduce(namefiles,list1,startingwith,topnwords):
    mapper(namefiles)
    return reducer(list1,startingwith,topnwords)

In [None]:
MapReduce(namefiles,list1,"M",10)

In [440]:
namefiles = ["names/yob1880.txt",
            "names/yob1881.txt"]

In [441]:
def mapper(namefiles):
    for namefile in namefiles:
        with open(namefile,'r') as file:
            for line in file:
                for word in line.strip().split(','):
                    list1.append((word))
    return list1

In [448]:
def reducer(list1,contains,topnwords):
    remove_values_from_list(list1,"F" and "M")
    list_names = list1[0::2]
    list_count = list1[1::2]
    babyname_list = pd.DataFrame(
    {'Names': list_names,
     'Names total Count': list_count
    })
    babyname_list['Contains?'] = babyname_list['Names'].str.extract('(%s)'%contains, expand=False).str.strip()
    babyname_list.sort_values(by='Names total Count', ascending=0)
    result = babyname_list.loc[babyname_list['Contains?'] == "ar"]
    return result.head(topnwords)

In [452]:
def MapReduce(namefiles,list1,contains,topnwords):
    mapper(namefiles)
    return reducer(list1,contains,topnwords)

In [453]:
MapReduce(namefiles,list1,"ar",10)

Unnamed: 0,Names,Names total Count,Contains?
0,Mary,7065,ar
5,Margaret,1578,ar
9,Sarah,1288,ar
11,Clara,1226,ar
15,Martha,1040,ar
19,Carrie,949,ar
46,Pearl,569,ar
54,Marie,471,ar
72,Harriet,319,ar
74,Caroline,306,ar
