#Hadoop Final Project 
###Udacity Course: Intro to Hadoop and MapReduce  
####Zach Farmer
***

##Table of Contents     
[Introduction](#Introduction)      
[Decision Process](#Decision Process)    
[Student Times](#Student Times)    
[Post and Answer Length](#Post and Answer Length)   
[Top Tags](#Top Tags)    
[Study Groups](#Study Groups)   
[Search Functionality](#Search Functionality)   



<a id="Introduction"></a> 
***

> This notebook contains the code for the final project but will not be able to run on a native Hadoop Distributed File System (HDFS) from this notebook. The code is designed to utilize Cloudera's Hadoop framework and was tested and run on a local (my machine) VM containing the necessary software. However I will use the Unix bourne shell as a proxy to approximate the application seen in the hadoop environment. It should be noted that this approximation happens on just one machine and cannot scale the same way as an actual HDFS environment it is only an approximation and shares little in common with an actual HDFS distribtuion. The mappers and reducers should however be portable as written in this report and could be implemented in an HDFS framework as is. The Cloudera framework supports Hadoop streaming and as a result the mappers and reducers are written in python. Their are two datasets, one which is a small sample and was used to confirm that the process and code is valid and a final dataset for which the mapreduce task was designed to be run on.  

###Introduction
*Intro to Hadoop and MapReduce project
In this project you will work with some discussion forum (also sometimes called discussion board) data. It is a type of user generated content that you can find all around the web. Most popular websites have some kind of a forum, and the things you will do in this project can transfer to other similar projects. This page will be followed by various questions about the data set.* __Udacity -- Intro to Hadoop and MapReduce Final Project__     

**The Data Set**   
This particular dataset was taken from the Udacity forums the first months after the launch of this course. Udacity forums were run on a free, opensource software called OSQA, which was designed to be similar to the popular StackOverflow forums. The basic structure is - the forum has nodes. All nodes have a body and author_id. Top level nodes are called questions, and will also have a title and tags. Questions can have answers. Both questions and answers can have comments.

You will have to run the code mostly on your VMs, or on your real Hadoop cluster, if you have set up one. You can download the additional dataset [here](http://content.udacity-data.com/course/hadoop/forum_data.tar.gz "http://content.udacity-data.com/course/hadoop/forum_data.tar.gz"). To unarchive it, download it to your VM, put in the data directory and run:  

`tar zxvf forum_data.tar.gz`     

There are 2 files in the dataset. The first is "forum_nodes.tsv", and that contains all forum questions and answers in one table. It was exported from the RDBMS by using tab as a separator, and enclosing all fields in doublequotes. If you finished Lesson 4, you already know how to deal with such files. You can find the field names in the first line of the file "forum_node.tsv". The ones that are the most relevant to the task are:      

* "id": id of the node      
* "title": title of the node. in case "node_type" is "answer" or "comment", this field will be empty        
* "tagnames": space separated list of tags     
* "author_id": id of the author      
* "body": content of the post     
* "node_type": type of the node, either "question", "answer" or "comment"      
* "parent_id": node under which the post is located, will be empty for "questions"     
* "abs_parent_id": top node where the post is located      
* "added_at": date added     

The second table is "forum_users.tsv". It contains fields for "user_ptr_id" - the id of the user. "reputation" - the reputation, or karma of the user, earned when other users upvote their posts, and the number of "gold", "silver" and "bronze" badges earned. The actual database has more fields in this table, like user name nickname, bio (if set) etc, but we have removed this information here.



In [20]:
#unzip the tar.gz file if forum_data is not already in the local directory.
import os 

#print os.listdir(os.getcwd())
if "forum_data" not in os.listdir(os.getcwd()):
    !tar zxvf forum_data.tar.gz

<a id="Decision Process"></a>
***

###Decision Process       
***Let's assume you have an active community site, similar to the Udacity forum, where users can post different information. You want to obtain some statistics about user behavior. Is it a a good idea to use MapReduce/Hadoop to process the data? Consider how each of the 3V's of Big Data would affect this decision process.***    

**Answer:**    
1. Volume     
2. Variety    
3. Velocity      

Udacity may have somewhere in the neighborhood of several million users and it is very likely that something less then a significant number of those users  have utilized the forums. Storage prices continue to fall and most of the forum data is in a text format and possibly could fit on just a few large storage drives. Certainly the data for this project easily fits on one machine. Therefore on the basis of the V related to Volume you could make a case that a Hadoop process may not be that necessary. This assessment depends on how much collective data accrues over the lifespan of the forum, if this data were in excess of several terabytes then perhaps an HDFS would make sense in order to process the data. 
As was stated in the prompt many users provide very different types of posts and information; text, code snippets, numbers, images (pngs,etc.), html and latex for example. A large Variety of data may be recored in the forums and it is true that implementing a HDFS would simplify the storage and extend the range of acceptable inputs for forum posts by accepting all data in it's raw format. This flexibility could be especially convenient if Udacity decides that it wants to collect more data or except a wider range of data types from the forums and doesn't want to have to overhaul it's RDMS schemas to accommodate changing strategies and maturing data analytics.       
Finally in regards to Velocity of data you could make that case that most forum data remains relevant for long periods of time and it doesn't appear to be a good strategy to dump old posts owing to storage limitations, rather acquring more space would be the preferred approach. The speed in which the data comes and is processed into usable statistics is likely not that important. The stastics to be run on the forum data probably do not have a super time sensitive nature. A case can be made therefore that an HDFS may not be necessary on the basis of data velocity, assuming of course that statistics are not being run on a very large amount of data that could not fit on a single machine in which case we would want to implement the statistics over and HDFS if only to more quickly debug our code.        

Overall we can conclude that using a MapReduce/Hadoop to process the data would be a viable design decision (a good idea) as a consequence of the large variety of data found in the forum post and the possibly large quantities of forum posts which over time will continue to grow. Velocity of the data is not as significant in our decison process to utilize HDFS in this instance.    
   


<a id="Student Times"></a>
***

###Student Times   
We have a lot of passionate students that bring a lot of value to forums. Forums also sometimes need a watchful eye on them, to make sure that posts are tagged in a way that helps to find them, that the tone on forums stays positive, and in general - they need people who can perform some management tasks - forum moderators. These are usually chosen from students who already have shown that they are active and helpful forum participants.      

Our students come from all around the world, so we need to know both at what times of day the activity is the highest, and to know which of the students are active at that time.     

In this exercise your task is to find for each student what is the hour during which the student has posted the most posts. Output from reducers should be:    
   
`author_id    hour`      

For example:    

```
13431511\t13
54525254141\t21
```  

If there is a tie: there are multiple hours during which a student has posted a maximum number of posts, please print the student-hour pairs on separate lines. The order in which these lines appear in your output does not matter.      

You can ignore the time-zone offset for all times - for example in the following line: "2012-02-25 08:11:01.623548+00" - you can ignore the +00 offset.     

In order to find the hour posted, please use the date_added field and NOT the last_activity_at field.       

To make sure your code is running properly, we have put together a smaller data set and set of expected outputs for you to use to check your work. Please click [here](https://www.udacity.com/wiki/ud617/local-testing-instructions "https://www.udacity.com/wiki/ud617/local-testing-instructions") to access the instructions to use it.

In [38]:
%%writefile mapper_studentTime.py
#!/usr/bin/python 
#Author: Zach Farmer
#Purpose: find each students most active posting hour.  

import sys
import csv 
from datetime import datetime  

def mapper():
    """
        parameters: sys.stdin from Udacity forum_node data
        
        Output: key,value pair where key is the author_id and 
        the value is the added_at hour 
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t") 
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    for line in reader: 
        
        
        author_id = line[3] 
        #                     %Y    %m     %d   %H    %M      %S      %f               %z                  
        #datetime Attributes: year, month, day, hour, minute, second, microsecond, and tzinfo.
        if line[8] != "added_at": #ignore header row
            time = line[8].split("+")[0] #ignore timezone offset 
            hour = datetime.strptime(time, "%Y-%m-%d %H:%M:%S.%f").hour
            writer.writerow([author_id,hour])
        
if __name__ == "__main__":
    mapper()

Overwriting mapper_studentTime.py


In [39]:
%%bash 
##Test mapper output
#chmod 764 mapper_studentTime.py #make executable 
cat student_test_posts.csv | ./mapper_studentTime.py | sort 

"100000005"	"1"
"100000066"	"1"
"100000066"	"5"
"100002460"	"12"
"100003192"	"8"
"100003268"	"15"
"100004467"	"12"
"100004467"	"20"
"100004819"	"10"
"100004819"	"4"
"100004819"	"4"
"100004819"	"5"
"100005156"	"17"
"100007808"	"12"
"100008254"	"22"
"100010128"	"14"
"100012200"	"5"
"100019875"	"5"
"100020526"	"14"
"100071170"	"12"
"100071170"	"12"
"100071170"	"14"
"100071170"	"5"


In [171]:
%%writefile reducer_studentTime_sansNumpy.py   
#!/usr/bin/python   
#Author: Zach 
#Purpose: find each students most active posting hour. This script does not 
#utilize numpy. 

import sys
import csv 
#import numpy as np

def reducer():
    """
        paramters: sys.stdin from mapper_studentTime 
        
        Output: Author_id followed by the hour were the id was most
        active.   
    """
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    oldKey = None
    hours = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    
    for line in reader:
        currentKey = line[0]  
        currentHour = line[1]
        
        if oldKey and oldKey != currentKey:
            
            most_active_hours = [(hour_count,idx) for idx,hour_count in enumerate(hours)]
            most_active_hours.sort(key = lambda x: x[0], reverse=True) 
            
            i = 0 
            while sorted(hours,reverse=True)[0] == most_active_hours[i][0]:
                #print "{0}\t{1}".format(oldKey,most_active_hours[i][1])
                writer.writerow([oldKey,most_active_hours[i][1]])
                i += 1
            oldKey = currentKey
            hours = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
            
        oldKey = currentKey
        hours[int(currentHour.replace('"',''))] += 1


    active_hours = [(hour,count) for count,hour in enumerate(hours)]
    active_hours.sort(key = lambda x: x[0], reverse= True)
    i = 0
    while sorted(hours,reverse=True)[0] == active_hours[i][0]:
        #print "{0}\t{1}".format(oldKey,active_hours[i][1])
        writer.writerow([oldKey,active_hours[i][1]])
        i += 1

if __name__ == "__main__":
    reducer() 
              



Overwriting reducer_studentTime_sansNumpy.py


In [98]:
%%writefile reducer_studentTime.py   
#!/usr/bin/python   
#Author: Zach 
#Purpose: find each students most active posting hour. This script utilizes the 
#numpy module. I personally prefer this method but we cannot assume that all the 
#machines in the HDFS cluster will contain a numpy distribution. 

import sys
import csv 
import numpy as np

def reducer():
    """
        paramters: sys.stdin from mapper_studentTime 
        
        Output: Author_id followed by the hour were the id was most
        active.   
    """
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    oldKey = None
    hours = np.zeros(24)
    
    for line in reader:
        currentKey = line[0]  
        currentHour = line[1]
        
        if oldKey and oldKey != currentKey:
            
            most_active_hours = np.argsort(-hours)
            
            i = 0 
            while hours[np.argmax(hours)] == hours[most_active_hours[i]]:
                #print "{0}\t{1}".format(oldKey, most_active_hours[i])
                writer.writerow([oldKey,most_active_hours[i]])
                i += 1
            oldKey = currentKey
            hours = np.zeros(24)
    
        oldKey = currentKey
        hours[int(currentHour.replace('"',''))] += 1


    most_active_hours = np.argsort(-hours)
    i = 0 
    while hours[np.argmax(hours)] == hours[most_active_hours[i]]:
        #print "{0}\t{1}".format(oldKey, most_active_hours[i])
        writer.writerow([oldKey,most_active_hours[i]])
        i += 1
    

if __name__ == "__main__":
    reducer() 
              


Overwriting reducer_studentTime.py


In [172]:
%%bash 
##Test mapper output
#chmod 764 reducer_studentTime.py #make executable 
#chmod 764 reducer_studentTime_sansNumpy.py
cat student_test_posts.csv | ./mapper_studentTime.py | sort | ./reducer_studentTime_sansNumpy.py



"100000005"	"1"
"100000066"	"1"
"100000066"	"5"
"100002460"	"12"
"100003192"	"8"
"100003268"	"15"
"100004467"	"12"
"100004467"	"20"
"100004819"	"4"
"100005156"	"17"
"100007808"	"12"
"100008254"	"22"
"100010128"	"14"
"100012200"	"5"
"100019875"	"5"
"100020526"	"14"
"100071170"	"12"


In [None]:
#Run on the whole dataset
#!cat forum_data/forum_node.tsv | ./mapper_studentTime.py | sort | ./reducer_studentTime_sansNumpy.py

<a id="Post and Answer Length"></a>  
****

###Post and Answer Length     
We are interested to see if there is a correlation between the length of a post and the length of answers.   

Write a mapreduce program that would process the forum_node data and output the length of the post and the average answer (just answer, not comment) length for each post. You will have to decide how to write both the mapper and the reducer to get the required result.

To make sure your code is running properly, we have put together a smaller data set and set of expected outputs for you to use to check your work. Please click [here](https://www.udacity.com/wiki/ud617/local-testing-instructions) to access the instructions to use it.  

> Hints for writing reducer code         
Code should not use a data structure (e.g. a dictionary) in the reducer that stores a large number of keys. Remember that Hadoop already sorts the mapper output based on key, such that key-value pairs with the same key will appear consecutively as input to the reducer. Make sure you take advantage of this ordering when you write your reducer code.           
This is part of a more general principle connected with the Volume characteristic of Big Data. Mappers and reducers read through very large amounts of data and we should be mindful, as we write mapper and reducer code, of how much data we store in main memory.

In [195]:
%%writefile mapper_PostLength.py   
#!/usr/bin/python 
#Author: Zach Farmer
#Purpose: determine if correlation btw. post length and answers length.   

import sys
import csv 

def mapper():
    """
        paramters: sys.stdin from forum_node.tsv or small test dataset 
        
        Output: Either node_id if node_type question or abs_parent_id if
        node_type answer as keys and the length of the body for values.
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    for line in reader:
        
        node_type = line[5] 
        body_len = len(line[4]) 
        
        if node_type == "question":
            writer.writerow([line[0],"Q-"+str(body_len)])
        elif node_type == "answer":
            writer.writerow([line[7],"A-"+str(body_len)])
    
        
if __name__=="__main__":
    mapper()

Overwriting mapper_PostLength.py


In [196]:
%%bash 
#chmod 764 mapper_PostLength.py
cat student_test_posts.csv | ./mapper_PostLength.py |sort 

"111"	"Q-35"
"15084"	"Q-237"
"2"	"Q-145"
"262"	"Q-50"
"26454"	"Q-101"
"3778"	"A-164"
"3778"	"Q-69"
"6011204"	"A-158"
"6011204"	"A-219"
"6011204"	"Q-2651"
"6011936"	"A-125"
"6011936"	"A-760"
"6011936"	"Q-347"
"6012754"	"A-414"
"6012754"	"Q-369"
"6015491"	"A-313"
"6015491"	"A-65"
"6015491"	"Q-170"
"66193"	"A-288"
"66193"	"A-302"
"66193"	"A-34"
"66193"	"Q-60"
"7185"	"Q-86"


In [203]:
%%writefile reducer_PostLength.py
#!/usr/bin/python  
#Author: Zach Farmer
#Purpose: Determine if correlation btw. post length and answers length.   

import sys
import csv 

def reducer(): 
    """
        paramters: sys.stdin from mapper_PostLength 
        
        Output: Post length of question and average post lenght of answers
        
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    oldKey = None
    q_post_len = 0.0
    answ_len = 0.0 
    count = 0
    
    for line in reader: 
        
        currentKey = line[0]
        node_type,post_length = line[1].split("-") 
        
        if oldKey and oldKey != currentKey:
            if count != 0:
                avg_answ_len = answ_len/float(count)
            else:
                avg_answ_len = 0
            writer.writerow([oldKey, q_post_len, avg_answ_len])
            oldKey = currentKey
            answ_len = 0.0
            count = 0
        
        oldKey = currentKey
        
        if node_type == "Q":
            q_post_len = int(post_length.replace('"',''))
        elif node_type == "A":
            answ_len += int(post_length.replace('"',''))
            count += 1
    
    if count != 0:
        avg_answ_len = answ_len/float(count)
    else:
        avg_answ_len = 0
    writer.writerow([oldKey,q_post_len, avg_answ_len])
        
if __name__ == "__main__":
    reducer()   
    

Overwriting reducer_PostLength.py


In [204]:
%%bash 
#chmod 764 reducer_PostLength.py
cat student_test_posts.csv | ./mapper_PostLength.py |sort |./reducer_PostLength.py

"111"	"35"	"0"
"15084"	"237"	"0"
"2"	"145"	"0"
"262"	"50"	"0"
"26454"	"101"	"0"
"3778"	"69"	"164.0"
"6011204"	"2651"	"188.5"
"6011936"	"347"	"442.5"
"6012754"	"369"	"414.0"
"6015491"	"170"	"189.0"
"66193"	"60"	"208.0"
"7185"	"86"	"0"


In [None]:
#Run on the whole dataset
#!cat forum_data/forum_node.tsv | ./mapper_PostLength.py |sort |./reducer_PostLength.py

<a id="Top Tags"></a>    
**** 

###Top Tags    
We are interested seeing what are the top tags used in posts.    

Write a mapreduce program that would output Top 10 tags, ordered by the number of questions they appear in.

For an extra challenge you can think about how to get a top 10 list of tags, where they are ordered by some weighted score of your choice. If you decide to do this, then please submit your solution to the regular problem and then also submit this extra challenge problem in separate files as described on the instruction page.

To make sure your code is running properly, we have put together a smaller data set and set of expected outputs for you to use to check your work. Please click [here](https://www.udacity.com/wiki/ud617/local-testing-instructions) to access the instructions to use it.

Please note that you should only look at tags appearing in questions themselves (i.e. nodes with node_type "question"), not on answers or comments.


In [222]:
%%writefile mapper_TopTags.py   
#!/usr/bin/python  
#Author: Zach Farmer
#Purpose: Return Top 10 tags found in questions 

import sys
import csv    

def mapper():   
    """
        Parameter: sys.stdin from forum_node.tsv or test file.
        
        Output: tab deliniated csv lines for input to reducer_TopTags.py
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    for line in reader:
        
        node_id = line[0]
        node_type = line[5]     
        tags = line[2].split(" ")      
        
        if node_type == "question":
            for tag in tags:
                #print "{0}\t{1}".format(line[0],tag)
                writer.writerow([tag,line[0]])
            
        
if __name__ == "__main__":
    mapper() 
    

Overwriting mapper_TopTags.py


In [223]:
%%bash
chmod 764 mapper_TopTags.py   
cat student_test_posts.csv | ./mapper_TopTags.py | sort 

"application"	"26454"
"board"	"15084"
"browsers"	"262"
"bug"	"111"
"cs101"	"111"
"cs101"	"15084"
"cs101"	"2"
"cs101"	"262"
"cs101"	"26454"
"cs101"	"3778"
"cs101"	"66193"
"cs101"	"7185"
"cs212"	"66193"
"cs253"	"6011204"
"cs253"	"6011936"
"cs253"	"6012754"
"cs253"	"6015491"
"cs253"	"66193"
"cs262"	"66193"
"deadlines"	"6012754"
"digital"	"15084"
"discussion"	"26454"
"discussion"	"6011204"
"discussion"	"6011936"
"discussion"	"6015491"
"discussion"	"66193"
"google-appengine"	"6011936"
"homework"	"6011204"
"homework"	"6012754"
"html"	"6011936"
"hungarian"	"7185"
"hw2-1"	"6012754"
"issues"	"111"
"issues"	"262"
"issues"	"3778"
"jobs"	"15084"
"jobs"	"26454"
"lessons"	"15084"
"lessons"	"66193"
"meta"	"111"
"meta"	"66193"
"nationalities"	"111"
"nationalities"	"7185"
"offtopic"	"6015491"
"profile"	"3778"
"udacity"	"6015491"
"udacity-future"	"6015491"
"video"	"111"
"welcome"	"2"
"welcome"	"66193"
"welcome"	"7185"


In [274]:
%%writefile reducer_TopTags.py
#!/usr/bin/python  
#Author: Zach Farmer
#Purpose: Return Top 10 tags found in questions   

import sys
import csv 
#from collections import defaultdict 

def reducer(): 
    """
        paramters: sys.stdin from mapper_TopTags.py
        
        Output: Top ten tags from question nodes 
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")   
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar = '"',\
                        quoting=csv.QUOTE_ALL)  
    
    oldKey = None 
    count = 0 
    
    #tag_dict = defaultdict(int)
    topTen_tagDict = {}
    for line in reader:
    
        currentKey = line[0]   
        
        if oldKey and oldKey != currentKey:
            if len(topTen_tagDict.keys()) < 10:
                topTen_tagDict[oldKey] = count
                oldKey = currentKey
                count = 0 
            elif len(topTen_tagDict.keys()) >= 10:
                if count > min(topTen_tagDict.values()):
                    del topTen_tagDict[[k for k, v in topTen_tagDict.iteritems()\
                                         if v == min(topTen_tagDict.values())][0]]
                    topTen_tagDict[oldKey] = count 
                    oldKey = currentKey 
                    count = 0 
                else:
                    oldKey = currentKey
                    count = 0 
            #print "{0}\t{1}".format(count,oldKey)
            #count = 0
        
        oldKey = currentKey 
        count += 1
    
    #Method of returning dict key from value courtesy of Chris Morgan at:
    #http://stackoverflow.com/questions/7657457/finding-key-from-value-in-python-dictionary
    if count > min(topTen_tagDict.values()):
        del topTen_tagDict[[k for k, v in topTen_tagDict.iteritems()\
                            if v == min(topTen_tagDict.values())][0]]
        topTen_tagDict[oldKey] = count   
        
    for key,count in sorted(topTen_tagDict.items(), key = lambda x: x[1], reverse=True):
        writer.writerow([key,count]) 
        
    #print "{0}\t{1}".format(count,oldKey)
        #tag_dict[currentKey] += 1
        
    #for key,count in sorted(tag_dict.items(), key = lambda x: x[1],reverse=True)[0:10]:
        #print "{0}\t{1}".format(key,count)
        #writer.writerow([count,key]) 
            
if __name__ == "__main__": 
    reducer()  
    
            
    

Overwriting reducer_TopTags.py


In [275]:
%%bash 
#chmod 764 reducer_TopTags.py 
cat student_test_posts.csv | ./mapper_TopTags.py | sort | ./reducer_TopTags.py 

"cs101"	"8"
"discussion"	"5"
"cs253"	"5"
"welcome"	"3"
"issues"	"3"
"lessons"	"2"
"jobs"	"2"
"meta"	"2"
"nationalities"	"2"
"homework"	"2"


In [None]:
#Run on the whole dataset
#!cat forum_data/forum_node.tsv | ./mapper_TopTags.py | sort | ./reducer_TopTags.py 

<a id="Study Groups"></a>   
***

###Study Groups    
We might want to help students form study groups. But first we want to see if there are already students on forums that communicate a lot between themselves.   

As the first step for this analysis we have been tasked with writing a mapreduce program that for each forum thread (that is a question node with all it's answers and comments) would give us a list of students that have posted there - either asked the question, answered a question or added a comment. If a student posted to that thread several times, they should be added to that list several times as well, to indicate intensity of communication.     

To make sure your code is running properly, we have put together a smaller data set and set of expected outputs for you to use to check your work. Please click [here](https://www.udacity.com/wiki/ud617/local-testing-instructions) to access the instructions to use it.     


In [3]:
%%writefile mapper_studyGroups.py   
#!/usr/bin/python  
#Author: Zach Farmer  
#Purpose: Find potential study groups by measuring frequency of students post per question.  

import sys 
import csv   

def mapper():
    """
        parameter: sys.stdin from fourm_node.tsv or the sample test file student_tests_post.tsv
        
        output: sys.stdout .tsv lines containing the post node_id and the author_ids 
        of the question node and each child answer and comment post. Will include duplicate 
        author_id's if author posted more then once.
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar='"',\
                        quoting = csv.QUOTE_ALL)   
    
    for line in reader:
        
        node_id = line[0]
        node_type = line[5]  
        author_id = line[3] 
        abs_parent_id = line[7]
        
        
        if node_type == "question":
            writer.writerow([node_id,author_id])
        elif node_type == "answer" or node_type == "comment":
            writer.writerow([abs_parent_id, author_id])

if __name__ == "__main__":
    mapper()   

Overwriting mapper_studyGroups.py


In [4]:
%%bash 
#chmod 764 mapper_studyGroups.py 
cat student_test_posts.csv | ./mapper_studyGroups.py | sort 

"111"	"100000066"
"15084"	"100004819"
"2"	"100000005"
"262"	"100004819"
"26454"	"100003192"
"3778"	"100000066"
"3778"	"100008254"
"6011204"	"100010128"
"6011204"	"100020526"
"6011204"	"100071170"
"6011936"	"100004819"
"6011936"	"100019875"
"6011936"	"100071170"
"6012754"	"100004819"
"6012754"	"100012200"
"6015491"	"100004467"
"6015491"	"100005156"
"6015491"	"100071170"
"66193"	"100002460"
"66193"	"100004467"
"66193"	"100007808"
"66193"	"100071170"
"7185"	"100003268"


In [15]:
%%writefile reducer_studyGroups.py   
#!/usr/bin/python  
#Author: Zach Farmer  
#Purpose: Find potential study groups by measuring frequency of students post per question.  

import sys 
import csv  

def reducer(): 
    """
        parameter: sys.stdin from mapper_studyGroups.py
        
        output: sys.stdout .tsv lines containing the post node_id for the top level
        question and all the author_ids associated with post related to the question.
    """
    
    reader = csv.reader(sys.stdin, delimiter="\t")
    writer = csv.writer(sys.stdout, delimiter="\t", quotechar='"',\
                        )   
    
    oldKey = None
    author_ids = []
    for line in reader:
        
        currentKey = line[0] 
        
        if oldKey and oldKey != currentKey: 
            writer.writerow([oldKey,author_ids])
            oldKey = currentKey
            author_ids = []
            
        oldKey = currentKey 
        author_ids.append(int(line[1]))
            
    writer.writerow([oldKey,author_ids])

if __name__ == "__main__": 
    reducer()  
        
    

Overwriting reducer_studyGroups.py


In [16]:
%%bash  
#chmod 764 reducer_studyGroups.py  
cat student_test_posts.csv | ./mapper_studyGroups.py | sort | ./reducer_studyGroups.py

111	[100000066]
15084	[100004819]
2	[100000005]
262	[100004819]
26454	[100003192]
3778	[100000066, 100008254]
6011204	[100010128, 100020526, 100071170]
6011936	[100004819, 100019875, 100071170]
6012754	[100004819, 100012200]
6015491	[100004467, 100005156, 100071170]
66193	[100002460, 100004467, 100007808, 100071170]
7185	[100003268]


In [None]:
#Run on the whole dataset
#!cat forum_data/forum_node.tsv | ./mapper_studyGroups.py | sort | ./reducer_studyGroups.py

<a id="Search Functionality"></a>   
***

###Search Functionality      
**Improving the search functionality and index-building**    
In lesson 4 you built an index which included {"word":"forum entries that include the word"}. This can be used to search efficiently for forum posts that contain a specific word. Can you think of improvements you could make to the process of building an index by using the design patterns you learned in Lesson 4?   

The improvements might include improving the efficiency of the index building by applying some of the MapReduce design patterns or changing the index to include other features from the data.    

**Answer:**    
If we were to stick to the same index-building features I would consider implementing a combiner in order to sub-aggregate the words and their node_ids in each of the mappers machines. This would reduce the quantity of traffic passed to the reduce and ideally speed up the entire map-reduce process. If however we were considering changing the methodology a little bit I might consider a.) removing stop words and b.) only return the parent question node_id in which the word is nested in. This would require more searching on the part of the individual using the index but greatly reduce the map-reduce task. More radically I might also consider just indexing the tags of the nodes, as most people I imagine use an index to search for a particular topics in order to find post on the subject. Tags ideally should summarize the key topic discussed in the post. On the topic of relevancy we could order the node_id lists by the percieved value of the word in a post. For example if the word were to be found many times in a post then that posts node_id would be given higher priority and listed closer to the begining of listed nodes. We might also consider more heavily weighting nodes were the relevant word can be found in the tag or title of the post, giving a node_id higher preference in listing order on the basis that their information may be more relevant.