# HW 1: Web log data wrangling

Please also refer to the HW1 [README](https://github.com/berkeley-cs186/course/tree/master/hw1) for the full assignment details.

--------------------------------------------

## Introduction

### Jupyter Notebooks w/ iPython

Jupyter Notebook is a web-based interactive computing system, which allow you to mix code and rich-text in one document. A notebook consists of a sequence of cells, which can be run using the "Play" button in the toolbar or by hitting Shift-Enter on the keyboard.

In HW1, you will primarily use code cells with iPython code. You can find a tour and pointers to more documentation in the `Help` menu above.


### The dataset

Let's take a look at the data. These web logs were produced by an Apache web server. Each line represents a request to the server that originally hosted an early viral video from 2002.

In [1]:
import os
DATA_DIR = os.environ['MASTERDIR'] + '/sp16/hw1/'

In [2]:
with open(DATA_DIR + "web_log_small.log") as log_file:
    sample_line = log_file.readline()

print sample_line

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"



This format is called "Combined Log Format", and you can find a description of each of the fields [here](https://httpd.apache.org/docs/1.3/logs.html#common).

Here's another way to view the first line of the dataset. We can run a shell command using [`! operator`](https://ipython.org/ipython-doc/3/interactive/reference.html#system-shell-access) (a feature of iPython). 

In [3]:
!head -1 {DATA_DIR}web_log_small.log

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"


-----------

## Your Assignment

Fill in the `process_logs` function below to complete the specification in the README. You can add any helper functions you need. You may use any of Python 2's standard libraries available on the instructional machines. You cannot use (and shouldn't need) any external libraries.

Remember, you need to ensure that your code will scale to datasets that are bigger than memory -- no matter how large or skewed the dataset or how much memory is on your test machine.  Avoid keeping data structures of unbounded size in memory, since it **won't** scale, e.g.: 

- having a list of every line in the dataset
- having a dictionary with an key for every IP address

Finally, to ensure proper grading, please make sure all of your log processing code (including `import` statements) is between the **BEGIN/END STUDENT CODE** cells. Do not modify or remove either of these cells.

### * BEGIN STUDENT CODE *

In [4]:
import apachetime
import time
import csv

def apache_ts_to_unixtime(ts):
    """
    @param ts - a Apache timestamp string, e.g. '[02/Jan/2003:02:06:41 -0700]'
    @returns int - a Unix timestamp in seconds
    """
    dt = apachetime.apachetime(ts)
    unixtime = time.mktime(dt.timetuple())
    return int(unixtime)

In [5]:
def process_logs(dataset_iter):
    """
    Processes the input stream, and outputs the CSV files described in the README.    
    This is the main entry point for your assignment.
    
    @param dataset_iter - an iterator of Apache log lines.
    """
    # FIX ME
#     with open("hits.csv", "w+") as hits_file:        
#         for i, line in enumerate(dataset_iter):            
#             if i % 1e5 == 0:
#                 print i,
        
#         print "Done."

    
    print "starting!"
    with open("hits.csv", "w+") as hits_file:
        hitswriter = csv.writer(hits_file, delimiter = ',', lineterminator='\n')
        hitswriter.writerow(["ip", "timestamp"])

        for i, line in enumerate(dataset_iter):
            # echo = write args to stdout
            # ! indicates a unix command 
            # $ is like extract value of variable, I think
            # unix just goes awk '{print $1}' for the first thing 
            # ip = !echo $line | awk '{{print $$1}}' 
            split_line = line.split(" ")
            ip = split_line[0]
            # ap_timestamp = !echo $line | awk '{{print $$4, $$5}}'
            ap_timestamp = split_line[3] + " " + split_line[4]
            timestamp = apache_ts_to_unixtime(ap_timestamp)

            hitswriter.writerow([ip, timestamp])
            
    hits_file.close()
              
        
        
        
    # sessions.csv
    tmp_sessions = !mktemp 
    tmp_sessions = tmp_sessions[0] # tmp_sessions ^ is an "ipython SList"
    # !sort -s hits.csv > {tmp_sessions}
    !tail -n +2 hits.csv | sort -s > {tmp_sessions}   # Don't want sort header
    # Logs already fairly time sorted, except for occassional 1 sec off
    # Sort by ip
    # Actually, I don't think it's really sorted by the numerical value of the ip
    # But it matches the ref, so okay
    # Also, don't really need sorted ip, just rendezvous, so it's fine.
    # {} are something about expanding a var
    
    with open("sessions.csv", "w+") as sessions_file:
        sessionswriter = csv.writer(sessions_file, delimiter = ',', lineterminator='\n')
        sessionswriter.writerow(["ip", "session_length", "num_hits"])
        
        # Iterate through, start calculating.
        prev_ip, prev_time, curr_length, curr_hits = 0, -1, 0, 1
        flag = False # Not a legit setting yet
        wrote = False
        sorted_f = open(tmp_sessions, "r")
        for line in sorted_f:
            #curr_ip = !line | tr ',' $'\n' | sed -n "1p" 
            #curr_time = !line | tr ',' $'\n' | sed -n "2p"
            
            split_line = line.split(",")
            curr_ip = split_line[0]
            curr_time = int(split_line[1])
            
            if curr_ip == prev_ip and prev_time >= 0 and abs(curr_time - prev_time) <= 1800:
                # part of same session - update
                curr_length = curr_length + abs(curr_time - prev_time) # weird -1 thing
                curr_hits = curr_hits + 1
                wrote = False
            # Else different session - write old one if legit, and update still
            else:
                wrote = False
                if flag:
                    sessionswriter.writerow([prev_ip, curr_length, curr_hits])
                    wrote = True
                curr_length = 0
                curr_hits = 1

#             if curr_ip != "ip" and curr_time != "timestamp":
#                 prev_ip = curr_ip
#                 prev_time = curr_time

            prev_ip = curr_ip
            prev_time = curr_time

            flag = True
        # And I guess, check the last one as well
        if not wrote:
            # Write it
            sessionswriter.writerow([prev_ip, curr_length, curr_hits])
            
    sessions_file.close()
    
    
    
    
    
    # session_length_plot.csv
    # I think Varun said a dictionary is okay, because it doesn't increase linearly. 
    
    tmp_sessions_length = !mktemp
    tmp_sessions_length = tmp_sessions_length[0]
    # sort by session_length - column 2
    # !tail -n +2 sessions.csv | sort -s -k 2 > {tmp_sessions_length}
    !tail -n +2 sessions.csv | sort -s -n -t "," -k2 > {tmp_sessions_length}
        
    with open("session_length_plot.csv", "w+") as session_length_file:
        lengthwriter = csv.writer(session_length_file, delimiter = ',', lineterminator='\n')
        lengthwriter.writerow(["left", "right", "count"])
        
        
        
#         # I think I might generate the first one first
#         session_dict = {(0, 2): 0}
#         curr_limit = 2 # Don't forget it's a soft limit!!!
        
        # Urgh I dunno if want to do dictionaries, because have to write in order. 
        # Maybe just do it running, like with previous one. 
        curr_left = 0
        curr_right = 2
        curr_count = 0
        
        legit = False
        
        # Do a similar thing to before, I guess. Need write in order
        sorted_sessions_f = open(tmp_sessions_length, "r")
        for line in sorted_sessions_f:
            
            split_line = line.split(",")
            session_length = int(split_line[1])
            
            # Good bin
            if session_length >= curr_left and session_length < curr_right:
                curr_count += 1
                # print "went in good bin, session_length = " + str(session_length)
                
            # Bad bin
            else:
                # Write old one
                if legit and curr_count != 0:
                    #print "wrote"
                    lengthwriter.writerow([curr_left, curr_right, curr_count])
                
                # And then generate new one - for loop until it's in range, I guess
                while not (session_length >= curr_left and session_length < curr_right):
                    #print "session length = " + str(session_length)
                    curr_left = curr_right
                    curr_right = curr_right * 2
                    #print curr_right
                    
                curr_count = 1
                
            legit = True
        
        # Don't forget check last one!
        # I think always need to do a final write 
        # Either alone or together, need to write it
        lengthwriter.writerow([curr_left, curr_right, curr_count])
        
        # And then do it as it comes
        
        # And don't forget to clear it out afterwards! 
        
        
    session_length_file.close()
    
            
  
        
    

### * END STUDENT CODE *

------------------------


In [6]:
def process_logs_small():
    """
    Runs the process_logs function with the small dataset (186 MB).
    """        
    with open(DATA_DIR + "web_log_small.log") as log_file:
        process_logs(log_file)

In [7]:
%time process_logs_small()

starting!
CPU times: user 19.6 s, sys: 696 ms, total: 20.3 s
Wall time: 21.4 s


In [8]:
import zipfile

def process_logs_large():
    """
    Runs the process_logs function on the full dataset.  The code below 
    performs a streaming unzip of the compressed dataset which is (158MB). 
    This saves the 1.6GB of disk space needed to unzip this file onto disk.
    """
    with zipfile.ZipFile(DATA_DIR + "web_log_large.zip") as z:
        fname = z.filelist[0].filename
        f = z.open(fname)
        process_logs(f)
        f.close()

In [9]:
%time process_logs_large()

starting!
CPU times: user 2min 14s, sys: 4.58 s, total: 2min 18s
Wall time: 2min 30s


---------------

# Testing

As mentioned in the README, we provide reference output only for the small dataset. `diff_outputs()` produces a `.diff` files if there's a difference between your output and the referrence output.

If you're unfamiliar with the format of `diff`'s output, you can read about it [here](https://en.wikipedia.org/wiki/Diff_utility#Usage).

There are other diff utilities which produce colored/side-by-side output, making it easier to see differences. If you're interested, try:

```
$ vimdiff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
OR
$ git diff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
```

In [10]:
import os

ref_output_dir = DATA_DIR + "ref_output_small/"

def _diff_helper(f, unordered=False):
    """
    @param f (str) - filename to diff with reference output
    @param unordered (bool) - whether the ordering of the lines matters
    """
    if not os.path.isfile(f):
        print "FAIL - {} does not exist.".format(f)
        return
    
    if unordered:
        tmp1 = !mktemp
        tmp1 = tmp1[0]
        !sort {f} > {tmp1}
        !sort {ref_output_dir + f} | diff {tmp1} - > {f}.diff
    else:
        !diff {f} {ref_output_dir + f} > {f}.diff
    
    success = _exit_code == 0
    if success:
        !rm {f}.diff
        print "PASS - {} matched reference output.".format(f)
    else:
        print "FAIL - {} did not match reference output. See {}.diff.".format(f, f)
        

def diff_against_reference():
    """
    Compares the output files in the current directory with the reference output.
    If there is a difference, writes a ".diff" file, e.g. hits.csv.diff.
    """ 
    _diff_helper("hits.csv")
    _diff_helper("sessions.csv", unordered=True)
    _diff_helper("session_length_plot.csv")

In [11]:
process_logs_small()
diff_against_reference()

starting!
PASS - hits.csv matched reference output.
PASS - sessions.csv matched reference output.
PASS - session_length_plot.csv matched reference output.



### Testing Memory Usage

For additional testing, we've included a script which:
 - (1) makes sure all of your log processing code is between the BEGIN/END STUDENT CODE CELLS above, so it will work with our autograder
 - (2) runs your code with a memory cap of 1MB. If you see a `MemoryError`, it's a sign your code is not doing appropriate streaming and/or divide-and-conquer!
 
Make sure to save your notebook (`File > Save and Checkpoint`) before running the next cell.

In [12]:
!bash test_memory_usage.sh

[NbConvertApp] Converting notebook hw1.ipynb to python
Running process_logs_large()
starting!
Memory Test Done.
