https://github.com/jimwaldo/APCOMP221Code

# AC 221 - Problem Set 3

## Problem 1

### Finding Duplicates

We know from Problem 2 that there are no incomplete or malformed lines. That means when finding duplicates, we can use pandas and our only concern will be that the two fields we're interested in, `course_id` and `user_id` are read correctly. We can do that by enforcing their data types.

In [2]:
import pandas as pd
import numpy as np
import csv

In [4]:
df = pd.read_csv('sample_set.csv', 
                 parse_dates=['cert_created_date','cert_modified_date', 
                              'start_time', 'first_event', 'last_event'],
                 dtype={
                     'course_id': np.str,
                     'user_id': np.int64,
                     'registered': np.bool, 
                     'explored': np.bool, 
                     'un_developing_nation': np.str},
                verbose=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
df.shape

(1100005, 91)

In [15]:
n_duplicates = df.duplicated(subset=['course_id', 'user_id'], keep='first').sum()
n_duplicates

962555

There are 962,555 duplicate rows. This count includes all instances of each duplicated row except the first, so it is a count of how many rows would be removed if we were to remove duplicates. We can confirm this by actually dropping the duplicates and counting the rows.

In [16]:
unique_rows = df.drop_duplicates(subset=['course_id', 'user_id'], keep='first')
unique_rows.shape

(137450, 91)

In [17]:
n_duplicates + unique_rows.shape[0] == df.shape[0]

True

Our numbers square.

**TODO** check that this squares with [Jim's code](https://github.com/jimwaldo/APCOMP221Code/blob/master/find_duplicates.py).

In [None]:
"""
Created on 2019-03-14
@author waldo
"""

import csv

def find_dups(csv_in, csv_out):
    total_lines = 0
    unique_lines = 0
    dup_lines = 0
    lines_seen = set()

    for l in csv_in:
        total_lines += 1
        key = get_key(l)
        if key not in lines_seen:
            lines_seen.add(key)
            csv_out.writerow(l)
            unique_lines += 1
        else:
            dup_lines += 1
    return total_lines, unique_lines, dup_lines

### Repairing Missing Values

**TODO**

## Problem 2

There are numerous ways a line could be corrupt. We will consider lines corrupt that either fail to parse as CSV lines or have fewer fields than the header specifies.
First, let's see how many newline characters are in the file. This is an upper bound on the number of rows our CSV should have (because fields can contain newline characters).

In [18]:
!wc -l sample_set.csv

1100006 sample_set.csv


There are 1,100,006 lines in the CSV. We can see below that the first line is a header and the second line is where data begins.

In [19]:
!head -n 2 sample_set.csv

course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,continent,city,region,subdivision,postalCode,un_major_region,un_economic_group,un_developing_nation,un_special_region,latitude,longitude,LoE,YoB,gender,grade,passing_grade,start_time,first_event,last_event,nevents,ndays_act,nplay_video,nchapters,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status,verified_enroll_time,verified_unenroll_time,profile_country,y1_anomalous,email_domain,language_brwsr,language_brwsr_country,language_brwsr_sec,language_brwsr_sec_country,language_brwsr_code,language_brwsr_subcode,language_brwsr_sec_code,language_brwsr_sec_subcode,language_brwsr_nevents,language_brwsr_ndiff,language,language_download,language_nevents,language_ndiff,ntranscript,nshow_answer,nvideo,nvideos_unique_viewed,nvideos_total_watched,nseq_goto,nseek_video,np

So we'd expect at most 1,100,005 rows of data.

We already have some indication from Problem 1 that there are exactly this number of rows if pandas parses the file for us. Let's do some double checks. For our second pass at counting the number of good rows, we will use an adapted version of [the code](https://github.com/jimwaldo/APCOMP221Code/blob/master/clean_csv.py) Professor Waldo showed in class.

In [31]:
import csv, sys

def count_line_status(csv_in):
    total_lines = 0
    good_lines = 0
    bad_lines = 0

    header = next(csv_in)
    l_len = len(header)
    while True:
        try:
            total_lines += 1
            l = next(csv_in)
            if len(l) == l_len:
                good_lines += 1
            else:
                bad_lines += 1
        except StopIteration:
            total_lines -= 1
            break
        except:
            bad_lines += 1
            continue
    return total_lines, good_lines, bad_lines

In [32]:
with open('sample_set.csv') as f:
    reader = csv.reader(f)
    total_lines, good_lines, bad_lines = count_line_status(reader)
print('Total lines = ' + str(total_lines), 'Good lines = ' + str(good_lines),
          'Bad lines = ' + str(bad_lines))

Total lines = 1100005 Good lines = 1100005 Bad lines = 0


This gives us further evidence there are zero corrupt lines. Every row parses correctly and every row has exactly 91 fields. Let's do that same thing one more time, but with the CSV parser set to strict so we know it will throw exceptions.

In [33]:
class MyDialect(csv.Dialect):
    strict = True
    skipinitialspace = True
    quoting = csv.QUOTE_ALL
    delimiter = ','
    quotechar = '"'
    lineterminator = '\n'
    
with open('sample_set.csv') as f:
    reader = csv.reader(f, MyDialect())
    total_lines, good_lines, bad_lines = count_line_status(reader)
    
print('Total lines = ' + str(total_lines), 'Good lines = ' + str(good_lines),
          'Bad lines = ' + str(bad_lines))

Total lines = 1100005 Good lines = 1100005 Bad lines = 0


We're told again there are zero bad lines. As one final check, let's count the number of commas in each line. A well-structured line should have 90 commas separating the 91 fields. If a line was truncated, it will likely have fewer than 90. This is not a strict test because lines may have commas that are not field delimiters but instead part of the field value, but it will likely provide some indication.

In [35]:
import collections
n_commas_per_line = []
with open('sample_set.csv') as f:
    for line in f:
        n_commas = len(line.split(','))-1
        n_commas_per_line.append(n_commas)
collections.Counter(n_commas_per_line)

Counter({90: 1096357, 91: 3624, 92: 25})

We see that every line has at least 90 commas, so between this result and those above, it's quite likely there are no truncated or unparsable lines. As this make answering the rest of the problem less meaningful, I asked on Piazza and was given another dataset which assuredly has corrupt lines against which to test the above methods. Let's see if they work.

In [36]:
with open('sample_set2.csv') as f:
    reader = csv.reader(f)
    total_lines, good_lines, bad_lines = count_line_status(reader)
print('Total lines = ' + str(total_lines), 'Good lines = ' + str(good_lines),
          'Bad lines = ' + str(bad_lines))

Total lines = 1000000 Good lines = 923993 Bad lines = 76007


Bad lines are detected here, so it's safe to assume our method works and there are no bad lines in the original file.

**TODO** address rest of question.

## Problem 3