## Problem Set 2


### Problem 1

Clean up the file. This means getting rid of duplicates; you can assume that no student can register for the same course more than once. How many duplicate records do you find? Some of the fields have bad or missing values; repair those that you can (and explain what a repair means).

### Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import hashlib
from functools import reduce

In [None]:
NA_FILL_VALUE = 0

In [None]:
columnNames = []
with open('dirty_sample_small_header.csv', 'r') as headerFile:
    headerReader = csv.reader(headerFile, delimiter=',')
    for row in headerReader:
        columnNames.append(row[1])
        
numCols = len(columnNames)

In [None]:
invalidCols = 0; duplicateRows = 0; keptRows = 0; totalRows = 0
onHeader = True
hashes = set()
with open('dirty_sample_small.csv', 'r') as dataFile:
    with open('valid_rows_sample_small.csv', 'w') as outFile:
        dataReader = csv.reader(dataFile, delimiter=',')
        outWriter = csv.writer(outFile, delimiter = ',')
        for row in dataReader:
            # Skip the header line
            if onHeader:
                # Because the column "viewed" has data that is logically inconsistent, drop it.  
                # Some entries have "Registered=True" and "Viewed=False". The values seem to actually be more similar
                # to "explored" than viewed.  Additionally including one of viewed, explored, certified, or completed
                # makes the column headers not align with the data so we need to drop one of them. 
                row = list(filter(lambda x: x != "viewed", row))
                outWriter.writerow(row)
                onHeader = False; continue
        
            totalRows += 1
            # Ignore rows with incorrect number of columns
            if len(row) != numCols:
                invalidCols += 1
                continue 
            # Get the md5 hash of each row to determine whether the row is duplicated or not
            else:
                m = hashlib.md5()
                
                # Two rows are identical if they have the same (course_id, student_id)
                # fullRowStr = reduce((lambda x, y: x + y), row).encode('utf-8')
                rowStr = (row[0] + row[1]).encode('utf-8')
                m.update(rowStr)
                hashedRow = m.hexdigest()
                if hashedRow in hashes:
                    duplicateRows += 1
                    continue 
                # If it's a new row, write it to the cleaned dataset valid_rows...
                else:            
                    keptRows += 1
                    hashes.add(hashedRow)
                    # Also ignore the data in the very last column as it does not seem to correspond to any of the 
                    # columns in this area of the dataset
                    outWriter.writerow(row[:-1])
print("Dropped: %d   Duplicates: %d   Kept: %d   Total: %d" % (invalidCols, duplicateRows, keptRows, totalRows))

# If we only drop duplicates that match on all fields these are the results.   
# Dropped: 15708   Duplicates: 595797   Kept: 49981   Total: 661486

In [None]:
invalidCols + duplicateRows + keptRows == totalRows

In [None]:
colToType = {
    "registered" : bool, 
    "explored" : bool,
    "certified" : bool,
    "completed" : bool,
    "latitude" : float, 
    "longitude" : float, 
    "YoB" : int, 
    "start_time" : "date",
    "first_event" : "date",
    "last_event" : "date", 
    "nevents" : int, 
    "ndays_act" : int, 
    "nplay_video" : int,
    "nchapters" : int, 
    "nforum_posts" : int, 
    "nforum_votes" : int, 
    "nforum_endorsed" : int, 
    "nforum_threads" : int, 
    "nforum_comments" : int, 
    "nforum_pinned" : int, 
    "nprogcheck" : int, 
    "nproblem_check" : int, 
    "nforum_events" : int, 
    # encoded as "0" or "1" (not "False" or "True"), need to convert to bool after converting to int
    "is_active" : int, 
    "cert_created_date" : "date", 
    "cert_modified_date" : "date"    
}

In [None]:
def convertToBool(x):
    if x == 'True': return True
    else: return False

In [None]:
df_test = pd.read_csv("valid_rows_sample_small.csv", sep=',', engine='python', error_bad_lines=False, dtype='unicode')

# Use Pandas drop_duplicates() as evidence that dataset is deduplicated
print("Deduplicated Valid Rows: %d\tFully Deduplicated: %r" 
      % (len(df_test), len(df_test) == len(df_test.drop_duplicates())))
print("Columns: %d" % len(df_test.columns.values))

# Convert types of columns
for colName, colType in colToType.items():
    if colType == "date":
        df_test[colName] = pd.to_datetime(df_test[colName])
    elif colType == int:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else 0).astype(int)
    elif colType == float:
        df_test[colName] = df_test[colName].apply(lambda x: x if x != 'nan' else float('nan')).astype(float)
    elif colType == bool:
        df_test[colName] = df_test[colName].apply(lambda x: convertToBool(x))

# special case for is_active
df_test.is_active = df_test.is_active.astype(bool)

Some fields may have values that are incompatible types. This may occur when no data is stored for a variable, a user did not complete the course or course registration, or a column may contain multiple data types. A string representation of an age cannot be compared to a number. If a user inputted N/A, or left that field blank, it is interpreted differently as NA, na, NaN.

In [None]:
df_test['LoE'].unique()

In [None]:
df_test_unique = df_test_no_dup

In [None]:
# Remove NA columns
original_columns = set(df_test_unique.columns.values)
df = df_test_unique.dropna(axis = 1, how = 'all').fillna(NA_FILL_VALUE)
new_columns = set(df.columns.values)
print("Removed columns", original_columns - new_columns)

In order to repair bad or missing values, we must understand which columns these values come from, which type all the data in that column should be represented with, and how empty values should be coded.

In [None]:
df_test_repair = df_test_unique
df_test_repair[df_test_repair.notnull()]

### Problem 2

Some lines may be corrupt; get rid of those or mark them in some way to show that they are not good lines. How many corrupt lines are there? Does the count of corrupt lines change if you get rid of them before getting rid of the duplicate records? What difference might this make to the remaining data set?

A line with no user id or course number is corrupt. The count of corrupt lines does not change if we get rid of them before or after the duplicate records. 

### Problem 3

What are some possible sources of bias in this data set? Is there anything unusual about the data set that you should flag?

Students who registered more than once for the same class, or reigstered for the same class under different names, and students who enrolled but did not participate in a class to any extent, are possible sources of bias in this data set. The dataset has an enormous number of duplicate rows, and many unreliable birth dates. A student who did not pass the class due to failed assignments, versus a student who did not pass the class because he did not engage in the course beyond registration, are both represented as not passing. 