## Problem Set 2


### Problem 1

Clean up the file. This means getting rid of duplicates; you can assume that no student can register for the same course more than once. How many duplicate records do you find? Some of the fields have bad or missing values; repair those that you can (and explain what a repair means).

### Import Data

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import hashlib
from functools import reduce

In [35]:
NA_FILL_VALUE = 0

In [36]:
columnNames = []
with open('dirty_sample_small_header.csv', 'r') as headerFile:
    headerReader = csv.reader(headerFile, delimiter=',')
    for row in headerReader:
        columnNames.append(row[1])
        
numCols = len(columnNames)

In [105]:
invalidCols = 0; duplicateRows = 0; keptRows = 0; totalRows = 0
onHeader = True
hashes = set()
with open('dirty_sample_small.csv', 'r') as dataFile:
    with open('valid_rows_sample_small.csv', 'w') as outFile:
        dataReader = csv.reader(dataFile, delimiter=',')
        outWriter = csv.writer(outFile, delimiter = ',')
        for row in dataReader:
            # Skip the header line
            if onHeader:
                outWriter.writerow(row)
                onHeader = False; continue
        
            totalRows += 1
            # Ignore rows with incorrect number of columns
            if len(row) != numCols:
                invalidCols += 1
                continue 
            # Get the md5 hash of each row to determine whether the row is duplicated or not
            else:
                m = hashlib.md5()
                rowStr = reduce((lambda x, y: x + y), row).encode('utf-8')
                m.update(rowStr)
                hashedRow = m.hexdigest()
                if hashedRow in hashes:
                    duplicateRows += 1
                    continue 
                # If it's a new row, write it to the cleaned dataset valid_rows...
                else:            
                    keptRows += 1
                    hashes.add(hashedRow)
                    outWriter.writerow(row)
print("Dropped: %d   Duplicates: %d   Kept: %d   Total: %d" % (invalidCols, duplicateRows, keptRows, totalRows))

Dropped: 15708   Duplicates: 595797   Kept: 49981   Total: 661486


In [106]:
invalidCols + duplicateRows + keptRows == totalRows

True

In [107]:
df_test = pd.read_csv("valid_rows_sample_small.csv", sep=',', engine='python', error_bad_lines=False, dtype='unicode')

In [108]:
# Use Pandas drop_duplicates() as evidence that dataset is deduplicated
print("Deduplicated Valid Rows: %d\tFully Deduplicated: %r" 
      % (len(df_test), len(df_test) == len(df_test.drop_duplicates())))

Deduplicated Valid Rows: 49981	Fully Deduplicated: True


In [109]:
len(df_test.columns.values)

48

In [111]:
df_test.head(10)

Unnamed: 0,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,nforum_pinned,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status
0,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,Student,0,0,0,audit,1,,,,
1,HarvardX/PH525.1x/1T2018,21193,True,True,False,False,103.108.88.2,,,,...,Student,0,142,0,audit,0,,,,
2,HarvardX/PH525.1x/1T2018,27938,True,True,False,False,179.214.111.130,BR,Brazil,Americas,...,Student,0,207,0,audit,0,,,,
3,HarvardX/PH525.1x/1T2018,28454,True,False,False,False,177.18.230.216,BR,Brazil,Americas,...,Student,0,0,0,audit,1,,,,
4,HarvardX/PH525.1x/1T2018,43100,False,,False,False,213.249.56.36,GR,Greece,Europe,...,Student,0,0,0,audit,1,,,,
5,HarvardX/PH525.1x/1T2018,44413,False,,False,False,95.91.213.204,DE,Germany,Europe,...,Student,0,0,0,audit,1,,,,
6,HarvardX/PH525.1x/1T2018,45324,False,,False,False,99.49.42.33,US,United States,Americas,...,Student,0,0,0,audit,1,,,,
7,HarvardX/PH525.1x/1T2018,45875,True,False,False,False,190.62.244.108,SV,El Salvador,Americas,...,Student,0,0,0,audit,0,,,,
8,HarvardX/PH525.1x/1T2018,52081,False,,False,False,165.225.104.86,US,United States,Americas,...,Student,0,0,0,audit,1,,,,
9,HarvardX/PH525.1x/1T2018,54513,True,True,False,False,71.212.103.239,US,United States,Americas,...,Student,0,520,0,audit,1,,,,


In [28]:
df_test.index = df_test.user_id
df_test = df_test.drop('user_id', axis = 1)

AttributeError: 'DataFrame' object has no attribute 'user_id'

Some fields may have values that are incompatible types. This may occur when no data is stored for a variable, a user did not complete the course or course registration, or a column may contain multiple data types. A string representation of an age cannot be compared to a number. If a user inputted N/A, or left that field blank, it is interpreted differently as NA, na, NaN.

In [49]:
df_test_unique = df_test_no_dup

In [53]:
# Remove NA columns
original_columns = set(df_test_unique.columns.values)
df = df_test_unique.dropna(axis = 1, how = 'all').fillna(NA_FILL_VALUE)
new_columns = set(df.columns.values)
print("Removed columns", original_columns - new_columns)

Removed columns set()


In order to repair bad or missing values, we must understand which columns these values come from, which type all the data in that column should be represented with, and how empty values should be coded.

In [54]:
df_test_repair = df_test_unique
df_test_repair[df_test_repair.notnull()]

Unnamed: 0_level_0,course_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,continent,...,nforum_pinned,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7940,HarvardX/PH525.1x/1T2018,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,Middlesbrough,...,Student,0,0,0,audit,1,,,,
21193,HarvardX/PH525.1x/1T2018,True,True,False,False,103.108.88.2,,,,,...,Student,0,142,0,audit,0,,,,
27938,HarvardX/PH525.1x/1T2018,True,True,False,False,179.214.111.130,BR,Brazil,Americas,Brasília,...,Student,0,207,0,audit,0,,,,
28454,HarvardX/PH525.1x/1T2018,True,False,False,False,177.18.230.216,BR,Brazil,Americas,Guarulhos,...,Student,0,0,0,audit,1,,,,
28454,HarvardX/PH525.1x/1T2018,True,False,False,False,177.18.230.216,BR,Brazil,Americas,Guarulhos,...,,,,,,,,,,
28454,HarvardX/PH525.1x/1T2018,True,False,False,False,177.18.230.216,BR,Brazil,Americas,Guarulhos,...,,,,,,,,,,
43100,HarvardX/PH525.1x/1T2018,False,,False,False,213.249.56.36,GR,Greece,Europe,Athens,...,Student,0,0,0,audit,1,,,,
44413,HarvardX/PH525.1x/1T2018,False,,False,False,95.91.213.204,DE,Germany,Europe,Berlin,...,Student,0,0,0,audit,1,,,,
45324,HarvardX/PH525.1x/1T2018,False,,False,False,99.49.42.33,US,United States,Americas,Austin,...,Student,0,0,0,audit,1,,,,
45875,HarvardX/PH525.1x/1T2018,True,False,False,False,190.62.244.108,SV,El Salvador,Americas,San Salvador,...,Student,0,0,0,audit,0,,,,


### Problem 2

Some lines may be corrupt; get rid of those or mark them in some way to show that they are not good lines. How many corrupt lines are there? Does the count of corrupt lines change if you get rid of them before getting rid of the duplicate records? What difference might this make to the remaining data set?

A line with no user id or course number is corrupt. The count of corrupt lines does not change if we get rid of them before or after the duplicate records. 

### Problem 3

What are some possible sources of bias in this data set? Is there anything unusual about the data set that you should flag?

Students who registered more than once for the same class, or reigstered for the same class under different names, and students who enrolled but did not participate in a class to any extent, are possible sources of bias in this data set. The dataset has an enormous number of duplicate rows, and many unreliable birth dates. A student who did not pass the class due to failed assignments, versus a student who did not pass the class because he did not engage in the course beyond registration, are both represented as not passing. 