**Problem 1**
Clean up the file. This means getting rid of duplicates; you can assume that no student can register for the same course more than once. How many duplicate records do you find? Some of the fields have bad or missing values; repair those that you can (and explain what a repair means).

Note that unit tests are included throughout the file after the functionality of each

In [1]:
import pandas as pd
import csv
import sys

Opening the data: Because of the number of corrupted lines/entries in the data, it was impossible to open the data as a full file by reading it into a pandas dataframe or even by reading it in line by line. The code below tries to read each line, and appends only the lines that are open-able to a list, good_lines. We count the number of bad lines

In [2]:
# Note - change file path or put csv in your directory to run this code yourself
with open('/dirty_sample_small.csv') as f:
    length = 661486
    good_lines = []
    bad_lines = 0
    for i in range(length):
        try:
            value = f.readline()
            good_lines.append(value)
        except:
            bad_lines += 1
    print("This file has", bad_lines, "unreadable lines")

This file has 0 unreadable lines


In [3]:
print("We are left with", len(good_lines), "readable lines.")

We are left with 661486 readable lines.


Read in header to figure out which categories we are supposed to have in our dataframe.

In [4]:
# Create header dataframe from header csv
header = pd.DataFrame.from_csv('/dirty_sample_small_header.csv')
header.loc[-1] = ['course_id']
header.index = header.index + 1
header = header.sort_index()
header = header.transpose()
header

  


0,0.1,2,3,4,5,6,7,8,9,10,...,39,40,41,42,43,44,45,46,47,48
course_id,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,nforum_pinned,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status


In [5]:
good_lines[:2]

['course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,continent,city,region,subdivision,postalCode,un_major_region,un_economic_group,un_developing_nation,un_special_region,latitude,longitude,LoE,YoB,gender,grade,passing_grade,start_time,first_event,last_event,nevents,ndays_act,nplay_video,nchapters,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status\n',
 'HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,Middlesbrough,MDB,Middlesbrough,,Northern Europe,Developed regions,,,54.5728,-1.1628,,,,,0.7,2018-04-10 08:30:28,2018-04-10 08:30:28.055918,2018-04-10 08:30:38.600946,6,2,0,,,,,,,,Student,0,0,0,audit,1,,,,\n']

In [6]:
df = pd.DataFrame([sub.split(",") for sub in good_lines])

In [7]:
# TEST - ensure that dataframe works
print(len(df))

661486


In [8]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
0,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status\n,
1,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
2,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
3,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
4,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,


In [9]:
df.drop([0])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
1,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
2,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
3,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
4,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
5,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
6,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
7,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
8,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
9,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
10,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,


In [10]:
my_header = header.iloc[0].reset_index()['course_id']
my_header

0                course_id
1                  user_id
2               registered
3                   viewed
4                 explored
5                certified
6                completed
7                       ip
8                 cc_by_ip
9             countryLabel
10               continent
11                    city
12                  region
13             subdivision
14              postalCode
15         un_major_region
16       un_economic_group
17    un_developing_nation
18       un_special_region
19                latitude
20               longitude
21                     LoE
22                     YoB
23                  gender
24                   grade
25           passing_grade
26              start_time
27             first_event
28              last_event
29                 nevents
30               ndays_act
31             nplay_video
32               nchapters
33            nforum_posts
34            nforum_votes
35         nforum_endorsed
36          nforum_threads
3

In [11]:
# Set header dataframe as header of main dataframe
df = df.rename(columns=my_header)

In [12]:
df.head()

Unnamed: 0,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status,48
0,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status\n,
1,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
2,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
3,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
4,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,


In [13]:
df.drop([0])

Unnamed: 0,course_id,user_id,registered,viewed,explored,certified,completed,ip,cc_by_ip,countryLabel,...,roles,nprogcheck,nproblem_check,nforum_events,mode,is_active,cert_created_date,cert_modified_date,cert_status,48
1,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
2,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
3,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
4,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
5,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
6,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
7,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
8,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
9,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,
10,HarvardX/PH525.1x/1T2018,7940,False,,False,False,81.108.107.58,GB,United Kingdom,Europe,...,0,0,0,audit,1,,,,\n,


**Removing Duplicate Records** Now that we have loaded our data into a workable dataframe, we can begin our analysis. We first look for duplicates within the 'user_id' column, a column that should have all unique values, to determine how many duplicate students we have in our dataframe (and thus need to remove).

In [14]:
# Find number of unique records in user_id column
unique_entries = df['user_id'].unique()
print("This dataset has", len(unique_entries), "unique entries")
print("Thus, we have", len(df) - len(unique_entries), "duplicate entries")

This dataset has 49145 unique entries
Thus, we have 612341 duplicate entries


In [15]:
# Create data frame with only unique entries (based on user id)
unique_df = df.drop_duplicates(subset='user_id')

In [16]:
print(len(unique_df))

49145


**Filling Missing Values** With our new dataframe (where non-unique rows have been dropped), we fill all missing values using a statistically representative draw from that column

In [17]:
import numpy as np

Methods for repairing values -- modified from those written for the Reidentification pset. These methods take a statistically representative draw from the values in any given column to repair nan values.

In [18]:
# Generate dictionary with representative proportions from each column
columns = unique_df.columns
columns

Index([           'course_id',              'user_id',           'registered',
                     'viewed',             'explored',            'certified',
                  'completed',                   'ip',             'cc_by_ip',
               'countryLabel',            'continent',                 'city',
                     'region',          'subdivision',           'postalCode',
            'un_major_region',    'un_economic_group', 'un_developing_nation',
          'un_special_region',             'latitude',            'longitude',
                        'LoE',                  'YoB',               'gender',
                      'grade',        'passing_grade',           'start_time',
                'first_event',           'last_event',              'nevents',
                  'ndays_act',          'nplay_video',            'nchapters',
               'nforum_posts',         'nforum_votes',      'nforum_endorsed',
             'nforum_threads',      'nforum_comments

In [19]:
# Generate dictionary of representative values for each column in our dataframe
col_props = {}
for column in columns:
    col_props[column] = dict(unique_df[column].value_counts(normalize=True))
#print(col_props)

In [20]:
# Method that takes in a column and draws a representative value
def generate_col_val(col):
    possVals = list(col_props[col].keys())
    possProps = list(col_props[col].values())
    propIndex = np.random.choice(len(possVals), 1, p=possProps)
    return(possVals[int(propIndex)])

In [21]:
print(generate_col_val('region'))




In [22]:
# Fill nans (empty values) with a representative draw from column
unique_df.fillna(np.nan)
for col in columns:
    filled_df = unique_df.replace(to_replace={col: {np.nan: generate_col_val(col)}})
filled_df

ValueError: a must be greater than 0

**Repairing "Bad" Values** Now that we have a dataframe with all empty values filled, we also need to look through our dataset to repair any "bad" values.

First, we repair the 'bad' values of those columns whose datatypes are easy to distinguish. Here, a bad value would be some value other than 'TRUE' or 'FALSE' in a binary column, or a non-integer or non-float value in a column that is supposed to contain integers and floats.

In [23]:
# list of columns with TRUE/FALSE values
TF_cols = ['registered', 'explored', 'viewed', 'certified']

In [24]:
# Replace any non-TRUE/FALSE values in these columns with a representative draw from the columns as a whole
for col in TF_cols:
    mask = ((filled_df[col] != ('TRUE')) & (filled_df[col] != ('FALSE')))
    filled_df.loc[mask, col] = generate_col_val(col)

In [25]:
# Unit test - check ot make sure all extraneous values have been replaced for T/F cols
for col in TF_cols:
    print(filled_df[col].value_counts())
print("All tests passed")

True    49145
Name: registered, dtype: int64
False    49145
Name: explored, dtype: int64
True    49145
Name: viewed, dtype: int64
False    49145
Name: certified, dtype: int64
All tests passed


The True/False values for this particular dataset are the only ones that are possible to repair in this fashion. If the dataset were to be fully repaired manually, we could go through each column, determine what the majority datatype was (i.e. string vs int vs float), and then replace all values that are not of that type with a representative draw from the rest of the column, using the function defined above. However, given the number of columns in this dataset, this is not feasible to do manually for this dataset. Thus, this resultant filled dataset, which has filled all empty values, fixed all T/F values, and removed clearly corrupted entries, is a significantly better dataset than the one we started with.

**Problem 2** Some lines may be corrupt; get rid of those or mark them in some way to show that they are not good lines. How many corrupt lines are there? Does the count of corrupt lines change if you get rid of them before getting rid of the duplicate records? What difference might this make to the remaining data set?

First we will count the number of lines in the original dataset. As visible below, there are 661486 lines in the original dataset.

In [26]:
# Note: File path may need to be changed
# Count number of rows in original dataset
f = open("/dirty_sample_small.csv")
reader = csv.DictReader(f)
count = 0
for row in reader:
  count = count+1
print("Count of rows is")
print(count)

Count of rows is
661486


Next, we will remove "corrupt" lines. In order to do this, we first had to decide what "corrupt" means.

We noticed that when we originally tried to read in lines from the dataset, we received errors that the processor was expecting some number of fields and receiving another number of fields. This seemed to be a form of corruption.

In the next section of code, you'll see that we set error_bad_lines = False, which means that lines with too many fields will by default cause an exception to be raised and our code will drop those "bad lines" from the DataFrame.

Of course, there might be many other types of corruption. Maybe values were unintentionally swapped in the dataset. Maybe we don't realize that we are missing several pieces of the data that was collected.

We chose to remove the corrupt lines instead of marking them in some way to simplify the rest of the process. However, you might see that in the next chunk of code, we included "warn_bad_lines = False." When we set this to true, it prints out all of the lines that have been removed and why they were removed. If we (or another interested party) wanted to look back and identify what the issues were with our dataset, we could change it to "warn_bad_lines = True" and observe the errors printed out.

In [27]:
# Note: File path may need to be changed
# Read in original dataset without bad lines (as defined in analysis above)
df2 = pd.read_csv('/dirty_sample_small.csv',warn_bad_lines = False, error_bad_lines = False, low_memory = False)

In [80]:
print("Count of rows after removing corrupt ones from original dataset is ")
print(len(df2))

Count of rows after removing corrupt ones from original dataset is 
652326


Since the count of rows started at 661486 and is now 652326, we know that we removed 9,160 rows.

In [81]:
print(df2)

                       course_id  user_id registered viewed explored  \
0       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
1       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
2       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
3       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
4       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
5       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
6       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
7       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
8       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
9       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
10      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
11      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
12      HarvardX/PH525.1x/1T2018     7940      False    NaN    F

Next, we noticed several oddities in the DataFrame. It appears that many of the columns were corrupted with swapped values, including "completed", "ip", "cc_by_ip", "countryLabel", "continent", "city", "region", "subdivision", "postalCode", "un_major_region", "un_economic_group", "un_developing_nation", "un_special_region", "latitude", "longitude", "LoE", "YoB", "gender", "passing_grade", "nforum_events", "nforum_pinned", and "roles". For example, the "ip" column is filled with values like "GB" and "BR" that appear to be country codes. The "completed" column is filled with values that appear to be in the format of ip numbers. We considered swapping these values back in order to repair the DataFrame but chose not to take the risk of assuming features of the data just based on characteristics like formatting. While it certainly looks like the "completed" column contains the "ip" numbers, how would we switch them back? Replace this user's "ip" value with this user's "completed" value? What if they weren't just swapped horizontally? What if this was in fact corrupted in combination with another dataset? For these reasons, we chose to get rid of these columns. While this obviously cuts down on the data available to analyze (e.g. we're losing all location information), we value accuracy and did not want to make assumptions about the data that might damage its accuracy further. We chose to drop only one columns in each cell so that someone looking at our notebook in the future would be able to recreate any combination of these drop's if they are able to investigate further into the corruption of the data and come to accurate conclusions about how it can be repaired more completely.

In [82]:
df2 = df2.drop('completed', 1)

KeyError: "['completed'] not found in axis"

In [83]:
df2 = df2.drop('ip', 1)

KeyError: "['ip'] not found in axis"

In [84]:
df2 = df2.drop('cc_by_ip', 1)

KeyError: "['cc_by_ip'] not found in axis"

In [85]:
df2 = df2.drop('countryLabel', 1)

KeyError: "['countryLabel'] not found in axis"

In [34]:
df2 = df2.drop('continent', 1)

In [35]:
df2 = df2.drop('city', 1)

In [36]:
df2 = df2.drop('region', 1)

In [37]:
df2 = df2.drop('subdivision', 1)

In [38]:
df2 = df2.drop('postalCode', 1)

In [39]:
df2 = df2.drop('un_major_region', 1)

In [40]:
df2 = df2.drop('un_economic_group', 1)

In [41]:
df2 = df2.drop('un_developing_nation', 1)

In [42]:
df2 = df2.drop('un_special_region', 1)

In [43]:
df2 = df2.drop('latitude', 1)

In [44]:
df2 = df2.drop('longitude', 1)

In [45]:
df2 = df2.drop('LoE', 1)

In [46]:
df2 = df2.drop('YoB', 1)

In [47]:
df2 = df2.drop('gender', 1)

In [48]:
df2 = df2.drop('nforum_events', 1)

In [49]:
df2 = df2.drop('nforum_pinned',1)

In [50]:
df2 = df2.drop('roles',1)

In [51]:
df2 = df2.drop('passing_grade',1)

In [52]:
print(df2)

652326
                       course_id  user_id registered viewed explored  \
0       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
1       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
2       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
3       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
4       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
5       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
6       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
7       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
8       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
9       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
10      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
11      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
12      HarvardX/PH525.1x/1T2018     7940      False    N

In [None]:
print(len(df2))

Our choice to fix the category corruption by removing columns did not impact the number of rows in the DataFrame.

Now we will analyze what happens if we execute this process after performing the operations from Problem 1 (getting rid of duplicates and repairing bad or missing values).

In [87]:
print(len(filled_df))

49145


We can see that the DataFrame resulting from Problem 1 has 49,145 rows.

In [53]:
filled_df.to_csv('filled_df.csv')

In [54]:
df3 = pd.read_csv('filled_df.csv',warn_bad_lines = False, error_bad_lines = False, low_memory = False)

In [86]:
print("Count of rows after removing corrupt ones from dataset resulting from Problem 1 is ")
print(len(df3))

Count of rows after removing corrupt ones from dataset resulting from Problem 1 is 
49145


In [56]:
print(df3)

       Unnamed: 0                 course_id              user_id  registered  \
0               0                 course_id              user_id        True   
1               1  HarvardX/PH525.1x/1T2018                 7940        True   
2              16  HarvardX/PH525.1x/1T2018                21193        True   
3              38  HarvardX/PH525.1x/1T2018                27938        True   
4              48  HarvardX/PH525.1x/1T2018                28454        True   
5              71  HarvardX/PH525.1x/1T2018                43100        True   
6              79  HarvardX/PH525.1x/1T2018                44413        True   
7              98  HarvardX/PH525.1x/1T2018                45324        True   
8             117  HarvardX/PH525.1x/1T2018                45875        True   
9             128  HarvardX/PH525.1x/1T2018                52081        True   
10            129  HarvardX/PH525.1x/1T2018                54513        True   
11            139  HarvardX/PH525.1x/1T2

In [57]:
df3 = df3.drop('completed', 1)

In [58]:
df3 = df3.drop('ip', 1)

In [59]:
df3 = df3.drop('cc_by_ip', 1)

In [60]:
df3 = df3.drop('countryLabel', 1)

In [61]:
df3 = df3.drop('continent', 1)

In [62]:
df3 = df3.drop('city', 1)

In [63]:
df3 = df3.drop('region', 1)

In [64]:
df3 = df3.drop('subdivision', 1)

In [65]:
df3 = df3.drop('postalCode', 1)

In [66]:
df3 = df3.drop('un_major_region', 1)

In [67]:
df3 = df3.drop('un_economic_group', 1)

In [68]:
df3 = df3.drop('un_developing_nation', 1)

In [69]:
df3 = df3.drop('un_special_region', 1)

In [70]:
df3 = df3.drop('latitude', 1)

In [71]:
df3 = df3.drop('longitude', 1)

In [72]:
df3 = df3.drop('LoE', 1)

In [73]:
df3 = df3.drop('YoB', 1)

In [74]:
df3 = df3.drop('gender', 1)

In [75]:
df3 = df3.drop('nforum_events', 1)

In [76]:
df3 = df3.drop('nforum_pinned',1)

In [77]:
df3 = df3.drop('roles',1)

In [78]:
df3 = df3.drop('passing_grade',1)

In [88]:
print(len(df3))

49145


We can see that, after the process in Problem 1, our anti-corruption operations do not remove any complete rows from the dataset.

This is because we remove corruption in two ways. One of the ways is taking out "bad lines" as we read in the dataset. We were not able to remove duplicates (or even read the file) in Problem 1 without removing some level of corruption by removing bad lines, so they have already been removed. The other way is deleting corrupted columns. As discussed above, this does not impact the number of rows.

In future iterations, it might be helpful to figure out which operation is more expensive: removing duplicates or removing corrupt lines. If we could optimize to do the more expensive one on the smaller dataset, we would be better prepared to scale.

**Problem 3** What are some possible sources of bias in this data set? Is there anything unusual about the data set that you should flag?

Now, we will discuss some of the potential manifestations of bias in this dataset.
Firstly, there is inherent bias in which fields EdX chose to collect. For example, it might be an assertion of EdX's own value system to check how many forum posts a user made instead of evaluating the quality of those forum posts. On the other hand, this could just be based on EdX's technical ability (which might be a space of bias in and of itself).

Another manifestation of bias in this dataset might be what the options were for users to select. For example, it seems that users were asked to check their country from a list of options. There are various controversies surrounding how to identify certain countries and this could certainly result in bias.

Another manifestation of bias might be the mere style of entering values. Some fields provide preset options while others ask for user input. It would be interesting to see which style results in more false entries.

Something we found very interesting was the dataset's distinction between developing nations and developed regions. Is this field automatically set by EdX when the user enters their information or does the user get to distinguish?

Additionally, it is worthwhile to note that our methods of repairing data may have introduced bias into the dataset. Given that missing and "bad" values were filled with a representative draw from the remaining data, it is possible that bias could have been introduced based on which values were "bad" in the first place. In order to fix this, it would be worthwhile to conduct an analysis into which data was "bad" or missing in the first place (i.e. if bad or missing data was skewed towards users with a certain profile). Then, the repairs could be corrected accordingly to minimize bias.