![](CountCode.png)

In [27]:
import pandas as pd
import csv
import sys

First we will count the number of lines in the original dataset. As visible below, there are 661486 lines in the original dataset.

In [28]:
f = open("/dirty_sample_small.csv")
reader = csv.DictReader(f)
count = 0
for row in reader:
  count = count+1
print("Count of rows is")
print(count)


Count of rows is
661486


Next, we will remove "corrupt" lines. In order to do this, we first had to decide what "corrupt" means.

We noticed that when we originally tried to read in lines from the dataset, we received errors that the processor was expecting some number of fields and receiving another number of fields. This seemed to be a form of corruption.

In the next section of code, you'll see that we set error_bad_lines = False, which means that lines with too many fields will by default cause an excpetion to be raised and our code will drop those "bad lines" from the DataFrame.

Of course, there might be many other types of corruption. Maybe values were unintentionally swapped in the dataset. Maybe we don't realize that we are missing several pieces of the data that was collected.

We decided that we wanted to preserve as much of the data as possible, so we only dealt with the type of the corruption that was preventing our code from processing the dataset.

We chose to remove the corrupt lines instead of marking them in some way to simplify the rest of the process. However, you might see that in the next chunk of code, we included "warn_bad_lines = False." When we set this to true, it prints out all of the lines that have been removed and why they were removed. If we retroactively wanted to look back and identify what the issues were with our dataset, we could change it to "warn_bad_lines = True" and observe the errors printed out.

In [29]:
df = pd.read_csv('/dirty_sample_small.csv',warn_bad_lines = False, error_bad_lines = False, low_memory = False)

In [30]:
print("Count of rows after removing corrupt ones is ")
print(len(df))

Count of rows after removing corrupt ones is 
652326


Since the count of rows started at 661486 and is now 652326, we know that we removed 9,160 rows.

In [31]:
print(df)

                       course_id  user_id registered viewed explored  \
0       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
1       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
2       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
3       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
4       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
5       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
6       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
7       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
8       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
9       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
10      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
11      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
12      HarvardX/PH525.1x/1T2018     7940      False    NaN    F

Next, we noticed several oddities in the DataFrame. It appears that many of the columns were corrupted with swapped values, including "completed", "ip", "cc_by_ip", "countryLabel", "continent", "city", "region", "subdivision", "postalCode", "un_major_region", "un_economic_group", "un_developing_nation", "un_special_region", "latitude", "longitude", "LoE", "YoB", "gender", "passing_grade", "nforum_events", "nforum_pinned", and "roles". For example, the "ip" column is filled with values like "GB" and "BR" that appear to be country codes. The completed column is filled with values that appear to be in the format of ip numbers. We considered swapping these values back in order to repair the DataFrame but chose not to take the risk of assuming features of the data just based on characteristics like formatting. While it certainly looks like the "completed" column contains the "ip" numbers, how would we switch them back? Replace this user's "ip" value with this user's "completed" value? What if they weren't just swapped horizontally? What if this was in fact corrupted in combination with another dataset? For these reasons, we chose to get rid of these columns. While this obviously cuts down on the data available to analyze (e.g. we're losing all location information), we value accuracy and did not want to make assumptions about the data that might damage its accuracy further. We chose to drop the columns in different cells so that someone looking at our notebook in the future would be able to recreate any combination of these drop's if they are able to investigate further into the corruption of the data and come to accurate conclusions about how it can be repaired more completely.

In [32]:
df = df.drop('completed', 1)

In [33]:
df = df.drop('ip', 1)

In [34]:
df = df.drop('cc_by_ip', 1)

In [35]:
df = df.drop('countryLabel', 1)

In [36]:
df = df.drop('continent', 1)

In [37]:
df = df.drop('city', 1)

In [38]:
df = df.drop('region', 1)

In [39]:
df = df.drop('subdivision', 1)

In [40]:
df = df.drop('postalCode', 1)

In [41]:
df = df.drop('un_major_region', 1)

In [42]:
df = df.drop('un_economic_group', 1)

In [43]:
df = df.drop('un_developing_nation', 1)

In [44]:
df = df.drop('un_special_region', 1)

In [45]:
df = df.drop('latitude', 1)

In [46]:
df = df.drop('longitude', 1)

In [47]:
df = df.drop('LoE', 1)

In [48]:
df = df.drop('YoB', 1)

In [49]:
df = df.drop('gender', 1)

In [50]:
df = df.drop('nforum_events', 1)

In [51]:
df = df.drop('nforum_pinned',1)

In [52]:
df = df.drop('roles',1)

In [54]:
df = df.drop('passing_grade',1)

In [55]:
print(df)

                       course_id  user_id registered viewed explored  \
0       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
1       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
2       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
3       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
4       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
5       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
6       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
7       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
8       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
9       HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
10      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
11      HarvardX/PH525.1x/1T2018     7940      False    NaN    False   
12      HarvardX/PH525.1x/1T2018     7940      False    NaN    F

Now, we will discuss some of the potential manifestations of bias in this dataset.
Firstly, there is inherent bias in which fields EdX chose to collect. For example, it might be an assertion of EdX's own value system to check how many forum posts a user made instead of evaluating the quality of those forum posts. On the other hand, this could just be based on EdX's technical ability (which might be a space of bias in and of itself).

Another manifestation of bias in this dataset might be what the options were for users to select. For example, it seems that users were asked to check their country from a list of options. There are various controversies surrounding how to identify certain countries and this could certainly result in bias.

Another manifestation of bias might be the mere style of entering values. Some fields provide preset options while others ask for user input. It would be interesting to see which style results in more false entries.

Something we found very interesting was the dataset's distinction between developing nations and developed regions. Is this field automatically set by EdX when the user enters their information or does the user get to distinguish?