# EDA of OULAD Dataset with focus on Pandas' groupby() Function

This exercise illustrates initial exploration of two datasets retrieved from the Open Learning Analytics project. In a separate exercise [where the two files were combined](https://github.com/datastate/analytics/blob/master/InitialCleaningOpenLearningAnalytics.ipynb), this exercise continues EDA by focusing on Pandas' groupby() function. The groupby() function is particularly useful where several variables need to be combined for a single result.

This the [Open Learning Analytics dataset](https://analyse.kmi.open.ac.uk/open_dataset) used in this project

In [1]:
import pandas as pd

sadata = pd.read_csv("C:\dataHub\combined_assessments.csv")
print(sadata.sample(n=2))

        row_num  id_assessment  id_student  time_taken_days  \
81037     84362          15005      440225              133   
138039   143371          34909      647320              241   

       result_transferred_from_previous  score_pct code_module  \
81037                                No         60         BBB   
138039                               No         84         FFF   

       code_presentation gender               region         highest_edu  \
81037              2013J      F  East Anglian Region  Lower Than A Level   
138039             2014J      F              Ireland  Lower Than A Level   

       imd_band_pct age_band  num_prev_attempts  current_credits disabled  \
81037         50-60    35-55                  0               60        N   
138039        30-40     0-35                  0              120        N   

       final_result  
81037          Pass  
138039         Pass  


In [2]:
sadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198004 entries, 0 to 198003
Data columns (total 17 columns):
row_num                             198004 non-null int64
id_assessment                       198004 non-null int64
id_student                          198004 non-null int64
time_taken_days                     198004 non-null int64
result_transferred_from_previous    198004 non-null object
score_pct                           198004 non-null int64
code_module                         198004 non-null object
code_presentation                   198004 non-null object
gender                              198004 non-null object
region                              198004 non-null object
highest_edu                         198004 non-null object
imd_band_pct                        198004 non-null object
age_band                            198004 non-null object
num_prev_attempts                   198004 non-null int64
current_credits                     198004 non-null int64
disabled   

In [3]:
sadata['row_num'] = sadata['row_num'].astype(str)
sadata['id_assessment'] = sadata['id_assessment'].astype(str)
sadata['id_student'] = sadata['id_student'].astype(str)

In [4]:
sadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198004 entries, 0 to 198003
Data columns (total 17 columns):
row_num                             198004 non-null object
id_assessment                       198004 non-null object
id_student                          198004 non-null object
time_taken_days                     198004 non-null int64
result_transferred_from_previous    198004 non-null object
score_pct                           198004 non-null int64
code_module                         198004 non-null object
code_presentation                   198004 non-null object
gender                              198004 non-null object
region                              198004 non-null object
highest_edu                         198004 non-null object
imd_band_pct                        198004 non-null object
age_band                            198004 non-null object
num_prev_attempts                   198004 non-null int64
current_credits                     198004 non-null int64
disabled

In [17]:
sadata.describe().astype(int)

Unnamed: 0,time_taken_days,score_pct,num_prev_attempts,current_credits
count,198004,198004,198004,198004
mean,114,75,0,78
std,72,19,0,38
min,-11,0,0,30
25%,49,65,0,60
50%,114,79,0,60
75%,172,89,0,90
max,608,100,6,630


In [6]:
print(sadata.groupby(['code_module']).groups.keys())
print(sadata.groupby(['code_presentation']).groups.keys())
print(sadata.groupby(['region']).groups.keys())
print(sadata.groupby(['highest_edu']).groups.keys())
print(sadata.groupby(['imd_band_pct']).groups.keys())
print(sadata.groupby(['age_band']).groups.keys())
print(sadata.groupby(['num_prev_attempts']).groups.keys())
print(sadata.groupby(['gender']).groups.keys())
print(sadata.groupby(['num_prev_attempts']).groups.keys())

dict_keys(['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG'])
dict_keys(['2013B', '2013J', '2014B', '2014J'])
dict_keys(['East Anglian Region', 'East Midlands Region', 'Ireland', 'London Region', 'North Region', 'North Western Region', 'Scotland', 'South East Region', 'South Region', 'South West Region', 'Wales', 'West Midlands Region', 'Yorkshire Region'])
dict_keys(['A Level or Equivalent', 'HE Qualification', 'Lower Than A Level', 'No Formal quals', 'Post Graduate Qualification'])
dict_keys(['0-10', '20-30', '20-Oct', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100'])
dict_keys(['0-35', '35-55', '55<='])
dict_keys([0, 1, 2, 3, 4, 5, 6])
dict_keys(['F', 'M'])
dict_keys([0, 1, 2, 3, 4, 5, 6])


The '20-Oct' IMD Band appears incorrect. For purposes of this exercise, it is assumed that it should have been 20-30 in line with the other categories. However, one does not immediately make such assumptions without due diligence, discussions with business partners, verifying data capture processes, etc

In [7]:
sadata['imd_band_pct'] = sadata['imd_band_pct'].str.replace("20-Oct", "20-30")

In [8]:
print(sadata.groupby(['imd_band_pct']).groups.keys())

dict_keys(['0-10', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100'])


# Grouping by a single variable

In [9]:
# Average score per Index of Multiple Depravation (IMD) band ("...where the student lived...")
sadata.groupby('imd_band_pct')['score_pct'].mean().astype(int)

imd_band_pct
0-10      71
20-30     73
30-40     74
40-50     75
50-60     75
60-70     75
70-80     75
80-90     77
90-100    77
Name: score_pct, dtype: int32

In [10]:
sadata.groupby('highest_edu')['score_pct'].mean().astype(int)

highest_edu
A Level or Equivalent          75
HE Qualification               77
Lower Than A Level             73
No Formal quals                69
Post Graduate Qualification    83
Name: score_pct, dtype: int32

In [11]:
sadata.groupby('num_prev_attempts')['score_pct'].mean().astype(int)

num_prev_attempts
0    75
1    71
2    70
3    66
4    69
5    67
6    77
Name: score_pct, dtype: int32

In [12]:
# Average score of students with no formal qualifications per code module
sadata[sadata['highest_edu'] == 'No Formal quals'].groupby('code_module')['score_pct'].mean().astype(int)

code_module
BBB    70
CCC    66
DDD    60
EEE    72
FFF    71
GGG    71
Name: score_pct, dtype: int32

In [13]:
# Average scores of students in the lowest IMD band (0-10%) per number of previous attempts
sadata[sadata['imd_band_pct'] == '0-10'].groupby('num_prev_attempts')['score_pct'].mean().astype(int)

num_prev_attempts
0    72
1    69
2    67
3    57
4    72
5    58
6    75
Name: score_pct, dtype: int32

# Grouping by multiple variables

In [14]:
# What is the average score per number of previous attempts within each IMB Band per geographical region?
gp_multi_1 = sadata.groupby(['region', 'imd_band_pct', 'num_prev_attempts'])['score_pct'].mean().astype(int)
gp_multi_1.sample(n = 5)

region                imd_band_pct  num_prev_attempts
South West Region     40-50         1                    72
Scotland              40-50         2                    77
West Midlands Region  30-40         2                    72
North Western Region  60-70         0                    76
South Region          80-90         2                    62
Name: score_pct, dtype: int32

In [15]:
# What is the average score for each code module per number of previous attempts per education level within each age band?
gp_multi_2 = sadata.groupby(['age_band', 'highest_edu', 'num_prev_attempts', 'code_module'])['score_pct'].mean().astype(int)
gp_multi_2.sample( n = 5)

age_band  highest_edu            num_prev_attempts  code_module
35-55     HE Qualification       4                  BBB            75
          A Level or Equivalent  1                  DDD            71
0-35      A Level or Equivalent  1                  FFF            74
          Lower Than A Level     2                  GGG            81
55<=      HE Qualification       0                  FFF            83
Name: score_pct, dtype: int32

# Brief Illustration of Acquiring Descriptive Statistics

In [18]:
gp_codes = sadata.groupby(
    ['code_module', 'code_presentation']
).agg(
    {
        'num_prev_attempts': max, # maximum previous attempts within each code presentation per code module
        'score_pct': "mean" # average score within each code presentation per code module
        # based on business requirements, many more descriptive statistics can be generated in similar fashion
    }
)
gp_codes.sample(n = 5)

Unnamed: 0_level_0,Unnamed: 1_level_0,num_prev_attempts,score_pct
code_module,code_presentation,Unnamed: 2_level_1,Unnamed: 3_level_1
BBB,2013J,5,78.608247
BBB,2014J,5,66.872123
FFF,2013J,4,76.41876
FFF,2014J,6,77.951975
CCC,2014B,0,72.747125


# Conclusion

This exercise illustrated EDA using the Open Learning Analytics dataset. After minor cleanup of rogue data, various single variable and multiple variable analyses were conducted. I put the power of Pandas' groupby function to work in order to obtain views of the data from numerous angles. Lastly, I illustrated how to use the agg() function to retrieve descriptive statistics. This EDA only scrapes the surface: as usual, let specific business requirements drive the analysis.