<center><h1>Data Pipeline</h1></center>
This notebooks shows how to use the `utils` package to build a data pipeline for the `open-university` research project.   

The entire pipeline have three steps:
1. load raw data with the `load_data` module
2. preprocess data with the `preprocessing` module
3. wrap and combine features with the `features` module

In [1]:
import sys
import os
import pandas as pd
sys.path = [os.path.abspath('..')] + sys.path # don't need this if you have installed the utils
import seaborn as sns
from collections import Counter
from utils import settings, data_loader, features, exceptions, preprocessing

### 1. load data

In [2]:
# utils comes with a data_loader module that ease the workload to load all data
# data_loader.get_raw_data() will import all the .csv file inside the settings.DATA_DIR into a python dictionary
data_container = data_loader.get_raw_data()

In [3]:
# here are the six files
data_container.keys()

dict_keys(['student_assessment', 'student_info', 'student_vle', 'courses', 'vle', 'student_registration', 'assessments'])

![database schema](database_schema.png)

In [4]:
# the data_loader module also comes with a train_test_split function
# by default, this function will split the data into training and testing by year
train, test = data_loader.train_test_split(data_container['courses'])

In [5]:
train['code_presentation'].value_counts()

2013J    6
2014B    6
2013B    3
Name: code_presentation, dtype: int64

In [6]:
test['code_presentation'].value_counts()

2014J    7
Name: code_presentation, dtype: int64

### 2. extract features

preprocessing and feature engineering is the **most time consuming** part of learning analytics.
in case:
1. you don't care about explaning the feature meaning, and
2. you have enough data

I strongly suggest you try out one of the deep learning framework, say [tensorflow](https://www.tensorflow.org/)   

In [7]:
# utils.preprocessing comes with two helper classes that preprocess numeric and categorical features
# you can check the code to see how to consumerize their behaviors
num = preprocessing.NumericData()
cat = preprocessing.CategoricalData()

In [8]:
# preprocessing.extractor is helper class to easily extract a specific column from raw data, and set the index
extractor = preprocessing.ColumnExtractor(
    data_container['student_info'], 
    index_col=['code_module', 'code_presentation', 'id_student'])

In [9]:
# we will only extract a small set of columns from the raw data for demonstration purpose
columns = []

cat_column_names = ['gender', 'region', 'highest_education']
num_column_names = ['studied_credits']

for col_name in cat_column_names:
    columns.append((col_name, extractor.extract(col_name), cat)) # (column_name, raw_data, processor_obj)

for col_name in num_column_names:
    columns.append((col_name, extractor.extract(col_name), num))

In [10]:
# now, encode the columns if needed, and wrap it with the feature class
# features.FeatureDict is a helper class that hold all the features
feature_container = features.FeatureDict()

for col_name, raw_data, processor in columns:
    processed = processor.fit_transform(raw_data)
    feature = features.Feature(col_name, processed)
    feature_container[feature.name] = feature

### 3. combine features

In [11]:
# you can merge all the features easily with the FeatureDict.merge() method
merged = feature_container.merge([])

In [12]:
merged.data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,gender_M,gender_F,region_Scotland,region_East Anglian Region,region_London Region,region_South Region,region_North Western Region,region_West Midlands Region,region_South West Region,region_East Midlands Region,region_South East Region,region_Wales,highest_education_A Level or Equivalent,highest_education_Lower Than A Level,highest_education_HE Qualification,highest_education_No Formal quals,highest_education_Post Graduate Qualification,studied_credits
code_module,code_presentation,id_student,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
AAA,2013J,11391,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.901483
AAA,2013J,28400,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.481076
AAA,2013J,30268,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.481076
AAA,2013J,31604,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.481076
AAA,2013J,32885,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.481076
