# (6) Cleaning Coursera Metadata

* **author** = Diego Sapunar-Opazo
* **copyright** = Copyright 2019, Thesis M.Sc. Diego Sapunar - Pontificia Universidad Católica de Chile
* **credits** = Diego Sapunar-Opazo, Ronald Perez, Mar Perez-Sanagustin, Jorge Maldonado-Mahauad
* **maintainer** = Diego Sapunar-Opazo
* **email** = dasapunar@uc.cl
* **status** = Dev

This scripts gets Coursera's Report and takes the necessary metadata for the analysis. Also the information from the f2f component is used, such as, matching the weeks.

The files created will be:

coursera_weeks.csv

(1) **week_id**, which corresponds to Coursera's internal id for a week

(2) **week**, which corresponds to the week related to the face-to-face component

coursera_lessons.csv

(1) **lesson_id**, which corresponds to Coursera's internal id for a lesson

(2) **week**, which corresponds to the week related to the face-to-face component

coursera_items.csv

(1) **item_id**, which corresponds to Coursera's internal id for a item

(2) **item_type_id**, which corresponds to Coursera's internal id for a item's type

(3) **week**, which corresponds to the week related to the face-to-face component

coursera_weeks_items.csv: The one above aggregated

(1) **week**, which corresponds to the week related to the face-to-face component

(2) **videos**, q of videos of that week

(3) **readings**, q of readings of that week

(4) **quiz**, q of quiz of that week

## Part 0: Import Packages

In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np

## Part 1: Getting the Data

In [2]:
def read_data(path):
    '''
    Read a .csv file and convert it in a Pandas DataFrame.
    
    Input:
    path - String: path where the .csv is located.
    
    Output:
    Pandas DataFrame: .csv in the Pandas DataFrame format.
    '''
    return pd.read_csv(path)

## Part 2: Data Preprocessing

In [11]:
def preprocc_data(df, slices=False, columns_to_rename=False, categories=False):
    '''
    From a dataframe on the fly, (1) get the necessary columns; (2) rename columns; and (3) clean data.
    
    Input: 
    df - Pandas DataFrame: dataframe to be cleaned.
    columns_to_rename - Dict: Columns to rename, Key: original name, Value: new name.
    categories - List of Strings: List of the names of the columns to be category type. If you renamed some columns, should be the new names.
    
    Output:
    df - Pandas DataFrame: the dataframe already cleaned.
    '''
    
    df_cleaned = df.copy()
    
    # slicing the columns, getting only the one that I need (num_alumno and seccion)
    if slices:
        df_cleaned = df_cleaned.iloc[:,slices]
    
    del df  # clean memory
    
    # rename columns
    if columns_to_rename:
        df_cleaned.rename(_columns_to_rename, 
                          inplace=True, 
                          axis=1)
    
    if categories:
        for cat in categories:
            # creating categories
            df_cleaned[cat] = df_cleaned[cat].astype('category')
    
    return df_cleaned

In [25]:
def merging(df1, df2, variable1, variable2):
    '''
    Merge df1 and df2 over the variable.
    
    Input:
    df1 - Pandas DataFrame
    df2 - Pandas DataFrame
    variable - String: name of the column to use as pivot.
    
    Output:
    Pandas DataFrame
    '''

    df1.dropna(inplace=True)
    df2.dropna(inplace=True)
    
    # getting same types
    df1[variable1] = df1[variable1].astype('str')
    df2[variable2] = df2[variable2].astype('str')
    
    return pd.merge(left=df1, right=df2, left_on=variable1, right_on=variable2)
    

## Part 3: Export Data

In [26]:
def export_data(df, path, columns_to_drop=False):
    '''
    Export df in .csv file to the path.
    
    Input:
    df - Pandas DataFrame: dataframe to be exported.
    path - String: path where the .csv will be exported.
    '''
    if columns_to_drop:
        df.drop(columns_to_drop, axis=1, inplace=True)
        
    df.to_csv(path, index=False)

# Part 4: Main

In [70]:
_branch = 'branch~JZ-LmYwtEeijexJtZffLBA'

# WEEKS
_weeks_path = '../data/original_data/coursera/CourseraReport/course_branch_modules.csv'
df_modules = read_data(_weeks_path)

mask_modules_branch = df_modules['course_branch_id'] == _branch
df_modules = df_modules.loc[mask_modules_branch]

_columns_to_rename = {
    'course_module_id': 'week_id',
    'course_branch_module_order': 'week'
}
df_weeks = preprocc_data(df_modules, slices=[1,2], columns_to_rename=_columns_to_rename)
# joining weeks and wrangling to match with the course design
df_weeks['week'] = df_weeks['week'] + 1

values_to_replace = {
    2: 1,
    3: 2,
    4: 3,
    5: 4,
    6: 5,
    7: 6,
    8: 7,
    9: 8,
    10: 9,
    11: 10,
    12: 11
}
df_weeks.replace(values_to_replace, inplace=True)
export_data(df_weeks, '../data/final_model/coursera_weeks.csv')
del df_modules

# LESSONS
_lessons_path = '../data/original_data/coursera/CourseraReport/course_branch_lessons.csv'
df_lessons = read_data(_lessons_path)

mask_lessons_branch = df_lessons['course_branch_id'] == _branch
df_lessons = df_lessons.loc[mask_lessons_branch]

df_lessons = merging(df_lessons, df_weeks, 'course_module_id', 'week_id')

_columns_to_rename = {
    'course_lesson_id': 'lesson_id'
}
df_lessons = preprocc_data(df_lessons, slices=[1, 6], columns_to_rename=_columns_to_rename)

export_data(df_lessons, '../data/final_model/coursera_lessons.csv')

# ITEMS
# _items_path = '../data/original_data/coursera/CourseraReport/course_branch_items.csv'
_items_path = '../data/original_data/coursera/CourseraReport/course_branch_items.csv'

df_items = pd.read_csv(_items_path, usecols=[i for i in range(5)])

mask_items_branch = df_items['course_branch_id'] == _branch
df_items = df_items.loc[mask_items_branch]

df_items = merging(df_items, df_lessons, 'course_lesson_id', 'lesson_id')
del df_lessons

_columns_to_rename = {
    'course_item_id': 'item_id',
    'course_item_type_id': 'item_type_id'
}
df_items = preprocc_data(df_items, slices=[1, 4, 5, 6], columns_to_rename=_columns_to_rename)
df_items.replace({106: 6}, inplace=True)
export_data(df_items, '../data/final_model/coursera_items.csv')

In [130]:
# Creating coursera_weeks_items.csv
aux = df_items.groupby(['week','item_type_id']).count()

n = pd.pivot_table(aux,
              index='week',
              columns='item_type_id',
              values='lesson_id',
              aggfunc=np.sum)
export_data(n.reset_index().
            rename({1:'videos', 3:'readings',6:'quiz'}, 
                                   axis=1), '../data/final_model/coursera_weeks_items.csv')