# (2) Cleaning Student lms ID

* **author** = Diego Sapunar-Opazo
* **copyright** = Copyright 2019, Thesis M.Sc. Diego Sapunar - Pontificia Universidad Católica de Chile
* **credits** = Diego Sapunar-Opazo, Ronald Perez, Mar Perez-Sanagustin, Jorge Maldonado-Mahauad
* **maintainer** = Diego Sapunar-Opazo
* **email** = dasapunar@uc.cl
* **status** = Dev

This script gets the Coursera's Gradebook and students.csv to create a .csv file with two columns:

(1) **num_alumno**, which corresponds to the internal face-to-face students' id and

(2) **lms_id**, which corresponds to internal Coursera lms id

## Part 0: Import Packages

In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np

## Part 1: Getting the Data

In [2]:
def read_data(path):
    '''
    Read a .csv file and convert it in a Pandas DataFrame.
    
    Input:
    path - String: path where the .csv is located.
    
    Output:
    Pandas DataFrame: .csv in the Pandas DataFrame format.
    '''
    return pd.read_csv(path)

## Part 2: Data Preprocessing

In [3]:
def preprocc_data(df, slices=False, columns_to_rename=False, categories=False):
    '''
    From a dataframe on the fly, (1) get the necessary columns; (2) rename columns; and (3) clean data.
    
    Input: 
    df - Pandas DataFrame: dataframe to be cleaned.
    columns_to_rename - Dict: Columns to rename, Key: original name, Value: new name.
    categories - List of Strings: List of the names of the columns to be category type. If you renamed some columns, should be the new names.
    
    Output:
    df - Pandas DataFrame: the dataframe already cleaned.
    '''
    
    df_cleaned = df.copy()
    
    # slicing the columns, getting only the one that I need (num_alumno and seccion)
    if slices:
        df_cleaned = df_cleaned.iloc[:,slices]
    
    del df  # clean memory
    
    # rename columns
    if columns_to_rename:
        df_cleaned.rename(_columns_to_rename, 
                          inplace=True, 
                          axis=1)
    
    if categories:
        for cat in categories:
            # creating categories
            df_cleaned[cat] = df_cleaned[cat].astype('category')
    
    return df_cleaned

In [4]:
def get_students(df_lms, df_students):
    '''
    Maintain only de lms_id in df_lms that are in the df_students.
    
    Input:
    df_lms - Pandas DataFrame: df with the lms ids
    df_students - Pandas DataFrame: df with the students that are important for me!
    
    Output:
    Pandas DataFrame: With the same structure of df_lms, filtered with the students thar are important.
    '''
    # getting same types
    df_lms['num_alumno'] = df_lms['num_alumno'].astype('str')
    df_students['num_alumno'] = df_students['num_alumno'].astype('str')
    
    return df_students.merge(df_lms, left_on='num_alumno', right_on='num_alumno')

## Part 3: Export Data

In [5]:
def export_data(df, path):
    '''
    Export df in .csv fole to the path.
    
    Input:
    df - Pandas DataFrame: dataframe to be exported.
    path - String: path where the .csv will be exported.
    '''
    df.drop('sec', axis=1, inplace=True)

    df.to_csv(path, index=False)

## Part 4: Main

In [199]:
_gradebook_path = '../data/raw_data/coursera/coursera_gradebook_edited.csv'
_students_path = '../data/clean_data//students_sec.csv'
_columns_to_rename = {
    'Anonymized Coursera ID': 'lms_id',
    'Student ID': 'num_alumno'
}

_export_path = '../data/clean_data/students_lms_id.csv'

df_lms = preprocc_data(read_data(_gradebook_path), 
                       slices=[0,2], 
                       columns_to_rename=_columns_to_rename)
df_students = read_data(_students_path)
df = get_students(df_students, df_lms)
export_data(df, _export_path)

In [200]:
len(df)

226