# Learning Path Recommender System

In [17]:
# importing all the required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# following functions are loaded from surprise module
from surprise import NormalPredictor
from surprise import Reader
from surprise import SVD
from surprise import Dataset, evaluate
from surprise.model_selection import cross_validate
from sklearn.metrics import accuracy_score
%matplotlib inline

### Reading the dataset

Here we will read the logs into DataFrame from pandas library 

In [18]:
# Read the dataset into pandas data frame
logs_df = pd.read_csv("logs_1000.csv", sep=',', header=None, \
                    names=["UserId", "Event", "StartTime", "EndTime"],\
                    parse_dates=[2, 3], infer_datetime_format=True)

Lets find out the number of rows/events we got

In [19]:
# Print the number of rows in the data frame
number_of_events = logs_df.shape[0]
print("This Log file has {} rows/events".format(number_of_events))

This Log file has 86395 rows/events


Lets check the structure of the DataFrame

In [20]:
# Print the logs_df information
logs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86395 entries, 0 to 86394
Data columns (total 4 columns):
UserId       86395 non-null int64
Event        86395 non-null object
StartTime    86395 non-null datetime64[ns]
EndTime      85395 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(1)
memory usage: 2.6+ MB


Check the data, by calling head function on the dataframe

In [21]:
# Displaying top 10 rows of dataframe
logs_df.head(10)

Unnamed: 0,UserId,Event,StartTime,EndTime
0,1,P1,2014-05-13 19:41:34,NaT
1,1,P1M4L2A2,2014-05-13 22:24:13,2014-05-13 22:24:13
2,1,P1M4L2V3,2014-05-17 09:50:03,2014-05-17 11:50:03
3,1,P1M4L2V2,2014-05-18 04:49:28,2014-05-18 06:49:28
4,1,P1M4L2A1,2014-05-20 13:26:18,2014-05-20 13:26:18
5,1,P1M4L2A3,2014-05-22 06:37:51,2014-05-22 06:37:51
6,1,P1M4L2V1,2014-05-24 00:24:26,2014-05-24 02:24:26
7,1,P1M4L2S1,2014-05-24 21:36:31,2014-05-24 21:36:31
8,1,P1M4L2V2,2014-05-27 06:17:49,2014-05-27 08:17:49
9,1,P1M4L2A1,2014-05-31 16:36:33,2014-05-31 16:36:33


### Column Definition
- UserId      : Unique ID given to a user/student
- Event       : The event name
- StartTime   : Start time of the event
- EndTime     : End time of the event

### Expanding the event acronym 

Following are the meaning of each letter
-    P Program
-    M Module
-    L Lesson
-    A Additional Links
-    Q Quiz
-    V Video
-    S Assignment
-    F Final Assignment

Ex: P1M4L2V2 = Event of Program 1, Module 4, Lesson 2, Video 2

#### More info on the events:

-    Every lesson has 8 events.
-    Quiz and Assignment occur only once
-    Video duration is 2hrs
-    Quiz duration is 30min
-    Additional links and Assignments, start and end time is same
-    Timeframe chosen is 2013-2018


### Data Cleaning

In [22]:
# Count number of null values
logs_df.isnull().sum()

UserId          0
Event           0
StartTime       0
EndTime      1000
dtype: int64

There are 1000 entries in the EndTime column which are Null. Lets check those entries

In [23]:
# Inspect the null values
logs_df[logs_df["EndTime"].isnull() == True]

Unnamed: 0,UserId,Event,StartTime,EndTime
0,1,P1,2014-05-13 19:41:34,NaT
109,2,P1,2016-12-29 20:40:51,NaT
183,3,P1,2013-08-16 14:07:05,NaT
277,4,P1,2017-09-22 04:37:38,NaT
315,5,P1,2017-04-17 00:22:53,NaT
392,6,P1,2014-02-09 18:42:12,NaT
487,7,P1,2013-12-03 02:05:17,NaT
593,8,P1,2014-10-16 16:18:08,NaT
697,9,P1,2017-10-12 04:17:19,NaT
731,10,P1,2013-02-10 13:43:13,NaT


The null entry in the EndTime column are the program start event. Since we will not be using that data for our recommender system, we shall drop those rows.

In [24]:
# Dropping the rows with null entries
logs_df.dropna(inplace=True)

In [25]:
# Number of rows dropped 
print("Total number of rows dropped: {}".format(number_of_events - logs_df.shape[0]))

Total number of rows dropped: 1000
