#  Behavior Detection

## Introduction

Computer environments such as those based on educational games, interactive simulations, and educational platforms are providing more and more data, which can enable a personalized adaptation of the environment itself. For instance, this data can be used to train models able to detect the extent to which students are using the educational platform properly and react accordingly. Empowering platforms with these models can serve as a means for adaptive interventions that are of paramount importance to ensure no student is left behind.      

The goal of this homework is to build a **behavior detector**, namely a classifier. Specifically, you are asked to build a detector able to classify the extent to which the students are off-task, i.e., whether students are performing interactions that are not related to the classroom's objectives. To this end, we will use a public data set which is stored in <code>ca1‐dataset.csv</code> in a CSV format. The dataset includes features at the grain size of all the actions that occurred during 20-second field observations for a student (so one student can occur in more than one record of the dataset). An example feature associated with one record of the dataset is the number of wrong actions made by the corresponding student in the last 20 seconds (more details on the features will be provided later). In addition to features, each record includes the "OffTask" label (Y or N), which is the target we ask you to predict, based on the values of the features. 

Specifically,  we will ask you to:
1. **Part 1:** Explore the dataset and select up to **5** features from those in the CSV file that you think are the most predictive of the off-task label.
2. **Part 2:** Design, fit, and interpret a Regression model for off-task prediction, based on the features you selected. 
3. **Part 3:** Design, fit, and interpret a Decision Tree classifier for off-task prediction, based on the features you selected, and investigate the impact of one hyper-parameter on the final results you obtain. 
3. **Part 4:** Design, fit, and interpret a Random Forest classifier for off-task prediction, based on the features you selected, investigate the impact of one hyper-parameter on the final results you obtain, and compare your findings with those you obtained with a single decision tree.
4. **Part 5:** Conduct feature engineering to improve the features in the original data set, using the data in a second more fine-grained dataset we will provide to you. Specifically, you will be asked to create at least **5** new features that cannot be created using just the original data set, add the new features to the original data set, and see what impact they have on the Random Forest classifier. 





## About the data

Our target is to identify that one specific behavior for students. A detailed description of features is given below. The features are related to the information on how the students interact with the system for individual actions. 

The description of the features for the individual raw actions in <code>ca1‐dataset.csv</code> is provided below. 


In the description, each line in the raw data set is associated with some interaction widget in the educational software which are denoted by "Cell". In addition, "Production" is the skill that the students are expected to learn. 

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
|   Avgright| Average of true feature (1 if the action is right, 0 otherwise) during observations|
|	Avgbug| Average of the bug feature (1 if there is a bug, 0 otherwise) during observations|
|	Avgpchange| Average number of changes of knowledege estimate during observations|
|	Avgtime| Average number of time spent during observations|
|	AvgtimeSDnormed| Average the (time taken – avg(cell) / SD(cell) for the last action of observations (SD is a function)|
|	Avgtimelast3SDnormed| Average of the (time taken – avg(cell) / SD(cell) for the last 3 action of observations (SD is a function)|
|	Avgtimelast5SDnormed| Average of the (time taken – avg(cell) / SD(cell) for the last 5 action of observations (SD is a function)|
|	Avgnotright| Average number of not right actions during observations|
|	Avghowmanywrong-up| Average number of total number of actions where this production was wrong|
|	Avgwrongpct-up| Average number of (total number of actions where this production was wrong, not just first attempt)/( number of steps where skill encountered so far (inclusive of current))|
|	Avgtimeperact-up| Average of timeperact-up feature (Total time so far on all actions involving this production, for all problems)|
|	AvgPrev3Count-up| Average count of involving in the same interface widget for the last 3 actions, during observations|
|	AvgPrev5Count-up| Average count of involving in the same interface widget for the last 5 actions, during observations|
|	Avgrecent5wrong| Average number of wrong actions in the last 5 actions|
|	Avgmanywrong-up| Average of the total number of wrong actions up to the current action|
|	Unique-id| ID of the aggregated observation (one observation aggregates interactions within a 20 second timeframe)|
|	namea| ID of the student|
|	OffTask| Classification target |

**The data set is available in the folder data**. 


In [None]:
#### PACKAGE IMPORTS ####

# Your libraries here
# Run this cell first to import all required packages. Do not make any imports elsewhere in the notebook

# YOUR CODE HERE
raise NotImplementedError()

%matplotlib inline

## **0 Load the data set**
---

In [None]:
df = pd.read_csv("./data/ca1-dataset.csv")

# Let's see how the dataframe looks like
print("length of the dataframe:", len(df))
print("first rows of the dataframe:\n")
df.head()

<a id="section1"></a>
## 1 Preprocess the data
----

In this section, your goal is to understand the data and select 5 meaningful features that will be used in the later sections to predict the off-task behavior.   

Specifically, you should:

1. Select 5 features from the original dataset that are meaningful for the off-task prediciton task.  
2. Justify your decision: How did you select the five features? Add visualizations (from your Exploratory Data Analysis) to support your answer. 
3. Create a function that splits the data set into X (the five features you selected) and y (target variable) and gives appropriate format to the target variable. 
4. Justify your decision: Which proportion of the data set will be used to validate the models. Why? 
5. Finally, do any other necessary preprocessing steps.
6. Justify the changes done to the data 

<a id="section1.1"></a>
### 1.1 

List 5 features from the original dataset that are meaningful for the off-task prediciton task.  

In [None]:
#### GRADED CELL ####
### 1.1

meaningful_features = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
df_five = df[meaningful_features + ['OffTask']]
df_five_id = df[['Unique-id']]
df_five.head()

### 1.2
Justify your decisions. How did you select the five features? Add visualizations to support your answer. 

YOUR ANSWER HERE

### 1.3

Create a function that splits the data set into training and validation set. The target variable (y) is off-task.

In [None]:
#### GRADED CELL ####
### 1.3
def split_data(df):
    """
    Splits data into X_train, X_val, y_train, y_val. 
    
    20% of the data should be randomly assigned to the validation set. 
    
    Parameters
    ----------
    df : DataFrame with the five selected features 
    
    Returns:
    -------
    X_train: DataFrame with features (training set) 
    X_val: DataFrame with features (validation set)
    y_train: np.array with target variable (training set) (Should take values 0,1)
    y_val: np.array with target variable (validation set) (Should take values 0,1)
                
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return X_train, X_val, y_train, y_val

In [None]:
X_train, X_val, y_train, y_val = split_data(df_five)

<a id="section2"></a>
## 2 Regression model
---

As seen in class, different link functions can be used in Generalized Linear Models.

Select the appropriate link function to understand the relationship between OffTask (target label) and the previously selected features. 

In this exercise, you should:
1. Create a regression model to explain the variable `OffTask` using the previously selected features.  
2. Calculate accuracy of the model's predictions using the validation set. 
3. Interpret and explain the coeficients. Which features are significant? Did the model successfully describe the data? How do you know?  
4. What changes could you do to improve the accuracy of the model (with the same type of regression model)?
5. Implement your suggestions and re-run the model.
6. Do you observe any change in the accuracy? Discuss your results

### 2.1
Create a generalized linear regression model to explain the variable `OffTask`

In [None]:
#### GRADED CELL ####
### 2.1
def build_regression(X_train, y_train):
    """
    Splits data into X and y

    Parameters
    ----------
    X_train: DataFrame with features (training set) 
    y_train: np.array with target variable (training set) 

    Returns:
    -------
    summary: detailed regression output
             including coefficients and p-values associated
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, summary

In [None]:
reg, summary = build_regression(X_train, y_train)
summary

### 2.2 

Calculate the accuracy. Given the classifier `clf`, features `X_val` and labels `y_val`, calculate the prediction accuracy. 

In [None]:
#### GRADED CELL ####
### 2.2
def calculate_accuracy(clf, X_val, y_val):
    """
    Calculates accuracy (percentage of validation samples 
    that are correctly classified) for linear regression

    Parameters
    ----------
    clr: previously trained classifier
    X_val: DataFrame with features (validation set)
    y_val: np.array with target variable (validation set)
    
    Returns:
    -------
    accuracy: float 
                
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return accuracy

In [None]:
"validation score of regression:", calculate_accuracy(reg, X_val, y_val)

### 2.3
Interpret and explain the coeficients. Which features are significant? Did the model successfully describe the data? How do you know? 

YOUR ANSWER HERE

### 2.4
What changes could you do to improve the accuracy of the model (with the same type of regression model)?
Explain and justify your decisions.

YOUR ANSWER HERE

### 2.5 

Implement your suggestions and re-run the model. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### 2.6
Do you observe any change in the accuracy? Discuss your results

YOUR ANSWER HERE

<a id="section3"></a>
## 3 Decision Tree
---

As mentioned during the lecture, decision trees are a classification model with a tree-like structure. In the following questions, you should:

1. Train a decision tree with the original 5 features selected in question 1.1. Return the names of the three most important features.  
2. Interpret the decision tree and feature importances based on the provided visualization.  
3. Play with the maximum depth hyper-parameter. 
4. Interpret your results. 



### 3.1
Train a decision tree with the original 5 features selected in question 1.1. Return the names of the three most important features.  

In [None]:
#### GRADED CELL ####
### 3.1
def build_decision_tree(X_train, y_train, max_depth=3):
    """
    Train a decision tree classifier.
    1. Create a decision tree classifier given max_depth
    2. Train the classifier
    3. Get the importance of features
    
    Parameters
    ----------
    X_train: DataFrame with the 5 selected features (training set) 
    y_train: np.array with target variable (training set)
    max_depth : maximum depth of the decision tree
 
    
    Returns:
    -------
    clf: decision tree classifier
    feature_importance: list of the names of the 3 most important of features
                        ordered by feature importance

    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return clf, feature_importances

In [None]:
 clf, feature_importance = build_decision_tree(X_train, y_train) # train the model using training data

In [None]:
print("validation score of decision tree:", clf.score(X_val, y_val))
print("most important features:", feature_importance)

Now we can visualize the decision tree

In [None]:
tree.plot_tree(clf);

### 3.2 

Interpret the decision tree and feature importances based on the provided visualization. 


YOUR ANSWER HERE

### 3.3 

Play with the maximum depth hyper-parameter. 

You might have noticed that there is an max_depth hyper-parameter in the Decision Tree classifier. 
Which max_depth parameter leads to the highest validation score?

Plot the accuracy with the varying depths (choose an appropriate range). 

In [None]:
#### GRADED CELL ####
### 3.3
def explore_max_depth(X_train, X_val, y_train, y_val):
    """
    Explore the max depth parameter
    1. Get the accuracy score of the decision tree classifier at different depths
    2. Plot the accuracy at different depths
    
    Parameters
    ----------
    X_train: DataFrame with features (training set) 
    X_val: DataFrame with features (validation set)
    y_train: np.array with target variable (training set) 
    y_val: np.array with target variable (validation set)
    
    
    Returns:
    -------
    None

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
explore_max_depth(X_train, X_val, y_train, y_val)

### 3.4 

Interpret your results. Which max_depth do you think is the best? 

YOUR ANSWER HERE

<a id="section4"></a>
## 4 Random Forest
---

In this section, you will explore another classifier: the Random Forest classifier. As mentioned during the lecture, the Random Forest classifier combines the output of multiple decision trees in order to generate the final output.

You should:

1. Train a Random Forest with the original 5 features selected in question 1.1. Return the names of the three most important features.  
2. Interpret your results.
3. Compare the results from the three models
4. Play with the Random Forest hyper-parameters. 
5. Interpret your results. 

### 4.1
Train a Random Forest with the original 5 features selected in question 1.1. Return the names of the three most important features.  

In [None]:
#### GRADED CELL ####
### 4.1
def build_random_forest(X_train, y_train, n_estimators=10):
    """
    Train a Random Forest classifier.
    1. Create a Random Forest classifier given max_depth
    2. Train the classifier 
    3. Extract the feature importance
    
    Parameters
    ----------
    X_train: DataFrame with features (training set) 
    y_train: np.array with target variable (training set)  
    n_estimators : the number of estimator in the random forest classifier
    
    
    Returns:
    -------
    clf: random forest classifier
    feature_importance: list of the names of the 3 most important of features
                        ordered by feature importance
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, feature_importance

In [None]:
clf, feature_importance = build_random_forest(X_train, y_train) 

print("validation score of random forest:", clf.score(X_val, y_val))
print("most important features:", feature_importance)

### 4.2 

Interpret your results. How do the top three features from the RF differ from the top features from the DT? 


YOUR ANSWER HERE

### 4.3 

Compare and discuss the results of three models (Regression, Decision Tree and Random Forest)

YOUR ANSWER HERE

### 4.4


As we have seen in the lecture and in the tutotial, there are multiple hyper-parameters that can be tuned in the Random Forest classifier (number of estimators, maximum depth, function to measure the quality of a split,  minimum number of samples required to be at a leaf node, minimum weighted fraction of the sum total of weights etc)

Pick only **one** hyper-parameter and explore the changes in accuracy as you vary the values. 

Plot the accuracy with the varying values for the hyper-parameter you selected (choose an appropriate range). 


In [None]:
#### GRADED CELL ####
### 4.4
def explore_hyperparameter(X_train, X_val, y_train, y_val):
    """
    Explore ONE chosen hyperparameter
    1. Get the accuracy score of Random Forest classifier 
        with varying values of ONE hyperparameter
    2. Plot the accuracy at different values 
    
    Parameters
    ----------
    X_train: DataFrame with features (training set) 
    X_val: DataFrame with features (validation set)
    y_train: np.array with target variable (training set) 
    y_val: np.array with target variable (validation set)
    
    
    Returns:
    -------
    None

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
explore_hyperparameter(X_train, X_val, y_train, y_val)

### 4.5 

Interpret your results. Justify your choice of hyper-parameter. What effect does the hyper-parameter selected have on the model? In this case, which is the optimal value? Why? 

YOUR ANSWER HERE


## 5 Creative extension
---

In this last part, we will ask you to be creative and extend the set of five features you are using. 

1. Conduct feature engineering to create at least 5 new features that cannot be created using just the original data set. Add the new features to the original data set. 
2. Train a Random Forest classifier with the extended set of features. 
3. Interpret your results. 

The goal of this last part is to build a better behavior detector (classifier), using the features you selected in the part part as well as new features you will create based on the new data we are providing now to youfor this question. 

In the first part, you have used the dataset stored in <code>ca1‐dataset.csv</code> in a CSV format, this data set has already been aggregated by `UniqueID`. Now, you will need to play also with the dataset stored in <code>ca2‐dataset.csv</code>, in a CSV format as well.

These two datasets represent the same data set, but at two different grain‐sizes. Specifically, the new data (<code>ca2‐dataset.csv</code>) represents individual raw student actions within educational software, while the previous data set (<code>ca1‐dataset.csv</code>) is at the grain size of all the actions that occurred during 20 second field observations by students. Note that the individual raw student actions are labeled with the same UniqueID labels as the observations are (each `UniqueID` corresponds to a single field observation). In this question, you must conduct feature engineering to improve the features in the original data set, using the data in the new data set. You must create at least **five** new features that cannot be created using just the original data set, and add the new features to the set of features you have selected in the first part of the homework. 

#### About the individual raw student actions dataset

Our target is to identify that one specific behavioral for students. The detailed description of features are given as below. The features are related to the information how the student interacting with the system for individual actions. 

The description of the features for the individual raw actions in  <code>ca2‐dataset.csv</code> is provided below. 


| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| right| 1 if the action is right, 0 otherwise|
|	bug| 1 if there is a bug, 0 otherwise|
|	pknow-1| Knowledge estimation before the action|
|	Pknow-2| Knowledge estimation after the action |
|	pchange| 1 if the knowledge estimation changes, 0 otherwise|
|	time| Time spent in seconds|
|	timeSDnormed| For last action, (time taken – avg(cell)  / SD(cell) (SD is a function)|
|	timelast3SDnormed| For last 3 action, (time taken – avg(cell)  / SD(cell) (SD is a function)|
|	timelast5SDnormed| For last 5 action, (time taken – avg(cell)  / SD(cell) (SD is a function)|
|	notright| 1 if the action is not right, 0 otherwise|
|	howmanywrong-up| How many of the actions are wrong up to the current action|
|	wrongpct-up| (total number of wrong actions)/( number of steps where skill encountered so far)|
|	timeperact-up| Total time so far on all actions involving this production, for all problems|
|	Prev3Count-up| Count of, for each of last 3 actions, how many involved the same interface widget|
|	Prev5Count-up| Count of, for each of last 5 actions, how many involved the same interface widget|
|	 recent5wrong| Of the last 5 actions, how many were wrong|
|	manywrong-up | Total number of actions where this production was wrong up to the current action|
|	Unique-id| ID of the aggregated observation (one observation aggregates interactions within a 20 second timeframe)|


**Please note that pknow-1 and pknow-2 only exist in second data set**

**The data set is available in the folder data**. 

In [None]:
# Read the second dataset
df_raw = pd.read_csv("./data/ca2-dataset.csv")

# Let's see how the dataframe looks like
print("length of the dataframe:", len(df))
print("first rows of the dataframe:\n")
df_raw.head()

You should note that the features stated in <code>ca1‐dataset.csv</code> were obtained by conducting feature engineering in <code>ca2‐dataset.csv</code>. To make it clearer, we provide you with an example. Let's consider the 55th row in the original dataset:

In [None]:
df.loc[52]

This row has been obtained by computing the average of each column for the rows between 55 and 58 in the dataset <code>ca2‐dataset.csv</code>. Specifically, in the second dataset, **we group by the column Unique-id** to group all the rows recorded in the same 20 seconds of interactions. Then, we select the group with Unique-id equal to the one of the row above in the original dataset. 



In [None]:
groups = df_raw.groupby(by='Unique-id', as_index=False).mean()
groups[groups['Unique-id'] == df.loc[52]['Unique-id']]

It can be observed that the Avgright in <code>df.iloc[52]</code> is the same we obtained by group data in the second dataset and picking the corresponding `Unique-id`. The same observation applies to the other columns. 

### 5.1 
Create at least 5 new features from df_raw 

In [None]:
#### GRADED CELL ####
### 5.1
def extend_features(df_five, df_raw, df_five_id = None):
    """
    Create at least 5 new features from df_raw 
    
    Parameters
    ----------
    df_five : DataFrame with processed data (5 features)
    df_raw:   DataFrame with raw data
    df_five_id: (optional) DataFrame with Unique-ids from df_five 
    
    Returns:
    -------
    df_ext_five: DataFrame with extended features 
        (the five you selected in the first part + 
        the five you create here + the target off-task label)

    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_ext_five

In [None]:
df_ext_five = extend_features(df_five,  df_raw, df_five_id)
X_train, X_val, y_train, y_val = split_data(df_ext_five)

### 5.2

Use the previously created `build_random_forest` function to train the models with the extended DataFrame. Print out the accuracy. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


### 5.3

Interpret and compare your results. Did the new features improve the score? Write down your intuition about the possible reasons.

YOUR ANSWER HERE