# STUDENT INTERVENTION

In [None]:
from IPython.display import Image
Image(filename="intervention.jpg", width=1000, height=800)

# 1. Business Problem
## 1.1 Context
These dataset have been collected from two portugal schools which consists of student achievement in secondary education. The data was collected by using school reports and questionnaires.

## 1.2 Problems with current approach
Nowadays,teachers are not much interacting with students due to which they don't know the prons and cons of students.Many of the students cannot recognize their weaknesses and this stops them from developing their skills.Proper student intervention procedure does not take place in many of the schools.They lag behind in academics and even their parents would not be able to help them if the time crosses.

## 1.3 Problem Statement
Many of the schools have hired us as data science consultants.If the students are made to understand their weaknesses,then,they can surely work on it and could improve themselves.We need to identify students who might need early intervention before they fail to graduate.

## 1.4 Business Objectives and Constraints
* Deliverable: Trained model file
* Model interprtability is very important
* Ouput Probabilities along with the prediction
* No latency constraints

# 2. Machine Learning Problem
## 2.1 Data Overview

For this project:
1. The dataset has 395 observations.
2. Each observation includes the student's status.

**Target variable**<br>
'passed' – Current student status (Passed/Failed)

**Features**

Student information
* school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) 
* gender - student's gender (binary: 'F' - female or 'M' - male) 
* age - student's age (numeric: from 15 to 22) 
* address - student's home address type (binary: 'U' - urban or 'R' - rural) 
* studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 
* failures - number of past class failures (numeric: n if 1<=n<3, else 4) 
* schoolsup - extra educational support (binary: yes or no) 
* paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 
* activities - extra-curricular activities (binary: yes or no) 
* nursery - attended nursery school (binary: yes or no) 
* higher - wants to take higher education (binary: yes or no) 
* romantic - with a romantic relationship (binary: yes or no)
* freetime - free time after school (numeric: from 1 - very low to 5 - very high) 
* goout - going out with friends (numeric: from 1 - very low to 5 - very high) 
* Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 
* Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 
* health - current health status (numeric: from 1 - very bad to 5 - very good) 
* absences - number of school absences (numeric: from 0 to 93) 

Family information

* famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) 
* Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart) 
* Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 
* Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 
* Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 
* Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 
* reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') 
* guardian - student's guardian (nominal: 'mother', 'father' or 'other') 
* famsup - family educational support (binary: yes or no) 
* internet - Internet access at home (binary: yes or no) 
* famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 

Distance information
*  traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

## 2.2 Mapping Business problem to ML problem
### 2.2.1 Type of Machine Learning Problem
This should be a classification problem.
This is because there possibly two discrete outcomes, typical of a classification problem:
* Students who need early intervention.
* Students who do not need early intervention.

We can classify accordingly with a binary outcome such as:
* Yes, 1, for students who need early intervention.
* No, 0, for students who do not need early intervention.
Evidently, we are not trying to predict a continuous outcome, hence this is not a regression problem.

### 2.2.2 Evaluation Metric (KPI)
Since this is binary classification problem, we use the following metrics:
* **Confusion matrix** - For getting a better clarity of the no of correct/incorrect predictions by the model
* **ROC-AUC** - It considers the rank of the output probabilities and intuitively measures the likelihood that model can distinguish between a positive point and a negative point. (**Note:** ROC-AUC is typically used for binary classification only). We will use AUC to select the best model.

# 3. Exploratory Data Analysis
Importing the libraries

In [None]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline
# import color maps
from matplotlib.colors import ListedColormap

# Seaborn for easier visualization
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Function for splitting training and test set
from sklearn.model_selection import train_test_split

# Function to perform data standardization 
from sklearn.preprocessing import StandardScaler

# Libraries to perform hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# import xgboost
import os
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
from xgboost import XGBClassifier
from xgboost import plot_importance  ## to plot feature importance

# Evaluation metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix

# To save the final model on disk
from sklearn.externals import joblib

In [None]:
# Set printing options.
np.set_printoptions(precision=2, suppress=True)
# These options determine the way floating point numbers, arrays and other NumPy objects are displayed.

## 3.1 Reading the data

In [None]:
df['reason'].value_counts()

In [None]:
## We can also use bar plots instead
plt.figure(figsize=(9,7))
sns.countplot(y='Mjob', data=df)

In [None]:
## We can also use bar plots instead
plt.figure(figsize=(9,7))
sns.countplot(y='Fjob', data=df)

In [None]:
## We can also use bar plots instead
plt.figure(figsize=(9,7))
sns.countplot(y='reason', data=df)

In [None]:
df['passed'].value_counts()

Approx 67.08% of students have passed and 32.91% of students have failed.

This means the dataset is **not balanced**

## 3.4 Segmentations
Segment the target variable (status) with key features

### Univariate segmentations

In [None]:
## passed vs absences
sns.boxplot(y='passed', x='absences', data=df)

The students who remained less absent passed in the exam .Here,we can see that also the students who failed remained present in the class.This don't make much sense.So,lets check the next feature.

In [None]:
## passed vs failures
sns.boxplot(y='passed', x='failures', data=df)

* This makes intuitive sense as the students who never failed in the exam passed the final exam.
* The students who failed previously also failed in the final exam.

### 6.1.1 Train test split

In [None]:
# Create separate object for target variable
y = df.passed

# Create separate object for input features
X = df.drop('passed', axis=1)

In [None]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=1234,
                                                    stratify=df.passed)

In [None]:
# Print number of observations in X_train, X_test, y_train, and y_test
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### 6.1.2 Data standardization
* In Data Standardization we perform zero mean centring and unit scaling; i.e. we make the mean of all the features as zero and the standard deviation as 1.
* Thus we use **mean** and **standard deviation** of each feature.
* It is very important to save the **mean** and **standard deviation** for each of the feature from the **training set**, because we use the same mean and standard deviation in the test set.

In [None]:
train_mean = X_train.mean()
train_std = X_train.std()

In [None]:
## Save these mean and std dev values
train_mean.to_pickle("train_mean.pkl")
train_std.to_pickle("train_std.pkl")

In [None]:
## Standardize the train data set
X_train = (X_train - train_mean) / train_std

In [None]:
## Check for mean and std dev.
X_train.describe()

In [None]:
## Note: We use train_mean and train_std_dev to standardize test data set
X_test = (X_test - train_mean) / train_std

In [None]:
## Check for mean and std dev. - not exactly 0 and 1
X_test.describe()

## 6.2 Model-1 Logistic Regression

In [None]:
# Dataframe dimensions
df.shape

In [None]:
type(df)

In [None]:
# Columns of the dataframe
df.columns

In [None]:
# First 5 rows
df.head()

In [None]:
# Column datatypes
df.dtypes

In [None]:
df.dtypes[df.dtypes=='object']

In [None]:
pd.set_option('display.max_columns', 100) ## display max 100 columns
df.head()

In [None]:
# Last 5 rows
df.tail()

In [None]:
# Calculate number of students
n_students = df.shape[0]

# Calculate number of features
n_features = df.shape[1] - 1

# Calculate passing students
# Data filtering using .loc[rows, columns]
passed = df.loc[df.passed == 'yes', 'passed']
n_passed = passed.shape[0]

# Calculate failing students
failed = df.loc[df.passed == 'no', 'passed']
n_failed = failed.shape[0]

# Calculate graduation rate
total = float(n_passed + n_failed)
grad_rate = float(n_passed * 100 / total)

print("Total no.of students =",n_students)
print("No.of features =",n_features)
print("No.of students who passed =",n_passed)
print("No.of students who failed =",n_failed)
print("Graduation rate of the class =",grad_rate)

## 3.2.Distribution of Numeric Features

In [None]:
# Plot histogram grid
df.hist(figsize=(10,10), xrot=-45)

**Observations:**
We can make out quite a few observations:

Let us look at all the plotted numeric features.

Consider the histogram for alcohol consumption(DALC and WALC):

Workdays:
* Above 200 students consumes less amount of alcohol during workdays and very less number of students consumes very high amount of alcohol.

Weekend:
* As compared to DALC,most of the students consumes alcohol at the weekend.

Consider the histogram for Father's and Mother's education(Fedu and Medu):

* We can notice that most of the parents are somewhat educated.

Absences:

* Almost above 290 out of 300 students were present everyday.Very few students never attended school.This can be selected as an important feature considering the fact that those students who were absent almost everyday have a very less chance of passing the final exam.

Failures:

* Few students have failed in the past exams.This can also be a very good feature which we can consider while doing predictions.

Famrel:

* Some students have very bad family relationships which can affect them mentally which could ultimately lead to their failures.

Free time and Go out:

* Above 150 students have normal free time after school and the same range of students go out with their friends.

Health:

* Almost 45-50 students have very bad health which can affect them physically and may lead to failures.

Study time:

* 200 students studies only 2 hours per week which is a very less time.

Travel time:

* Most of the students take a very less time to go from home to school and vice-versa.

In [None]:
# Summarize numerical features
# Generates descriptive statistics.
# Summarizes the central tendency(the tendency for the values of a random variable to cluster round its mean, mode, or median.)
# Summarizes dispersion and shape of a dataset’s distribution, excluding NaN values.
df.describe()

* Just scan over the min, max and mean rows and make sure the values make sense.
* There are no indicator variables since no features have std deviation as 0.
* The minimum and maximum values of all the features looks reasonable.

## 3.3 Distribution of categorical features

In [None]:
# Summarize categorical features
df.describe(include=['object'])

* There are many categorical features as compared to numeric features.
* Most of the features are binary.
* Only 3 features are nominal.

Let's check the frequency of features having nominal values.

In [None]:
df['Mjob'].value_counts()

In [None]:
df['Fjob'].value_counts()

In [None]:
# Columns
df.columns

In [None]:
# We want to get the column name "passed" which is the last 
df.columns[-1]

In [None]:
# This would get everything except for the last element that is "passed"
df.columns[:-1]

In [None]:
# Extract feature columns
# As seen above, we're getting all the columns except "passed" here but we're converting it to a list
feature_cols = list(df.columns[:-1])

In [None]:
# Extract target column 'passed'
# As seen above, since "passed" is last in the list, we're extracting using [-1]
target_col = df.columns[-1]

In [None]:
# Show the list of columns
print("Feature columns =",feature_cols)
print("\nTarget column =",target_col)

In [None]:
# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = df[feature_cols]
Y_all = df[target_col]

In [None]:
# Show the feature information by printing the first five rows
X_all.head()

In [None]:
# Show the target information by printing the first five rows
Y_all.head()

** Finally convert 'passed' (target variable) into a binary indicator variable.**
* 'Failed' should be 1
* 'Passed' should be 0

In [None]:
# Convert passed to an indicator variable
df['passed'] = pd.get_dummies( df.passed ).no

To confirm we did that correctly, display the proportion of students in our dataset who failed.

In [None]:
# The proportion of observations who 'failed'
df.passed.mean()

Matches with the earlier count. Seems good

## 5.2 One-Hot Encoding for categorical variables

In [None]:
# Create new dataframe with dummy features
df = pd.get_dummies(df, columns=['school', 'gender', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic'])

# Display first 10 rows
df.head(10)

**Save this dataframe as your analytical base table to use for future use.**
* Remember to set the argument index=None to save only the data.

In [None]:
# Save analytical base table
df.to_csv('Student_new_DB.csv', index=None)

# 6. Machine Learning Models
## 6.1 Data Preparation

In [None]:
print(df.shape)

In [None]:
df

**What to look for?**
* The colorbar on the right explains the meaning of the heatmap - Dark colors indicate **strong negative correlations** and light colors indicate **strong positive correlations**.

# 4. Data Cleaning
## 4.1 De-duplication and dropping unwanted observations

In [None]:
# Drop duplicates
df = df.drop_duplicates()
print(df.shape)

There are no duplicates.

## 4.2 Outliers
An eye test for all the previous analysis tells us that it doesn't look like outliers will be a huge problem.

## 4.3 Missing Data

In [None]:
# Display number of missing values by feature
df.isnull().sum()

There are no missing values.

# 5. Feature Engineering
* Our datset is small and contains sparse data.
* A common problem in machine learning is sparse data, which alters the performance of machine learning algorithms and their ability to calculate accurate predictions. 
* Data is considered sparse when certain expected values in a dataset are missing, which is a common phenomenon in general large scaled data analysis.But,we don't have missing value.Hence,it will not be a huge problem.
* We won't always have a lot of domain knowledge for the problem. In these situations, we should rely on exploratory analysis to provide us hints better feature engineering.

## 5.1 Identify feature and target columns

* It is often the case that the data you obtain contains non-numeric features. 
* This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

In [None]:
## passed vs studytime
sns.boxplot(y='passed', x='studytime', data=df)

* This,too makes sense because the students who studied only 2 hours per week failed in the exam.
* The students who studied ranging from 1 our to 3 hours passed the exam. Let us assume that the students who studied only 1-2 hours has more grasping power.

### Bivariate segmentations

In [None]:
# Scatterplot of studytime vs. failures
sns.lmplot(x='studytime', y='failures', hue='passed', data=df, fit_reg=False)

* This is a **bivariate segmentation** because we are plotting the relationship between two variables while segmenting classes using color.
* It's a quick way to see if there are potential interactions between different features.

In [None]:
# Get just the prediction for the positive class (1)
y_pred_proba = model.predict_proba(X_test)[:,1]

In [None]:
# Display first 10 predictions
y_pred_proba[:10]

**Note:** Just as above, we can use these probabilities for model interpretation

In [None]:
confusion_matrix(y_test, y_pred).T

In [None]:
# Calculate ROC curve from y_test and pred
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

In [None]:
# Plot the ROC curve
fig = plt.figure(figsize=(8,8))
plt.title('Receiver Operating Characteristic')

# Plot ROC curve
plt.plot(fpr, tpr, label='l1')
plt.legend(loc='lower right')

# Diagonal 45 degree line
plt.plot([0,1],[0,1],'k--')

# Axes limits and labels
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Calculate AUC for Train set
roc_auc_score(y_train, y_train_pred)

In [None]:
# Calculate AUC for Test set
print(auc(fpr, tpr))

#### Feature Importance

In [None]:
## Building the model again with the best hyperparameters
model = RandomForestClassifier(n_estimators=200, min_samples_split=5, min_samples_leaf=2)
model.fit(X_train, y_train)

In [None]:
indices = np.argsort(-model.feature_importances_)
print("The features in order of importance are:")
print(50*'-')
for feature in X.columns[indices]:
    print(feature)

## 6.4 Model-3 XGBoost

**Note:**
* The probality values represent the probability of a data point belonging to class 1 ('Failed')

In [None]:
# Calculate AUC for Train set
print(roc_auc_score(y_train, y_train_pred))

In [None]:
## Building the model again with the best hyperparameters
model = LogisticRegression(C=10, penalty = 'l2')
model.fit(X_train, y_train)

In [None]:
indices = np.argsort(-abs(model.coef_[0,:]))
print("The features in order of importance are:")
print(50*'-')
for feature in X.columns[indices]:
    print(feature)

## 6.3 Model-2 Random Forest

In [None]:
df = pd.read_csv("student-data.csv")

In [None]:
model.best_estimator_

In [None]:
y_train_pred = model.predict(X_train)

In [None]:
y_pred = model.predict(X_test)

**Note:** Just as above we can use these probabilities to get model interpretation

In [None]:
# Calculate AUC for Train
roc_auc_score(y_train, y_train_pred)

In [None]:
# Calculate AUC for Test
print(auc(fpr, tpr))

# 7. Save the winning model to disk

In [None]:
# Scatterplot of absences vs. failures
sns.lmplot(x='absences', y='failures', hue='passed', data=df, fit_reg=False)

In [None]:
# Scatterplot of studytime vs. absences
sns.lmplot(x='studytime', y='absences', hue='passed', data=df, fit_reg=False)

## 3.6 Correlations
* Finally, let's take a look at the relationships between numeric features and other numeric features.
* ***Correlation*** is a value between -1 and 1 that represents how closely values for two separate features move in unison.
* Positive correlation means that as one feature increases, the other increases; eg. a child's age and her height.
* Negative correlation means that as one feature increases, the other decreases; eg. hours spent studying and number of parties attended.
* Correlations near -1 or 1 indicate a strong relationship.
* Those closer to 0 indicate a weak relationship.
* 0 indicates no relationship.

In [None]:
df.corr()

#### A lot of numbers make things difficult to read. So let's visualize this.
But first, it's important to notice that the correlations for 'basement' all show as NaN. This is expected because right now that feature doesn't vary at all (its standard deviation is 0), as we saw all the way back in step 2. We'll fix this later.

In [None]:
tuned_params = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000], 'penalty': ['l1', 'l2']}
model = GridSearchCV(LogisticRegression(), tuned_params, scoring = 'roc_auc', n_jobs=-1)
model.fit(X_train, y_train)

In [None]:
## Predict Train set results
y_train_pred = model.predict(X_train)

In [None]:
## Predict Test set results
y_pred = model.predict(X_test)