# Predicting Student Success (Naive Bayesian and Regression Approaches) 
Two fundamental learning approahces in predictive analytics is probability-based learning and error-based learning. In this homework assignment, you will explore how those two approaches can be used for building predictive models. 

You will use a subset of _Student Performance Data Set_ from UCI (https://archive.ics.uci.edu/ml/datasets/student+performance). This dataset addresses the factors impacting the student achievement in secondary education of two Portuguese schools. The data attributes include grades, demographic, social and school related features and it was collected by using school reports and questionnaires. In this homework assignment you will use the mathematics scores (file provided 'student-mat.csv'). You will apply initial pre-processing and build learning models for predicting student success. 

Initial loading and pre-processing steps are shown in the next cell. Please use the descriptive features outlined below. Target variable you will derive (or use) will be based on the third year/final grade of the students.  

In [1]:
# load libraries
import pandas as pd
import numpy as np
# load and process data
mat_df = pd.read_csv('student-mat.csv', sep=';')
mat_df = mat_df.drop(columns=['G1', 'G2']) # G1 and G2 are first two years grades. We are interested in the final year
_target = 'G3'
desc_features = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 
        'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
        'Walc', 'health', 'absences']
mat_df[desc_features]

Unnamed: 0,school,sex,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,U,GT3,A,4,4,at_home,teacher,course,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,U,GT3,T,1,1,at_home,other,course,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,U,LE3,T,1,1,at_home,other,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,U,GT3,T,4,2,health,services,home,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,U,GT3,T,3,3,other,other,home,...,yes,no,no,4,3,2,1,2,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,U,LE3,A,2,2,services,services,course,...,yes,no,no,5,5,4,4,5,4,11
391,MS,M,U,LE3,T,3,1,services,services,course,...,yes,yes,no,2,4,5,3,4,2,3
392,MS,M,R,GT3,T,1,1,other,other,course,...,yes,no,no,5,5,3,3,3,3,3
393,MS,M,R,LE3,T,3,2,services,other,course,...,yes,yes,no,4,4,1,3,4,5,0


### Q1. Create a new categorical target variable, called 'passed' based on grade obtained in the third year (see G3). Any student who gets 9 or above will be considered as passed. Also, split the data into training and testing (33% for testing). 

In [2]:
# you answer for Q1 goes here.
mat_df["passed"]=np.where(mat_df[_target]>=9,1,0) ## Creating new target variable named passed, 1 means passed, 0 means failed
desc_features.append("passed")
data_final=mat_df[desc_features] ## Final DataFrame for further analysis


In [3]:
## Splitting of data into training & testing set
from sklearn.model_selection import train_test_split
X=data_final[data_final.columns[0:29]]
Y=data_final["passed"]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.33, random_state=26)

### Q2. Using the training and testing partitions form Q1, train a model using Naive Bayes algorithm. Test and analyze the trained model.
You are expected to use the 'passed' variable (categorical target) instead of G3 (do not include G3 in your training or testing set). You can use the GaussianNB() model available in sklearn library and make use of OrdinalEncoder utility class.
Using the testing set, calculate the accuracy score and display the confusion matrix.

Additionally, print the __class_prior___ attribute of your classifier. What do these numbers represent? How would the prediction results change if you initialized them to "[0.5, 0.5]"?

In [4]:
# answer for Q2
## Encoding of dataset using ordinal encoder
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X=encoder.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.33, random_state=26)

In [5]:
## Training Guassian Naive Bayes Model
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)

GaussianNB()

In [6]:
## Predictions made by our classifier
y_pred = clf.predict(X_test)

In [7]:
# Accuracy of the model
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.7633587786259542

In [8]:
## Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
df_cm = pd.DataFrame(cm)
df_cm

Unnamed: 0,0,1
0,11,16
1,15,89


In [9]:
##Class_prior of the model. This is the probability of classes derived by the data.
print(clf.class_prior_ )

[0.28409091 0.71590909]


These numbers are the probability of each class. If nothing is specefied, then it is derived from the training data. 

# Training of new model using different prior class probabilities 

In [10]:
clf_updated = GaussianNB(priors=[0.5,0.5]) ### [0.5,0.5] class probabilities
clf_updated.fit(X_train, y_train)

GaussianNB(priors=[0.5, 0.5])

In [11]:
y_pred_new = clf_updated.predict(X_test)

In [12]:
# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred_new)
accuracy

0.7099236641221374

In [13]:
## Confusion matrix
cm = confusion_matrix(y_test, y_pred_new)
cm
df_cm = pd.DataFrame(cm)
df_cm

Unnamed: 0,0,1
0,14,13
1,25,79


### Q3. Build a logistic regression model using the same training/testing split you used in Q2. Test and analyze the trained model.  
Using the testing set, calculate the accuracy score and display the confusion matrix.

Also print the __coef___ attribute of your logistic regression classifier. What do these numbers in 'coef_' represent? 

In [15]:
# answer for Q3 
## Training a logistic regression model
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(random_state=18).fit(X_train, y_train)

In [16]:
## Predictions made by logistic regression
log_pred=log_clf.predict(X_test)

In [18]:
# Accuracy of the logistic regression model
accuracy = accuracy_score(y_test, log_pred)
accuracy

0.8015267175572519

In [19]:
## Confusion matrix of logistic regression
cm = confusion_matrix(y_test, log_pred)
cm
df_cm = pd.DataFrame(cm)
df_cm

Unnamed: 0,0,1
0,11,16
1,10,94


In [20]:
## Coffecients of decision function
print(log_clf.coef_)

[[ 0.41459296  0.14603337  0.31205744  0.44998563  0.03584358  0.06344831
   0.11585604 -0.04763119  0.07368474  0.17761244 -0.1055507  -0.28581192
   0.29442115 -0.66130109 -0.44661368 -0.55369528  0.25170124  0.25562687
  -0.38328839  0.57356185  0.02525186 -0.3057767   0.14499349  0.22796183
  -0.5632803  -0.18350942  0.26254696  0.00797929 -0.0063134 ]]


This is the list of coffecient for the features in decision function. When our features are  normalized, coffecients can be used to get understanding of weightage of the feature for making decision.

### Q4. Build a linear regression model using the same training/testing split you used in Q2.  
This time you are expected to use 'G3' attribute in the original dataset as target variable (continuous). The descriptive features you will use are provided below (in `reg_desc_feat` list). Use the same split in Q1 (33% testing set) and calculate the mean absolute error.

In addition, print the __coef___ and __intercept___ attributes of your linear regression model. What are these numbers?

In [23]:
## Preparation of data for modelling
reg_desc_feat = ['school', 'address', 'famsize', 'Mjob', 'Fjob', 'guardian', 'paid', 'activities', 'romantic' ]
# answer for Q4
X_linear=mat_df[reg_desc_feat]
Y_linear=mat_df[_target]

In [26]:
## Encodding & Getting same split as of 2 by keeping same value of random_state
encoder_linear = OrdinalEncoder()
X_linear=encoder_linear.fit_transform(X_linear)
X_train, X_test, y_train, y_test = train_test_split( X_linear, Y_linear, test_size=0.33, random_state=26)

In [27]:
## Training linear regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [28]:
## Predicted values by our model
reg_pred=reg.predict(X_test)

In [29]:
## Mean absolute error produced by the model
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, reg_pred)

3.223741477595621

In [30]:
## Cofficient of the linear regression model
print(reg.coef_)

[ 0.39347619  0.20570381  0.81397323  0.2593395   0.0448912  -0.58638011
  0.9681467   0.13602719 -0.91172801]


In [31]:
## Intercept of linear regression model
print(reg.intercept_)

9.425721982738176


coef_: These are the estimated coffecients of linear regression decision function. This will be the array of length n(number of features).



intercept_: This will give us independent term in our linear regression decision function. This is a constant in our linear regression function.