In [None]:
import warnings 
warnings.filterwarnings('ignore')

## K-Nearest-Neighbors

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.

In this module we will explore the inner workings of KNN, choosing the optimal K values and using KNN from scikit-learn.

## Overview

1.Read the problem statement.

2.Get the dataset.

3.Explore the dataset.

4.Pre-processing of dataset.

5.Visualization

6.Transform the dataset for building machine learning model.

7.Split data into train, test set.

7.Build Model.

8.Apply the model.

9.Evaluate the model.

10.Finding Optimal K value

11.Repeat 7,8,9 steps.

## Problem statement

### Dataset

The data set we’ll be using is the Iris Flower Dataset which was first introduced in 1936 by the famous statistician Ronald Fisher and consists of 50 observations from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.

**Attributes of the dataset:** https://archive.ics.uci.edu/ml/datasets/Iris

**Train the KNN algorithm to be able to distinguish the species from one another given the measurements of the 4 features.**

In [2]:
# All the Imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score


column_name=['sepal length in cm','sepal width in cm','petal length in cm','petal width in cm','class']
dataFrame=pd.read_csv("C:/Users/anupama.pushparaju/Documents/Python/Internal_Lab_Residency3/Iris.txt",names=column_name)
print('**** DATAFRAME ****')
dataFrame

**** DATAFRAME ****


Unnamed: 0,sepal length in cm,sepal width in cm,petal length in cm,petal width in cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


## Question 1

Import the data set and print 10 random rows from the data set

In [None]:
rand_10=dataFrame.sample(10)
rand_10

## Data Pre-processing

## Question 2 - Estimating missing values

*Its not good to remove the records having missing values all the time. We may end up loosing some data points. So, we will have to see how to replace those missing values with some estimated values (median) *

In [None]:
data_Temp=dataFrame.drop('class', 1)
Imputer = Imputer(missing_values = 'NaN', strategy = 'median', axis =0)
Imputer.fit(data_Temp)
Imputer.transform(data_Temp)
data_Temp

## Question 3 - Dealing with categorical data

Change all the classes to numericals (0to2).

In [None]:
labelEncoder=LabelEncoder()

dataFrame['class']=labelEncoder.fit_transform(dataFrame['class'])

dataFrame['class'].unique()

## Question 4

*Observe the association of each independent variable with target variable and drop variables from feature set having correlation in range -0.1 to 0.1 with target variable.*

In [None]:
dataFrame.corr()
#No variables to be dropped as the correlation does not fall into the range of -0.1 to 0.1

## Question 5

*Observe the independent variables variance and drop such variables having no variance or almost zero variance(variance < 0.1). They will be having almost no influence on the classification.*

In [None]:
dataFrame.var()
#Out of the four independent variables, the variance for all the variables is greater 0.1. We are not dropping any variables.

## Question 6

*Plot the scatter matrix for all the variables.*

In [None]:

import seaborn as sns
sns.pairplot(dataFrame,diag_kind='kde')

## Split the dataset into training and test sets

## Question 7

*Split the dataset into training and test sets with 80-20 ratio.*

In [None]:
x=dataFrame.drop('class',1)
print(x)
y=dataFrame[['class']]
print(y)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=1)


## Question 8 - Model

*Build the model and train and test on training and test sets respectively using **scikit-learn**. Print the Accuracy of the model with different values of **k=3,5,9**.*

**Hint:** For accuracy you can check **accuracy_score()** in scikit-learn

In [None]:
arr=[3,5,9]
for i in range(3):
    nNH=KNeighborsClassifier(n_neighbors=arr[i],weights='distance')
    nNH.fit(X_train,y_train)
    predicted_labels=nNH.predict(X_test)
    print('For k ='+str(arr[i])+' score is '+str(nNH.score(X_test,y_test)))
    ##print('Confusion metrics is ==>> '+metrics.confusion_matrix(y_test,predicted_labels))

## Question 9 - Cross Validation

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the **optimal number of neighbours** from the above list using the Mis classification error

Hint:

Misclassification error (MSE) = 1 - Test accuracy score. Calculated MSE for each model with neighbours = 1,3,5...19 and find the model with lowest MSE

In [None]:

cv_scores=[]
misCalError=[]
range_of_K=pd.DataFrame(np.arange(1,20,2,dtype='int'),columns=['k'])
range_of_K
for i in(np.arange(1,20,2,dtype='int')):
    nNH=KNeighborsClassifier(n_neighbors=i,weights='distance')
    
    #Divide the data into 10 buckets for cross validation.
    scores=cross_val_score(nNH,X_train,y_train,cv=10,scoring='accuracy')
    cv_scores.append(scores.mean())
    nNH.fit(X_train,y_train)
    predicted_labels=nNH.predict(X_test)
    
    #Calculating the Miscalculation error
    m=1-nNH.score(X_test,y_test)
    misCalError.append(m)
    misCalError
    
    
range_of_K['Mis_Cal_Error']=misCalError


In [None]:
range_of_K

## Question 10

*Plot misclassification error vs k (with k value on X-axis) using matplotlib.*

In [None]:
plt.plot(range_of_K['k'],range_of_K['Mis_Cal_Error'])

# Classification vs. Regression
The aim of this project is to predict how likely a student is to pass. Which type of supervised learning problem is this, classification or regression? Why?
Answer:
This project is a classification supervised learning problem because the variable to predict, i.e. if a student graduates or fails to graduate, is categorical. On this case this a dichotomous categorical variable where the only two possible values are "pass" or "fail".

### Overview:

1.Read the problem statement.

2.Get the dataset.

3.Explore the dataset.

4.Pre-processing of dataset.

5.Transform the dataset for building machine learning model.

6.Split data into train, test set.

7.Build Model.

8.Apply the model.

9.Evaluate the model.

10.Provide insights.

## Problem Statement 

Using Logistic Regression **predict the performance of student**. The classification goal is to predict whether the student will pass or fail.

## Dataset 

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in Mathematics.

**Source:** https://archive.ics.uci.edu/ml/datasets/Student+Performance

# Question 1 - Exploring the Data
*Read the dataset file using pandas. Take care about the delimiter.*

#### Answer:

In [11]:
#Load the dataframe
dataFrame_student=pd.read_csv('C:\\Users\\anupama.pushparaju\\Documents\\Python\\Internal_Lab_Residency3\\student\\student-mat.csv',delimiter=';')
print('*** Student Table ***')
print(dataFrame_student)
#Exploring the dataset
print(dataFrame_student.columns)
print(dataFrame_student.shape)


*** Student Table ***
    school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
5       GP   M   16       U     LE3       T     4     3  services     other   
6       GP   M   16       U     LE3       T     2     2     other     other   
7       GP   F   17       U     GT3       A     4     4     other   teacher   
8       GP   M   15       U     LE3       A     3     2  services     other   
9       GP   M   15       U     GT3       T     3     4     other     other   
10      GP   F   15       U     GT3       T     4     4   teacher    health   
11      GP   F   15       U   

# Question 2 - drop missing values
*Set the index name of the dataframe to **"number"**. Check sample of data to drop if any missing values are there.*
*Use .dropna() function to drop the NAs*

#### Answer:

In [None]:
len(missing_values)

In [12]:
#Setting the index
dataFrame_student.index.name='number'
print(dataFrame_student)

#Checking for null values
missing_values=dataFrame_student.columns[dataFrame.isnull().any()]
if len(missing_values)==0:
    print("No missing values")

       school sex  age address famsize Pstatus  Medu  Fedu      Mjob  \
number                                                                 
0          GP   F   18       U     GT3       A     4     4   at_home   
1          GP   F   17       U     GT3       T     1     1   at_home   
2          GP   F   15       U     LE3       T     1     1   at_home   
3          GP   F   15       U     GT3       T     4     2    health   
4          GP   F   16       U     GT3       T     3     3     other   
5          GP   M   16       U     LE3       T     4     3  services   
6          GP   M   16       U     LE3       T     2     2     other   
7          GP   F   17       U     GT3       A     4     4     other   
8          GP   M   15       U     LE3       A     3     2  services   
9          GP   M   15       U     GT3       T     3     4     other   
10         GP   F   15       U     GT3       T     4     4   teacher   
11         GP   F   15       U     GT3       T     2     1  serv

# Transform Data

## Question 3

*Print all the attribute names which are not numerical.*

**Hint:** check **select_dtypes()** and its **include** and **exclude** parameters.**

#### Answer:

In [13]:
data_temp=dataFrame_student.copy()
data_temp.select_dtypes(exclude=["number"])
data_temp.columns
print("String Dataset")
print(data_temp)


String Dataset
       school sex  age address famsize Pstatus  Medu  Fedu      Mjob  \
number                                                                 
0          GP   F   18       U     GT3       A     4     4   at_home   
1          GP   F   17       U     GT3       T     1     1   at_home   
2          GP   F   15       U     LE3       T     1     1   at_home   
3          GP   F   15       U     GT3       T     4     2    health   
4          GP   F   16       U     GT3       T     3     3     other   
5          GP   M   16       U     LE3       T     4     3  services   
6          GP   M   16       U     LE3       T     2     2     other   
7          GP   F   17       U     GT3       A     4     4     other   
8          GP   M   15       U     LE3       A     3     2  services   
9          GP   M   15       U     GT3       T     3     4     other   
10         GP   F   15       U     GT3       T     4     4   teacher   
11         GP   F   15       U     GT3       T   

# Question 4 - Drop variables with less variance

*Find the variance of each numerical independent variable and drop whose variance is less than 1. Use .var function to check the variance*

In [24]:
dataFrame.columns
print("**** Before Dropping ****")
var_dataFrame=dataFrame_student.var()
print(var_dataFrame)
k=(dataFrame_student.var() < 1)
print(k[k==True].index.tolist())
student_data=dataFrame_student.drop(k[k==True].index.tolist(),axis=1)
print("*** After Dropping ***")
student_data.columns

**** Before Dropping ****
age            1.628285
Medu           1.198445
Fedu           1.184180
traveltime     0.486513
studytime      0.704324
failures       0.553017
famrel         0.803997
freetime       0.997725
goout          1.239388
Dalc           0.793420
Walc           1.658678
health         1.932944
absences      64.049541
G1            11.017053
G2            14.148917
G3            20.989616
dtype: float64
['traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'Dalc']
*** After Dropping ***


Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid',
       'activities', 'nursery', 'higher', 'internet', 'romantic', 'goout',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

#### Variables with less variance are almost same for all the records. Hence, they do not contribute much for classification.

# Question 5 - Encode all categorical variables to numerical

Take the list of categorical attributes(from the above result) and convert them into neumerical variables. After that, print the head of dataframe and check the values.

**Hint:** check **sklearn LabelEncoder()**

#### Answer:

In [20]:
student_cols=dataFrame_student.select_dtypes(exclude=np.number).columns.tolist()
student_cols

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [22]:
from sklearn import preprocessing
labelEncoder=preprocessing.LabelEncoder()

for column in student_cols:
    student_data[column]=labelEncoder.fit_transform(student_data[column])
student_data.head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,18,1,0,0,4,4,0,4,...,1,0,0,4,1,3,6,5,6,6
1,0,0,17,1,0,1,1,1,0,2,...,1,1,0,3,1,3,4,5,5,6
2,0,0,15,1,1,1,1,1,0,2,...,1,1,0,2,3,3,10,7,8,10
3,0,0,15,1,0,1,4,2,1,3,...,1,1,1,2,1,5,2,15,14,15
4,0,0,16,1,0,1,3,3,2,2,...,1,0,0,2,2,5,4,6,10,10


# Question 6 - Convert the continuous values of grades into classes

*Consider the values in G1, G2 and G3 with >= 10 as pass(1) and < 10 as fail(0) and encode them into binary values. Print head of dataframe to check the values.*

#### Answer:

# Question 7

*Consider G3 is the target attribute and remaining all attributes as features to predict G3. Now, separate feature and target attributes into separate dataframes with X and y variable names.*

Answer

# Question 8 - Training and testing data split

# *So far, you have converted all categorical features into numeric values. Now, split the data into training and test sets with training size of 300 records. Print the number of train and test records.*

**Hint:** check **train_test_split()** from **sklearn**

#### Answer:

# Question 9 - Model Implementation and Testing the Accuracy

*Build a **LogisticRegression** classifier using **fit()** functions in sklearn. 
* You need to import both Logistic regression and accuracy score from sklearn*
#### Answer:

# Question 10 - Print the intercept of the Logistic regression model (0.5 points)

The value of the intercepts are stored in the model itself. You can use .intercept_ function to do the same

# Question 11 - Print the coefficients of the model and name the coefficient which has the highest impact on the dependent variable

Hint: Use .coef_ to get the coefficients and use pd.Dataframe to store the coefficients in a dataframe with column names same as the independent variable dataframe

# Question 12 - Predict the dependent variable for both training and test dataset

Accuracy score() should help you to print the accuracies