# Classification

Let's start by importing the necessary libraries and viewing the first five rows of the dataset

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('input/train.csv')

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We next check for missing values using the ```.info()``` method of the dataframe.

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Preprocessing

Let's first correct the *Age* column by filling in the null values with -0.5.

In [3]:
data.Age.fillna(-0.5,inplace = True)

In [4]:
data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

For the *Embarked* column we fill in the NaNs with 'S', which is the most frequently occurring class.

In [5]:
data["Embarked"].fillna("S",inplace=True)

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
categorical = ["Sex","Embarked"]
for i in categorical:
    le.fit(data[i])
    data[i] = le.transform(data[i])

Next, the irrelevant columns are dropped

In [7]:
data.drop(["Name","Ticket","Cabin"],axis = 1, inplace=True)

Let's again call the ```.info()``` method to perform a sanity check and insure that everything has gone right.

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null int64
dtypes: float64(2), int64(7)
memory usage: 62.7 KB


## Dataset preparation and model evaluation

We next split the training and testing sets into independent and dependent (target) variables, which is a requirement of the scikit API. Then, we generate separate training and testing sets by using ```train_test_split()``` method present in scikit-learn.

In [9]:
X_cla = data.drop("Survived",axis=1)
y_cla = data["Survived"]

In [10]:
from sklearn.model_selection import train_test_split

X_train_cla, X_test_cla, y_train_cla, y_test_cla = train_test_split(X_cla, y_cla, test_size=0.33, random_state=42)

We now fit a couple of classifiers (K-nearest neighbors and Random Forest) and evaluate their performance.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [13]:
# initialize classifiers by creating objects
knn = KNeighborsClassifier()
rf = RandomForestClassifier()

# fit to training data
knn.fit(X_train_cla,y_train_cla)
rf.fit(X_train_cla,y_train_cla)

print(f"Score of KNN classifier on training set: {knn.score(X_train_cla,y_train_cla)}")
print(f"Score of KNN classifier on test set: {knn.score(X_test_cla,y_test_cla)}\n\n")
print(f"Score of Random Forest Classifier on training set: {rf.score(X_train_cla,y_train_cla)}")
print(f"Score of Random Forest Classifier on test set: {rf.score(X_test_cla,y_test_cla)}")

Score of KNN classifier on training set: 0.7399328859060402
Score of KNN classifier on test set: 0.6576271186440678


Score of Random Forest Classifier on training set: 0.9714765100671141
Score of Random Forest Classifier on test set: 0.8


# Regression

To analyze the ease-of-use of the scikit API, we'll now quickly go through an example of regression. For that, we'll use the boston dataset that we can directly import from ```sklearn.datasets```

In [14]:
from sklearn.datasets import load_boston

boston=load_boston()
type(boston)

sklearn.utils.Bunch

A bunch has features similar to a dictionary, and its keys can be accessed using ```.keys()``` function.

In [15]:
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR'])


Let's have a look at the dataset

In [40]:
X_reg=pd.DataFrame(boston.data,columns=boston.feature_names)
y_reg=pd.DataFrame(boston.target)
y_reg.columns=['Price']
pd.concat([X_reg,y_reg],axis=1).head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


We next prepare the training and test sets. We use ```.values.ravel()``` method with y_reg, a column vector, to convert it to a 1d array

In [41]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg.values.ravel(), test_size=0.33, random_state=42)

We make use of 2 regressors, KNN and Random Forest

In [42]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

knnr = KNeighborsRegressor()
rfr = RandomForestRegressor()

knnr.fit(X_train_reg, y_train_reg)
rfr.fit(X_train_reg, y_train_reg)

print(f"Score of KNN regressor on training set: {knnr.score(X_train_reg,y_train_reg)}")
print(f"Score of KNN regressor on test set: {knnr.score(X_test_reg,y_test_reg)}\n\n")
print(f"Score of Random Forest regressor on training set: {rfr.score(X_train_reg,y_train_reg)}")
print(f"Score of Random Forest on regressor test set: {rfr.score(X_test_reg,y_test_reg)}")

Score of KNN regressor on training set: 0.6422259111197384
Score of KNN regressor on test set: 0.5748334691810936


Score of Random Forest regressor on training set: 0.9700652699175766
Score of Random Forest on regressor test set: 0.8497987337635583
