# Classification Models

## Package: Scikit-Learn

Classification is a large domain in the field of statistics and machine learning. Generally, classification can be broken down into two areas:

- Binary classification, where we wish to group an outcome into one of two groups.

- Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.



## Logistic Regression

Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables.

To fit a binary logistic regression with sklearn, we use the LogisticRegression module with multi_class set to "ovr" and fit X and y.

`from sklearn.linear_model import LogisticRegression`


We can then use the predict method to predict probabilities of new data, as well as the score method to get the mean prediction accuracy:


## Support Vector Machines

Support Vector Machines (SVMs) are a type of classification algorithm that are more flexible - they can do linear classification, but can use other non-linear basis functions.


`from sklearn import svm`

## Random Forests

Random Forests are an ensemble learning method that fit multiple Decision Trees on subsets of the data and average the results. We can again fit them using sklearn, and use them to predict outcomes, as well as get mean prediction accuracy:

`from sklearn.ensemble import RandomForestClassifier`

## Neural Networks

Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. These essentially use a very simplified model of the brain to model and predict data.

`from sklearn.neural_network import MLPClassifier`

In [None]:
# import packages
import pandas as pd
from statsmodels.api import Logit

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt

# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 7]

In [None]:
hotel_data = pd.read_csv('hotelloyaltydata.csv')

# convert target variable to float
label = hotel_data['Reedemer'].unique()
hotel_data['Reedemer'] = hotel_data['Reedemer'].astype('category').cat.codes.astype('float')

# Convert str to categorical
hotel_data['Customer Segment'] = hotel_data['Customer Segment'].astype('category')
hotel_data['Income'] = hotel_data['Income'].astype('category')
hotel_data['Status'] = hotel_data['Status'].astype('category')
hotel_data['Region'] = hotel_data['Region'].astype('category')

# Dummy Variables for all categorical
hotel_data = pd.concat([hotel_data, pd.get_dummies(hotel_data['Customer Segment'], prefix='CustomerSeg', drop_first=True)], axis=1) #Not relevant
hotel_data = pd.concat([hotel_data, pd.get_dummies(hotel_data['Income'], prefix='Income', drop_first=True)], axis=1) #Not relevant
hotel_data = pd.concat([hotel_data, pd.get_dummies(hotel_data['Status'], prefix='Status', drop_first=True)], axis=1)
hotel_data = pd.concat([hotel_data, pd.get_dummies(hotel_data['Region'], prefix='Region', drop_first=True)], axis=1)

# Non Numerical Colums
non_numerical_columns = ['Customer Key', 'First Name', 'Last Name', 'Customer Segment', 'Income', 'Status', 'Region']

# drop all non numerical
numerical_data = hotel_data.drop(columns=non_numerical_columns)

# Split data fro train[70%] and test[30%]
train, test = train_test_split(numerical_data, train_size=0.7, shuffle=False)

# Predictors Variables
X_train = train.drop(columns='Reedemer')
X_test = test.drop(columns='Reedemer')

# Target Variable
Y_train = train['Reedemer']
Y_test = test['Reedemer']

# LogisticRegression Model
LR = LogisticRegression().fit(X_train, Y_train)

# predict score
LR.score(X_test, Y_test)

## Package: Statsmodels

`from statsmodels.api import Logit`

In [None]:
# Fit Model
sm_lr = Logit(Y_train, X_train).fit()

# Check Summary
sm_lr.summary()

## Decision Tree

In [None]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, Y_train)

y_hat = random_forest.predict(X_test)

print(metrics.classification_report(Y_test.values, y_hat, target_names=label))

feature_importances = pd.Series(random_forest.feature_importances_, index=X_test.columns)
print(feature_importances.head())
tree.plot_tree(random_forest.estimators_[0], max_depth=3,feature_names=X_test.columns, class_names=label, filled=True)
plt.show()