#Introduction
Classification can be performed on structured or unstructured data.

To start with, let's learn classification of structured data.

Before we get into what is classification of structured data, let's see what is classification?

Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.

Now, what is structured data?

Any data which has a high level of organization can be considered as structured data. This includes data in an excel sheet, relational database etc.

#Vocabulary: Classification
Classifier- An algorithm that maps the input data to a specific category.

Feature: A feature is an individual measurable property of a phenomenon being observed.

Feature selection: It is the process of identifying/deriving the most meaningful data(features) from the given input.

Classification model-A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.

#Vocabulary: Classification Types
Binary Classification: Classification task with two possible outcomes. Eg: Gender classification(Male/Female)

Multi class classification : Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time

Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, location at the same time.

Supervised classification: It is a technique where the learning is based on a training set of correctly labeled observations. Eg: Email classification where input data is a set of emails labeled as spam/not spam.

Unsupervised classification: Grouping the observations into various categories based on some similarity measures. Eg: Grouping of news articles based on the content.

![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/112/728/large/2c2cedcedda1397beb4102f8a3426420a3ea6c0d/Sdc_pipeline_v1.jpeg)



In [2]:
import sklearn.datasets

iris = sklearn.datasets.load_iris()

print(type(iris))

<class 'sklearn.utils.Bunch'>


#Problem Description
Let us understand structured data classification through a case study:

Churn Analysis in Telecommunication:

A customer can be called as a “churner” when he/she discontinue their subscription in a company and move their business to a competitor. Prediction as well as prevention of customer churn brings a huge additional revenue source for every business.

Here, we use a telecom customer data set to classify the set of possible customers who are likely to churn.

#Feature Identification
In any classification problem, identifying the right features plays a major role. So, how do we identify the features(columns) that influence the prediction.

In our case by analyzing the dataset, we can understand that the columns like Phone Number might be irrelevant as they are not dependent on call usage pattern.

Since Churn? is our target variable, we will be removing it from the feature set.

With these assumptions we will extract all the relevant columns required for our classification.

In [None]:
import numpy as np
import pandas as pd
#To read csv file
churn = pd.read_csv('dataset.csv', sep=',')
data_size=churn.shape
print(data_size)
churn_col_names=list(churn.columns)
print(churn_col_names)
print(churn.describe())
print(churn.head(3))
churn_target=churn['Churn?'] 
print(churn_target)
#Phone number : unique number (might not influence prediction)
#Churn? : target variable (not required in feature set)
cols_to_drop = ['Phone','Churn?']
#axis=1 depicts drop along columns
churn_feature = churn.drop(cols_to_drop,axis=1)
print(churn_feature)

#Categorical Data
The data which can be grouped into some kind of category or multiple categories are known as categorical data.

For example, in our case study the telecom users can be grouped by state, so State can be considered as a categorical variable. Similar is the case for Area Code.

The data represented in yes/no fashion can also be considered as categorical data. For example, the people with International plan(Int'l plan=yes) can be considered as a group. Hence, Int'l Plan is also a categorical variable.

In order to Identify the categorical variable in a data, use the following command,

#Handling Categorical Data
Categorical data has a lot of hidden information. It is important to treat such variables.

Also, most of the machine learning algorithms in python (sklearn library) requires input features as numerical arrays. If the categorical data is given as such, it will result in an error. The following are some methods to deal such variables:

Convert to boolean

Label Encoding

One hot Encoding

#Convert to boolean
The 'yes'/'no' type categorical variables can be converted to boolean values( True/False). The Numpy array package in python will automatically convert it to 1 or 0.

In our example,the columns Int'l Plan,VMail Plan are categorical variables with yes/no values.



In [None]:
churn_categorical = churn.select_dtypes(include=[object])
print(churn_categorical)

#Changing the 'yes or no' values to boolean
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feature[yes_no_cols] = churn_feature[yes_no_cols] == 'yes'
print(churn_feature)

#Label Encoding
Label encoding is a technique used to map non-numerical labels to numerical labels. The numerical values are encoded with values ranging from 0 : N (no of unique labels).

```
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
churn_feature['Area Code'] = label_encoder.fit_transform(churn_feature['Area Code'])
print(churn_feature)
After doing label encoding, the Area Code column is converted to numerical values.
```
Note: Make sure that the values in the label encoded fields does not overlap with any other columns within the dataset.

In [None]:
print('Churn data size before one hot encoding',churn_feature.shape)
print('No of unique states',len(churn_feature['State'].unique()))
#Give the feature and columns to one hot encode in 'columns' and column rename prefix in 'prefix'
churn_dumm=pd.get_dummies(churn_feature, columns=["State"], prefix=["State"])
print('Churn data size after one hot encoding',churn_dumm.shape)
import numpy as np #converting to numpy matrix
churn_matrix = churn_dumm.values.astype(np.float)

#Missing Values
If we encounter any missing values for any feature in our dataset, we need to handle those values for better classification.

Missing Inputs : Strategies

By deleting the observations

If we have sufficient number of observations in our data, we can delete those rows or observations containing missing values. Make sure that deletion of missing inputs will not create any bias.
By deleting the variables

We can drop those variables based on a threshold. If it has more than 50% missing values, we can eliminate those features. If one variable is having 20% of missing values, we can impute that variable rather than dropping it.
Imputing

The missing values can be replaced by taking the mean, median or mode of the values present in a particular column.

#Imputing-Missing Values
In our dataset, we do not have any missing values. This might not be the case always. Inorder to demonstrate how to deal with such cases, here is a code snippet:

```
from sklearn.impute import SimpleImputer
#Missing values replaced by mean
imp=SimpleImputer(missing_values=np.nan,strategy='mean',fill_value=None,verbose=0,copy=True)
#Fit to data, then transform it.
churn_matrix=imp.fit_transform(churn_matrix)
```


In this example, the mean is taken for the column which has the value NaNand imputed.

#Standardization
Standardization is a technique for re-scaling variable to a mean of zero and standard deviation of one.

It is used to transform the data to its center by ignoring the shape of the distribution.

The mean is subtracted from each value which results to a mean of zero. Then, the difference is divided by its standard deviation, resulting in a standard deviation of one.
```
from sklearn.preprocessing import StandardScaler
#Standardize the data by removing the mean and scaling to unit variance
scaler = StandardScaler()
#Fit to data, then transform it.
churn_matrix = scaler.fit_transform(churn_matrix)

```

#Class Imbalance
Class imbalance occurs when the number of data samples of a class is less than the number of data samples of another class.

Machine learning algorithms works better when each class samples are roughly equal. In classification problems, the variation in number of data samples will lead to model fitting issues.

#Handling Class Imbalance
How to handle class imbalance:

One technique to solve the class imbalance problem is balancing the dataset.

Under-sampling will sample the majority class to same size as the minority class.

Oversampling will sample the minority class to same size as the majority class.

In the algorithm level, we can adjust the class weight, decision threshold or modify an algorithm to perform on imbalanced data.

Sometimes, nothing needs to be done. It works well without data modification.
#Other Pre-processing Techniques
There are many more pre-processing techniques apart from the ones discussed.

However, based on the requirement and the type of dataset in hand, you might have to decide which of them to use.

A few other techniques are as follows:

Outlier Detection

Multi-Collinearity

Label Encoding

Normalization

Discretization

Correlation Analysis

Note: You can refer this for further details on pre-processing.

#Classification Algorithms
There are various algorithms to solve the classification problems. Code to try out few of these algorithms will be covered in the upcoming cards.

We will discuss on the following :

Decision Tree Classifier

Naive Bayes Classifier

Stochastic Gradient Descent Classifier

Support Vector Machine Classifier

Random Forest Classifier

Note:- The explanation for these algorithms are given in the Machine Learning Axioms course. Refer the course for further details.

#How Does a Classifier Work?
How Does a Classifier Work?
The following are the steps involved in building a classification model:

Initailize the classifier to be used.

Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.

Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label y.

Evaluate the classifier model - The score(X,y) returns the score for the given test data X and test label y.

#Train and Test Data
Code snippet for partitioning the data into train and test for building the classifier model. This split will be used for explanation of classification algorithms.

```seed=7 #To generate same sequence of random numbers
from sklearn.model_selection import train_test_split
#Splitting the data for training and testing(90% train,10% test)
train_data,test_data, train_label, test_label = train_test_split(churn_matrix, churn_target, test_size=.1,random_state=seed)
```


#Decision Tree Classification
Decision Tree Classification
It is one of the commonly used classification technique for performing binary as well as multi class classification.

The decision tree model predicts the class/target by learning simple decision rules from the features of the data.

```
from sklearn.tree import DecisionTreeClassifier
#Initializing decision tree classifier
classifier=DecisionTreeClassifier(random_state=seed)
#Model training
classifier = classifier.fit(train_data, train_label)
#After being fitted, the model can then be used to predict the output.
churn_predicted_target=classifier.predict(test_data)
#Evaluating the classifier
score = classifier.score(test_data, test_label)
print('Decision Tree Classifier : ',score)
```

#Stochastic Gradient Descent Classifier
Stochastic Gradient Descent Classifier
Used for large scale learning

Supports different loss functions & penalties for classification

```
from sklearn.linear_model import SGDClassifier
classifier =  SGDClassifier(loss='modified_huber', shuffle=True,random_state=seed)
classifier = classifier.fit(train_data, train_label)
churn_predicted_target=classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('SGD classifier : ',score)

```

#Support Vector Machine
Support Vector Machine
Support Vector Machine(SVM) is effective in high dimensional spaces.

Effective in cases where number of dimensions is greater than the number of samples.

It works really well with clear margin of separation.
```
from sklearn.svm import SVC
classifier = SVC(kernel="linear", C=0.025,random_state=seed)
classifier = classifier.fit(train_data, train_label)
churn_predicted_target=classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('SVM Classifier : ',score)
```

#Random Forest Classifier
Random Forest Classifier
Controls over fitting

A random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy.

```
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10,random_state=seed)
classifier = classifier.fit(train_data, train_label)
churn_predicted_target=classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('Random Forest Classifier : ',score)
```

#Model Tuning
The classification algorithms in machine learning are parameterized. Modification of any of those parameters can influence the results. So algorithm/model tuning is very essential to find out the best model.

For example, lets take Random Forest Classifier and change the values of few parameters (n_ estimators,max_ features)

```
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=15, ax_features=60,random_state=seed)
classifier = classifier.fit(train_data, train_label)
score=classifier.score(test_data, test_label)
print('Random Forest classification after model tuning',score)
Refer scikit-learn tutorials and try to change the parameters of other classifiers and analyse the results.
```


#Partitioning the Data
Partitioning the Data
It is a methodological mistake to test and train on same dataset because the classifier would fail to predict correctly for any unseen data. This could result in overfitting.

To avoid this problem,

We split our data to train set,validation set and test set.
Training Set: The data used to train the classifier.
Validation Set: The data used to tune the classifer model parameters i.e, to understand how well the model has been trained (a part of training data).
Testing Set: The data used to evaluate the performance of the classifier(unseen data by the classifier).
This will help us to know the efficiency of our model.

![alt text](https://docs-secure-cdn.fresco.me/system/attachments/files/006/327/967/original/a78b9ae8ab284bedb34b00c49d6f5f0df153dc91/cross_validation_V1.gif?Expires=1569930650&Signature=kLB~eFdXWz~1wgBLJAgQkiptP1ZBDL0cItFGIrJYszOlGPoCgzF1llS7SvMNCfvB~pLZ-eWwP4J1hQXsHhFJ~zeM3foAYAfrLqfpn-f2XgVUoSOwRWSUKxIISu3A7GTHhXEpVjvuGyWgliY4K7-BAOpCpr-IzhEodV5XiNV5Ne0GjPEJov1mYy~49OzuYsoJdJdkugbHNCB7Pdpeaj7rW-R39C7P8Lzi95JCMwG7kjgy90dYRzFG9L0FRN3r2UMBb66iQvwuaR38ddeI0lXgGa4QKdYK561DfS9dXzqxXyJLOikyClurJOLcYDFn4OCNauacVlHQ2MpldOFFQV4wLg__&Key-Pair-Id=APKAJUTRVJCFRZY3Z43A)

#Cross Validation
Cross Validation
Cross validation is a model validation technique to evaluate the performance of a model on unseen data (validation set).
It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.
Points to remember :

Cross validation gives high variance if the testing set and training set are not drawn from same population.
Allowing training data to be included in testing data will not give actual performance results.
In cross validation, the number of samples used for training the model is reduced and the results depend upon the choice of pair of training and testing sets.

You can refer to the various CV approaches from here.

#Stratified Shuffle Split
StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can be seen from the below code snippet:

The StratifiedShuffleSplit splits the data by taking equal number of samples from each class in a random manner.
```
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.1, random_state=7)
sss.get_n_splits(churn_matrix,churn_target)
print(sss)
```
test_size=0.1 denotes that 10 % of the dataset is used for testing.

#Stratified Shuffle Split Contd...
This selection is then used to split the data into test and train sets.
```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import svm
classifiers = [
    DecisionTreeClassifier(),
    GaussianNB(),
    SGDClassifier(loss='modified_huber', shuffle=True),
    SVC(kernel="linear", C=0.025),
    KNeighborsClassifier(),
    OneVsRestClassifier(svm.LinearSVC()),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),
    AdaBoostClassifier(),
   ]
for clf in classifiers:
    score=0
    for train_index, test_index in sss.split(churn_matrix,churn_target):
        X_train, X_test = churn_matrix[train_index], churn_matrix[test_index]
        y_train, y_test = churn_target[train_index], churn_target[test_index]
        clf.fit(X_train, y_train)
        score=score+clf.score(X_test, y_test)
    print(score)
```
The above code uses ensemble of classifiers for cross validation. It helps to select the best classifier based on the cross validation scores. The classifier with the highest score can be used for building the classification model.

Note: You may add or remove classifiers based on the requirement.

#Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.
```
from sklearn.metrics import accuracy_score
print('Accuracy Score',accuracy_score(test_label,churn_predicted_target))  
This simple classification accuracy will not tell us the types of errors by our classifier.
```
Its just an easiest method, but it will not give us the latent distribution of response values.

#Confusion Matrix
It is a technique used to evaluate the performance of a classifier.

It visually depicts the performance in a tabular form that has 2 dimensions namely “actual” and “predicted” sets of data.

The rows and columns of the table shows the count of false positives, false negatives, true positives and true negatives.
```
from sklearn.metrics import confusion_matrix
print('Confusion Matrix',confusion_matrix(test_label,churn_predicted_target))
```
![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/111/328/large/caf65e294bcfc5805872ceb53c49f83120d48141/perfomance_evalusation_measures.jpeg)
The first parameter shows true values and second parameter shows predicted values.

#Confusion Matrix
The above image is a confusion matrix for a two class classifier.

In the table,

TP (True Positive) - The number of correct predictions that the occurrence is positive

FP (False Positive) - The number of incorrect predictions that the occurrence is positive

FN (False Negative) - The number of incorrect predictions that the occurrence is negative

TN (True Negative)- The number of correct predictions that the occurrence is negative

TOTAL - The total number of occurrence


#Classification Report
The classification_report function shows a text report showing the commonly used classification metrics.

from sklearn.metrics import classification_report
target_names = ['False.', 'True.']
print(classification_report(test_label, churn_predicted_target, target_names=target_names))
Precision

When a positive value is predicted, how often is the prediction correct?
Recall

It is the true positive rate.

When the actual value is positive, how often is the prediction correct?

To know more about model evaluation, check this link.

In [None]:
import pandas as pd
iris =pd.read_csv( "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv")
classes=list(iris['species'].unique())
print(iris.size)
print(classes)

750
['setosa', 'versicolor', 'virginica']
