# Essentials of Machine Learning Algorithms (with Python)

<a href= 'https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/'>Link to the tutorial</a>

Broadly, there are 3 types of Machine Learning Algorithms..

1. Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.


2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.


3. Reinforcement Learning:
How it works:  Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process


### List of Common Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

* Linear Regression
* Logistic Regression
* Decision Tree
* SVM
* Naive Bayes
* kNN
* K-Means
* Random Forest
* Dimensionality Reduction Algorithms
* Gradient Boosting algorithms
* GBM
* XGBoost
* LightGBM
* CatBoost


## 1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above.

In this equation:

* Y – Dependent Variable
* a – Slope
* X – Independent variable
* b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.


Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

In [1]:
#let us read the data
import pandas as pd


### I am using a dataset "breatCancer" from UCI for these tests because it is already cleaned

In [2]:
data = pd.read_csv('../breastCancer.csv')

In [3]:
data.shape

(569, 32)

In [4]:
data.head(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
8,844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075


In [5]:
y = data.iloc[:,1]

In [6]:
#the values are categorical so we need to transform them
y[:10]

0    M
1    M
2    M
3    M
4    M
5    M
6    M
7    M
8    M
9    M
Name: diagnosis, dtype: object

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [8]:
y[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [9]:
X = data.iloc[:,2:29]

## This is where I split the data so that it can fit into the model

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=7)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((398, 27), (398,), (171, 27), (171,))

In [11]:
X_test

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst
350,11.660,17.07,73.70,421.0,0.07561,0.03630,0.008306,0.011620,0.1671,0.05731,...,0.006296,0.022160,0.002668,13.28,19.74,83.61,542.5,0.09958,0.06476,0.03046
259,15.530,33.56,103.70,744.9,0.10630,0.16390,0.175100,0.083990,0.2091,0.06650,...,0.010220,0.009947,0.003359,18.49,49.54,126.30,1035.0,0.18830,0.55640,0.57030
115,11.930,21.53,76.53,438.6,0.09768,0.07849,0.033280,0.020080,0.1688,0.06194,...,0.007711,0.012780,0.003856,13.67,26.15,87.54,583.0,0.15000,0.23990,0.15030
60,10.170,14.88,64.55,311.9,0.11340,0.08061,0.010840,0.012900,0.2743,0.06960,...,0.008193,0.041830,0.005953,11.02,17.45,69.86,368.6,0.12750,0.09866,0.02168
275,11.890,17.36,76.20,435.6,0.12250,0.07210,0.059290,0.074040,0.2015,0.05875,...,0.019100,0.026780,0.003002,12.40,18.99,79.46,472.4,0.13590,0.08368,0.07153
53,18.220,18.70,120.30,1033.0,0.11480,0.14850,0.177200,0.106000,0.2092,0.06310,...,0.009222,0.026740,0.005126,20.60,24.13,135.10,1321.0,0.12800,0.22970,0.26230
221,13.560,13.90,88.59,561.3,0.10510,0.11920,0.078600,0.044510,0.1962,0.06303,...,0.008360,0.018420,0.002918,14.98,17.13,101.10,686.6,0.13760,0.26980,0.25770
284,12.890,15.70,84.08,516.6,0.07818,0.09580,0.111500,0.033900,0.1432,0.05935,...,0.017740,0.018780,0.003696,13.90,19.69,92.12,595.6,0.09926,0.23170,0.33440
146,11.800,16.58,78.99,432.0,0.10910,0.17000,0.165900,0.074150,0.2678,0.07371,...,0.018430,0.056280,0.004635,13.74,26.38,91.93,591.7,0.13850,0.40920,0.45040
480,12.160,18.03,78.29,455.3,0.09087,0.07838,0.029160,0.015270,0.1464,0.06284,...,0.005161,0.014540,0.001858,13.34,27.87,88.83,547.4,0.12080,0.22790,0.16200


In [12]:
X_train.head(10)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst
543,13.21,28.06,84.88,538.4,0.08671,0.06877,0.02987,0.03275,0.1628,0.05781,...,0.009117,0.01724,0.001343,14.37,37.17,92.48,629.6,0.1072,0.1381,0.1062
58,13.05,19.31,82.61,527.2,0.0806,0.03789,0.000692,0.004167,0.1819,0.05501,...,0.004167,0.0219,0.00299,14.23,22.25,90.24,624.1,0.1021,0.06191,0.001845
436,12.87,19.54,82.67,509.2,0.09136,0.07883,0.01797,0.0209,0.1861,0.06347,...,0.006502,0.02223,0.002378,14.45,24.38,95.14,626.9,0.1214,0.1652,0.07127
453,14.53,13.98,93.86,644.2,0.1099,0.09242,0.06895,0.06495,0.165,0.06121,...,0.01136,0.02207,0.003563,15.8,16.93,103.1,749.9,0.1347,0.1478,0.1373
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4
386,12.21,14.09,78.78,462.0,0.08108,0.07823,0.06839,0.02534,0.1646,0.06154,...,0.01087,0.01921,0.004622,13.13,19.29,87.65,529.9,0.1026,0.2431,0.3076
74,12.31,16.52,79.19,470.9,0.09172,0.06829,0.03372,0.02272,0.172,0.05914,...,0.007965,0.01386,0.002304,14.11,23.21,89.71,611.1,0.1176,0.1843,0.1703
264,17.19,22.07,111.6,928.3,0.09726,0.08995,0.09061,0.06527,0.1867,0.0558,...,0.009875,0.01144,0.001575,21.58,29.33,140.5,1436.0,0.1558,0.2567,0.3889
510,11.74,14.69,76.31,426.0,0.08099,0.09661,0.06726,0.02639,0.1499,0.06758,...,0.01528,0.0226,0.006822,12.45,17.6,81.25,473.8,0.1073,0.2793,0.269
532,13.68,16.33,87.76,575.5,0.09277,0.07255,0.01752,0.0188,0.1631,0.06155,...,0.005077,0.01054,0.001697,15.85,20.2,101.6,773.4,0.1264,0.1564,0.1206


In [13]:
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
linear.score(X_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
#Predict Output
predicted = linear.predict(X_test)

Coefficient: 
 [-2.07308728e-01  9.66241529e-03  1.35833802e-02  5.66048010e-04
 -1.99267143e+00 -5.30668862e+00  9.65806455e-01  3.95429565e+00
  8.10433592e-01  5.58426833e+00  1.53554857e-01 -2.62394603e-02
 -1.88840380e-02  6.42084918e-04  7.79970515e+00 -7.02596823e-01
 -4.68122170e+00  1.39662745e+01  5.17055371e+00  7.02418771e+00
  2.32777227e-01  2.29762745e-03  1.24914929e-05 -1.31925243e-03
  2.45853093e+00  5.09704868e-01  6.75234751e-01]
Intercept: 
 -2.0455083789919124


In [14]:
z = [int(i)for i in predicted] #these are the values predicted we take either 0 or 1

Z = pd.DataFrame(z)
Z

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [15]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 350 to 2
Data columns (total 27 columns):
radius_mean               171 non-null float64
texture_mean              171 non-null float64
perimeter_mean            171 non-null float64
area_mean                 171 non-null float64
smoothness_mean           171 non-null float64
compactness_mean          171 non-null float64
concavity_mean            171 non-null float64
concave points_mean       171 non-null float64
symmetry_mean             171 non-null float64
fractal_dimension_mean    171 non-null float64
radius_se                 171 non-null float64
texture_se                171 non-null float64
perimeter_se              171 non-null float64
area_se                   171 non-null float64
smoothness_se             171 non-null float64
compactness_se            171 non-null float64
concavity_se              171 non-null float64
concave points_se         171 non-null float64
symmetry_se               171 non-null float64
fr

In [16]:
#accuracy
from sklearn.metrics import accuracy_score
y = accuracy_score(y_test,Z)
y

0.8070175438596491

### How do i get the values of all the dataset?

In [17]:
#if I use the full dataset of X which is not sliced I can be able to get the values of all the inputs
w = linear.predict(X)

In [18]:
#turn the values into dataframes
w1 = [int(i) for i in w]
w2 = pd.DataFrame(w1)
w2.head(10)

Unnamed: 0,0
0,1
1,0
2,1
3,1
4,0
5,0
6,0
7,0
8,0
9,1


In [19]:
#i want to add the values to their IDs and the original diagnosis
labels = ['Orig','Predicted','ID']
l1 = data.iloc[:,1]
l2 = w2.iloc[:,0]
l3 = data.iloc[:,0]
columns = [l1,l2,l3]
# for i, j in zip(l1,l2):
#     print(i,j)
df = pd.DataFrame(columns,labels)
df.T.head(10)

Unnamed: 0,Orig,Predicted,ID
0,M,1,842302
1,M,0,842517
2,M,1,84300903
3,M,1,84348301
4,M,0,84358402
5,M,0,843786
6,M,0,844359
7,M,0,84458202
8,M,0,844981
9,M,1,84501001


# 2. Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trignometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables.


odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk


Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical way to replicate a step function. I can go in more details, but that will beat the purpose of this article.

In [20]:
X2 = X.copy() #this is fine for our tex
X2

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst
0,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,0.2419,0.07871,...,0.015870,0.03003,0.006193,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.71190
1,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,0.1812,0.05667,...,0.013400,0.01389,0.003532,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.24160
2,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,0.2069,0.05999,...,0.020580,0.02250,0.004571,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.45040
3,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,0.2597,0.09744,...,0.018670,0.05963,0.009208,14.910,26.50,98.87,567.7,0.20980,0.86630,0.68690
4,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,0.1809,0.05883,...,0.018850,0.01756,0.005115,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.40000
5,12.450,15.70,82.57,477.1,0.12780,0.17000,0.157800,0.080890,0.2087,0.07613,...,0.011370,0.02165,0.005082,15.470,23.75,103.40,741.6,0.17910,0.52490,0.53550
6,18.250,19.98,119.60,1040.0,0.09463,0.10900,0.112700,0.074000,0.1794,0.05742,...,0.010390,0.01369,0.002179,22.880,27.66,153.20,1606.0,0.14420,0.25760,0.37840
7,13.710,20.83,90.20,577.9,0.11890,0.16450,0.093660,0.059850,0.2196,0.07451,...,0.014480,0.01486,0.005412,17.060,28.14,110.60,897.0,0.16540,0.36820,0.26780
8,13.000,21.82,87.50,519.8,0.12730,0.19320,0.185900,0.093530,0.2350,0.07389,...,0.012260,0.02143,0.003749,15.490,30.73,106.20,739.3,0.17030,0.54010,0.53900
9,12.460,24.04,83.97,475.9,0.11860,0.23960,0.227300,0.085430,0.2030,0.08243,...,0.014320,0.01789,0.010080,15.090,40.68,97.65,711.4,0.18530,1.05800,1.10500


In [21]:
y2 = data.iloc[:,1]
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y2 = le.fit_transform(y2)

In [22]:
#split the data
from sklearn.model_selection import train_test_split

X2_train,X2_test,y2_train,y2_test, = train_test_split(X2,y2,test_size = 0.25, random_state = 7)

In [23]:
#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X2, y2)
model.score(X2, y2)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted2 = model.predict(X2_test)
predicted2

Coefficient: 
 [[-2.11342248e+00 -1.20723647e-01  6.71357864e-02  2.46808939e-03
   1.52495081e-01  4.07565401e-01  6.45674995e-01  3.37782817e-01
   2.27634878e-01  2.67091529e-02  2.00587323e-02 -1.23058948e+00
  -4.63832723e-02  9.68119889e-02  1.65755128e-02  1.21701405e-03
   5.24547534e-02  4.00069633e-02  4.40252015e-02 -5.48599772e-03
  -1.28854764e+00  3.43611451e-01  1.32890025e-01  2.43158691e-02
   2.82076100e-01  1.16375973e+00  1.59740370e+00]]
Intercept: 
 [-0.39424138]


array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [24]:
predicted2

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [25]:
accuracy_score(y2_test,predicted2)#calculate the accuracy score, and its better than logistics regression

0.958041958041958

In [26]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y2_test,predicted2))

[[97  1]
 [ 5 40]]


In [27]:
from sklearn.metrics import classification_report
print(classification_report(y2_test,predicted2))

             precision    recall  f1-score   support

          0       0.95      0.99      0.97        98
          1       0.98      0.89      0.93        45

avg / total       0.96      0.96      0.96       143



#### Furthermore..
There are many different steps that could be tried in order to improve the model:

* including interaction terms
* removing features
* regularization techniques
* using a non-linear model


# 3. Decision Tree
This is one of my favorite algorithm and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read: Decision Tree Simplified.

Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn these algorithms and use them for modeling.

#### Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.

Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two types:

* Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. Example:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.

* Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

### Important Terminology related to Decision Trees
Let’s look at the basic terminology used with Decision trees:

* Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
* Splitting: It is a process of dividing a node into two or more sub-nodes.
* Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
* Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node
* Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
* Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
* Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

## Advantages
* Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
* Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick.  It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.
* Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
* Data type is not a constraint: It can handle both numerical and categorical variables.
* Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
 

## Disadvantages
* Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
* Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.
 

In [28]:
X3 = X2.copy()

In [29]:
y3 = y2.copy()

In [30]:
from sklearn.model_selection import train_test_split

X3_train, X3_test, y3_train, y3_test = train_test_split(X3,y3, test_size = 0.25,random_state = 10)
X3_train.shape, X3_test.shape, y3_train.shape, y3_test.shape

((426, 27), (143, 27), (426,), (143,))

In [31]:
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X3, y3)
model.score(X3, y3)
#Predict Output
predicted3= model.predict(X3_test)

In [32]:
predicted3

array([1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1])

In [33]:
accuracy_score(y3_test, predicted3)

1.0

# 4. SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors)

## What is a classification analysis?
Let’s consider an example to understand these concepts. We have a population composed of 50%-50% Males and Females. Using a sample of this population, you want to create some set of rules which will guide us the gender class for rest of the population. Using this algorithm, we intend to build a robot which can identify whether a person is a Male or a Female. This is a sample problem of classification analysis. Using some set of rules, we will try to classify the population into two possible segments. For simplicity, let’s assume that the two differentiating factors identified are : Height of the individual and Hair Length. Following is a scatter plot of the sample.

<img src="xyplot.webp">

The blue circles in the plot represent females and green squares represents male. A few expected insights from the graph are :

1. Males in our population have a higher average height.

2. Females in our population have longer scalp hairs.

If we were to see an individual with height 180 cms and hair length 4 cms, our best guess will be to classify this individual as a male. This is how we do a classification analysis.

## What is a Support Vector and what is SVM?
Support Vectors are simply the co-ordinates of individual observation. For instance, (45,150) is a support vector which corresponds to a female. Support Vector Machine is a frontier which best segregates the Male from the Females. In this case, the two classes are well separated from each other, hence it is easier to find a SVM.

### How to find the Support Vector Machine for case in hand?
There are many possible frontier which can classify the problem in hand. Following are the three possible frontiers.

<img src="xyplot1.webp">

How do we decide which is the best frontier for this particular problem statement?

The easiest way to interpret the objective function in a SVM is to find the minimum distance of the frontier from closest support vector (this can belong to any class). For instance, orange frontier is closest to blue circles. And the closest blue circle is 2 units away from the frontier. Once we have these distances for all the frontiers, we simply choose the frontier with the maximum distance (from the closest support vector). Out of the three shown frontiers, we see the black frontier is farthest from nearest support vector (i.e. 15 units).

### What if we do not find a clean frontier which segregates the classes?
Our job was relatively easier finding the SVM in this business case. What if the distribution looked something like as follows :

<img src="xyplot2.webp">
In such cases, we do not see a straight line frontier directly in current plane which can serve as the SVM. In such cases, we need to map these vector to a higher dimension plane so that they get segregated from each other. Such cases will be covered once we start with the formulation of SVM. For now, you can visualize that such transformation will result into following type of SVM.

<img src="xyplot3.webp">

Each of the green square in original distribution is mapped on a transformed scale. And transformed scale has clearly segregated classes. Many algorithms have been proposed to make these transformations and some of which will be discussed in following articles.

In [34]:
X4 = X2.copy()
X4.head(10)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,0.01137,0.02165,0.005082,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,0.01039,0.01369,0.002179,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,0.01448,0.01486,0.005412,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105


In [35]:
y4 = y2.copy()
y4[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [36]:
from sklearn.model_selection import train_test_split
X4_test, X4_train, y4_test, y4_train = train_test_split(X4,y4,test_size = 0.30,random_state = 10)

In [37]:
#Import Library
from sklearn.svm import SVC
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 
model = SVC(kernel = 'linear') # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X4, y4)
model.score(X4, y4)
#Predict Output
predicted4= model.predict(X4_test)

In [41]:
predicted4[:10]

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0])

In [39]:
accuracy_score(y4_test, predicted4)

0.9597989949748744

In [40]:
from sklearn.metrics import classification_report
print(classification_report(y4_test, predicted4 ))

             precision    recall  f1-score   support

          0       0.95      0.98      0.97       245
          1       0.97      0.92      0.95       153

avg / total       0.96      0.96      0.96       398



# 5. Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

<img src="Bayes_rule.webp">

Here,

* P(c|x) is the posterior probability of class (target) given predictor (attribute). 
* P(c) is the prior probability of class. 
* P(x|c) is the likelihood which is the probability of predictor given class. 
* P(x) is the prior probability of predictor.

### Example: 
Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’. Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.

Step 1: Convert the data set to frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.

<img src = "Bayes_2.webp">

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

Problem: Players will pay if weather is sunny, is this statement is correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.


In [42]:
X5 = X2.copy()
y5 = y2.copy()

In [44]:
from sklearn.model_selection import train_test_split

X5_train, X5_test, y5_train, y5_test = train_test_split(X5,y5,test_size = 0.30 ,random_state = 10)
X5_train.shape, X5_test.shape, y5_train.shape, y5_test.shape

((398, 27), (171, 27), (398,), (171,))

In [45]:
#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X5, y5)
#Predict Output
predicted5 = model.predict(X5_test)

In [48]:
predicted5

array([1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0])

In [47]:
from sklearn.metrics import accuracy_score
accuracy_score(y5_test,predicted5)

0.9707602339181286

In [50]:
from sklearn.metrics import average_precision_score
average_precision_score(y5_test,predicted5)

0.92949251660224

In [56]:
from sklearn.metrics import classification_report
u =classification_report(y5_test,predicted5)
print(u)

             precision    recall  f1-score   support

          0       0.98      0.97      0.98       112
          1       0.95      0.97      0.96        59

avg / total       0.97      0.97      0.97       171



# 6. kNN (k- Nearest Neighbors)
It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.

KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting kNN:

* KNN is computationally expensive
* Variables should be normalized else higher range variables can bias it
* Works on pre-processing stage more before going for kNN like outlier, noise removal

In [58]:
X6 = X2.copy()
y6 = y2.copy()

In [60]:
X6_train, X6_test, y6_train, y6_test = train_test_split(X6,y6, test_size = 0.30, random_state = 5)
X6_train.shape, X6_test.shape, y6_train.shape, y6_test.shape

((398, 27), (171, 27), (398,), (171,))

In [62]:
#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model 
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X6, y6)
#Predict Output
predicted6= model.predict(X6_test)

In [63]:
accuracy_score(y6_test, predicted6)

0.9824561403508771

# 7. K-Means

## Unsupervised, Clustering Algorithm

It is a type of unsupervised algorithm which  solves the clustering problem. Its procedure follows a simple and easy  way to classify a given data set through a certain number of  clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar this activity. You look at the shape and spread to decipher how many different clusters / population are present!

### How K-means forms cluster:

K-means picks k number of points for each cluster known as centroids.
Each data point forms a cluster with the closest centroids i.e. k clusters.
Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
As we have new centroids, repeat step 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs i.e. centroids does not change.

### How to determine value of K:

In K-means, we have clusters and each cluster has its own centroid. Sum of square of difference between centroid and the data points within a cluster constitutes within sum of square value for that cluster. Also, when the sum of square values for all the clusters are added, it becomes total within sum of square value for the cluster solution.

We know that as the number of cluster increases, this value keeps on decreasing but if you plot the result you may see that the sum of squared distance decreases sharply up to some value of k, and then much more slowly after that. Here, we can find the optimum number of cluster.

<img src = 'Kmeans.webp'>

In [72]:
X7 = X2.copy()
y7 = y2.copy()

In [73]:
from sklearn.model_selection import train_test_split
X7_train, X7_test, y7_train, y7_test = train_test_split(X7,y7, test_size = 0.30, random_state = 0)

In [89]:
# #checking the optimak number of clusters
# from sklearn.decomposition import PCA 
# pca = PCA(n_components=8)
# fit = pca.fit(X7)
# fit

In [88]:
# import matplotlib.pyplot as plt
# val = pca.explained_variance_ratio_
# plt.plot(val)
# plt.show()

In [90]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=5,random_state= 0)

In [92]:
fit = k_means.fit(X7)

In [98]:
k_means.predict(X7_test)

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 4, 0, 4, 2, 1, 4, 1, 0, 4,
       0, 2, 1, 2, 2, 0, 0, 1, 2, 1, 0, 4, 2, 0, 0, 1, 0, 4, 0, 2, 4, 2,
       0, 1, 2, 0, 2, 0, 4, 0, 1, 2, 2, 2, 2, 2, 2, 1, 0, 4, 2, 2, 4, 2,
       1, 4, 1, 2, 0, 4, 2, 0, 1, 0, 2, 2, 2, 2, 4, 1, 4, 2, 4, 2, 0, 2,
       4, 1, 2, 0, 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 1, 2, 4, 0, 0, 4,
       0, 1, 4, 0, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 1, 2, 0, 2, 2, 2, 4, 2,
       0, 0, 2, 2, 0, 0, 3, 2, 2, 0, 1, 2, 2, 1, 0, 4, 2, 0, 2, 0, 2, 0,
       2, 0, 0, 0, 2, 4, 1, 2, 2, 4, 2, 4, 0, 4, 0, 0, 2], dtype=int32)

In [None]:
#Another way to do it is as below:-

In [80]:
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model 
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X7, y7)
#Predict Output
predicted7= model.predict(X7_test)

In [81]:
#testing accuracy of the module
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y7_test,predicted7 )

0.930393907748069

# 8. Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

* If the number of cases in the training set is N, then sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree.
* If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
* Each tree is grown to the largest extent possible. There is no pruning.
For more details on this algorithm, comparing with decision tree and tuning model parameters, I would suggest you to read these articles:

With increase in computational power, we can now choose algorithms which perform very intensive calculations. One such algorithm is “Random Forest”, which we will discuss in this article. While the algorithm is very popular in various competitions (e.g. like the ones running on Kaggle), the end output of the model is like a black box and hence should be used judiciously.

Before going any further, here is an example on the importance of choosing the best algorithm.

## Case Study
Following is a distribution of Annual income Gini Coefficients across different countries :

<img src = 'oecd-income_inequality_2013_2.png'>

Mexico has the second highest Gini coefficient and hence has a very high segregation in annual income of rich and poor. Our task is to come up with an accurate predictive algorithm to estimate annual income bracket of each individual in Mexico. The brackets of income are as follows :

* 1. Below $40,000

* 2. $40,000 – 150,000

* 3. More than $150,000

Following are the information available for each individual :

1. Age , 2. Gender,  3. Highest educational qualification, 4. Working in Industry, 5. Residence in Metro/Non-metro

We need to come up with an algorithm to give an accurate prediction for an individual who has following traits:

1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro

We will only talk about random forest to make this prediction in this article.

 

### The algorithm of Random Forest

Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction.

 

## Back to Case  study
#### Disclaimer : The numbers in this article are illustrative

Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of  input variables.

#### Salary bands :

Band 1 : Below $40,000

Band 2: $40,000 – 150,000

Band 3: More than $150,000

Following are the outputs of the 5 different CART model.

## CART 1 : Variable Age
<img src = 'rf1.png'>

## CART 2 : Variable Gender
<img src = 'rf2.png'>

## CART 3 : Variable Education
<img src = 'rf3.png'>

## CART 4 : Variable Residence

<img src = 'rf4.png'>
## CART 5 : Variable Industry
<img src = 'rf5.png'>

Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes. For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method to come up with the final prediction. To come up with the final prediction let’s locate the following profile in each CART model :

1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro

For each of these CART model, following is the distribution across salary bands :

<img src = 'DF.png'>

The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2.

In [99]:
x8 = X2.copy()
y8 = y2.copy()

In [100]:
from sklearn.model_selection import train_test_split
x8_train, x8_test, y8_train, y8_test = train_test_split(x8,y8, test_size = 0.30, random_state = 5)
x8_train.shape, x8_test.shape, y8_train.shape, y8_test.shape

((398, 27), (171, 27), (398,), (171,))

In [101]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(x8,y8)
predicted8 = model.predict(x8_test)

In [102]:
accuracy_score(y8_test,predicted8)

0.9941520467836257

In [103]:
print(classification_report(y8_test,predicted8))

             precision    recall  f1-score   support

          0       0.99      1.00      1.00       110
          1       1.00      0.98      0.99        61

avg / total       0.99      0.99      0.99       171



# Dimensionality Reduction Algorithms
In the last 4-5 years, there has been an exponential increase in data capturing at every possible stages. Corporates/ Government Agencies/ Research organisations are not only coming with new sources but also they are capturing data in great detail.

For example: E-commerce companies are capturing more details about customer like their demographics, web crawling history, what they like or dislike, purchase history, feedback and many others to give them personalized attention more than your nearest grocery shopkeeper.

As a data scientist, the data we are offered also consist of many features, this sounds good for building good robust model but there is a challenge. How’d you identify highly significant variable(s) out 1000 or 2000? In such cases, dimensionality reduction algorithm helps us along with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on correlation matrix, missing value ratio and others.

In May ‘ 2015, we conducted a Data Hackathon ( a data science competition) in Delhi-NCR, India.

We gave participants the challenge to identify Human Activity Recognition Using Smartphones Data Set. The data set had 561 variables for training model used for the identification of Human activity in test data set.

The participants in hackathon had varied experience and expertise level. As expected, the experts did a commendable job at identifying the human activity. However, beginners & intermediates struggled with sheer number of variables in the dataset (561 variables). Under the pressure of time, these people tried using variables  really without understanding the significance level of variable(s).  They lacked the skill to filter information from seemingly high dimensional problems and reduce them to a few relevant dimensions – the skill of dimension reduction.

Further, this lack of skill came across in several forms in way of questions asked by various participants:

There are too many variables – do I need to explore each and every variable?
Are all variables important?
All variables are numeric and what if they have multi-collinearity? How can I identify these variables?
I want to use decision tree. It can automatically select the right variables. Is this a right technique?
I am using random forest but it is taking a high execution time because of high number of features
Is there any machine learning algorithm that can identify the most significant variables automatically?
As this is a classification problem, can I use SVM with all variables?
Which is the best tool to deal with high number of variable, R or Python?
If you have faced similar questions, you are reading the right article. In this article, we will look at various methods to identify the significant variables using the most common dimension reduction *techniques and methods.*


## Table of Contents
Why Dimension Reduction is Important in machine learning and predictive modeling?
What are Dimension Reduction techniques?
What are the benefits of using Dimension Reduction techniques?
What are the common methods to reduce number of Dimensions?
Is Dimensionality Reduction good or bad?


### Why Dimension Reduction is important in machine learning & predictive modeling?

The problem of unwanted increase in dimension is closely related to fixation of measuring / recording data at a far granular level then it was done in past. This is no way suggesting that this is a recent problem. It has started gaining more importance lately due to surge in data.

Lately, there has been a tremendous increase in the way sensors are being used in the industry. These sensors continuously record data and store it for analysis at a later point. In the way data gets captured, there can be a lot of redundancy. For example, let us take case of a motorbike rider in racing competitions. Today, his position and movement gets measured by GPS sensor on bike, gyro meters, multiple video feeds and his smart watch. Because of respective errors in recording, the data would not be exactly same. However, there is very little incremental information on position gained from putting these additional sources. Now assume that an analyst sits with all this data to analyze the racing strategy of the biker – he/ she would have a lot of variables / dimensions which are similar and of little (or no) incremental value. This is the problem of high unwanted dimensions and needs a treatment of dimension reduction.

Let’s look at other examples of new ways of data collection:

* Casinos are capturing data using cameras and tracking each and every move of their customers.
* Political parties are capturing data by expanding their reach on field
* Your smart phone apps collects a lot of personal details about you
* Your set top box collects data about which programs preferences and timings
* Organizations are evaluating their brand value by social media engagements (comments, likes), followers, positive and negative sentiments

With more variables, comes more trouble! And to avoid this trouble, dimension reduction techniques comes to the rescue.

### What are Dimension Reduction techniques?
Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. These techniques are typically used while solving machine learning problems to obtain better features for a classification or regression task.

Let’s look at the image shown below. It shows 2 dimensions x1 and x2, which are let us say measurements of several object in cm (x1) and inches (x2). Now, if you were to use both these dimensions in machine learning, they will convey similar information and introduce a lot of noise in system, so you are better of just using one dimension. Here we have converted the dimension of data from 2D (from x1 and x2) to 1D (z1), which has made the data relatively easier to explain.

In similar ways, we can reduce n dimensions of data set to k dimensions (k < n) . These k dimensions can be directly identified (filtered) or can be a combination of dimensions (weighted averages of dimensions) or new dimension(s) that represent existing multiple dimensions well.


What are the benefits of Dimension Reduction?
Let’s look at the benefits of applying Dimension Reduction process:

* It helps in data compressing and reducing the storage space required
* It fastens the time required for performing same computations. Less dimensions leads to less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions
* It takes care of multi-collinearity that improves the model performance. It removes redundant features. For example: there is no point in storing a value in two different units (meters and inches).
* Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly. Below you can see that, how a 3D data is converted into 2D. First it has identified the 2D plane then represented the points on these two new axis z1 and z2.
* It is helpful in noise removal also and as result of that we can improve the performance of models.


### What are the common methods to perform Dimension Reduction?

There are many methods to perform Dimension reduction. I have listed the most common methods below:

1. Missing Values: While exploring data, if we encounter missing values, what we do? Our first step should be to identify the reason then impute missing values/ drop variables using appropriate methods. But, what if we have too many missing values? Should we impute missing values or drop the variables?

I would prefer the latter, because it would not have lot more details about data set. Also, it would not help in improving the power of model. Next question, is there any threshold of missing values for dropping a variable? It varies from case to case. If the information contained in the variable is not that much, you can drop the variable if it has more than ~40-50% missing values.

2. Low Variance: Let’s think of a scenario where we have a constant variable (all observations have same value, 5) in our data set. Do you think, it can improve the power of model? Ofcourse NOT, because it has zero variance. In case of high number of dimensions, we should drop variables having low variance compared to others because these variables will not explain the variation in target variables.

3. Decision Trees: It is one of my favorite techniques. It can be used as a ultimate solution to tackle multiple challenges like missing values, outliers and identifying significant variables. It worked well in our Data Hackathon also. Several data scientists used decision tree and it worked well for them.

4. Random Forest: Similar to decision tree is Random Forest. I would also recommend using the in-built feature importance provided by random forests to select a smaller subset of input features. Just be careful that random forests have a tendency to bias towards variables that have more no. of distinct values i.e. favor numeric variables over binary/categorical values.

 
5. High Correlation: Dimensions exhibiting higher correlation can lower down the performance of model. Moreover, it is not good to have multiple variables of similar information or variation also known as “Multicollinearity”. You can use Pearson (continuous variables) or Polychoric (discrete variables) correlation matrix to identify the variables with high correlation and select one of them using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped.

 

6. Backward Feature Elimination: In this method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features.

Repeat this process until no other variables can be dropped. Recently in Online Hackathon organised by Analytics Vidhya (11-12 Jun’15), Data scientist who held second position used Backward Feature Elimination in linear regression to train his model.
Reverse to this, we can use “Forward Feature Selection” method. In this method, we select one variable and analyse the performance of model by adding another variable. Here, selection of variable is based on higher improvement in model performance.


7. Factor Analysis: Let’s say some variables are highly correlated. These variables can be grouped by their correlations i.e. all variables in a particular group can be highly correlated among themselves but have low correlation with variables of other group(s). Here each group represents a single underlying construct or factor. These factors are small in number as compared to large number of dimensions. However, these factors are difficult to observe. There are basically two methods of performing factor analysis:

>>EFA (Exploratory Factor Analysis)

>>CFA (Confirmatory Factor Analysis)

8. Principal Component Analysis (PCA): In this technique, variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principle components. They are obtained in such a way that first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance.

The second principal component must be orthogonal to the first principal component. In other words, it does its best to capture the variance in the data that is not captured by the first principal component. For two-dimensional dataset, there can be only two principal components. Below is a snapshot of the data and its first and second principal components. You can notice that second principle component is orthogonal to first principle component.

In [104]:
x9 = X2.copy()
y9 = y2.copy()

In [106]:
x9.head(10)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,0.01137,0.02165,0.005082,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,0.01039,0.01369,0.002179,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,0.01448,0.01486,0.005412,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105


In [109]:
y9[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [110]:
from sklearn.model_selection import train_test_split
x9_train, x9_test, y9_train, y9_test = train_test_split(x9,y9, test_size = 0.2, random_state= 1234)
x9_train.shape, x9_test.shape, y9_train.shape, y9_test.shape

((455, 27), (114, 27), (455,), (114,))

In [114]:
from sklearn.decomposition import PCA
model = PCA(n_components= 10,whiten=True, tol= 0.4, random_state=1234)

In [115]:
#lets check the parameters we can change in this model
model.get_params()

{'copy': True,
 'iterated_power': 'auto',
 'n_components': 10,
 'random_state': 1234,
 'svd_solver': 'auto',
 'tol': 0.4,
 'whiten': True}

In [116]:
#let us fit the model now and see:
model.fit(x9,y9)

PCA(copy=True, iterated_power='auto', n_components=10, random_state=1234,
  svd_solver='auto', tol=0.4, whiten=True)

In [117]:
#reduce the dimensions of test and train
train_reduced = model.fit_transform(x9_train,y9_train)

In [118]:
train_reduced

array([[-0.39630535, -0.35143118,  0.34482807, ...,  0.28332037,
         0.4985502 , -0.6342089 ],
       [-0.79802171, -0.0785324 ,  0.23554171, ...,  0.05156462,
        -0.33181675, -0.45196981],
       [-0.59896309,  0.36911938, -0.00330755, ...,  1.28286509,
        -0.36380683,  0.00466264],
       ...,
       [-0.4016872 , -0.47900239,  0.21638678, ..., -0.44507672,
         0.85890525, -0.05576649],
       [ 0.87304025,  1.0675429 ,  1.03066507, ..., -0.24664661,
        -0.08509915,  1.22927685],
       [-0.55643337,  0.32696579, -0.32475448, ..., -0.39264916,
        -0.53607077, -0.00221829]])

## Gradient Boosting Algorithms

GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to a build strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix.

There are some machine learning engines. These engines make use of certain algorithms and help user reach to the output stage. Some of the most popular engines are Decision Tree and Regression.

In this article, we’ll introduce you to some of the best practices used to enhance power of these engines to achieve a higher predictability using an additional booster.



#### Where are Boosted algorithms required?
Boosted algorithms are used where we have plenty of data to make a prediction. And we seek exceptionally high predictive power. It is used to for reducing bias and variance in supervised learning. It combines multiple weak predictors to a build strong predictor.

If you ever want to participate in Kaggle competitions, I would suggest that you bookmark this article. Participants in Kaggle completitions use these boosting algorithms extensively.

The underlying engine used for boosting algorithms can be anything. For instance, AdaBoost is a boosting done on Decision stump. There are many other boosting algorithms which use other types of engine such as:

1. GentleBoost

2. Gradient Boosting (Always my first choice for any Kaggle problem)

3.  LPBoost

4. BrownBoost

Perhaps, I can go on adding more engines to this list. But, I would like to focus on these five boosting techniques which are the most commonly used. **Let’s first learn about – AdaBoost.**


## What are Classifier Boosting Algorithms ?
Classification problem is the one where we need to assign every observation to a given set of class. The easiest classification problem is the one with binary class. This problem can be solved using AdaBoost. Let’s take a very simple example to understand the underlying concept of AdaBoost. You have two classes : 0’s and 1’s. Each number is an observation. The only two features available is x-axis and y-axis. For instance (1,1)  is a 0 while (4,4) is a 1. Now using these two features you need to classify each observation. Our ultimate objective remains the same as any classifier problem : find the classification boundary. Following are the step we follow to apply an AdaBoost.

**Step 1 : Visualize the data :** Let’s first understand the data and find insights on whether we have a linear classifier boundary. As shown below, no such boundary exist which can separate 0’s from 1’s.

<img src = 'gb1.png'>


**Step 2 : Make the first Decision stump :** You have already read about decision trees in many of our previous articles. Decision stump is a unit depth tree which decides just 1 most significant cut on features. Here it chooses draw the boundary starting from the third row from top. Now the yellow portion is expected to be all 0’s and unshaded portion to be all 1’s. However, we see high number of false positive post we build this decision stump. We have nine 1’s being wrongly qualified as 0’s. And similarly eighteen 0’s qualified as 1’s.

<img src = 'pgb2.webp'>

**Step 3 : Give additional weight to mis-classified observations:** Once we know the misclassified observations, we give additional weight to these observations. Hence, you see 0’s and 1’s in bold which were misclassified before. In the next level, we will make sure that these highly weighted observation are classified correct

<img src = 'gb3.webp'>

**Step 4 : Repeat the process and combine all stumps to get final classifier :** We repeat the process multiple times and focus more on previously misclassified observations. Finally, we take a weighted mean of all the boudaries discovered which will look something as below.

<img src = 'gb4.webp'>

### Brief introduction to Regression boosters
Similar to classifier boosters, we also have regression boosters. In these problems we have continuous variable to predict. This is commonly done using gradient boosting algorithm. Here is a non-mathematical description of how gradient boost works :

Type of Problem – You have a set of variables vectors x1 , x2 and x3. You need to predict y which is a continuous variable.

Steps of Gradient Boost algorithm

* Step 1 : Assume mean is the prediction of all variables.

* Step 2 : Calculate errors of each observation from the mean (latest prediction).

* Step 3 : Find the variable that can split the errors perfectly and find the value for the split. This is assumed to be the latest prediction.

* Step 4 : Calculate errors of each observation from the mean of both the sides of split (latest prediction).

* Step 5 : Repeat the step 3 and 4 till the objective function maximizes/minimizes.

* Step 6 : Take a weighted mean of all the classifiers to come up with the final model.

We have excluded the mathematical formation of boosting algorithms from this article to keep the article simple.

In [122]:
x10 = X2.copy()
y10 = y2.copy()

In [123]:
from sklearn.model_selection import train_test_split
x10_train, x10_test,y10_train,y10_test = train_test_split(x10,y10, test_size = 0.30, random_state = 1234)

In [130]:
#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model2= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
# Train the model using the training sets and check score
model2.fit(x10, y10)
#Predict Output
predicted10= model2.predict(x10_test)

In [131]:
accuracy_score(y10_test,predicted10)

1.0

<a href = 'https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/'>Hyper Parameter Tuning XGBoost</a>

In [132]:
#supposing I play with the hyper parameters in the model
model2.get_params()
#below are the parameters we can play with

{'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 1.0,
 'loss': 'deviance',
 'max_depth': 1,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'presort': 'auto',
 'random_state': 0,
 'subsample': 1.0,
 'verbose': 0,
 'warm_start': False}

## Advantages of XGBoost

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:

* Regularization:
>Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.

>In fact, XGBoost is also known as ‘regularized boosting‘ technique.

* Parallel Processing:
>XGBoost implements parallel processing and is blazingly faster as compared to GBM.

>But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.

>XGBoost also supports implementation on Hadoop.

* High Flexibility
>XGBoost allow users to define custom optimization objectives and evaluation criteria.

>This adds a whole new dimension to the model and there is no limit to what we can do.

* Handling Missing Values
> XGBoost has an in-built routine to handle missing values.

>User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.


* Tree Pruning:
>A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.

>XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

>Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.

* Built-in Cross-Validation
>XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

>This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

* Continue on Existing Model
>User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.

>GBM implementation of sklearn also has this feature so they are even on this point.
I hope now you understand the sheer power XGBoost algorithm. Note that these are the points which I could muster. You know a few more? Feel free to drop a comment below and I will update the list.

Did I whet your appetite ? Good. You can refer to following web-pages for a deeper understanding:

### I landed on some links that have some useful information about the topic above:

Studied various articles on handling data:
 1- <a href ='https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python' >Cool Article</a>
 2- <a href ='https://www.kaggle.com/nanomathias/feature-engineering-importance-testing' >Feature Engineering</a>
 3- <a href ='https://www.kaggle.com/asindico/customer-segments-with-pca' >Principle Component Analysis</a>
 4- <a href ='https://www.kaggle.com/dansbecker/cross-validation' >Cross Validation</a>
 5- <a href ='https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search' >Hyper parameter tuning</a>
 6- <a href ='https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/' >XBoost</a>
  7- <a href ='https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/#' >complete-guide-parameter-tuning-gradient-boosting-gbm-python/#</a>

## Catboost
CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML.

The best part about CatBoost is that it does not require extensive data training like other ML models, and can work on a variety of data formats; not undermining how robust it can be.

Make sure you handle missing data well before you proceed with the implementation.

Catboost can automatically deal with categorical variables without showing the type conversion error, which helps you to focus on tuning your model better rather than sorting out trivial errors.

>> 'Error: could not convert string to float:'

This error occurs when dealing with categorical (string) variables. In sklearn, you are required to convert these categories in the numerical format.

In order to do this conversion, we use several pre-processing methods like “label encoding”, “one hot encoding” and others.

In this article, I will discuss a recently open sourced library ” CatBoost” developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.

### What is CatBoost?
CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with diverse data types to help solve a wide range of problems that businesses face today. To top it up, it provides best-in-class accuracy.

It is especially powerful in two ways:

>>It yields state-of-the-art results without extensive data training typically required by other machine learning methods, and

>>Provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems.

“CatBoost” name comes from two words “Category” and “Boosting”.

As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.

“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.


### Advantages of CatBoost Library
* **Performance:** CatBoost provides state of the art results and it is competitive with any leading machine learning algorithm on the performance front.
* **Handling Categorical features automatically:** We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. You can read more about it here.
* **Robust:** It reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models. Although, CatBoost has multiple parameters to tune and it contains parameters like the number of trees, learning rate, regularization, tree depth, fold size, bagging temperature and others. You can read about all these parameters here.
* **Easy-to-use:** You can use CatBoost from the command line, using an user-friendly API for both Python and R.

In [1]:
import pandas as pd
import numpy as np

from catboost import CatBoostRegressor

from sklearn.model_selection import train_test_split

Xx_train, Xx_validation, yx_train, yx_validation = train_test_split(X5, y5, train_size=0.7, random_state=1234)
#categorical_features_indices = np.where(X.dtypes != np.float)[0]

#importing library and building model
from catboost import CatBoostRegressormodel
modelX=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')

modelX.fit(Xx_train, yx_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)
modelX.predict(yx_validation)
submission = pd.DataFrame()

submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


NameError: name 'X5' is not defined