<h1>                             Essentials of Machine Learning Algorithms (with Python)</h1>


Machine learning is the subfield of computer science that "gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959). Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions;through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is unfeasible; example applications include spam filtering, optical character recognition (OCR),search engines and computer vision.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on exploratory data analysis and is known as unsupervised learning.

Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction; in commercial use, this is known as predictive analytics. These analytical models allow researchers, data scientists, engineers, and analysts to "produce reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data.

To read more go to: https://en.wikipedia.org/wiki/Machine_learning

<h3>Why is machine learning important?</h3>

Resurging interest in machine learning is due to the same factors that have made data mining and Bayesian analysis more popular than ever. Things like growing volumes and varieties of available data, computational processing that is cheaper and more powerful, and affordable data storage.

All of these things mean it's possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results – even on a very large scale. And by building precise models, an organization has a better chance of identifying profitable opportunities – or avoiding unknown risks.

<h5>Broadly, there are 3 types of Machine Learning Algorithms:</h5>

<h5>Supervised Learning:</h5>

<b>How it works</b>: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.

 
<h5>Unsupervised Learning:</h5>

<b>How it works</b>: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.

 
<h5>Reinforcement Learning:</h5>

<b>How it works</b>:  Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

To read more go to : https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

<h3>List of Common Machine Learning Algorithms:</h3>

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

<ol><li> Linear Regression</li>
<li>Logistic Regression</li>
<li>Decision Tree</li>
<li>SVM</li>
<li>Naive Bayes</li>
<li>KNN</li>
<li>K-Means</li>
<li>Random Forest</li>
<li>Dimensionality Reduction Algorithms</li>
<li>Gradient Boost & Adaboost</li></ol>



<h3>Choose The Best Machine Learning Model</h3>

How do you choose the best model for your problem?

When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.

<h3>Compare Machine Learning Models Carefully</h3>

When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives.

The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two to finalize.

A way to do this is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

In the next section you will discover exactly how you can implement various algorithms in Python and judge their performance based on accuracy <b>(with scikit-learn)</b>

NOTE:<b>Scikit-learn</b> (formerly scikits.learn) is a free software machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.  know m
To know more about Scikit-learn follow the link : http://scikit-learn.org/stable/

<h3>Packages needed to be imported</h3><pre>
1.scikit-learn
2.numpy</pre>

<h4>Installing scikit-learn and numpy</h4>

In [1]:
# uncomment to install
# !pip install -U scikit-learn
# !pip install -U numpy

<h3> Cross Validation: </h3>

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.   
To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set.

There is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called <b>cross-validation (CV for short)</b>.
A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using k-1 of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.
<br>

Below, we have incorporated cross validation technique on training data for all the algorithms and found the accuracy range.

<h4>Reading training and heldout data</h4>

In [2]:
import numpy
import csv

reader=csv.reader(open("Data/EssentialsofML/trainingFilteredWithNewFeatures.csv","rb"),delimiter=',')
result=numpy.matrix(list(reader))[1:]

X = result[:,5:].astype('int')
y = numpy.squeeze(numpy.asarray(result[:,4]))

readerHeldout=csv.reader(open("Data/EssentialsofML/heldoutFilteredWithNewFeatures.csv","rb"),delimiter=',')
resultHeldout=numpy.matrix(list(readerHeldout))[1:]

heldoutX = resultHeldout[:,5:].astype('int')
heldouty = numpy.squeeze(numpy.asarray(resultHeldout[:,4]))

<h3>1. Support vector machines (SVMs) </h3>

SVMs are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:<pre>
1.Effective in high dimensional spaces.
2.Still effective in cases where number of dimensions is greater than the number of samples.
3.Uses a subset of training points in the decision function (called support vectors), so it is also memory     efficient.
4.Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.</pre>

The disadvantages of support vector machines include:<pre>
1.If the number of features is much greater than the number of samples, the method is likely to give poor performances.
2.SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.</pre>

<h5>Class required to implement support vector machines </h5><br>
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)

To read more, follow : http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

<h4>Cross validation for SVM</h4>

In [None]:
from sklearn import svm
from sklearn.model_selection import cross_val_score

#create model
clf = svm.SVC()

#Evaluate a score by cross-validation
scores = cross_val_score(clf, X, y, cv=2)

#report accuracy
print("Accuracy for SVM: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy for SVM: 0.58 (+/- 0.00)


<h4>SVM on heldout data</h4>

In [None]:
from sklearn import svm
from sklearn.metrics import classification_report

#create model
clf = svm.SVC()

#fit the model according to the training data.
clf.fit(X, y)

#predict labels for heldout data
predictedy = clf.predict(heldoutX)

#print precision recall table
print(classification_report(heldouty, predictedy))

#report accuracy on heldout data
print "Accuracy= ",clf.score(heldoutX, heldouty)

<h3>2.Random forest classifier.</h3>

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

<h5>Class required to implement random forest classifier</h5><br>
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

To read more, follow : http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

<h4>Cross validation for Random Forest Classifier</h4>

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

#create model
clf3 = RandomForestClassifier(n_estimators=50)

#Evaluate a score by cross-validation
scores3 = cross_val_score(clf3, X, y, cv=10)

#report accuracy
print("Accuracy for Random Forest Classifier: %0.2f (+/- %0.2f)" % (scores3.mean(), scores3.std() * 2))

Accuracy for Random Forest Classifier: 0.75 (+/- 0.04)


<h4>Random forest classifier on heldout data</h4>

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

#create model
clf = RandomForestClassifier(n_estimators=100)

#fit the model according to the training data.
clf.fit(X, y)

#predict labels for heldout data
predictedy = clf.predict(heldoutX)

#print precision recall table
print(classification_report(heldouty, predictedy))

#report accuracy on heldout data
print "Accuracy= ",clf.score(heldoutX, heldouty)

             precision    recall  f1-score   support

          A       0.89      0.44      0.59        54
          B       0.78      0.79      0.79       860
          D       0.89      0.45      0.59       228
          T       0.71      0.81      0.76       853

avg / total       0.77      0.75      0.75      1995

Accuracy=  0.752882205514


<h3>3. Logistic Regression (aka logit, MaxEnt) classifier</h3>

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross- entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty.

<h5>Class required to implement Logistic Regression (aka logit, MaxEnt) classifier</h5><br>
class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)

To read more, follow : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

<h4>Cross validation for Logistic Regression</h4>

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_val_score

#create model
clf2 = LogisticRegressionCV()

#Evaluate a score by cross-validation
scores2 = cross_val_score(clf2, X, y, cv=3)

#report accuracy
print("Accuracy for Logistic Regression: %0.2f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))

<h4>Logistic Regression Classifier on heldout data</h4>

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

#create model
clf =  LogisticRegressionCV()

#fit the model according to the training data.
clf.fit(X, y)

#predict labels for heldout data
predictedy = clf.predict(heldoutX)

#print precision recall table
print(classification_report(heldouty, predictedy))

#report accuracy on heldout data
print "Accuracy= ",clf.score(heldoutX, heldouty)

<h3>4. Decision trees </h3>

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

<h5>Class required to implement Decision trees </h5><br>
class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False)

To read more, follow : http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

<h4>Cross validation for Decision Tree Classifier</h4>

In [4]:
from sklearn import tree
from sklearn.model_selection import cross_val_score

#create model
clf1 = tree.DecisionTreeClassifier()

#Evaluate a score by cross-validation
scores1 = cross_val_score(clf1, X, y, cv=10)

#report accuracy
print("Accuracy for Decision Tree Classifier: %0.2f (+/- %0.2f)" % (scores1.mean(), scores1.std() * 2))

Accuracy for Decision Tree Classifier: 0.65 (+/- 0.03)


<h4>Decision Tree Classifier on heldout data</h4>

In [16]:
from sklearn import tree
from sklearn.metrics import classification_report

#create model
clf = tree.DecisionTreeClassifier()

#fit the model according to the training data.
clf.fit(X, y)

#predict labels for heldout data
predictedy = clf.predict(heldoutX)

#print precision recall table
print(classification_report(heldouty, predictedy))

#report accuracy on heldout data
print "Accuracy= ",clf.score(heldoutX, heldouty)

             precision    recall  f1-score   support

          A       0.47      0.39      0.42        54
          B       0.69      0.71      0.70       860
          D       0.53      0.47      0.50       228
          T       0.65      0.66      0.65       853

avg / total       0.65      0.65      0.65      1995

Accuracy=  0.651127819549


<h3>Comparison between the above implemented algorithms</h3><br>
<table>
<tr>
<td><b>Classification Algorithm</b></td>
<td><b>Accuracy</b></td>
</tr>
<tr>
<td>SVM</td>
<td>0.597493734336</td>
</tr>
<tr>
<td>Random Forest</td>
<td>0.726817042607</td>
</tr>
<tr>
<td>Logistic Regression</td>
<td>0.7609022556397</td>
</tr>
<tr>
<td>Decision Tree</td>
<td>0.65313283208</td>
</tr>

</table>

<h2>References</h2>
<br>
Scikit Learn : http://scikit-learn.org
<br>
Numpy : http://www.numpy.org/

<h2>Developers</h2>
<ul>
<li>Bhargavkumar Patel <a href="mailto:bhargav079@gmail.com">bhargav079@gmail.com</a><br></li>
<li>Minesh Gandhi <a href="mailto:mineshmini33@gmail.com">mineshmini33@gmail.com</a><br></li>
<li>Prachi Agarwal <a href="mailto:24prachiagarwal@gmail.com">24prachiagarwal@gmail.com</a></li>
</ul>