### Linear regression

We start formulating a model form some data by first fitting a straight line (that is an equation) with some parameters and coefficients over this data. We can think of linear regression as a way of lifing a straight line **y=mx+c**, over the sample data, if we assume that the sample data has only a single dimension.

The linear regression model is simply described as a linear eqation that represents the **regressand** or **dependent variable** of the model. The formulated regression model can have one to several parameters depending on the available data, and these parameters of the model are also termed as **regressors**, **features**, or **independent variables** of the model.

Sometimes, the function creates a linear model of our data by using the **ordinaly-least squares (OLS)** curve-fitting algorithm. Once we've defined a linear model over our data, it's obvious that not all the points will be on a line that is plotted to represent the formulated model. Each data point has some deviation from the linear mode's plot. To represent the ovarall deviation of the model from some given data, we can use the **residual sum of quares**, **mean-squared error**, and **root mean-squared error functions**. The value of these three functions represent a **scalar mesaure** of the amount of error in the formulated model.

The **sum of squared errors of prediction (SSE)** is simply the sum of errors in a formulated model. The SSE is also termed as **residual sum of quares (RSS)**. 

The **mean-squared error (MSE)** measures the average magnitude of error in a formulated model without considering the direction of the errors. We can calculate this value by squaring the differences of all the given values of the dependent variable and their corresponding predicted values on the formulated linear model, and calculating the mean of these squared errors. If the MSE of a formulated model is zero, then we can say that the model fits the given data perfectly.

The **root mean-squared error (RMSE)** is simple the square root of MSE and is often used to measure the deviation of a formulated linear model.

In order to formulate a model that best fits the sample data, we should strive to minimize the previously described values. For some given data, we can formulate several models and calculate the error for each model. This calculated error can then be used to determine which formulated model is the best fit for the data, thus selecting the optimal linear model for the given data.

Based on the MSE of a formulated model, the model is said to have a cost function. ** The problem of fitting a linear model over some data is equivalent to the problem of minimizing the cost function of a formulated linear model**. 

## Classification

We are interested in the category of the observed values rather than predicting a value based on the given set of values. The independent variables of the classification model are also termed as the **explanatory variables** of the model, and the dependent variable is also called the **outcome**, **category**, or **class** of the observed values. The outcome of a classification model is always a discrete value. This is one of the primary differences between classification and regression. A **classifier** can be formally defined as a function that maps a set of values to the cateogry of a class. Similar to regression, the problem of classifying the obeserved values for the given independent variables is analogous to determining the best-fit function for the given training data.

### Logistic regression

Regression means that we try to find the best-fit set parameters. Finsing the best-fit is similar to regression, and in this method 

### Naive Bayes

### K-nearest neighbor

This algorithm is a form of **lazy leanrning** in which all the computation is deferred until classification. It can be applied to regression as well by simply selecteind the predicted values as the average of the nearest values of the dependent variable for a set of observed feature values. This technique of modeling regresion is, in fact, a generalization of **linear interpolation**.

### Decision Trees

The process of constructing a decision tree is loosely based on the concepts of **information entropy and information gain** from information theory. A decision tree is a graph that describes a model of decisions and their possible consequences. An internal node in a decision tree represents a decision, or rather a condition of a particular feature in the context of classification. It has two possible outcomes that are represented by the left and right subtrees of the node. Of course, a node in the decision tree coul also have more than two subtrees. Each leaf node represents a particular class.

There are actually several algorithms that are used to construct a decision tree from some training data. Generally, the tree is constructed by splitting the set of sample values in the training data into smaller subsets based on an attribute value test. The process is repeated in each subset until splitting a given subset of sample values no longer adds internal nodes to the decision tree.

Once a decision tree has been created, we can optionally perform **pruning** on the tree. Pruning is simply the process of removing any extraneous decision nodes from the tree. This can be thought as a form for the regularization of decision tree through which we prevent underfitting or overfitting of the estimated decision tree model.

**J48** is an open source implementation of the **C4.5** algorithm in Java.

**Pros** Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features.
**Cons**: Prone to overfitting
**Works with**: Numeric values and  nominal values.

In [85]:
# Function to calculate the Shannon Entropy of a dataset
from math import log
import operator

def calcShannonEnt(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset:
        currentLabel = featVec[-1] 
        #print currentLabel
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 1
        else:
            labelCounts[currentLabel] += 1
            
    #print labelCounts
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        #print prob
        shannonEnt -= prob * log(prob,2)
        
    return shannonEnt

def splitDataSet(dataset, axis, value):
    retDataSet = []
    for featVec in dataset:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def createTree(dataset, labels):
    classList = [example[-1] for example in dataset]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        return majorityCnt(classList)
    bestFeature = chooseBestFeatureToSplit(dataset)
    bestFeatureLabel = labels[bestFeature]
    myTree = {bestFeatureLabel: {}}
    del(labels[bestFeature])
    featValues = [example[bestFeature] for example in dataset]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatureLabel][value] = createTree(splitDataSet\
                                                    (dataset, bestFeature, value), subLabels)
    return myTree

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 1
        else:
            classList[vote] +=1
    

def chooseBestFeatureToSplit(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcShannonEnt(dataset)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataset]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataset, i, value)
            prob = len(subDataSet)/float(len(dataset))
            newEntropy += prob * calcShannonEnt(subDataSet)

        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i

    return bestFeature

def createDataSet():
    dataset = [[1, 1, 'yes'],
              [1, 1, 'yes'],
              [1, 0, 'no'],
              [0, 1, 'no'],
              [0, 1, 'no'],]

    labels = ['no surfacing', 'flippers']

    return dataset, labels

In [77]:
myDat, labels = createDataSet()

In [49]:
calcShannonEnt(myDat)

0.9709505944546686

In [50]:
myDat[0][-1] = 'maybe'
calcShannonEnt(myDat)

1.3709505944546687

In [52]:
splitDataSet(myDat, 0, 1)

[[1, 'maybe'], [1, 'yes'], [0, 'no']]

In [59]:
chooseBestFeatureToSplit(myDat)

0

In [87]:
myDat, labels = createDataSet()
myTree = createTree(myDat, labels)
myTree

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}