# Machine Learning

1. Supervised: Induction --> Specifics to generalization.
2. Unsupervised: Finding structure in the data, description.  
3. Reinforcement: Getting a feedback and learning from that.

## Combination of unsupervised and supervised learnings can do wonders.
Data -> Unsupervised learning --> labels --> Supervised learning --> Knowledge

# Naive Bayes

Bayes Rule: 

P(c) = Probability of cancer(event occuring). (Prior Probability)

Test evidence:
P(pos/c) = Probability that test is positive when cancer is there. -- Sensitivity (SENACPO)
P(pos/-c) = Probability that test is negative when cancer is not there. -- Specificity (SPACNE)

Remember:

SENACPO -- Number of times we are correct when ACtual value is POsitive.
SPACNE  -- Number of times we are correct when ACtual value is NEgative. 

BAYES RULE:

(PRIOR PROB) . (TEST EVIDENCE) -> (POSTERIOR PROB)

P(c) = 0.01 (1%)  |  P(-c) = 0.99 (99%)

SENSITIVITY: P(pos/c) = 0.9 (90%) | P(neg/c) = 0.1 (10%)

SPECIFICITY: P(neg/-c) = 0.9 (90%)| P(pos/-c) = 0.1 (10%) 


JOINT PROB :

P(c/pos) = P(c) . P(pos/c) = 0.01 x .90 = 0.009

P(-c/pos) = P(-c) . P(pos/-c) = 0.99 x 0.1 = 0.099


joint Probabilities generally dont add up to 1. Normalize them to make them add upto 1. 

(Normalizer)factor = P(c/pos) + P(-c/pos) = 0.108

ACTUAL POSTERIOR PROB:

P(c/pos) = 0.009/factor = 0.08333 (8.33%)

P(-c/pos) = 0.099/factor = 0.9166 (91.67%)

P(c/pos) + P(-c/pos) = 0.0833 + 0.9166 = 1.0




##  Uses of Naive Bayes:
    
### It is used a lot for text learning. 

### Why Naive? 

Ignores word order, only considers the frequency of words.


### Strengths and Weaknesses

Strengths:
1. Can handle lots of words (features) 20k and more.
2. Easy to implement.

Weakness:
1. Ignores word order.

# SVM - Support Vector Machine

Finds a hyperplane/line(2d) which separates the classes being at max distance from the nearest datapoints of the classes. 

The distance at which hyplane or line is from the nearest point is called MARGIN.

Best Hyperplane/line is the one which maximizes the margin from classes and has most correct classifications. Priority of SVM is correct classification then margin. 

Tolerates outliers easily. Robust to outliers.

## Non Linear SVMs

Adding a feature from mathematical combination of existing features e.g z = x^2 + y^2 or z = |x| finds a hyperplane where it's impossible to separate classes linearly using original features.

Uses kernels to tap high dimensional space to convert non linearly separable variables in low dimension, finds a hyperplane in high dimension and returns the solution to lower dimension in form of a non linear separator.

### Parameters in SVM

Kernel = rbf, sigmoid, poly , custom, linear etc.

Gamma = radial influence of single data point low gamma meaning far influence, high gamma meaning close influence. 
The 'gamma' parameter actually has no effect on the 'linear' kernel for SVMs. The key parameter for this kernel function is "C".

C = Controls the tradeoff between simple decision boundary and correctly classifying training points.
Larger the C more the correct classifications, lower the C simpler the decision boundary.

Overfitting can be controlled by parameters of the algo, for example in case of SVMs C, Gamma, Kernel.


### Advantages

Memory efficient as uses only subset of training points.
Performs well in high dimensional data.

### Disadvantages

Doesn't perform well with lots and lots of data as order is n^3.
Doesn't work well with lots of noise.
Very slow compared to Naive Bayes.


# SVM tips:

1. Changing kernel can improve accuracy drastically eg. rbf - 48% to linear - 97%.
2. Reducing sample size increases training and prediction speed but reduces testing accuracy.
3. SVMs do not scale well. (O(n) = n^2 , quadratic order)
4. Optimized rbf 99% , linear 97%


# Decision Trees

creates linear decision boundaries.

# Parameters

min_samples_split = 2(Default)
Means won't split if samples at a node < min_sample_split 

More the min_samples_split, lesser the splits, lesser the complexity, lesser the overfitting.

# Entropy - 
## Measure of impurity in a bunch of examples.

Purity: Having all examples of the same class in a splitted section. 

Entropy/impurity: Having more than 1 examples of other class at a node.

Entropy is defined for a node. A node might have multiple classes and thus entropy
if a node has only one class, it is a pure node and entropy is 0. 

Entropy = 1.0 when examples are evenly split amongst classes.
Entropy = 0 when only one class is present in a split. Pure !

Objective: Minimizing impurity in splitting.

Entropy = −∑
​i
​​ (p
​i
​​ )log
​2
​​ (p
​i
​​ )

where, i is a class and pi is % of that class in the split. 

## Information Gain

Gain = entropy(parent) - [weighted average]entropy(children)

[weighted average] is calculated basis proportion of samples going in a split.
exam 2/3 and 1/3

More Gain, lesser entropy in children, more purity, better classification. Objective: Maximize Gain.
Decision trees maximize gain.

## Gini and Entropy:
sklearn has two criterion namely gini and entropy. Default is gini. 

# Bias Variance Trade-off

Bias: 

1. A high bias ml algo doesn't learn anything from data, practically ignores it. 
2. Pays little attention to data.
3. High error on training set (low Rsquared, high SSE)
4. Oversimplified

Variance: Willingness and flexibility of an algo to learn. 

1. A high variance ml algo is highly susceptive to data and can't generalize. 
2. Memorizes the data.
3. Fails to generalize well.
4. Much Higher error on testing set (low Rsquared, high SSE)
5. Overfitting

In stats : Variance means spread of a data distribution.

## Underfit -High bias ----> Good Model ----> Overfit High Variance
## Finding the optimal number of features for the good model which balances bias and variance. Can be done by:

1. Regularization: Penalizes for extra features.




## Decision Tree :
Strengths: Can make bigger classifiers(Ensembled Methods).

Weakness: Overfitting. (Be careful about parameter tuning). 

# Reducing complexity of algorithms and improving speed

1. Tune parameters
2. Identify necessary features and only use them for building model. 
(Generally more features the algo has, the more complex it is for fitting)





In [5]:
# Entropy calculator
from math import log2
def entrocalc(class_samples):
    entropy = 0
    tot_samples = 0
    for val in class_samples:
        tot_samples += class_samples[val]
    
    for key in class_samples:
        pi = class_samples[key]/(tot_samples)
        entropy = entropy - (pi)*log2(pi)
    return entropy

# More Data Better Results 
(Generally) better than even a super optimized algo.


# Types of Data

1. Numerical: Numbers like 234, 453.0 etc ex. age, height, score.

2. Categorical: Discrete values like gender, color, material, job title etc

3. TimeSeries: Temporal data (timestamp)

4. Text: Words



# Be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes.

# Regression (Continuous output)

Minimizes sum of squared errors (actual - predicted). 
Finds slope and intercept for the line which minimizes sum of squared errors.

absolute error minimization not used because it can give us more than one lines. 

In case of squared errors there will be only one line. also SSE is easier to implement. 
## Problem with SSE:
1. Adding more data increases SSE but that doesn't mean fit is bad.


This is done by:

Ordinary Least squares OLS (used in sklearn)

Linear descent

## Performance Measure for Regression : R-Squared

R-Squared: "How much of change in the output is explained by the change in the input.

 0.0 < R-Squared < 1.0 (Best)
 
Negative R-Squared is possible.*

Advantage over SSE:
1. Independent of datapoints. 
Higher the R-Squared, the better. Max value = 1.0


# Always visualize your data. 

# Outliers

### Rare data points which don't follow the trend.

Causes:
1. Sensor Malfunction - to be ignored
2. Data entry errors - to be ignored  
3. Freak events: - to be paid attention to.  e.g. Fraud detection

Removal:

1. Train > 2.Remove(10% with max residual error) > 3.Train again

(Repeat steps 2 & three until satisfied)


# Visualization is one of the most powerful tools for finding outliers!

# First thing to do is to Identify and Clean the outliers

# Unsupervised learning

# K means clustering

Steps:

1. Assign: Randomly assign cluster centers.
2. Cluster Identification: Find points nearest to these cluster centers to identify the clusters. 
3. Centroid :Find the centroid of these clusters. New cluster centers are these centroids.
4. Repeat 2 & 3 until cluster centers stop updating.


sklearn params:

n_clusters = Number of clusters we want to have.
n_init = How many times it is initialized. Play with this if you see clustering getting affected by initialization. 

max_iter = How many iterations in total?


Limitations :

1. Premature convergence to sub optimal values. 
2. Can result into different clusters based on Initialization.


# Feature Scaling :

Rescaling the features such that they are on the same scale and have equal influence on the results.

X' = (X - Xmin)/(Xmax - Xmin)

Before scaling REMOVE THE OUTLIERS coz OUTLIERS WILL MESS WITH SCALING.

sklearn's minmaxscaler

## Algorithms which involve two or more dimensions will be affected by feature scaling. 
But since in regression features go with coefficients which take care of scale of that feature. 
Also in decision trees decision boundaries are always vertical or horizontal rendering it unaffected by the size of of different features. 
Kmeans and SVMs will be affected by scaling coz distance calculation is involved with different dimensions. 


# Learning From Text

Bag of Words: Frequency counts of occuring words.

Using sklearn countvectorizer

### All words are not equally important, words like the/a/an/is/etc don't tell much about what's going on and so are redundant called "STOPWORDS" 

Remove stopwords before starting text analysis.

STEMMER: used to consolidate different words with same stem like repond, responsiveness etc.
various stemmers in nltk eg. snowball stemmer and more.






In [7]:
from nltk.corpus import stopwords

sw = stopwords.words('french')

# How many stop words in french nltk corpus
print(len(sw))


155


## Order of operation in Text processing

1. Stop words removal
2. Stemming
3. Bag of words

# TF IDF : 
Term Frequency: How many times a word occurs in a document.
Inverse Document Frequency: In how many document a word occurs.

# Feature Selection

1. Select best features
2. Engineer new features
3. Getting Rid of features

## Engineering new Feature:

1. Use Human Intuition
2. Code up the feature
3. Visualize : See if there are trends which can be utilized by ML algos. 
4. Repeat

### Beware of programming bugs that might creep in while engineering new features.

1. Anyone can make mistakes--be skeptical of your results!
2. 100% accuracy should generally make you suspicious. Extraordinary claims require extraordinary proof.
3. If there's a feature that tracks your labels a little too closely, it's very likely a bug!
4. If you're sure it's not a bug, you probably don't need machine learning--you can just use that feature alone to assign labels.

## Getting Rid of features

Remove the feature when:

1. It's noisy
2. It's highly correlated to other feature. (Repeating information)
3. It causes overfitting
4. slows down training/testing

## General Rule

# Features are not equal to information. Features attempt to access information.

# Goal: Bare minimum number of features that give the most info.

# Univariate Feature Selection:
Treats each feature independently and asks how much power it gives you in classifying or regressing.

sklearn:

1. SelectPercentile: X% of features that are most powerful 
2. SelectKBest: selects the K features that are most powerful
3. TFIDF vectorizer max_df, min_df can also help get the right features. 

Text data has lots and lots of features , feature reduction can be used.
Feature reduction can be used for highly dimensional data. 

# A classic way to overfit an algorithm is by using lots of features and not a lot of training data




# PCA
principal component analysis

Finds a new coordinate system by shift-rotation of current one to reduce dimensionality.

New center is the middle point of old data range and principal axis is the one having significant variation.

Gives importance vectors
\
Gives spread

art of the beauty of PCA is that the data doesn't have to be perfectly 1D in order to find the principal axis!

Making composite features using PCA to dimension reduction.

In stats : Variance means spread of a data distribution.

Principal component direction is the one that has maximum variance(spread). Because only in that direction, information loss is minimized.

More the distance of data point from principal component more the information loss. 

## PCA transforms features into principal components.

## Principal components are used as new features.

## Principal components are perpendicular to each other thus are independent. 

## Max nof PCs = Nof features


## When to use PCA.

1. Identifying latent features driving the patterns in the data.
2. Dimensionality Reduction.

    a. Visualizing High Dimensional data.
    
    b. Reduce Noise
    
    c. Make algos work better with fewer inputs. 
    
Higher F1 score better classifier.
But more pcs don't mean better classifier, there is an optimal nof pcs that give best results.

# Do not perform feature selection before PCA coz it'll throw information away. Feature can be performed after PCA to help improve model.

# Validation

### Train Test split:

splitting data into training and testing sets and using only training set for training and testing set to evaluate the model.

1. Serves as a check on overfitting.
2. Gives an estimate of performance on independent set.

## Flow for split,pca, model training and prediction

![title](train_test_pca_svm_flow.PNG)


# K-fold cross validation

Dividing the dataset into k subsets, taking each subset as a testing set once and remaining as training set and reporting the average of performance on K subsets. 

1. Slower to train than train/test split.
2. Better estimate of model accuracy than train/test split.

Just splits the data irrespective of classes coming in the train/test. This might result into training the model on one class and using it to predict the other which will as we expect perform poorly.

## Training data should be such that it has a similar presence of all the classes as in the complete data set. 

# Stratified K-fold ensures that 

each set contains approximately the same percentage of samples of each target class as the complete set.

# Evaluation Metrics

## Classification 

### accuracy = Nof items in a class predicted correctly/all items in that class.

### Recall = When the actual value is positive how often are we correct. 

### Precision = When predicting positive , how often are we correct. 


