# Classification and Hypothesis Testing

## Hypothesis Testing

Suppose you went to a supermarket to buy a packet of juice of 500ml. After reaching home you realized the weight in the juice seems to be on the lesser side. You measured it and found out the juice was only 478 ml. Then, you search on the internet and find out that the company claims to have 490 ml juice in the box with 10 ml of standard deviation.

Now, to formulate your hypothesis you collected a sample of 40 juice packets and found that the mean amount of juice from this sample is 485 ml and the sample standard deviation is 8 ml. To verify this, you assumed that it just happened by chance and the average juice is 490 ml. The assumption you have made is called **Null Hypothesis**.

Alternate to this assumption will be that the company claim is wrong and the juice packets have an average of less than 490 ml of juice in them. This is called an **Alternate Hypothesis**.

Formally we define:
- **H0**: average juice in the packet is 490ml
- **Ha**: average juice in the packet is not 490ml

Another thing you assume is that the chance of the juice in the box <=490 ml by random is 95%. This means the chance that the company did not indulge in any wrong activity and is providing on average 490ml of juice is 95% and the chance that they are indulging in some wrong activity is 5%.

- This 5% is called the **level of significance (alpha)**, it signifies that the chance of something happening is not random is <=5%.

Now to formulate this hypothesis we also assume that the data is normally distributed. Now we have everything we require. The next step is simple, we just have to find the probability/chance that the juice is <=485ml(sample mean) in a box. As population mean is known, we can calculate the probability using the z-statistic:

**Z-statistic = sample mean - population mean/ (population standard deviation/sqrt(n))**
**Z-statistic= (485-490)/(10/sqrt(40)) = -3.162**

This means that the sample measn 3.162 standard deviations below the population mean. Now, we learned during the week statistics for data science, that 95% of the data lies within 2 standard deviations, this means that this point is beyond 95% or it has a probability <5%, Hence we can say that it does not happen at random and there is something fishy about the company.

Another important topic is P-value, now we know using the z score that the probability is less than 5%, but how much is this probability?

We can find this probability using the Z-table. We have a z-score of -3.162 and a significance level of 5% ( i.e 0.05), the corresponding value is ~0.99921. Therefore P-Value for this will be 1-0.99921= ~0.0008, which is less than the significance level of 5%. **Hence we reject the null hypothesis.**



### Important Terminology - Hypothesis Testing

1. **Type I Error:** Rejection of null hypothesis when it should not have been rejected. $\alpha$ is the probability of Type 1 Error, also called the level of significance, or the significance level (it indicates the accepted risk level defaulted 5% probability).
2. **Type II Error:** A hypothesis test fails to reject the null hypothesis that is false. $\beta$ is the probability of Type 2 error.
3. **P-value:** It is actual risk level calculated from the data. p-value provides information on the probability of the observations, given that the null hypothesis is correct. A p-value less than 0.05 is considered statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability that the null is correct.
4. **Confidence Level:** Indicates how much % confident are we in the decision, usually at a 95% level. This means if we repeat the test multiple times, 95% of the time the results will match the population result.
5. **Confidence Interval"** The most common value for $\alpha$ is 0.05 and typically 95% confidence intervals are constructed. A confidence interval provides an interval, or a range of values, which is expected to cover the true unknown parameter values. 
6. Confidence interval is calculated as: X^ +- Z ($\sigma$ / $\sqrt{n}$). Where X_hat is the sample mean, Z is the critical test statistic, $\sigma$ is the population standard deviation, and n is the sample size.

## Logistic Regression

- Logistic Regression is a populat statistical model used for binary classification. It's used to predict the probability that an observation belongs to one of two possible classes.

Logistic regression model, instead of creating the best fit line, coerces the output of the linear fuction to fall between 0 and 1.

**Advantages**

- Logistic Regression is one of the simplest machine learning algorithms and is easy to implement, yet it provides great training efficiency in some cases. Also due to these reasons, training a model with this algorithm doesn't require high computation power.
- Logistic Regression works well for cases where the dataset is linearly separable: A dataset is said to be linearly separable if it is possible to draw a straight line that can separate two classes of data from each other. Logistic regression is used when you Y variable can take only two values, and if the data is linearly separable, this algorithm is an efficient way to classify it into two separate classes.

**Assumptions:**

- Binary logistic regression requires the dependent variable to be categorical and binary.
- Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
- Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.

## Support Vector Machines

**Support Vector Machines (SVM)** is a supervised learning algorithm used for regression, classification and outlier detection. The main objective of SVM is to create a line or a hyperplane which separates the data into classes.

An SVM finds the points closest to the line from both classes. These points are called support vectors. Now, we compute the distance between the line and the support vectos. This distance is called the margin. Our goal is to maximize the margin. The hyperplane for which the margin is maximum is the optimal hyperplane for the SVM. Thus, an SVM tries to make a decision boundary in such a way that the separation or margin between the two classes (the street) is as wide as possible.

We have two kinds of margins:
1. Hard margin
2. Soft margin

**Hard Margin**:
In the Hard Margin method, the decision boundary makes sure that all the data points are classified correctly. This means that the classifier is not going to make any room for error in the training data. It may also reduce the size of the margin to accomplish this, defeating the whole purpose of using an SVM. This method is also sensitive to outliers.

**Soft Margin**:
As most real-world data is not fully linearly separable, we should allow some margin violation or misclassification of the samples to occur in our margin; this is called the Soft Margin method of classification. The idea is to keep the margin as wide as possible. It tries to balance the trade-off between finding a line that maximizes the margin, as well as minimizing the misclassification. The Soft Margin method is robust to outliers, and generalizes well on unseen data.


**Non-Linear SVM**: 

If the data is not linearly separable, we can apply a technique called the **kernel trick** that helps handle the data.

<img src="Screenshot 2023-06-14 at 17.22.36.png">

**Kernel Functions**

Kernel functions can be regarded as the tuning parameters in an SVM model. They are responsible for removing the computational requirement to achieve the higher dimensional vector space and deal with non-linearly separable data. Let us discuss two of the most widely used kernel functions:

- Polynomial kernel: A polynomial function is used with a degree k to separate the non-linear data by transforming it into higher dimensions. 

- Radial Basis Function kernel: This kernel function is also known as the Gaussian kernel function. It is capable of producing an infinite number of dimensions to separate the non-linear data. It depends on a hyperparameter ‘γ'(gamma) which needs to be scaled while normalizing the data. The smaller the value of the hyperparameter, the smaller the bias and the higher the variance it gives. On the other hand, a higher value of hyperparameter gives a higher bias and lower variance solutions. 

**Advantages and Disadvantages of SVM**

Some of the advantages of SVMs are:
1. They are flexible with respect to unstructured, structured, and semi-structured data.
2. The Kernel function eases the complexities in almost any data type.
3. Overfitting is not observed as often as in other models.

despite the advantages, it also holds certain disadvantages which are:
1. Training time is more while computing large datasets.
2. Overall interpretation is difficult because of some black-box approaches.


## Decision Trees

Decision trees are tree-based models that help in making a decision in both regression and classification problems. To make a decision, they use a hierarchical structure and split the dataset into smaller subsets.

Formally, the decision tree splits the data based on different splitting methods. One of the most commonly used methods is Entropy and Information Gain.

- **Entropy:** Entropy is the measure of randomness or impurity contained in a dataset.
- **Information Gain:** It is the measure of the information gained by adding a feature/independent variable or, in other words, reduction in the impurity after adding a feature. We simply subtract the entropy of Y given X from the entropy of Y to calculate the reduction of impurity about Y given an additional piece of information X.

**Important Terminology**

- **Root Node:** The root node is from where the decision tree starts. It represents the entire population or samples which get divided into two or more branches.
- **Branch or Sub-Tree:** A part of the entire decision tree is called a branch or sub-tree.
- **Splitting:** Dividing a node into two or more sub-nodes based on if-else conditions.
- **Decision Node:** A sub-node that splits into further sub-nodes. In simple terms, every node is a decision node, except for leaf nodes.
- **Leaf or Terminal Node:** This is the end of the decision tree where it cannot be split into further sub-nodes.
- **Depth of the tree:** The depth of a decision tree is the number of nodes from the root node down to the furthest leaf node. The below tree has a depth equal to 2.

**Construction of the Decision Trees**

- We have different splitting methods to decide the split in the decision tree. The feature which has the lowest entropy, i.e., highest information gain is selected as our root node.

<img src="Screenshot 2023-06-14 at 17.37.23.png">

Decision trees only make the optimal split at each node, and the algorithm does not consider the larger problem as a whole. Also, once a split has been made, it is never reconsidered. Decision trees are hence considered **Greedy Algorithms**, which is a Computer Science term for any algorithm that tries to approximate the globally optimal solution to a problem by finding the locally optimal solution at each step of the problem instead.

#### Tree Pruning

As the tree grows large, the tendency to overfit on the train data increases because, in a large tree, splits are made even to get small gains. One of the techniques used to handle the overfitting problem of Decision Trees is **Tree Pruning**.

Pruning **selectively removes** certain parts of a tree to improve the tree's structure and reduce overfitting. It reduces the size of a Decision Tree, a step that may slightly increase your training error but may also drastically decrease your testing error, making it more adaptable to new, unseen data.

<img src="Screenshot 2023-06-14 at 17.46.24.png">

## Bagging and Random Forest

**Bagging**

- **Ensemble Methods:** Ensemble methods have multiple models trained on different subsets of the dataset, and the outputs of these models are somehow aggregated to get the final result. These multiple models are also called the **base models/weak models**

The two most common types of Ensemble models are:

1. Bagging
2. Boosting

**Bagging** = **Bootstrap** + **Aggregation**

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. Aggregation is the process of combining the results of all the algorithms to get the final result.

A Bagging classifier is an ensemble method that trains a base model on each random subset of the original dataset. These subsets are made using bootstrapping and the classifier then aggregates their individual predictions (either by voting for classification or by averaging for regression) to form a final prediction. 

**Random Forest**

The Random Forest is a type of bagging algorithm where the base models are decision trees, and the algorithm uses only a subset of randomly picked independent variables (features) for each node's branching possibilities, unlike in bagging where all the feautures are considered for splitting a node.

Bootstrapped samples are taken from the original training data and on each bootstrapped training dataset, a decision tree is built by considering only a subset of features at each split. the results from all the decision trees are combined together and the final prediction is made using voting or averaging.

For regression problems, we take the average of all the predictions obtained from the different decision trees as our final prediction.

<img src ="Screenshot 2023-06-14 at 17.53.36.png">

## K-Nearest Neighbours

The K-Nearest Neighbors (KNN) algorithm is a type of supervised algorithm that can be used for both classification and regression predictive problems. However, it is primarily used in the industry for classification and prediction problems.

KNN is well defined by the two properties listed below:
- KNN is a lazy learning algorithm because it does not have a specialized training pahse and uses all of the data for training while classfying the data.
- KNN is a non-parametric learning algorithm because it makes no assumptions about the underlying data.

**How does the K-NN Algorithm function?**

The K-nearest neighbors (KNN) algorithm predicts the values of new data points based on 'feature similarity,' which means that the new data point will be assigned a value based on how closely it matches the points in the training set. The following steps will help us understand how it works.

- Step-1: Select the number K of the neighbors
- Step-2: Calculate the Euclidean distance of K number of neighbors
- Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
- Step-4: Among these k neighbors, count the number of the data points in each category.
- Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
- Step-6: Our model is ready.

Here are some things to keep in mind when choosing the value of K in the K-NN algorithm:

- There is no specific way to determine the best value for "K," so we must experiment with various values to find the best one. 
- A very low value for K, such as K=1 or K=2, can be noisy and cause outlier effects in the model.
- Large values for K are preferable, but they may cause complications.