# Precision/Recall:
- True Positive (TP): correctly predicted positive
- True Negative (TN): correctly predicted negative
- False Positive (FP): incorrectly predicted positive
- False Negative (FN): incorrectly predicted negative
- Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Precision: TP / (TP + FP), it is the ratio of correctly predicted positive to all predicted positive
- Recall: TP / (TP + FN), it is the ratio of correctly predicted positive to all actual positive
## Trading off Precision and Recall
- Precision and Recall are inversely related, as precision increases, recall decreases
- F1 score: 2 * (precision * recall) / (precision + recall)
- F1 score is the harmonic mean of precision and recall rather than the arithmetic mean as it punishes extreme values -> if either precision or recall is 0, F1 score is 0 and if both precision and recall are 1, F1 score is 1
- The harmonic mean is also used in the F1 score because it is more appropriate for measuring the performance of models that are trying to achieve a balance between precision and recall. For example, a model that is being used to classify spam emails would ideally have both high precision and high recall. However, it is often necessary to sacrifice some precision in order to improve recall, or vice versa. The harmonic mean allows us to measure the overall performance of the model, even if it is not able to achieve perfect precision and recall.

In conclusion, the harmonic mean is used in the F1 score because it is more sensitive to low values and it is more appropriate for measuring the performance of models that are trying to achieve a balance between precision and recall.

# Decision Trees:
- Decision trees are a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, even after being fairly simple and easy to understand, decision trees are powerful. Often decision trees are at the top of any machine learning algorithm. Decision trees are also the building blocks of some of the more advanced algorithms like random forests, gradient boosting etc.

## Decision Tree Learning:
- Maximize Purity, purity is the measure of how mixed the labels at a node are
- Entropy: measure of impurity in a bunch of examples
- p1 = probability of positive examples
- p0 = probability of negative examples
- p1 + p0 = 1
- Entropy: H(p1, p0) = -p1log(p1) - p0log(p0) (base 2 as we are using binary classification)
(We use base 2 when calculating entropy in decision trees because entropy is measured in bits. A bit is the smallest unit of information, and it can be either a 0 or a 1. The base 2 logarithm is used because it is the most natural way to measure the amount of information in a dataset.

Entropy is a measure of the disorder or uncertainty in a dataset. A dataset with high entropy is very disordered, meaning that there is no clear pattern to the data. A dataset with low entropy is very ordered, meaning that there is a clear pattern to the data.

Decision trees use entropy to decide which features to split on. A feature with high entropy is a good candidate for splitting because it will help to reduce the entropy of the dataset. This is because splitting the dataset on a high-entropy feature will create two new datasets, each with lower entropy than the original dataset.)
- H(p1) = -p1log(p1) - (1 - p1)log(1 - p1)

## Choosing a split:
- Split on the feature that has maximum entropy
- Information Gain: the amount of entropy reduced from the parent to the child node
- Information Gain = H(parent) - [weighted average]H(children)

## Decision Tree Algorithm:
- Recursively split the data based on the feature that gives the maximum information gain
- Stop when we reach a pure state or when we reach a predefined depth
Order:
- Start with the root node
- Find the feature that gives the maximum information gain
- Split the data based on the feature
- Repeat the process on each child node
- Stop when we reach a pure state or when we reach a predefined depth

## Regression Trees:
- Regression trees are used for predicting continuous values
- While calculating information gain, we use variance instead of entropy
- Variance: measure of how far a set of numbers is spread out from their average value
- Variance = sum((x - mean)^2) / n
- Information Gain = Variance(parent) - [weighted average]Variance(children)

### Trees are highly sensitive to small changes in the data
- Small changes in the data can cause a large change in the final estimated tree
- This makes the tree unstable
- This problem is solved by using ensemble methods like random forests and gradient boosting
- Tree ensemble methods are also used for feature selection
- In tree ensemble methods, we build multiple trees and average them to get more stable predictions

# Random Forests:
- In random forests algorithm we use many trees, and each tree is trained on a different subset of the data
- here sampling is done with replacement
- Each tree is trained on a random subset of features
- Random forests are used for classification and regression tasks

# XGBoost:
- XGBoost stands for eXtreme Gradient Boosting
- Boosted trees intuition:
- Boosting is an ensemble technique where new models are added to correct the errors made by existing models

```python
# classification
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# regression
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
```

# Unsupervised Learning:


## Clustering:
- Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups
- Clustering is an unsupervised learning technique
- Clustering algorithms are used for exploratory data analysis to find hidden patterns or grouping in data
## K-Means Clustering:
- step1: choose the number of clusters k
- step2: select k random points from the data as centroids
- step3: assign all the points to the closest cluster centroid
- step4: recompute the centroids of newly formed clusters
- step5: repeat steps 3 and 4
- stop when the centroids are not moving
- ```Repeat {
    #u is the vector of cluster centroids
    #c(i) is the index of cluster centroid closest to x(i)
    #assign points to cluster
    for i=1 to m:
        c(i) := index (from 1 to K) of cluster centroid closest to x(i)
                centroid closest to x(i) is argmin ||x(i) - u(j)||^2

    #move cluster centroids
    for j=1 to K:
        u(j) := average (mean) of points assigned to cluster j
            u(j) := mean of x(i) s.t. c(i) = j
} until convergence
```

## K-means optimization objective:
- the optimization objective of k-means is to minimize the sum of squared distances between the points and their respective cluster centroid
- J(c(1), ..., c(m), u(1), ..., u(k)) = 1/m * sum(||x(i) - u(c(i))||^2)
- c(i) = index of cluster centroid closest to x(i)
- u(j) = cluster centroid j
- k = number of clusters
- m = number of training examples

### Random Initialization:
- Randomly pick k training examples
- set u(1), ..., u(k) equal to these k examples
- Run k-means
- Random initialization can sometimes lead to bad clusters
- To solve this problem, we run k-means multiple times from multiple random initializations
- We then pick the clustering that gave us the lowest cost
- cost function is the sum of squared distances between the points and their respective cluster centroid given by the formula: J(c(1), ..., c(m), u(1), ..., u(k)) = 1/m * sum(||x(i) - u(c(i))||^2)

- numpy.linalg.norm(x, ord=None, axis=None, keepdims=False)
- this calculates the norm of a vector or matrix given by the formula: ||x|| = sqrt(sum(x(i)^2))

## Anomaly Detection:
- Anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data
- Anomaly detection is applicable in a variety of domains, such as intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, and detecting ecosystem disturbances
## Density Estimation:
- Density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function
## Gaussian Distribution:
- The Gaussian distribution is a continuous function which approximates the exact binomial distribution of events
- Properties of Gaussian distribution:
    - mean = 0
    - variance = 1
- probability: $p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$    
# Parameter Estimation:
- done by maximum likelihood estimation
- Maximum likelihood estimation: given some data, what is the most likely probability distribution that produced the data?, here we are trying to find the parameters of the probability distribution
- given a dataset x(1), ..., x(m)
- assume that x(i) ~ p(x; \mu, \sigma^2)
- parameters are $\mu$ and $\sigma^2$
- likelihood function: $L(\mu, \sigma^2) = p(x; \mu, \sigma^2) = \prod_{i=1}^m p(x(i); \mu, \sigma^2)$
then we find the parameters that maximize the likelihood function