# What is OOD(out-of-domain)in Classification?

Out-of-domain (OOD) detection for low-resource text classification is a realistic but understudied task. The goal is to detect the OOD cases with limited in-domain (ID) training data, since we observe that training data is often insufficient in machine learning applications. In this work, we propose an OOD-resistant Prototypical Network to tackle this zero-shot OOD detection and few-shot ID classification task. Evaluation on real-world datasets show that the proposed solution outperforms state-of-the-art methods in zero-shot OOD detection task, while maintaining a competitive performance on ID classification task.


# What is Density-Based Clustering?

Density-Based Clustering refers to unsupervised machine learning methods that identify distinctive clusters in the data, based on the idea that a cluster/group in a data space is a contiguous region of high point density, separated from other clusters by sparse regions. The data points in the separating, sparse regions are typically considered noise/outliers.

Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning servers, group genes with similar expression patterns, identify anomalies in biomedical images, and perform various other applications.

There are many families of data clustering algorithms, and you may be familiar with the most popular ones: k-Means and DBSCAN. K-Means determines k centroids - the center of a data cluster - in the data and clusters points by assigning them to the nearest centroid.


![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

# all types of optimization algorithms in machine learning?

An "optimizer" that can tell the network how to change its weights.
We've described the problem we want the network to solve, but now we need to say how to solve it. This is the job of the optimizer. The optimizer is an algorithm that adjusts the weights to minimize the loss.
Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps. One step of training goes like this:

Sample some training data and run it through the network to make predictions.
Measure the loss between the predictions and the true values.
Finally, adjust the weights in a direction that makes the loss smaller.

Bracketing Algorithms:
Bracketing optimization algorithms are intended for optimization problems with one input variable where the optima is known to exist within a specific range.

Bracketing algorithms are able to efficiently navigate the known range and locate the optima, although they assume only a single optima is present (referred to as unimodal objective functions).

Some bracketing algorithms may be able to be used without derivative information if it is not available.

Examples of bracketing algorithms include:

Fibonacci Search

Golden Section Search

Bisection Method


Local Descent Algorithms:
Local descent optimization algorithms are intended for optimization problems with more than one input variable and a single global optima (e.g. unimodal objective function).

Perhaps the most common example of a local descent algorithm is the line search algorithm.

Line Search There are many variations of the line search (e.g. the Brent-Dekker algorithm), but the procedure generally involves choosing a direction to move in the search space, then performing a bracketing type search in a line or hyperplane in the chosen direction.

This process is repeated until no further improvements can be made.

The limitation is that it is computationally expensive to optimize each directional move in the search space.


First-Order Algorithms:
First-order optimization algorithms explicitly involve using the first derivative (gradient) to choose the direction to move in the search space.

The procedures involve first calculating the gradient of the function, then following the gradient in the opposite direction (e.g. downhill to the minimum for minimization problems) using a step size (also called the learning rate).

The step size is a hyperparameter that controls how far to move in the search space, unlike “local descent algorithms” that perform a full line search for each directional move.

A step size that is too small results in a search that takes a long time and can get stuck, whereas a step size that is too large will result in zig-zagging or bouncing around the search space, missing the optima completely.

First-order algorithms are generally referred to as gradient descent, with more specific names referring to minor extensions to the procedure, e.g.:

Gradient Descent

Momentum

Adagrad

RMSProp

Adam


Second-Order Algorithms:
Second-order optimization algorithms explicitly involve using the second derivative (Hessian) to choose the direction to move in the search space.

These algorithms are only appropriate for those objective functions where the Hessian matrix can be calculated or approximated.

Examples of second-order optimization algorithms for univariate objective functions include:

Newton’s Method

Secant Method

Second-order methods for multivariate objective functions are referred to as Quasi-Newton Methods.

Quasi-Newton Method

There are many Quasi-Newton Methods, and they are typically named for the developers of the algorithm, such as:

Davidson-Fletcher-Powell

Broyden-Fletcher-Goldfarb-Shanno (BFGS)

Limited-memory BFGS (L-BFGS)

Now that we are familiar with the so-called classical optimization algorithms, let’s look at algorithms used when the objective function is not differentiable


Stochastic Gradient Descent:
Stochastic optimization algorithms are algorithms that make use of randomness in the search procedure for objective functions for which derivatives cannot be calculated.

Unlike the deterministic direct search methods, stochastic algorithms typically involve a lot more sampling of the objective function, but are able to handle problems with deceptive local optima.

Stochastic optimization algorithms include:

Simulated Annealing

Evolution Strategy

Cross-Entropy Method

Batch Gradient Descent:
Batch Gradient Descent involves calculations over the full training set at each step as a result of which it is very slow on very large training data. Thus, it becomes very computationally expensive to do Batch GD. However, this is great for convex or relatively smooth error manifolds. Also, Batch GD scales well with the number of features.

Mini-Batch Gradient Descent:
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.

Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.



Direct Algorithms:
Direct optimization algorithms are for objective functions for which derivatives cannot be calculated.

The algorithms are deterministic procedures and often assume the objective function has a single global optima, e.g. unimodal.

Direct search methods are also typically referred to as a “pattern search” as they may navigate the search space using geometric shapes or decisions, e.g. patterns.

Gradient information is approximated directly (hence the name) from the result of the objective function comparing the relative difference between scores for points in the search space. These direct estimates are then used to choose a direction to move in the search space and triangulate the region of the optima.

Examples of direct search algorithms include:

Cyclic Coordinate Search

Powell’s Method

Hooke-Jeeves Method

Nelder-Mead Simplex Search


Population Algorithms:
Population optimization algorithms are stochastic optimization algorithms that maintain a pool (a population) of candidate solutions that together are used to sample, explore, and hone in on an optima.

Algorithms of this type are intended for more challenging objective problems that may have noisy function evaluations and many global optima (multimodal), and finding a good or good enough solution is challenging or infeasible using other methods.

The pool of candidate solutions adds robustness to the search, increasing the likelihood of overcoming local optima.

Examples of population optimization algorithms include:

Genetic Algorithm

Differential Evolution

Particle Swarm Optimization



whale's optimization :
Whale optimization algorithm (WOA) is a recently developed swarm-based meta-heuristic algorithm that is based on the bubble-net hunting maneuver technique—of humpback whales—for solving the complex optimization problems. It has been widely accepted swarm intelligence technique in various engineering fields due to its simple structure, less required operator, fast convergence speed and better balancing capability between exploration and exploitation phases. Owing to its optimal performance and efficiency, the applications of the algorithm have extensively been utilized in multidisciplinary fields in the recent past.

WOA is a swarm-based intelligent algorithm proposed for continuous optimization problems. It has been proven to exhibit superior performance with recent meta-heuristics methods . For instance, when compared with other swarm intelligence methods, it is easy to implement and robust which makes it comparable to different nature-inspired algorithms. The algorithm requires fewer control parameters; practically, only a single parameter (time interval) needs to be fine-tuned. In WOA, the population of humpback whales search through a multi-dimensional search space for food as shown in Fig. 2. The locations of humpback individuals are represented as different decision variables, while the distance between the humpback whale individuals and the food corresponds to the value of objective cost. Note that the time-dependent location of a whale individual is measured by three operational processes: (1) shrinking encircling prey, (2) bubble-net attacking method (exploitation phase) and (3) search for prey (exploration phase). Figure 3 shows the basic presentation of the WOA. The description and the mathematical expression of these operational processes are provided in the following subsections.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

https://medium.com/@minions.k/optimization-techniques-popularly-used-in-deep-learning-3c219ec8e0cc

# Bagging and Boosting

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)



https://www.upgrad.com/blog/bagging-vs-boosting/

# how to apply activation functions in machine learning for multi classsification and examples? 

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

*Softmax

![image.png](attachment:image.png)

# what are clusrting algorithms ?

Affinity Propagation

Agglomerative Clustering

BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies)

DBSCAN(density-based scan)

K-Means

Mini-Batch K-Means

Mean Shift

OPTICS( Ordering Points To Identify the Clustering Structure)

Spectral Clustering

Mixture of Gaussians

# Automation ML

In [2]:
!pip install lazypredict

Collecting lazypredict
  Using cached lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Collecting lightgbm
  Using cached lightgbm-3.3.3-py3-none-win_amd64.whl (1.0 MB)
Collecting xgboost
  Downloading xgboost-1.6.2-py3-none-win_amd64.whl (125.4 MB)
Installing collected packages: xgboost, lightgbm, lazypredict
Successfully installed lazypredict-0.2.12 lightgbm-3.3.3 xgboost-1.6.2


In [5]:
from sklearn.datasets import load_iris
import pandas as pd
import plotly.express as px
df = px.data.iris()
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


In [6]:
df_cat_to_array = pd.get_dummies(df)
df_cat_to_array = df_cat_to_array.drop("species_id", axis=1)
df_cat_to_array

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,1,0,0
1,4.9,3.0,1.4,0.2,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0
3,4.6,3.1,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0,0,1
146,6.3,2.5,5.0,1.9,0,0,1
147,6.5,3.0,5.2,2.0,0,0,1
148,6.2,3.4,5.4,2.3,0,0,1


In [7]:
import lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor

In [8]:
X = df_cat_to_array .drop(["sepal_width"], axis=1)
Y = df_cat_to_array ["sepal_width"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 64)
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lars())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LarsCV())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), OrthogonalMatchingPursuit())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), OrthogonalMatchingPursuitCV())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


100%|█████████████████████

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVR,0.7,0.76,0.19,0.02
NuSVR,0.7,0.76,0.19,0.33
KNeighborsRegressor,0.64,0.71,0.2,0.08
RandomForestRegressor,0.62,0.7,0.21,0.22
GradientBoostingRegressor,0.62,0.7,0.21,0.08
LGBMRegressor,0.6,0.69,0.21,0.56
HistGradientBoostingRegressor,0.6,0.68,0.21,1.51
HuberRegressor,0.6,0.68,0.21,0.04
Ridge,0.6,0.68,0.22,0.02
RidgeCV,0.6,0.68,0.22,0.01


In [14]:
!pip install tpot

Collecting tpot
  Using cached TPOT-0.11.7-py3-none-any.whl (87 kB)
Collecting deap>=1.2
  Using cached deap-1.3.3-cp39-cp39-win_amd64.whl (114 kB)
Installing collected packages: deap, tpot


ERROR: Could not install packages due to an OSError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\Anas\\anaconda3\\Lib\\site-packages\\TPOT-0.11.7.dist-info\\LICENSE'
Consider using the `--user` option or check the permissions.



Collecting tpot
  Using cached TPOT-0.11.7-py3-none-any.whl (87 kB)
Collecting deap>=1.2
  Downloading deap-1.3.3-cp39-cp39-win_amd64.whl (114 kB)
Collecting update-checker>=0.16
  Using cached update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Using cached stopit-1.1.2.tar.gz (18 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py): started
  Building wheel for stopit (setup.py): still running...
  Building wheel for stopit (setup.py): finished with status 'done'
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11956 sha256=f837d03cb4c5e93c3d0a1ed6020835db6b658e14ec0cb1263be0770a7ad7bc8f
  Stored in directory: c:\users\anas\appdata\local\pip\cache\wheels\48\8c\93\3afb1916772591fe6bcc25cdf8b1c5bdc362f0ec8e2f0fd413
Successfully built stopit
Installing collected packages: update-checker, stopit, deap, tpot
Successfully installed deap-1.3.3 stopit-1.1.2 tpot-0.11.7 update-checker-0.18.0


In [1]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')



Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.9844058928817294

Generation 2 - Current best internal CV score: 0.9844058928817294

Generation 3 - Current best internal CV score: 0.9844058928817294

Generation 4 - Current best internal CV score: 0.9844058928817294

Generation 5 - Current best internal CV score: 0.9851493873055212

Best pipeline: KNeighborsClassifier(GradientBoostingClassifier(input_matrix, learning_rate=0.01, max_depth=10, max_features=0.2, min_samples_leaf=5, min_samples_split=5, n_estimators=100, subsample=0.7500000000000001), n_neighbors=3, p=2, weights=distance)
0.9844444444444445
