# Neural Network in Scikit-Learn

![image.png](attachment:image.png)

Now when we talk about building neural network models, scikit-learn is not the first library that will come to your mind. That's because support for neural networks in scikit-learn is currently quite limited. Before we dive into what exactly neural networks are and how they help build machine-learning models, let's compare and contrast what scikit-learn has to offer with other deep-learning frameworks such as TensorFlow and PyTorch. 


If you want to build machine-learning models using traditional techniques such as decision trees, linear regression, support-vector machines, the most popular library out there for general purpose ML is scikit-learn. 


TensorFlow and PyTorch specialize in deep learning. If you're looking to build highly- customized neural network models, you'll choose to go with TensorFlow or PyTorch. 

The scikit-learn library tries to make things as easy for you as possible. There is a vast array of estimators available for common machine-learning techniques, classification, regression, clustering, and even dimensionality reduction. 
All of these estimators under the hood are implemented using traditional machine- learning techniques. 



With TensorFlow and PyTorch, there are a relatively small number of algorithmic estimators for common problems. TensorFlow and PyTorch will give you the building blocks that you need to build complex neural networks. 

with the scikit-learn library there are many estimators available which are implemented using traditional machine-learning algorithms such as those I listed earlier, support-vector machines, decision trees, random forest, and even ensemble techniques. 


TensorFlow and Pytorch, there is little to no support for traditional ML. 


Now it is possible to build neural networks using scikit-learn, and we'll see how shortly, but the support for neural networks is very, very limited. 


TensorFlow and PyTorch are entirely focused on building neural networks of different types, convolutional neural networks, the current neural networks, you name it. 


Scikit-learn estimator objects can work directly with data stored in pandas data frames or NumPy arrays. 


When you're working with TensorFlow or PyTorch, you have to work with special data types called tensors. Tensors are essentially just multi- dimensional arrays, but they're optimized for distributed training and prediction. 


Scikit-learn offers no specialized GPU support for its estimator objects and other models, whereas with TensorFlow and PyTorch, they help you leverage the power of GPUs in a built-in manner. 


While it is possible to use multiple processors to train scikit-learn models, it's not really suited to distributed training. 


TensorFlow, as well as PyTorch have extensive support for distributed training on multiple devices on the same machine, as well as multiple machines in a cluster. 


Now, neural networks in scikit-learn do exist, but you'll see that there are a limited number of neural network building blocks. 



With TensorFlow and PyTorch, well, they are for neural networks. You have a large number of neuron types, activation functions, loss functions, etc., to really build your custom model. 



With scikit-learn, you can only build one kind of neural network model, that is the fully-connected neural network with regularization to prevent overfitting. 


The building blocks of deep-learning frameworks support all kinds of neural network layers whether they are convolutional neural networks or recurrent neural networks. If you're looking to build recurrent neural networks to work with sequential data or convolutional neural networks to work with image data, you can't really use scikit-learn. 


TensorFlow and PyTorch, it's relatively simple to build complex RNN and CNN architectures. 


And finally, scikit-learn does not really offer support for pre-trained models, which you can use for transfer learning. 


With TensorFlow, as well as PyTorch, there is an impressive array of pre-trained models available for all kinds of use cases. 



Now that we know what scikit-learn does and does not support as far as neural networks are concerned, let's discuss what kind of models we can build using neural networks in scikit-learn. 


Scikit-learn offers high- level estimators to build fully-connected dense neural networks for supervised learning solutions such as classification and regression. 


Scikit-learn also offers a neural network solution for dimensionality reduction, which is an unsupervised learning technique. 


The neural networks that you can build in scikit-learn are called multi-layer perceptrons. Multi-layer perceptrons are what we'll use to build our classification and regression models, and we'll perform dimensionality reduction using restricted Boltzmann machines, or RBMs.

# Perceptrons and Neurons



Neural networks in scikit-learn are built using multi-layer perceptrons, so what exactly is a perceptron and how does that relate to  the neurons in a neural network? 

supervised learning techniques such as classification and regression can be implemented in scikit-learn using multi-layer perceptrons, and for unsupervised learning, we have restricted Boltzmann machines. 


Let's look at multi-layer perceptrons, and to do that, we have to understand what exactly a perceptron is. 


When neural networks were originally invented, the active learning unit was the perceptron. 
This is the simplest artificial neural network architecture, and it was originally invented back in 1957 by Frank Rosenblatt. 

You can think of the perceptron as the precursor to the neuron that is used today. 


A neuron in a neural network is its active learning unit, the perceptron is its precursor. 


Just like a neuron, a perceptron can accept several X values at the input, and it calculates the weighted sum of inputs. 


![image.png](attachment:image.png)




To this the weighted sum of the inputs, the perceptron then applies a step function, which has a threshold. 


Now the output of the perceptron is positive if this weighted sum is above the threshold, and the output of the perceptron is a negative value if this weighted sum is below the threshold. 


Now we'll talk about how a neuron works in just a bit, but the basic functionality is the same, weighted sum of inputs and then a function, which in the neuron is called the activation function. 


In the perceptron, it's the threshold function. So essentially a perceptron is simply a neuron with a step activation function. 

![image.png](attachment:image.png)


The step activation function is the threshold that we spoke of earlier. 


Now that we understand what the perceptron is and its relationship with the neuron, from here on in, we only talk of neurons and not perceptrons. 



Even though in scikit-learn neural networks are referred to as multi-layer perceptrons, the active learning unit is a neuron, and you can apply different activation functions just like you can with a neuron. 



This leads us to our next concept. What exactly is a neuron, and how does it work? How does it learn from your data? 



![image.png](attachment:image.png)


A neuron is nothing but a mathematical function which accepts a number of X values at its input, and once the mathematical function is applied to these X values, you get the output Y value of the neuron. 


Now for an active neuron, any change in the inputs of the neuron should trigger a corresponding change in its output.




### SIMPLE - EXPLANATION 


####  What does a neuron do ?
The operations done by each neurons are pretty simple :

![image.png](attachment:image.png)
Fig. Operations done by a neuron


First, it adds up the value of every neurons from the previous column it is connected to. On the Figure 2, there are 3 inputs (x1, x2, x3) coming to the neuron, so 3 neurons of the previous column are connected to our neuron.


This value is multiplied, before being added, by another variable called “weight” (w1, w2, w3) which determines the connection between the two neurons. Each connection of neurons has its own weight, and those are the only values that will be modified during the learning process.



Moreover, a bias value may be added to the total value calculated. It is not a value coming from a specific neuron and is chosen before the learning phase, but can be useful for the network.
After all those summations, the neuron finally applies a function called “activation function” to the obtained value.


![image.png](attachment:image.png)
Fig. — Sigmoid function


The so-called activation function usually serves to turn the total value calculated before to a number between 0 and 1 (done for example by a sigmoid function shown by Figure 3). Other function exist and may change the limits of our function, but keeps the same aim of limiting the value.



That’s all a neuron does ! Take all values from connected neurons multiplied by their respective weight, add them, and apply an activation function. Then, the neuron is ready to send its new value to other neurons.


After every neurons of a column did it, the neural network passes to the next column. In the end, the last values obtained should be one usable to determine the desired output.


Now that we understand what a neuron does, we could possibly create any network we want. However, there are other operations to implement to make a neural network learn.

*** 

How does a neural network learn ?

**** 
Yep, creating variables and making them interact with each other is great, but that is not enough to make the whole neural network learn by itself. We need to prepare a lot of data to give to our network. Those data include the inputs and the output expected from the neural network.


Let’s take a look at how the learning process works :



First of all, remember that when an input is given to the neural network, it returns an output. On the first try, it can’t get the right output by its own (except with luck) and that is why, during the learning phase, every inputs come with its label, explaining what output the neural network should have guessed. If the choice is the good one, actual parameters are kept and the next input is given. However, if the obtained output doesn’t match the label, weights are changed. Those are the only variables that can be changed during the learning phase. This process may be imagined as multiple buttons, that are turned into different possibilities every times an input isn’t guessed correctly.


To determine which weight is better to modify, a particular process, called “backpropagation” is done. We won’t linger too much on that, since the neural network we will build doesn’t use this exact process, but it consists on going back on the neural network and inspect every connection to check how the output would behave according to a change on the weight.
Finally, there is a last parameter to know to be able to control the way the neural network learns : the “learning rate”. The name says it all, this new value determines on what speed the neural network will learn, or more specifically how it will modify a weight, little by little or by bigger steps. 1 is generally a good value for that parameter.

*** 

### interconnection between neurons. 
_________________________________

![image.png](attachment:image.png)

*** 

The outputs of neurons feed into the neurons from the next layer, so the neurons in the different layers of the neural network are connected together, and each of these interconnections is associated with a weight. 



![image.png](attachment:image.png)

<img src = "./files/Capture1.png">

If the second neuron is sensitive to the output of the first neuron, it means that that the two neurons have a strong connection. And during the training of the neural network, the more sensitive the second neuron is to the output of the first, the larger the weight associated with this interconnection. W increases as the sensitivity of the neuron increases to its input. 


The training process of the neural network is responsible for figuring out these Ws, the weights for the various neuron interconnections. 

#### So what exactly is within a neuron? 


A single neuron applies just two mathematical functions to its inputs. The first of these functions is called the affine transformation, and the second function is called the activation function. 


The <b>affine transformation</b> is responsible for learning linear relationships that exist in neural data between the input X variables and the output Y of the neuron. 


<img src = "./files/Capture2.png">


The affine transformation calculates a weighted sum of the input X variables and adds a bias. 

- Now if you'll remember your perceptron, the first mathematical operation of the perceptron was to calculate a weighted sum of the inputs. The perceptron similarity with the neuron that we're discussing here should become more obvious. 


<img src = "./files/Capture3.png">


Observe the second mathematical function that a neuron applies to its inputs. The <b> activation function </b> is one that helps discover non-linear relationships that exist in your data. 

- If you go back to our discussion on the perceptron, you'll remember that the activation function in the perceptron was the step function or a threshold function. 

Now using the affine transformation with the step or the threshold function allowed the perceptron to only work with linearly separable data. 




#### What does it mean to have linearly separable data? 

Let's consider data in two dimensions. 

![image.png](attachment:image.png)

Here is a line that we can draw to separate the two classes represented by the red and green colors. 


Data that is distributed in this manner where a linear boundary exists between classes can be thought of as linearly separable data. 


Perceptrons, which had the threshold function as the activation function, could only work with data of this kind. 


When you're working with neurons though, there are a large number of activation functions that you can use with your neuron. And it is this activation function within the neuron which allows the neuron to learn more complex relationships in your data. 




<img src = "./files/Capture4.png">

The activation function in a neuron can be a fairly simple one. It can simply be the identity function where the output of the affine transformation is simply passed through as the output of the neuron itself. Such a neuron is often referred to as a linear neuron; 

however, choosing the <b>right activation function</b> for your neuron is an important part of the design of your neural network, and this combination of the affined transformation and the activation function allows the neuron to learn arbitrary relationships. 


There are a wide variety of activation functions that you can use with your neuron. Each have their strengths and weaknesses, and you have to really train the model on your data to figure out which one works well.

Here are the shapes of the curves of the different commonly used activation functions, 

<img src = "./files/Capture5.png">

ReLU, logit, tanh, step function, and so on. 

<img src = "./files/Capture6.png">

Notice how all of these activation functions have an active region or a region which has a gradient. It is this gradient that allows neurons to be sensitive to changes in the input.


During the training of your neural network, active neurons will operate in the active region. 


In order to train and adjust the weights of the neural network, the activation functions should be active and not be in the saturation region 
highlighted here.


<img src = "./files/Capture7.png">


As you can see, the idea behind perceptrons and neurons is fairly simple, but many of these simple neurons arranged in layers can do magical stuff, and these are the building blocks of your neural network.

<img src = "./files/Capture8.png">


*** 

# Multi-layer Perceptrons and Neural Networks

*** 

A multi-layer perceptron is simply a feed- forward neural network. What does feed forward mean? The inputs flow forward through the layers of the neural network. 

<img src = "./files/Capture9.png">


The neurons in the layer receive their inputs from neurons in the previous layer and pass their outputs to neurons in the next layer. What you see on the image is a feed-forward neural network. This is a multi-layer perceptron. 





A neural network such as this is comprised of neurons, neurons that we mentioned before, and these neurons are arranged in layers. 


Each layer of the neural network learns different features and details from the training data that is used to train this model. 


Layers in a neural network are groups of neurons that perform similar functions. 


In our image here, you can imagine that the first layer, when looking at image data, extracts pixel-related information, the second layer looks at edges, the third layer looks for corners, the fourth layer looks at object parts and other layers aggregate all of this information together to perform, say, image classification. 


If you take a magnifying glass to the layers of the neural network, 

<img src = "./files/Capture10.png">

you'll find that each layer consists of individual interconnected neurons and observe how neurons in one layer receive input from neurons in the previous layer and pass their output to neurons in the next layer. 


You can see that the input is fed forward through this network. This is a feed-forward network. This is a multi-layer perceptron. 


In feed-forward neural networks, there are no connections between neurons in the same layer. 


Neurons are only connected to neurons in the layer after them and the layer before them.


You can see that neural networks are essentially just computation graphs that learn from your data. And this leads us right into a discussion of deep learning with neural networks. 


We know what neural networks are, directed computation graphs that learn relationships that exist in your data, and these relationships are learned by the active learning units, the neurons that are arranged in layers. 


A simple rule of thumb is the more complex the graph, the more relationships it can learn. So complicated neural networks are built up by setting up complex interconnections in the graph. 


Neural networks are the most popular kind of deep-learning models.<b> Deep learning here refers to the depth of the computation graph, the number of layers in your neural network.</b>

*** 

# Training a Neural Network

*** 

In order to build a neural network model to make predictions, you need to design your neural network and then train the neural network. 


neural networks are made up of active learning units called neurons. These are the nodes in the computation graph which are simple entities. 


Each neuron performs very simple operations on data, and the neurons are connected in complex, sophisticated ways. 


A neural network is simply a network of these complex interconnections between simple neurons. 


Once you think of the neural network as a computation graph, you know that the interconnections between neurons can be configured in many different ways, and these different kinds of interconnections give you different designs for your neural network such as <b>convolutional neural networks or recurrent neural networks. </b>


The whole idea behind arranging the neurons in your neural network in layers is the fact that you have groups of neurons, which perform similar operations, and all of these neurons together form a layer. 


By understanding the working of a neuron, we discuss that each neuron applies at least two simple mathematical functions to its input. 


The first of these operations is the affine transformation that calculates the weighted sum of inputs and adds a bias. 


We also discussed the fact that the weight corresponding to a connection between two neurons gets stronger if one neuron is sensitive to the output of the other. 


A good question to ask right now would be where do the values of W and b come from? How do we find the weights between the interconnections of the neurons in a neural network? 

 -- This is important because finding the best values for W and b is crucial. This is what allows the neurons in a neural network to make complex predictions. The best values of the weights and biases of your neural network, these are often referred to as the neural network parameters as well, are found using a cost function, an optimizer, and a corpus of training data. And the process of finding the best values of W and b is referred to as the training process. 
 
 
The objective of the training function of the neural network is to use a corpus of training data, minimize the cost function, and use the optimizer to update the weight of the neural network. 


Training the neural network happens via a process called gradient descent optimization. 



Let's visually understand what gradient descent is. Let's consider W and b to be the weights and biases of our neural network model, and on the third axis, we model the loss. 


<img src = "./files/Capture11.png">


Think of the loss as a measure of how much the predicted values from our neural network model differ from the actual values in our training data. For different values of the weights and biases of our neural network model, the loss of the model will be different. 

<img src = "./files/Capture12.png">

Hypothetically, let's assume that the loss is represented by this shape that you see above. 


The objective of the gradient descent algorithm to train neural networks is to minimize the loss of your neural network model, so we are looking for those values of W and b that correspond to the smallest value of loss. 

<img src = "./files/Capture13.png">



Now the weights and biases of your neural network will be initialized to some value, and let's say that these values of W and b correspond to some initial value of loss on this gradient curve. 


The training process of a neural network involves walking down this gradient curve tweaking values of W and b until we get to the smallest value of loss. 

<img src = "./files/Capture14.png">


Gradient descent refers to converging on this best value of W and b using an optimization algorithm. 




During the training of a neural network, the output of deeper or later layers of your neural network may be fed back to the previous layers to find the best W and b. And this feeding back of values is called back propagation, and <b>back propagation</b> is the standard algorithm for training any kind of neural network. 



The weights and biases of your neural network will be found during the training process, and the training algorithm will use the weights to tell a neuron what inputs to the neuron matter and which inputs do not. And it'll also apply a corrective bias if needed. 


The weighs and biases found in the training process refer to the affine transformation within each neuron. This can only be used to learn linear functions, but we can generalize this using the activation function, which is a non-linear function. 


The most common activation function is the <b>ReLU</b> activation, which is simply the max function of the input value and 0, so negative values are clamped to 0 in the ReLU activation function. 

Positive values go through as is. 


Now you'll see that in most of our neural networks, we tend to use the ReLU activation function because empirically in the real world, it has proven to work very well. The term ReLU stands for rectified linear unit. ReLU of x is equal to the maximum of 0, x.

### Terms to learn:

## Statistical noise 


Statistical noise is the random irregularity we find in any real life data. They have no pattern. One minute your readings might be too small. The next they might be too large. These errors are usually unavoidable and unpredictable.



##### Quantifying Statistical Noise
Statistical noise generally consists of errors and residuals:

- Errors might include measurement errors and sampling errors; the differences between the observed values we’ve actually measured and their ‘true values’. While most errors are unavoidable, systematic errors—can usually be avoided. They creep into your data when you make the same mistake over and over again. For example, let’s say you wanted to know something about the general health of the population, but only surveyed patients in doctors’ waiting rooms. That systematic error (polling sick people over and over again) will create a statistic that’s completely off the mark.


- The residual of observed data is the difference between your observed value (again, that data point you’ve measured) and the predicted value; not the ‘true value’ per say but the point in space your theory tells you the data point should lie on. In regression analysis, it’s the distance between your observed data point and the regression line.




##### The Significance of Noise


Recognizing and quantifying the amount of statistical noise in a data set is an important step in analysis; a step which will allow us to see immediately whether or not data shifts are significant or simply part of the static.

**** 

#### What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.


Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.



Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

#### What is variance?
Variance is the variability of model prediction for a given data point . Model with high variance( distance from overfitted line to a new_point will be more ) pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.


Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.



Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

## OVERFITTING AND UNDERFITTING

The Bias-Variance Tradeoff is relevant for supervised machine learning - specifically for predictive modeling. It's a way to diagnose the performance of an algorithm by breaking down its prediction error.

In machine learning, an algorithm is simply a repeatable process used to train a model from a given set of training data.

You have many algorithms to choose from, such as Linear Regression, Decision Trees, Neural Networks, SVM's, and so on.
You can learn more about them in our practical tour through modern machine learning algorithms.
As you might imagine, each of those algorithms behave very differently, each shining in different situations. One of the key distinctions is how much bias and variance they produce.

There are 3 types of prediction error: bias, variance, and irreducible error.

Irreducible error is also known as "noise," and it can't be reduced by your choice in algorithm. It typically comes from inherent randomness, a mis-framed problem, or an incomplete feature set.

The other two types of errors, however, can be reduced because they stem from your algorithm choice.

Error from Bias
Bias is the difference between your model's expected predictions and the true values.

That might sound strange because shouldn't you "expect" your predictions to be close to the true values? Well, it's not always that easy because some algorithms are simply too rigid to learn complex signals from the dataset.

Imagine fitting a linear regression to a dataset that has a non-linear pattern:

Error from Bias
No matter how many more observations you collect, a linear regression won't be able to model the curves in that data! This is known as under-fitting.

![image.png](attachment:image.png)


Error from Variance


Variance refers to your algorithm's sensitivity to specific sets of training data.

High variance algorithms will produce drastically different models depending on the training set.

For example, imagine an algorithm that fits a completely unconstrained, super-flexible model to the same dataset from above:

![image.png](attachment:image.png)

Error from Variance
As you can see, this unconstrained model has basically memorized the training set, including all of the noise. This is known as over-fitting.

![image.png](attachment:image.png)

<b>Overfitting</b> : 

 Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance

<b>Underfitting</b> : This is when your model is unable to capture the relationships that exist in training. The model performs poorly on the training data itself. There is no point in using it on test data because it hasn't trained completely. 

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

#### Examples of Overfitting


Let’s say we want to predict if a student will land a job interview based on her resume.

Now, assume we train a model from a dataset of 10,000 resumes and their outcomes.

Next, we try the model out on the original dataset, and it predicts outcomes with 99% accuracy… wow!

But now comes the bad news.

When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy… uh-oh!

Our model doesn’t generalize well from our training data to unseen data.

This is known as overfitting, and it’s a common problem in machine learning and data science.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### How to Prevent Overfitting
Detecting overfitting is useful, but it doesn’t solve the problem. Fortunately, you have several options to try.

Here are a few of the most popular solutions for overfitting:

#### Cross-validation
Cross-validation is a powerful preventative measure against overfitting.

The idea is clever: Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your model.

In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).

![image.png](attachment:image.png)

Cross-validation allows you to tune hyperparameters with only your original training set. This allows you to keep your test set as a truly unseen dataset for selecting your final model.

#### Train with more data
It won’t work every time, but training with more data can help algorithms detect the signal better. In the earlier example of modeling height vs. age in children, it’s clear how sampling more schools will help your model.

Of course, that’s not always the case. If we just add more noisy data, this technique won’t help. That’s why you should always ensure your data is clean and relevant.

#### Remove features
Some algorithms have built-in feature selection.

For those that don’t, you can manually improve their generalizability by removing irrelevant input features.

An interesting way to do so is to tell a story about how each feature fits into the model. This is like the data scientist's spin on software engineer’s rubber duck debugging technique, where they debug their code by explaining it, line-by-line, to a rubber duck.

If anything doesn't make sense, or if it’s hard to justify certain features, this is a good way to identify them.
In addition, there are several feature selection heuristics you can use for a good starting point.

#### Early stopping


When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs.

Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data.

Early stopping refers stopping the training process before the learner passes that point.

![image.png](attachment:image.png)

Early stopping graphic
Today, this technique is mostly used in deep learning while other techniques (e.g. regularization) are preferred for classical machine learning.

#### Regularization


Regularization refers to a broad range of techniques for artificially forcing your model to be simpler.

The method will depend on the type of learner you’re using. For example, you could prune a decision tree, use dropout on a neural network, or add a penalty parameter to the cost function in regression.

Oftentimes, the regularization method is a hyperparameter as well, which means it can be tuned through cross-validation.

We have a more detailed discussion here on algorithms and regularization methods.

#### Ensembling


Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are:

Bagging attempts to reduce the chance overfitting complex models.

- It trains a large number of "strong" learners in parallel.
- A strong learner is a model that's relatively unconstrained.
- Bagging then combines all the strong learners together in order to "smooth out" their predictions.


Boosting attempts to improve the predictive flexibility of simple models.

- It trains a large number of "weak" learners in sequence.
- A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
- Each one in the sequence focuses on learning from the mistakes of the one before it.
- Boosting then combines all the weak learners into a single strong learner.

While bagging and boosting are both ensemble methods, they approach the problem from opposite directions.

Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses simple base models and tries to "boost" their aggregate complexity.

*** 

#### Underfitting Remedies

Experts suggest that this problem can be alleviated by simply using more (good!) data for the project. In addition, the following ways can also be used to tackle underfitting.


- Increase the size or number of parameters in the ML model.


- Increase the complexity or type of the model.


- Increasing the training time until cost function in ML is minimised.