Beata Sirowy

# Machine Learning:  Basics
Based on: 

Geron, A. (2023) _Hands-On Machine Learning
with Scikit-Learn, Keras, and TensorFlow_ 

Liu, Y. (2020) _Python Machine Learning By Example_

![image.png](attachment:image.png)

![image.png](attachment:image.png)

__Machine learning__ is the process of teaching computers to learn and make decisions from data without being explicitly programmed. 

It can be also defined as  the science (and art) of programming computers so
they can learn from data.

It is a  a subset of artificial intelligence where computers are trained to learn from data and make decisions based on patterns they identify, without needing explicit programming for each specific task. 
- It involves using algorithms to parse data, learn from it, and make predictions or decisions. 
- This process improves over time as the system is exposed to more data, enhancing its accuracy and performance in various applications.

__Use cases__:
- Problems for which existing solutions require a lot of fine-tuning or
long lists of rules (a machine learning model can often simplify code
and perform better than the traditional approach)
- Complex problems for which using a traditional approach yields no
good solution (the best machine learning techniques can perhaps find a
solution)
- Fluctuating environments (a machine learning system can easily be
retrained on new data, always keeping it up to date)
- Getting insights about complex problems and large amounts of data

__Concrete examples of machine learning tasks, along with
the techniques that can tackle them:__

- __Analyzing images of products__ on a production line to automatically classify
them
This is image classification, typically performed using convolutional
neural networks or sometimes transformers

- __Detecting tumors in brain scans__
This is semantic image segmentation, where each pixel in the image is
classified (as we want to determine the exact location and shape of
tumors), typically using CNNs or transformers.

- __Automatically classifying news articles__
This is natural language processing (NLP), and more specifically text
classification, which can be tackled using recurrent neural networks
(RNNs) and CNNs, but transformers work even better.

- __Automatically flagging offensive comments__ on discussion forums
This is also text classification, using the same NLP tools.

- __Summarizing long documents automatically__
This is a branch of NLP called text summarization, again using the same
tools.

- __Creating a chatbot or a personal assistant__
This involves many NLP components, including natural language
understanding (NLU) and question-answering modules.

- __Forecasting your company’s revenue next year__, based on many performance
metrics. This is a regression task (i.e., predicting values) that may be tackled
using any regression model, such as a linear regression or polynomial
regression model, a regression support vector machine, a regression random forest, or an
artificial neural network. If you want to take into
account sequences of past performance metrics, you may want to use
RNNs, CNNs, or transformers (see Chapters 15 and 16).

- __Making your app react to voice commands__
This is speech recognition, which requires processing audio samples:
since they are long and complex sequences, they are typically processed
using RNNs, CNNs, or transformers.

- __Detecting credit card fraud__
This is anomaly detection, which can be tackled using isolation forests,
Gaussian mixture models, or autoencoders.

- __Segmenting clients__ based on their purchases so that you can design a
different marketing strategy for each segment
This is clustering, which can be achieved using e.g. k-means or DBSCAN.

- __Representing a complex, high-dimensional dataset__ in a clear and insightful
diagram. This is data visualization, often involving dimensionality reduction
techniques.

- __Recommending a product__ that a client may be interested in, based on past
purchases. This is a recommender system. One approach is to feed past purchases
(and other information about the client) to an artificial neural network, and get it to output the most likely next purchase. This neural net would typically be trained on past sequences of purchases
across all clients.

- __Building an intelligent bot for a game__
This is often tackled using reinforcement learning (RL),
which is a branch of machine learning that trains agents (such as bots)
to pick the actions that will maximize their rewards over time (e.g., a
bot may get a reward every time the player loses some life points),
within a given environment (such as the game). The famous AlphaGo
program that beat the world champion at the game of Go was built using
RL.

### The main production-ready Python ML frameworks:

#### Scikit-Learn 
It is very easy to use, yet it implements many machine
learning algorithms efficiently, so it makes for a great entry point to
learning machine learning. 
- It was created by David Cournapeau in
2007, and is now led by a team of researchers at the French Institute
for Research in Computer Science and Automation (Inria).

#### TensorFlow 
It is a more complex library for distributed numerical
computation. 
- It makes it possible to train and run very large neural
networks efficiently by distributing the computations across potentially hundreds of multi-GPU (graphics processing unit) servers. 
- TensorFlow
(TF) was created at Google and supports many of its large-scale
machine learning applications. 
- It was open sourced in November 2015.
- Ideal for large-scale projects and production environments that require high-performance and scalable models.

#### Keras

It is a high-level deep learning API that makes it very simple to
train and run neural networks. 
- Keras comes bundled with TensorFlow,
and it relies on TensorFlow for all the intensive computations.


#### PyTorch 
It is used for applications such as computer vision and natural language processing.
- Developed by Meta AI (formerly Facebook AI Research Lab).
- Its initial release in 2016 quickly garnered attention due to its flexibility, ease of use, and dynamic computation graph.
- Ideal for research and small-scale projects prioritizing flexibility, experimentation and quick editing capabilities for models

## Types of ML systems

A machine learning system is fed with input data—this can be numerical,
textual, visual, or audiovisual. 

The system usually has an output—this can
be a floating-point number, for instance, the acceleration of a self-driving
car, or it can be an integer representing a category (also called a class), for
example, a cat or tiger from image recognition.

The main task of machine learning is to explore and construct algorithms
that can learn from historical data and make predictions on new input data.

For a data-driven solution, we need to define (or have it defined by an
algorithm) an evaluation function called loss or cost function, which
measures how well the models are learning. In this setup, we create an
optimization problem with the goal of learning in the most efficient and
effective way.


There are so many different types of machine learning systems that it is
useful to classify them in broad categories, based on the following criteria:
 - How they are supervised during training (__supervised, unsupervised,
semi-supervised, self-supervised, and others__)
- Whether or not they can learn incrementally on the fly (__online versus
batch learning__)
- Whether they work by simply comparing new data points to known
data points, or instead by detecting patterns in the training data and
building a predictive model, much like scientists do (__instance-based
versus model-based learning__)

These criteria are not exclusive; you can combine them in any way you like. E.g. a spam filter may be an online, model-based, supervised learning system.


### Training Supervision

![image.png](attachment:image.png)

#### __Supervised learning:__ 

![image.png](attachment:image.png)

When learning data comes with a description,
targets, or desired output besides indicative signals, the learning goal is
to find a general rule that maps input to output. 
- This kind of learning
data is called __labeled data__. 
- The learned rule is then used to label new
data with unknown output. 
- The labels are usually provided by event logging systems or evaluated by human experts. 
- If it's feasible, they may also be produced by human raters, through crowd
sourcing, for instance. 
- Supervised learning is commonly used in daily
applications, such as __face and speech recognition, products or movie
recommendations, sales forecasting, and spam email detection__.

We can further subdivide supervised learning into:

- __Regression__ trains on and predicts continuous-
valued responses, for example, predicting house prices or other a target numeric value, such as the price
of a car, given a set of features (mileage, age, brand, etc.). To train the system, you need to give it many examples of cars, including both their features and their targets (i.e.,
their prices).

- __Classification__ attempts to find the appropriate class label, such as
analyzing a positive/negative sentiment and predicting a loan default.

- Note that some regression models can be used for classification as well, and
vice versa. For example, __logistic regression is commonly used for
classification__, as it can output a value that corresponds to the probability of
belonging to a given class (e.g., 20% chance of being spam).

![image-2.png](attachment:image-2.png)

- _The words target and label are generally treated as synonyms in supervised learning, but
target is more common in regression tasks and label is more common in classification
tasks._ 
- _Features are sometimes called predictors or attributes. These terms
may refer to individual samples (e.g., “this car’s mileage feature is equal to 15,000”) or
to all samples (e.g., “the mileage feature is strongly correlated with price”)._

#### __Unsupervised learning__: 
When the learning data only contains
indicative signals without any description attached, it's up to us to find
the structure of the data underneath, to discover hidden information, or
to determine how to describe the data. This kind of learning data is
called __unlabeled data.__ 

Unsupervised learning can be helpful in 
- __Detecting
anomalies__, such as fraud or defective equipment, or automatically removing outliers from a
dataset before feeding it to another learning algorithm. The system is shown
mostly normal instances during training, so it learns to recognize them;
then, when it sees a new instance, it can tell whether it looks like a normal
one or whether it is likely an anomaly.

![image-2.png](attachment:image-2.png)

- __Novelty detection__ a very similar task to anomaly detection: it aims to detect new instances that look different from all instances in the training set. This requires having a very “clean” training set, devoid of any instance that you would like the algorithm to detect. For
example, if you have thousands of pictures of dogs, and 1% of these
pictures represent Chihuahuas, then a novelty detection algorithm should
not treat new pictures of Chihuahuas as novelties. (On the other hand,
anomaly detection algorithms may consider these dogs as so rare and so
different from other dogs that they would likely classify them as anomalies).
- __Grouping customers__ with similar online behaviors for a marketing campaign.
- __Data visualization__ that makes data more digestible. These
algorithms try to preserve as much structure as they can (e.g., trying to keep
separate clusters in the input space from overlapping in the visualization) so
that you can understand how the data is organized and perhaps identify
unsuspected patterns.
- __Dimensionality reduction__  distills relevant information from noisy data, without losing too much information. One way to do this is to
merge several correlated features into one. For example, a car’s mileage
may be strongly correlated with its age, so the dimensionality reduction
algorithm will merge them into one feature that represents the car’s wear
and tear. This is called __feature extraction__.
- __Association rule learning__- here
the goal is to dig into large amounts of data and discover interesting
relations between attributes. For example, suppose you own a supermarket.
Running an association rule on your sales logs may reveal that people who
purchase barbecue sauce and potato chips also tend to buy steak. Thus, you
may want to place these items close to one another.

__Example - a clustering algorithm__:

Say you have a lot of data about your blog’s visitors. You may
want to run a clustering algorithm to try to detect groups of similar visitors
(Figure 1-8). At no point do you tell the algorithm which group a visitor
belongs to: it finds those connections without your help. For example, it
might notice that 40% of your visitors are teenagers who love comic books
and generally read your blog after school, while 20% are adults who enjoy
sci-fi and who visit during the weekends. If you use a hierarchical
clustering algorithm, it may also subdivide each group into smaller groups.
This may help you target your posts for each group.

![image.png](attachment:image.png)

- _It is often a good idea to try to reduce the number of dimensions in your training data
using a dimensionality reduction algorithm before you feed it to another machine
learning algorithm (such as a supervised learning algorithm)._ 
- _It will run much faster, the
data will take up less disk and memory space, and in some cases it may also perform
better._

#### __Semi- supervised learning__

If not all learning samples are labeled, but some are, we will have semi-
supervised learning. This makes use of unlabeled data (typically a large amount) for training, besides a small amount of labeled data. 
- Semi- supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset and more practical to label a small subset.

![image.png](attachment:image.png)

Most semi-supervised learning algorithms are combinations of
unsupervised and supervised algorithms. 
- For example, a clustering
algorithm may be used to group similar instances together, and then every
unlabeled instance can be labeled with the most common label in its cluster.
- Once the whole dataset is labeled, it is possible to use any supervised
learning algorithm.

__Example: photo hosting sites:__
- Once you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and 11, while
another person B shows up in photos 2, 5, and 7. This is the unsupervised
part of the algorithm (clustering). 
- Now all the system needs is for you to tell
it who these people are. Just add one label per person and it is able to name
everyone in every photo, which is useful for searching photos.

#### __Self-supervised learning__

- involves generating a fully
labeled dataset from a fully unlabeled one.
- once the whole dataset is
labeled, any supervised learning algorithm can be used.
- For example, if you have a large dataset of unlabeled images, you can
randomly mask a small part of each image and then train a model to recover
the original image
- During training, the masked images are used as the inputs to the model, and the original images are used as the
labels.

![image.png](attachment:image.png)

The resulting model may be quite useful in itself—for example, to repair
damaged images or to erase unwanted objects from pictures. But more often
than not, a model trained using self-supervised learning is not the final goal.
- You’ll usually want to tweak and fine-tune the model for a slightly different
task
- For example, suppose that what you really want is to have a pet
classification model: given a picture of any pet, it will tell you what species
it belongs to.
- you can start by training an image-repairing model using self-supervised
learning. 
- Once it’s performing well, it should be able to distinguish different
pet species: when it repairs an image of a cat whose face is masked, it must
know not to add a dog’s face. 
- it is then possible to tweak the
model so that it predicts pet species instead of repairing images.
- The final
step consists of fine-tuning the model on a labeled dataset: the model
already knows what cats, dogs, and other pet species look like, so in this step it can learn the mapping between the species it
already knows and the labels we expect from it.

Transferring knowledge from one task to another is called transfer learning, 
-  it’s one
of the most important techniques in machine learning today, especially when using deep
neural networks (i.e., neural networks composed of many layers of neurons.

Some people consider self-supervised learning to be a part of unsupervised
learning, since it deals with fully unlabeled datasets, but in some aspects it is closed to supervised learning:
- self-supervised
learning uses (generated) labels during training 
- the term “unsupervised learning” is generally
used when dealing with tasks like clustering, dimensionality reduction, or
anomaly detection, whereas self-supervised learning focuses on the same
tasks as supervised learning: mainly classification and regression.

#### __Reinforcement learning:__

Learning data provides feedback so that the
system adapts to dynamic conditions in order to achieve a certain goal
in the end. 

The learning system,
called __an agent__, can observe the environment, select and
__perform actions__, and __get rewards__ in return (__or penalties__ in the form of
negative rewards. It must then learn by itself what
is the best strategy, called __a policy__, to get the most reward over time.

- The system evaluates its performance based on the
feedback responses and reacts accordingly. 
- Instances
include __robotics for industrial automation, self-driving cars, and the
chess master, AlphaGo__. 
- The key difference between reinforcement
learning and supervised learning is the __interaction with the
environment.__

![image.png](attachment:image.png)

__Example: DeepMind’s AlphaGo program__
- In 2017 it beat
Ke Jie, the number one ranked player in the world at the time, at the game
of Go. 
- It learned its winning policy by analyzing millions of games, and
then playing many games against itself. 
- Note that learning was turned off
during the games against the champion; AlphaGo was just applying the
policy it had learned. 
- This is called
offline learning.

### Batch Versus Online Learning

#### Batch learning
In batch learning, the system is incapable of learning incrementally: it must
be trained using all the available data. 
- This will generally take a lot of time
and computing resources, so it is typically done offline. 
- First the system is
trained, and then it is launched into production and runs without learning
anymore; it just applies what it has learned. 
- This is called offline learning.

Unfortunately, a model’s performance tends to decay slowly over time,
simply because the world continues to evolve while the model remains
unchanged. This phenomenon is often called __model rot or data drift__. The
solution is to regularly retrain the model on up-to-date data.

#### __ML algorithms__

Machine learning algorithms have evolved over time. We can roughly categorize them
into four main approaches: 
- __logic-based learning__: used basic
rules specified by human experts and, with these rules, systems tried to
reason using formal logic, background knowledge, and hypotheses
- __statistical learning__: attempts to find a function to formalize the
relationships between variables
- __artificial neural networks (ANNs)__: imitate animal brains and
consist of interconnected neurons that are also an imitation of biological
neurons. They try to model complex relationships between input and output
values and to capture patterns in data. 
- __genetic algorithms (GA)__: were
popular in the 1990s. They mimic the biological process of evolution and
try to find the optimal solutions using methods such as mutation and
crossover.

We are currently seeing a revolution in __deep learning__, which we might
consider __a rebranding of neural networks__. The term deep learning was
coined around 2006 and refers to deep neural networks with many layers.
The breakthrough in deep learning was the result of the integration and
utilization of Graphical Processing Units (GPUs), which massively speed
up computation.
- It's believed that deep learning
resembles the way humans learn.Therefore, it may be able to deliver on the
promise of sentient machines.

#### __Moore's law:__
It is an empirical observation
claiming that __computer hardware improves exponentially with time__. The
law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. 
- The consensus seems to be that Moore's law should continue to be valid for
a couple of decades.This gives some credibility to Ray Kurzweil's
predictions of achieving true machine intelligence by 2029.

![image.png](attachment:image.png)

#### __Overfitting, underfitting, and the bias-variance trade-off__
__Overfitting:__  
- a model fits the existing
observations too well but fails to predict future new observations. 
- This can occur
when we're over extracting too much information from the training sets and
making our model just work well with them, which is called __low bias__ in
machine learning.
- __bias:__ the difference between the average prediction and the true value
- The model, as a result, will perform poorly on datasets that weren't seen before. We call
this situation __high variance__ in machine learning. 
- __variance__ measures the spread of the prediction, which is the
variability of the prediction.


![image.png](attachment:image.png)

- Overfitting occurs when we try to describe the learning rules based on too
many parameters relative to the small number of observations, instead of
the underlying 
- Overfitting also takes place when we make the model excessively
complex so that it fits every training sample, such as memorizing the
answers for all exam questions.

__Underfitting:__

When a model is underfit, it doesn't
perform well on the training sets and won't do so on the testing sets, which
means it fails to capture the underlying trend of the data. 
- Underfitting may
occur if we aren't using enough data to train the model
- this may also happen if we're
trying to fit a wrong model to the data, 
- We call any of these situations a __high bias__ in machine learning;
- although its __variance is low__ as the performance in training and test sets is
pretty consistent, in a bad way.

![image.png](attachment:image.png)

 __The bias-variance trade-off__

- bias is the error stemming from incorrect assumptions in the learning
algorithm; high bias results in underfitting 
- variance measures how sensitive the model prediction is to variations in the datasets 
- hence, we need to avoid cases where either bias or variance is getting high 
- in practice, there is an explicit trade-off
between them, where decreasing one increases the other. 
This is the so-called bias-variance trade-off

The more complex the learning model ŷ(x) is, and the larger the size of the training samples, the lower the bias will become. 
However, this will also create more shift to the model in order to better fit the increased data points.
As a result, the variance will be lifted.

We usually employ the cross-validation technique as well as regularization
and feature reduction to find the optimal model balancing bias and variance
and to diminish overfitting.

### __Avoiding overfitting with cross-validation__

In machine learning, the
validation procedure helps to evaluate how the models will generalize to
independent or unseen datasets in a simulated setting. 

In a conventional validation setting: 
- the original data is partitioned into three subsets, usually
__60% for the training set__, __20% for the validation set__, and the rest __(20%) for
the testing set__. 
- This setting suffices if we have enough training samples
after partitioning and we only need a __rough estimate__ of simulated
performance. 

Otherwise, __cross-validation__ is preferable: 

- In one round of cross-validation, __the original data is divided into two
subsets__, for training and testing (or validation), respectively. 
- The testing
performance is recorded. 
- Similarly, __multiple rounds of cross-validation are
performed__ under different partitions. 
- Testing results from all rounds are
finally averaged to generate a more reliable estimate of model prediction
performance. 
- Cross-validation helps to reduce variability and, therefore,
limit overfitting.

_When the training size is very large, it's often sufficient to split it into
training, validation, and testing (three subsets) and conduct a performance
check on the latter two. Cross-validation is less preferable in this case since
it's computationally costly to train a model for each single round. But if you
can afford it, there's no reason not to use cross-validation. When the size
isn't so large, cross-validation is definitely a good choice._

There are mainly two cross-validation schemes in use: 

In __the exhaustive scheme__, we leave out a fixed number of
observations in each round as testing (or validation) samples and use the
remaining observations as training samples. 
- This process is repeated until
all possible different subsets of samples are used for testing once. 
- For
instance, we can apply __Leave-One-Out-Cross-Validation (LOOCV)__,
which lets each sample be in the testing set once. For a dataset of the size n,
LOOCV requires n rounds of cross-validation. This can be slow
when n gets large. This following diagram presents the workflow of
LOOCV:

![image.png](attachment:image.png)

The __non-exhaustive scheme__ as the name implies, doesn't
try out all possible partitions. The most widely used type of this scheme
is __k-fold cross-validation__. 
- We first randomly split the original data into k
equal-sized folds. 
- In each trial, one of these folds becomes the testing set,
and the rest of the data becomes the training set.
- We repeat this process k times, with each fold being the designated testing
set once. 
- Finally, we average the k sets of test results for the purpose of
evaluation. Common values for k are 3, 5, and 10. The following table
illustrates the setup for five-fold:

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

- K-fold cross-validation often has a lower variance compared to LOOCV,
since we're using a chunk of samples instead of a single one for validation.

__The holdout method__

We can also randomly split the data into training and testing sets numerous
times. This is formally called the holdout method. The problem with this
algorithm is that some samples may never end up in the testing set, while
some may be selected multiple times in the testing set.

__In summary__
cross-validation derives a more accurate assessment of model
performance by combining measures of prediction performance on different
subsets of data. This technique not only reduces variance and avoids
overfitting, but also gives an insight into how the model will generally
perform in practice.

### __Avoiding overfitting with regularization__

Recall that the
unnecessary complexity of the model is a source of overfitting. 
- __Occam's razor:__ the simplest hypothesis that fits data
should be preferred. 
- One justification is that we can invent fewer simple
models than complex models.
- Regularization adds extra parameters to the error function we're trying to
minimize, in order to penalize complex models.

It is much easier to find a model that perfectly captures all training data points
with a high-order polynomial function, as its search space is much larger
than that of a linear function. However, these easily obtained models
generalize worse than linear models / are more prone to overfitting.

![image.png](attachment:image.png)

- The linear model is preferable as it may generalize better to more data
points drawn from the underlying distribution. 
- We can use regularization to
reduce the influence of the high orders of polynomial by imposing penalties
on them. This will discourage complexity, even though a less accurate and
less strict rule is learned from the training data.

Besides __penalizing complexity__, we can also __stop a training procedure early__
as a form of regularization. 
- If we limit the time a model spends learning or
we set some internal stopping criteria, it's more likely to produce a simpler
model.
- The model complexity will be controlled in this way and, hence,
overfitting becomes less probable. 
- This approach is called __early
stopping__ in machine learning.

__In summary__ it's worth noting that regularization should be kept at a
moderate level or, to be more precise, fine-tuned to an optimal level. Too
small a regularization doesn't make any impact; too large a regularization
will result in underfitting, as it moves the model away from the ground
truth.

### __Avoiding overfitting with feature selection and dimensionality reduction__

We typically represent data as a grid of numbers (a matrix). Each column
represents a variable, which we call a feature in machine learning (e.g. size, colour, shape, etc). In
supervised learning, one of the variables is actually not a feature, but the
label that we're trying to predict (e.g. value  ). And in supervised learning, each row is an
example that we can use for training or testing.

__Feature selection is the process of picking a subset of significant features for use in
better model construction.__  

In principle, feature selection boils down to multiple binary decisions about
whether to include a feature 
- For n features, we get 2n feature sets, which
can be a very large number for a large number of features. 
- For example, for
10 features, we have 1,024 possible feature sets (for instance, if we're
deciding what clothes to wear, the features can be temperature, rain, the
weather forecast, and where we're going). 
- some features are either
redundant or irrelevant, and hence can be discarded with little loss


The number of features corresponds to the dimensionality of the data. Our
machine learning approach depends on the number of dimensions versus
the number of examples.

- For instance, text and image data are very high
dimensional, while stock market data has relatively fewer dimensions.
- Fitting high-dimensional data is computationally expensive and is prone to
overfitting due to the high complexity. 
- Higher dimensions are also
impossible to visualize, and therefore we can't use simple diagnostic
methods.
- It's therefore often important to do good feature selection. 


Basically, we have two options:
- we either start with all of the features and remove features iteratively, 
- or we
start with a minimum set of features and add features iteratively. 
- We then
take the best feature sets for each iteration and compare them.

At a certain point, brute-force evaluation becomes infeasible. Hence, more advanced
feature selection algorithms were invented to distill the most useful features/signals

Another common approach of reducing dimensionality is to transform high-
dimensional data into lower-dimensional space. This is known
as __dimensionality reduction__.

So far we've talked about how the goal of machine learning is to
find __the optimal generalization / fitting__  to the data, and how to avoid ill-
generalization. 
In the next two sections, we will explore how to get closer
to the goal throughout individual phases of machine learning, including: 
- data preprocessing
- feature engineering, 
- modeling

### __The main production-ready Python ML frameworks:__

#### Scikit-Learn 
It is very easy to use, yet it implements many machine
learning algorithms efficiently, so it makes for a great entry point to
learning machine learning. 
- It was created by David Cournapeau in
2007, and is now led by a team of researchers at the French Institute
for Research in Computer Science and Automation (Inria).

#### TensorFlow 
It is a more complex library for distributed numerical
computation. 
- It makes it possible to train and run very large neural
networks efficiently by distributing the computations across potentially hundreds of multi-GPU (graphics processing unit) servers. 
- TensorFlow
(TF) was created at Google and supports many of its large-scale
machine learning applications. 
- It was open sourced in November 2015.
- Ideal for large-scale projects and production environments that require high-performance and scalable models.

#### Keras

It is a high-level deep learning API that makes it very simple to
train and run neural networks. 
- Keras comes bundled with TensorFlow,
and it relies on TensorFlow for all the intensive computations.


#### PyTorch 
It is used for applications such as computer vision and natural language processing.
- Developed by Meta AI (formerly Facebook AI Research Lab).
- Its initial release in 2016 quickly garnered attention due to its flexibility, ease of use, and dynamic computation graph.
- Ideal for research and small-scale projects prioritizing flexibility, experimentation and quick editing capabilities for models