In [1]:
# Fundamentals of Machine Learning

# Contents:
#     1) Forms of machine learning beyond classification and regression
#     2) Formal evaluation procedures for machine-learning
#     3) Formal evaluation procedures for machine-learning models
#     4) Preparing data for deep learning
#     5) Feature engineering
#     6) Tackling overfitting
#     7) The universal workflow for approaching machine learning problems
    
    

In [2]:
# Four Branches of Machine Learning

# 1) Supervised Learning
#    Supervised learning is the most common case. It consists of learning to map input data to known targets, known
#    as annotations given a set of examples, often times annotated by humans. Generally, almost all applications of 
#    deep learning that are in the spotlight these days belong in this category, such as optical characer recognition, 
#    speech recognition, speech recognition, image classification, and language translation.

#    Most supervised learning mostly consists of classification and regression, there are more exotic variants as well:
#     - Sequence generation: Given a picture, predict a caption describing it
#     - Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree
#     - Object detection: Given a picture, draw a bounding box around certain objects inside the picture. This can
#       also be expressed as a classification problem, or a joint classification and regression problem
#     - Image segmentation: Given a picture, draw a pixel-level mask on a specific object

# 2) Unsupervised Learning
#    This branch of machine learning consists of finding interesting transformations of the input data without the help 
#    of any targets, tfor the purposes of data visualization, data compression, or data denoising, or to better 
#    understand the correlatoins present in the data at hand. Unsupervised learning is the bread and butter of data 
#    analytics, and it's often a necessary step in better understanding a dataset before attempting to solve a 
#    supervised-learning problem. Dimensionality reduction and clustering are well-known categories of unsupervised 
#    learning. 

# 3) Self-supervised learning
#    This is a specific instance of supervised learning, but it's different enough to be a different category. This is
#    essentially supervised learning without any humans in the loop. There are still labels involved, because the 
#    learning has to be supervised by something, but they're generated from the input data, typically using a heuristic 
#    algorithm.  
 
# 4) Reinforcement learning
#    In reinforcement learning, an agent receives information about its environment and learns to choose actions that 
#    will maximize some reward. For instance, a neuralnetwork that looks at a video game screen and outputs game 
#    actions in order to maximize its score can be trained via reinforcement learning

In [3]:
# Classification and Regression Glossary:
    
# Sample/input - One data popint that goes into your model
# Prediction/output - What comes out of th model
# Target - The truth, or what your model should ideally have predicted, according to an external source of data
# Prediction error/loss value - A measure of the distance between your model's prediction and the target
# Classes - A set of possible labels to choose from in a classification problem
# Label - A specific instance of a class annotation in a classification problem
# Annotations - All targets for a dataset, typically collected by humans
# Binary classification - A classification task where each input sample should be categorized into two exclusive 
#                         categories
# Multiclass classification - A classification task where each input sample should be categorized into two exclusive 
#                             categories.
# Scalar regression - A task where the target is a continuous scalar value
# Vector regression - A task where the target is a set of continuous values
# Mini-batch/batch - A small set of samples that are processed simultaneously by the model. Usually the number of 
#                    samples is often a power of 2, to facilitate memory allocation on GPU

In [1]:
# Evaluating machine-learning models

# Developing a model always involves tuning its confiruation, like choosing the number of layers or the size of the 
# layers (called hyper-parameters of the model, to distinguish them from the parameters, which are the network's 
# weights). 

# Tuning is a form of learning, a search for a good configuration in some parameter space. As a result, tuning the 
# configuration of the model based on its performance on the validation set can quickly result in overfitting to 
# the validation set, even though your model is never directly trained on it.

# Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some 
# information about the validation data leaks into the model. If you do this only once, for one parameter, then
# very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But
# if you repeat this many times - running one experiment, evaluating on the validation set, and modifying your model 
# as a result - then you'll leak an increasingly significant amount of information about the validation set into 
# the model.

# There are some different ways to set apart data:
#     1) Simple hold-out validation: Set apart some fraction of your data as your test set. Train on the remaining
#        data, and evaluate on the test set. 
#     2) K-Fold Validation: Split data into K partitions of equal size. For each partition i, train a model on the 
#        remaining K-1 partitions, and evaluate it on partition i. Your final score is then the averages of the K 
#        scores obtained. This method is helpful when the performance of your model shows significant variance based 
#        on your train-test split. Like hold-out validation, this method does not exempt you from using a distinct 
#        validation set for model calibration.
#     3) Iterated K-Fold Validation with shuffling: This is for when you have relatively little data available and you 
#        need to evaluate your model as precisely as possible. It applies K-fold validation multiple times, shuffling 
#        the data every time before splitting it K-ways. The final score is the average of the scores obtained at each 
#        run of K-fold validation. Note that you end up training and evaluating P*K models where P is the number of 
#        iterations you use, which can be very expensive.






In [2]:
# Data Preprocessing, Feature Engineering, and Feature Learning

# Vectorization:
#     The process of turning all inputs and targets in a neural network to tensors of floating-point data
#     or in some cases, tensors of integers.

# Value Normalization:
#     We have done value normalization twice already. In the digit classification example, we started from image data 
#     encoded as integers in the 0-255 range, encoding grayscale values, and casted the dated to float32 and divided by
#     255 in order to get floating-point values in the 0-1 range. In the house prices example, we started from features 
#     that took a variet of ranges - some already floating point values and some integers - and we normalized each 
#     feature independently so that it had a standard deviation of 1 and a mean of 0.

#     To make learning easier for the network, we should:
#         1) Take small values (0-1 range typically)
#         2) Be homogenous, so all features should take values roughly in the same range
#         3) Where you can, normalize each feature independently to have a mean of 0 and a standard deviation of 1
        
# Feature Engineering:
#     Feauture Engineering is the process of using your own knowledge about the data and about the machine-learning 
#     algorithm at hand (in this case, a neural network), to make the algorithm work better by applying hardcoded 
#     transformations to the data before it goes into the model. 
    
#     Example of feature engineering:
#         Suppose we want to create a model that inputs an image of a clock and outputs the time of day. 
        
#         Approach 1: We can use the raw pixels of the image as input data, but then we have a big problem on our
#                     hands. We'll need a convlutional neural netwrok to solve it, which is quite expensive.
#         Approach 2: We can come up with a much better input feature for a machine learning algorithm. We can
#                     write a Python script to follow the black pixels of the clock hands and output the (x, y) 
#                     coordinates of the tip of each hand. Then a simple machine learning algorithm can learn to 
#                     associate these coordinates with the right time of day.
                    
#     Feature egineering is important because:
#         1) Good features allow you to solve problems more elegantly while using fewer resources
#         2) Good features let you solve a problem with far less data. The ability of deep learning models to
#            learn features on their own relies on having lots of training data available. If you only have a
#            few samples, then the information value in their features becomes critical.
        
        
        

In [1]:
# Overfitting and underfitting

# The fundamental issue in machine learning is the tension between optimization and generalization

# Optimization is the process of adjusting a model to get the best performance possible on the training data
# Generalization refers to how well the trained model performs on data it has never seen before

# To prevent a model from learning misleading/irrelevant patters in the training data, the best solution is
# to get more training data. However, when you can't do this, you need to use modeulate the quantity of  information
# that you rmodel si allowed to store, or add constraints on what information it's allowed to store. If a network 
# can only afford to memorize a small umber of patterns, the opitmization process will force it to to focus on the 
# most prominent patterns, which have a better chance of generalizing well.

# Regularization is the process of fighting overfitting in a model.

# Methods of Regularization:
    
#     Reducing the network's size:
#         The simplest way to prevent overfitting is to reduce the number of learnable parameters in the model. The 
#         number of learnable parameters in a model is often referred to as the model's capacity. Intuitively, a model 
#         with more parameters has more memorization capacity and therefore can learn a perfect dictionary-like mapping 
#         between training samples and their targets, which is a mapping without any generalization power. At the same 
#         time, your model shouldn't be starved for memorization resources.    

#     Adding weight regularization:
#         Occam's razor states that given two explanations for something, the explanation most likely to be correct 
#         is the simplest one - the one that makes fewer assumptions. Similarly, simpler models are less likely to
#         overfit than complex ones. 
        
#         A simple model in this context is a model where the distribution of parameter values has less entropy, or
#         a model with fewer parameters. 
        
#         Weight regularization is a common way to mitigate overfitting, which works by putting  constraints on the 
#         complexity of a network by forcing its weights to only takes small values to be more regular. This is done
#         by adding a cost associated with having large wieght values to the loss function.  
        
#         There are two main types of Weight regularization:
#             1) L1 Regularization - the cost added is proportional to the absolute value of the weight coefficients
#             2) L2 Regularization - The cost added is proportional to the square of the values of the weight 
#                coefficients
            
#     Adding dropout:
#         This technique was inspired by the fraud-prevention mechanism used by banks. At banks, tellers are always 
#         changing, because cooperation between tellers would allow them to defraud the banks. By keeping them 
#         independent, it helped prevent people from coming together to make schemes. 
        
#         Similarly, dropout prevents cooperation between units in the graph by zeroing them out. By randomly 
#         removing certain units, it becomes harder for units to create/detect arbitrary, unwanted patterns. 
        
#         Dropout works well in practice because it prevents the co-adaption of neurons during the training phase.
        
#         At test time, no units are dropped out. Instead, the layer's output values are scaled up by a factor 
#         equal to the droupout rate, to balance for the fact that more units are active than at training time and
#         that all values are inputted.
        
    


In [None]:
Universal Workflow of Machine Learning

Defining the problem and assembling a dataset:
    First, you must define the problem at hand:
        What will your input data be? What are you trying to predict?
        What tye of problem are you facing? 
        
        You hypothesize that your outputs can be predicted given your inputs
        You hypothesize that your available data is sufficiently informative to
            learn the relatioship between inputs and outputs

Choosing a measure of sucess

Deciding on an evaluation protocol

Preparing your data

Developing a model that does better than a baseline

Scale up: developing a model that overfits
    
Regularizing your model and tuning your hyperparameters


    






