# Machine Learning Terminology Explained

Explanations of some of the terminology used in Machine Learning and Data Science.

Useful formatting of Jupyter notebooks info here [http://jupyter.org/](http://jupyter.org/ "http://jupyter.org/")
and [https://jupyter-notebook.readthedocs.io](https://jupyter-notebook.readthedocs.io "https://jupyter-notebook.readthedocs.io")

<html>
<dl>

  <dt>$A$ $priori$ association rules</dt> 
  <dd>The rules that can be observed in the training data
    and, based on which, a classification of the future data can be made. Often used in Bayesian classification.</dd>

  <dt>Bagging</dt>
  <dd>A method of classifying a data item by the majority vote of the
  classifiers trained on the random subsets of the training data.</dd>


  <dt>Bayesian probability</dt>
  <dd>Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification of a personal belief. The Bayesian interpretation of probability can be seen as an extension of propositional logic that enables reasoning with hypotheses, i.e., the propositions whose truth or falsity is uncertain. In the Bayesian view, a probability is assigned to a hypothesis, whereas under frequentist inference, a hypothesis is typically tested without being assigned a probability.

Bayesian probability belongs to the category of evidential probabilities; to evaluate the probability of a hypothesis, the Bayesian probabilist specifies some <i>prior probability</i>, which is then updated to a <i>posterior probability</i> in the light of new, relevant data (evidence). The Bayesian interpretation provides a standard set of procedures and formulae to perform this calculation.

The term Bayesian derives from the 18th century mathematician and theologian Thomas Bayes, who provided the first mathematical treatment of a non-trivial problem of statistical data analysis using what is now known as Bayesian inference. Mathematician Pierre-Simon Laplace pioneered and popularised what is now called Bayesian probability.</dd>


  <dt>Correlation</dt>
  <dd>The correlation coefficient measures the extent to which two variables are associated with one another. When high values of v1 go with high values of v2, v1 and v2 are positively associated. When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated. The correlation coefficient is a standardized metric so that it always ranges from –1 (perfect negative correlation) to +1 (perfect positive correlation). A correlation coefficient of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance. Correlation does not imply causation.</dd>

  <dt>Data</dt>
  <dd>A collection of observations.</dd>

  <dt>Decision Tree</dt> 
  <dd>A model classifying a data item into one of the classes at the leaf
      node, based on the matching properties between the branches on the tree and the
      actual data item.</dd>

  <dt>Ensemble learning</dt> 
  <dd>A method of learning where different learning algorithms
      are used to make a final conclusion.</dd>
      

  <dt>Fit</dt>
  <dd>Applying a learning algorithm to data using analytical approaches.</dd>

  <dt>Genetic algorithms</dt> 
  <dd>Machine learning algorithms inspired by the genetic
  processes, for example, an evolution where classifiers with the best accuracy are
  trained further.</dd>

  <dt>Hyperparameters</dt>
  <dd>The settings of a learning algorithm that need to be set before training.</dd>

  <dt>k-Means Clustering algorithm</dt> 
  <dd>The clustering algorithm that divides the dataset into the k groups such that 
      the members in the group are as similar possible, that is, closest
      to each other.</dd>

  <dt>k-Nearest Neighbors algorithm</dt>
  <dd>An algorithm that estimates an unknown data item to be like 
      the majority of the k-closest neighbors to that item.</dd>

  <dt>Learning algorithms</dt>
  <dd>An algorithm used to learn the best parameters of a model - for example, 
    linear regression, naive Bayes, or decision trees.</dd>

  <dt>Loss</dt>
  <dd>A metric to maximise or minimise through training.</dd>


  <dt>Models</dt>
  <dd>An output of a learning algorithm's training. Learning algorithms train models which we can then use to make predictions.   </dd>

  <dt>Naive Bayes classifier</dt>
  <dd>A way to classify a data item using Bayes' theorem about
      the conditional probabilities,   
   The naive aspect of the algorithm is based on its 
   assuming the independence between the given variables in the data.  </dd>

   \begin{equation*}
         P(A|B)  =  \frac{P(B|A) \times P(A)}{P(B)}
   \end{equation*}


   <dt>Neural networks</dt>
   <dd>A machine learning algorithm consisting of a network of
   simple classifiers making decisions based on the input or the results of the other
   classifiers in the network.</dd>


  <dt>Normalize Data (Normalization)</dt>
  <dd>Normalization can refer to different techniques depending on context. Here, we use normalization
to refer to rescaling an input variable to the range between 0 and 1. Normalization requires
that you know the minimum and maximum values for each attribute being normalized.</dd>


  <dt>Observation</dt>
  <dd>A single unit in our level of observation - for example, a person, a sale, or a record.</dd>

  <dt>Parameters</dt>
  <dd>The weights or coefficients of a model learned through training.</dd>


  <dt>Performance</dt>
  <dd>A metric used to evaluate a model.</dd>

  <dt>$Posterior$ probability</dt>
  <dd>In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined.</dd>

  <dt>Random Decision Tree</dt> 
  <dd>A decision tree in which every branch is formed using
      only a random subset of the available variables during its construction.</dd>
      
  <dt>Random Forest</dt> 
  <dd>An ensemble of random decision trees constructed on the
      random subset of the data with the replacement, where a data item is classified to
      the class with the majority vote from its trees.</dd>  

   <dt>Regression analysis</dt>
   <dd>A method of the estimation of the unknown parameters in a
       functional model predicting the output variable from the input variables, for
       example, to estimate $m$ and $c$ in the linear model $y=m*x + c$.</dd>


  <dt>Standardize Data (Standardization)</dt>
  <dd>Standardization is a rescaling technique that refers to centering the distribution of the data on
the value 0 and the standard deviation to the value 1. Together, the mean and the standard
deviation can be used to summarize a normal distribution, also called the Gaussian distribution
or bell curve. It requires that the mean and standard deviation of the values for each column be known
prior to scaling.
  </dd>  

  <dt>Support Vector Machines (SVMs)</dt>
  <dd>A classification algorithm that finds the maximum-margin hyperplane that divides the training 
      data into the given classes. This division by the hyperplane is then used to classify the  
      data further.</dd>

  <dt>Time series analysis</dt> 
  <dd>The analysis of data dependent on time; it mainly includes
   the analysis of trend and seasonality.</dd>


  <dt>Train</dt>
  <dd>Applying a learning algorithm to data using numerical approaches like gradient-descent.</dd>

</dl>
</html>

## General Notes

### 1. When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution.
If a given data attribute is normal or close to normal, this is probably the scaling method to use.
It is good practice to record the summary statistics used in the standardization process so that
you can apply them when standardizing data in the future that you may want to use with your
model. Normalization is a scaling technique that does not assume any specific distribution.

If your data is not normally distributed, consider normalizing it prior to applying your
machine learning algorithm. It is good practice to record the minimum and maximum values
for each column used in the normalization process, again, in case you need to normalize new
data in the future to be used with your model.

### 2. Bayesian Probability Example

This example modified from here https://en.wikipedia.org/wiki/Posterior_probability

Suppose there is a college having 60% male (M) and 40% female (F) students. The male students all wear trousers. The femal students wear skirts or trousers in equal numbers. An observer selects a random student from a distance: all the observer can see is that the student is wearing trousers. Given this knowledge, what is the probability this student is female? The correct answer can be computed using Bayes' theorem.   

We have two events. Event $F$ is that the observed student is female. Event $T$ is that the observed student is wearing trousers. To compute the posterior probability P(F|T), we first need to know:

- P(F), or the probability that the student is female regardless of any other information. Since the observer sees a random student, meaning that all students have the same probability of being observed, and the percentage of females among the students is 40%, so this probability equals 0.4.
- P(M), or the probability that the student is not female (i.e. is male) regardless of any other information (M is the complementary event to F). This is 60%, or 0.6.
- P(T|F), or the probability of the student wearing trousers given that the student is female. As they are as likely to wear skirts as trousers, this is 0.5.
- P(T|M), or the probability of the student wearing trousers given that the student is male. This is given as 1.
- P(T), or the probability of a (randomly selected) student wearing trousers regardless of any other information. We know that all males (i.e. 60% of students) wear trousers, and of all female students (i.e. 40% of students) half of them (i.e. 20%) wear trousers. So 60% plus 20% equals 80%, so this probability is 0.8. 



Having assigned probabilities, we can now plug the numbers into Bayes Formula to compute the posterior probabilities.
Original formula,

   \begin{equation*}
         P(A|B)  =  \frac{P(B|A) \times P(A)}{P(B)}
   \end{equation*}

Original formula substituting with our events,

   \begin{equation*}
         P(F|T)  =  \frac{P(T|F) \times P(F)}{P(T)}
   \end{equation*}

Replace with assigned probabilities,

   \begin{equation*}
                 =  \frac{0.5 \times 0.4}{0.8}
   \end{equation*}
   
   \begin{equation*}   
                 =  \frac{0.2}{0.8}
   \end{equation*}
   
   \begin{equation*}   
                 =  0.25
   \end{equation*}

The intuition of this result is that out of every hundred students (60 males and 40 females), since we observe trousers the student is one of the 80 students who wear these (60 male and 20 female); since 20/80 = 1/4 of these are female, the probability that the student in trousers is a female is 1/4, or 0.25