# Machine Learning   

## What is ML?

>  "[*Machine Learning is the*] *field of study that gives computers the ability to learn without being explicitly programmed*" *(Samuel, 1959)*  

> "*A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.*" *(Mitchell, 1997)*  

> "*ML is a branch of AI that systematically applies algorithms to synthesize the underlying relationships among data and information*". *(Awad & Khanna, 2015)*

> "*Machine Learning is the science (and art) of programming computers so they can learn from data.*" *(Géron, 2017)*




## Machine intelligence measure
    Alan Turing introduced a benchmark standard for demonstrating machine intelligence: machine has to be intelligent and responsive in a manner that cannot be differentiated from that of a human being. (Turing, 1950)




References:    
  1. Samuel, Arthur L. “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of R&D 44:1.2 (1959): 210–229.  
  2. Thomas M. Mitchell. Machine Learning (1st. ed.). McGraw-Hill, Inc., USA. 1997.
  3. Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1
  4. Géron, Aurélien. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems. Sebastopol, CA: O'Reilly Media, 2017. 
  5. Turing, Alan M. “Computing machinery and intelligence.” Mind (1950): 433–460.
  
  
  
  
## Where ML is used?

* Web search engines and Ad placement  
  * improving the search results
  * product recommendations
* Stock market prediction
* Traffic prediction
* Price forecasting
* Weather forecasting
* Big Data Analytics
* Credit scoring 
* Financial cyber security  
  * online detection and tracking monetary frauds  
  * isolation of illegitimate transactions 
* Filtering services
  * email spam filtering
  * bad pop-up ads removal
  * malware protection
* Gene sequence analysis
* Behavior analysis
* Drug Development
* Face recognition

## Venn diagram

<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0, 0, 0, 0.000);">
    <img src="./pics/DS_VD.png" 
         alt="alternate text" 
         width=400 
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 400px; 
                word-wrap: break-word; 
                text-align:justify;">
        Illustration of DS Venn diagram by Drew Conway in 2010. <br> 
        <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>

<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0,0, 0, 0.000;">
    <img src="./pics/DS_VD_2.jpg" 
         alt="alternate text" 
         width=400
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 400px; 
                word-wrap: break-word; 
                text-align:justify;">
        Illustration of AI/ML/DL Venn diagram by Gregory Piatetsky-Shapiro. <br> 
        <a href="https://www.kdnuggets.com/2016/03/data-science-puzzle-explained.html" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>

## Key Terminology in ML

* **Classifier**.  
  A method that receives a new input as an unlabeled instance of an observation or feature and identifies a category or class to which it belongs. Many commonly used classifiers employ statistical inference (probability measure) to categorize the best label for a given instance.

* **Confusion Matrix** (aka **Error Matrix**).  
  A matrix that visualizes the performance of the classification algorithm using the data in the matrix. It compares the predicted classification against the actual classification in the form of false positive, true positive, false negative and true negative information. A confusion matrix for a two-class classifier system (Kohavi and Provost, 1998) follows: 

<img src="./pics/errormatrix.jpg" style="width:400px;">

* **Accuracy** (aka **Error Rate**).  
  The rate of correct (or incorrect) predictions made by the model over a dataset. Accuracy is usually estimated by using an independent test set that was not used at any time during the learning process. More complex accuracy estimation techniques, such as **cross-validation** and **bootstrapping**, are commonly used, especially with datasets containing a small number of **instances**.

<img src="./pics/accuracy.jpg" style="width:500px;">

where β has a value from 0 to infinity (∞) and is used to control the weight assigned to P and R.

* **Cost**.  
  The measurement of performance (or accuracy) of a model that predicts (or evaluates) the outcome for an established result; in other words, that quantifies the deviation between predicted and actual values (or class labels). An optimization function attempts to minimize the cost function.

* **Cross-Validation**.  
  A verification technique that evaluates the generalization ability of a model for an independent dataset. It defines a dataset that is used for testing the trained model during the training phase for overfitting. Cross-validation can also be used to evaluate the performance of various prediction functions. In **k-fold cross-validation**, the training dataset is arbitrarily partitioned into **k** mutually exclusive subsamples (or **folds**) of equal sizes. The model is trained **k** times (or **folds**), where each iteration uses one of the k subsamples for testing (cross-validating), and the remaining **k-1** subsamples are applied toward training the model. The k results of cross-validation are averaged to estimate the accuracy as a single estimation.

* **Data Mining**.  
  The process of knowledge discovery or pattern detection in a large dataset. The methods involved in data mining aid in extracting the accurate data and transforming it to a known structure for further evaluation.

* **Dataset**.  
  A collection of data that conform to a schema with no ordering requirements. In a typical dataset, each column represents a feature and each row represents a member of the dataset.

* **Dimension**.  
  A set of attributes that defines a property. The primary functions of dimension are filtering, classification, and grouping.

* **Induction Algorithm**.  
  An algorithm that uses the training dataset to generate a model that generalizes beyond the training dataset.

* **Instance**.  
  An object characterized by feature vectors from which the model is either trained for generalization or used for prediction.

* **Knowledge discovery**.  
  The process of abstracting knowledge from structured or unstructured sources to serve as the basis for further exploration. Such knowledge is collectively represented as a schema and can be condensed in the form of a model or models to which queries can be made for statistical prediction, evaluation, and further knowledge discovery.

* **Model**.  
  A structure that summarizes a dataset for description or prediction. Each model can be tuned to the specific requirements of an application. Applications in big data have large datasets with many predictors and **features** that are too complex for a simple parametric model to extract useful information. The learning process synthesizes the parameters and the structures of a model from a given dataset.  
  Models may be categorized as:  
   * **parametric** described by a finite set of parameters, such that future predictions are independent of the new dataset    
   * **nonparametric** described by an infinite set of parameters, such that the data distribution cannot be expressed in terms of a finite set of parameters. Nonparametric models are simple and flexible, and make fewer assumptions, but they require larger datasets to derive accurate conclusions.

* **Feature vector**.  
  An n-dimensional numerical vector of explanatory variables representing an instance of some object that facilitates processing and statistical analysis. Feature vectors are often weighted to construct a predictor function that is used to evaluate the quality or fitness of the prediction. The dimensionality of a feature vector can be reduced by various dimensionality reduction techniques, such as:  
  * **principal component analysis (PCA)**  
  * **multilinear subspace reduction, isomaps, and latent semantic analysis (LSA)**  
  The vector space associated with these vectors is often called the feature space.
  

References:  
  1. Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1
  2. Kohavi, Ron, and Foster Provost. “Glossary of Terms.” Machine Learning 30, no. 2–3 (1998): 271–274.

### The main steps in the process of developing ML algorithms:

  1. **Collect the data**.  
     Select the subset of all available data attributes that might be useful in solving the problem. Selecting all the available data may be unnecessary or counterproductive. Depending upon the problem, data can either be retrieved through a data-stream API (such as a CPU performance counters) or synthesized by combining multiple data streams. In some cases, the input data streams, whether raw or synthetic, may be statistically preprocessed to improve usage or reduce bandwidth.
     
     
  2. **Preprocess the Data**.  
     Present the data in a manner that is understood by the consumer of the data. Preprocessing consists of the following three steps:  
     2.1. **Formatting**. The data needs to be presented in a useable format. Using an industry-standard format enable plugging the solution with multiple vendors that in turn can mix and match algorithms and data sources such as XML, HTML, and SOAP.  
     2.2. **Cleaning**. The data needs to be cleaned by removing, substituting, or fixing corrupt or missing data. In some cases, data needs to be normalized, discretized, averaged, smoothened, or differentiated for efficient usage. In other cases, data may need to be transmitted as integers, double precisions, or strings.   
     2.3. **Sampling**. Data need to be sampled at regular or adaptive intervals in a manner such that redundancy is minimized without the loss of information for transmission via communication channels.
         
     
  3. **Transform the data**.  
     Transform the data specific to the algorithm and the knowledge of the problem. Transformation can be in the form of feature scaling, decomposition, or aggregation. Features can be decomposed to extract the useful components embedded in the data or aggregated to combine multiple instances into a single feature.
     
     
  4. **Train the algorithm**.  
     Select the training and testing datasets from the transformed data. An algorithm is trained on the training dataset and evaluated against the test set. The transformed training dataset is fed to the algorithm for extraction of knowledge or information. This trained knowledge or information is stored as a model to be used for cross-validation and actual usage. *Unsupervised learning, having no target value, does not require the training step*.
     
     
  5. **Test the algorithm**.   
     Evaluate the algorithm to test its effectiveness and performance. This step enables quick determination whether any learnable structures can be identified in the data. A trained model exposed to test dataset is measured against predictions made on that test dataset which are indicative of the performance of the model. If the performance of the model needs improvement, repeat the previous steps by changing the data streams, sampling rates, transformations, linearizing models, outliers’ removal methodology, and biasing schemes.
     
     
  6. **Execute and predict**.  
     Apply the validated model to perform an actual task of prediction. If new data are encountered, the model is retrained by applying the previous steps. The process of training may coexist with the real task of predicting future behavior.
     
     
<img src="./pics/01_09.png" style="width:700px;">

References:    
  1. Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1
  2. Sebastian Raschka, Vahid Mirjalili. Python Machine Learning. 3rd Edition. Birmingham, UK: Packt Publishing, 2019. ISBN: 978-1789955750

## ML categories

* **Supervised Learning** [Обучение с учителем]  
    * Classifictation
    * Regression

* **Unsupervised Learning** [Обучение без учителя]  
    * Clustering
    * Dimensionality reduction
    
* **Reinforcement Learning (RL)** [Обучение с подкреплением]  


<img src="./pics/ml_types_usage.png" style="width:700px;">


## Fundamental ML Algorithms

* Linear Regression
* Logistic Regression
* k-Means Clustering
* k-Nearest Neighbors (kNN)
* Support Vector Machines (SVM)
* Decision Trees
* Random Forests (RF)
* Naive Bayes 
* Ensembles
* Boosting
* Dimensionality Reduction
* Neural Networks (NN) 
* Deep Learning (DL)
* Reinforcement Learning (RL)

## Supervised Learning
<img src="./pics/sup-unsup.jpeg" style="width:400px;">
<img src="./pics/class-regress.png" style="width:600px;">
<img src="./pics/class-regress_example1.png" style="width:400px;">

* Regression  
    These algorithms are normally useful for predicting a single number. If you needed to create an algorithm that predicted a stock price based on features of stocks, you would select this type of model. These are called *continuous variables*.
    
   * Linear
       Model: y=ax+b => y=β₀+β₁x₁+…+βᵢxᵢ,   
       where β₀ is the y-intercept, the y-value when all explanatory variables are set to zero. β₁ to βᵢ are the coefficients for variables x₁ to xᵢ, the amount y increases or decreases with a one unit change in that variable, assuming that all other variables are held constant. For example, if the equation was y=1+2x₁+3x₂ then y would increase from 1 to 3 if x₁ increased from 0 to 1 and x₂ stayed at 0.
   * Logistic  
       Model: y= 1 / (1+e^-(β₀+β₁x₁+…+βᵢxᵢ)),  
       which constrains it to values between 0 and 1.  
       For this reason, it’s mostly used for binary target variables where the possible values are zero or one or where the target is the probability of a binary variable. As mentioned earlier, the equation keeps predictions from being illogical in the sense of having probabilities below 0 or higher than 1.
       
       
<img src="./pics/regr_lin_log.jpeg" style="width:400px;">
   
   
* Classification  
   These algorithms are used to predict a member of a class of possible answers. This could be a simple "yes or no" classification (*binary* classification, when you have 2 classes), or "red, green or blue." (*multi-class* classification, when you have multiple classes). If you needed to predict whether an unknown person was male or female from features, you would select this type of model. These are called *discrete variables*.


## Unsupervised Learning

* Unsupervised machine learning tries to score more points for artificial intelligence without any human touch. Unsupervised machine learning algorithms rely on data that has no labels, predefined features, or specified classification sets.

<img src="./pics/unsup_example1.png" style="width:600px;">


<img src="./pics/unsup_scheme.jpg" style="width:800px;">

## Supervised vs Unsupervised


* k-means example

<img src="./pics/sup_unsup_example1.jpg" style="width:800px;">

### Book recommendations
* **[en]** [Python Machine Learning - 3rd edition](https://sebastianraschka.com/books.html) by Sebastian Raschka , Vahid Mirjalili (2020)
* **[en]** [Hands-On Machine Learning with Scikit-Learn and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) by Aurélien Géron (2019)

### Online cources recommendations   
* **[ru]** [Advanced Machine Learning Specialization](https://www.coursera.org/specializations/aml?ranMID=40328) **(HSE)**  
* **[en]** [Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning) **(deeplearning.ai)** by [Andrew Ng](https://scholar.google.com/citations?user=mG4imMEAAAAJ&hl=en)
* **[en]** [Natural Language Processing Specialization](https://www.coursera.org/specializations/natural-language-processing) **(deeplearning.ai)**
* **[en]** [MicroMasters® in Statistics and Data Science](https://www.edx.org/micromasters/mitx-statistics-and-data-science) **(MIT)** at **EdX** 


### Sources for datasets
* [Kaggle](https://www.kaggle.com/datasets)  (sometimes with codes/notebooks)
* [data.world](https://data.world/)
* [ U.S. Government’s open data](https://catalog.data.gov/dataset)

Like?  
-  [x] Yes  
-  [ ] No

```css
def ml(self):
    mlcc = self.mlcc
return mlcc
```