*This notebook is part of  course materials for CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions were created by Asa Ben-Hur.
The content is availabe [on GitHub](https://github.com/asabenhur/CS345).*

*The text is released under the [CC BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/), and code is released under the [MIT license](https://opensource.org/licenses/MIT).*


<a href="https://colab.research.google.com/github//asabenhur/CS345/blob/master/fall22/notebooks/module08_01_conclusions.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# Summary / Conclusions

In this course we focused about what happens after we have a set of features that characterize our dataset.  However, there is a lot that a machine learning practitioner needs to consider before that:

* What is the problem that you are trying to solve?  What kind of machine learning problem is it, and is there data that can be used to address it?

* Do you have legal access to the data?

* Do you understand the data?  Without understanding the data, you cannot design good features.  Your classifier is only as good as the features you provide it.  The saying "garbage-in garbage-out" definitely applies!

* Is your data clean?


Once you have a dataset it is time to consider which machine learning approach to use.  There are a lot of options, and it can be difficult to choose.  
The following image from the scikit-learn webpage summarizes and provides a rough roadmap for making these choices:

<img style="padding: 10px; float:center;" alt="https://scikit-learn.org/stable/tutorial/machine_learning_map/" src="https://scikit-learn.org/stable/_static/ml_map.png
" width="700">


The following figure is a rough description of the process of coming up with a machine learning model:

<img style="padding: 10px; float:center;" src="https://github.com/asabenhur/CS345/raw/master/fall20/notebooks/figures/machine_learning_process.svg" width="500">



The following table can help you in that as well:  it compares the characteristics of the classification algorithms we studied.  Several of these algorithms can be applied to regression problems as well, so this applies there as well.

| Algorithm /<br/>Characteristic |  logistic<br/> regression |  KNN  | decision<br/>trees  | ensemble<br/>methods |  feed-forward<br/>networks | SVM |
|:-----------|:-------:|:--------:|:-------:|:------:|:-------:|:------:|
| **Predictive power** | ✔️ | ✔️ | ❌ | ✔️✔️ | ✔️✔️ | ✔️✔️ |
| **Interpretability** | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ |
| **Scalability<br/>(large N)** | ✔️✔️ | ❌ | ✔️✔️ | ✔️✔️ | ❌ | ✔️✔️ |
| **Scalability<br/>(large d)** | ✔️ | ❌ | ❌ | ✔️✔️ | ✔️ | ✔️✔️ |
| **Handle categorical<br/>data** | ❌ | ❌ | ✔️✔️ | ✔️✔️ | ❌ | ❌ |
| **Handle missing<br/>data** | ❌ | ❌ | ✔️✔️ | ✔️✔️ | ❌ | ❌ |

Legend:  ✔️✔️: high performer,✔️: mid performer, ❌: low performer

A few comments regarding this table:

* This assessment is based on information gleaned from the literature and the author's experience.

* Ensemble methods refer to random forests and the various flavors of gradient boosting.

* Predictive power is a method's ability to produce "state-of-the-art" performance.  Note that neural networks have the potential for high accuracy, but this requires very careful model selection.  SVMs also require carefuly setting of hyperparameters, but the space of hyperparameters is much smaller and it is easier to find optimal hyperparameters.  Among the state-of-the-art methods, ensemble methods require the least amount of "fiddling", and often produce great results out of the box (as long as you use enough classifiers in your ensemble).

* Scalability with respect to the number of examples ($N$) refers to the algorithm's ability to handle large datasets.  Note that SVMs are highly scalable for the linear version, but a lot less so for the nonlinear version.

* Scalability with respect to the number of features ($d$) refers to the algorithm's robustness with respect to the presence of a large number of features, many of which might be irrelevant.

* When it comes to neural networks, the table refers to feed forward networks (multi-layer perceptrons).  Deep learning is addressed below.

* SVMs have another advantage over the listed classifiers in their flexibility to model a variety of data through their use of kernels.  Similar flexibility is afforded by deep learning methods.

### What about deep learning?

In the above table we focused on standard feed-forward neural networks.  Now we all heard of deep learning.  What is the role of these methods in our machine learning toolbox?  Well, if you have image or text data you should definitely look into deep learning.  This also extends to problems where you have non-standard data like sequences (e.g. accoustic signals or DNA/protein sequences) and graphs.  The amazing flexibility and sophisticated architectures found in deep learning are definitely worth exploring in such cases.  However, keep in mind that they require lots of data for effective training.  

On the flip side, if you have data like is fixed dimensional vectors like we have analyzed in this course, deep learning is not likely to help you much.

### What you have learned

In this course we covered a lot of ground.  You have gained an understanding of most of the steps that go into designing effective machine learning solutions:

* Write effective and efficient code for manipulating matrix and vector data
* Choosing the right way to formulate the problem
* Accuracy depends on the quality of the features and how you choose to represent them (feature scaling/normalization)
* Training vs testing accuracy
* Overfitting
* Model selection: the validation set
* Beware of biases
* Conducting meaningful machine learning experiments using scikit-learn
* Present code, data, and results using Jupyter notebooks.

### Topics we haven't covered

Machine learning is a very broad area that is growing rapidly.  Here are some important aspects that were not covered:

* Unsupervised learning:  dimensionality reduction (PCA/t-SNE) and clustering
* Deep learning
* Kernel methods and support vector machines
* Graphical models
