In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Advanced Machine Learning
<!--exclude from index-->

## Module description

Advanced Machine Learning is a continuation of our Machine Learning study track.  As the module name suggests, we push beyond the basics into advanced techniques with a particular focus on handling unstructured data such as text.  We discuss "bag of words" models, feature engineering, and topic modeling to extract useful information from text including sentiment analysis.  We also explore techniques such as support vector machines, decision trees, random forests, time series, and signal processing.  Students come away with intuition about the suitability of different techniques for different problems.  

Several real-world applications such as recommendation systems, outlier detection, and sentiment analysis are included in the module, with case studies that demonstrate advanced machine learning methods to solve common problems.   

We continue coverage of the powerful and widely-used Python package `Scikit-Learn` including implementation of customized transformers and estimators to develop a full "Extract-Transform-Load" pipeline that turns raw data into useful outcomes.  

## Learning outcomes

At the conclusion of this module, students should: 

* Have familiarity with advanced machine learning tools in Scikit-Learn such as:
   - `CountVectorizer`, `HashingVectorizer`, and `TfidfVectorizer` for working with text data
   - Support Vector Machines for regression and classification, as well as outlier/novelty detection
   - Decision Trees and Random Forests
* Be familiar with methods for processing text data for topic classification and sentiment analysis
* Understand different metrics for measuring the performance of machine learning models, e.g. accuracy vs precision in classification, and when those metrics are appropriate to be used
* Be able to work with the Scikit-Learn API, specifically: 
   - Know how to create custom estimators and transformers to manipulate data effectively and build specialized prediction methods
   - Be able to effectively use built-in tools such as `Pipeline` and `FeatureUnion` to produce a "start-to-end" ETL workflow
* Know common procedures for analyzing time series data
* Have a good intuition about what machine learning methods are appropriate for use in different situations


## Prerequisites

We assume an understanding of the Python programming language, including data types such as `list`, `dict`, and `tuple`, list and dictionary comprehensions, writing functions, and basic object-oriented programming techniques in Python.  

In addition, students should have understanding of the Pandas `DataFrame` data structure, and methods to manipulate data in a `DataFrame` (such as selecting, filtering, creating new columns from existing columns, and understanding the various ways of indexing and slicing DataFrames), and NumPy arrays.  

We also expect students to have familiarity with some basic machine learning algorithms such as linear regression, logistic regression, $k$-nearest neighbors, and decision trees, together with knowledge of how these methods can be used for regression or classification purposes.  

Finally, students should be able to use the `Pipeline` and `FeatureUnion` objects in the Scikit-Learn library, together with transformers and estimators, to construct a start-to-end "Extract-Transform-Load" pipeline taking data from raw form to final prediction.    

## Suggested notebooks

Due to time constraints, it is unlikely that all of these notebooks will be covered during the course of the module.  We begin with `AM_Natural_Language_Processing`, but which additional notebooks are covered can depend upon the desires of the students and/or the instructor.  

1.  `AM_Natural_Language_Processing`
1.  `AM_Sentiment_Analysis`
1.  `AM_Decision_Trees_and_Random_Forests`
1.  `AM_Support_Vector_Machines`
1.  `AM_Time_Series`
1.  `AM_Naive_Bayes`
1.  `AM_Outlier_Detection`
1.  `AM_Unbalanced_Classes`
1.  `AM_Recommendation_Engines`
1.  `AM_Digital_Signals`
1.  `AM_Choosing_ML_Algorithms`

## Additional resources

### General Machine Learning 

* Wikipedia's articles on 
   - Machine Learning: https://en.wikipedia.org/wiki/Machine_learning
   - Supervised (Machine) Learning: https://en.wikipedia.org/wiki/Supervised_learning  
   - Unsupervised (Machine) Learning: https://en.wikipedia.org/wiki/Unsupervised_learning
   
### Scikit-Learn

* The online Scikit-Learn documentation is extensive:
https://scikit-learn.org/stable/

* Scikit-Learn's introductory tutorial:
https://scikit-learn.org/stable/tutorial/basic/tutorial.html

* The Scikit-Learn algorithm cheat sheet:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

*  The Scikit-Learn `Pipeline` object: 
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

### Pandas and NumPy

Pandas DataFrames and NumPy arrays are commonly-used data structures that interact well with the Scikit-Learn Machine Learning API.  

**Pandas**
* A cheat sheet with a summary of commonly-used Pandas operations:
http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

* A blog post discussing the connection between Pandas and NumPy: https://blog.thedataincubator.com/2018/02/numpy-and-pandas/

**NumPy**
* A quick-start tutorial on NumPy:
https://docs.scipy.org/doc/numpy/user/quickstart.html

* A NumPy cheat sheet geared for those transitioning from MATLAB:
https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html

* 100 NumPy exercises: 
http://www.labri.fr/perso/nrougier/teaching/numpy.100/

### Data Sources
Where can one find data sets to try out machine learning algorithms?  Here's a few online sources.  

* University of California Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php (This is a well-known repository of common and some not-so-common data sets.)

* Google Dataset Search: https://toolbox.google.com/datasetsearch (Provides links to various online data sets, try a keyword search.)

###  The Mathematics of Machine Learning

You can do machine learning without the need to understand the fine details of the mathematics that underlies it, but if you want to delve into this world, here is a good starting point:
*  __[The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)__ by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (A rather comprehensive treatment, available for free online.)
*  __[An Introduction to Statistical Learning](https://www-bcf.usc.edu/~gareth/ISL/)__ by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani  (A gentler introduction, i.e. less mathematically intense, than the above book.)

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*