# Statistical Learning 
This Jupyter Notebook covers "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani which is considered a canonical text in the field of statistical/machine learning and is an absolutely fantastic reference to move forward in your analytics career. ([The book to download](http://link.Springer.com/book/10.1007/978-1-4614-7138-7))

The field of Statistics have always been in the news for a long period of time. With booming industries revolving around Statistics, the pace in which it is being famous is truly swift.

Some of the highlights:
- How IBM built Watson, its Jeopardy-playing supercomputer by Dawn Kawamoto DailyFinance, 02/08/2011
- Quote of the Day, New York Times, August 5, 2009 ”I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” — HAL VARIAN, chief economist at Google.




We will cover various Statistical Learning problems in the following chapters as follows:

Statistical Learning Problems
- Identify the risk factors for prostate cancer.
- Classify a recorded phoneme based on a log-periodogram.
- Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements.
- Customize an email spam detection system.
- Identify the numbers in a handwritten zip code.
- Classify a tissue sample into one of several cancer classes, based on a gene expression profile.
- Establish the relationship between salary and demographic variables in population survey data.
- Classify the pixels in a LANDSAT image, by usage.

There are two set of tools or categories to understand any problem in learning:

### The Supervised Learning Problem
Starting point:
- Outcome measurement Y (also called dependent variable, response, target).
- Vector of p predictor measurements X (also called inputs, regressors, covariates, features, independent variables).
- In the regression problem, Y is quantitative (e.g price, blood pressure).
- In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).
- We have training data (x1, y1), . . . ,(xN , yN ). These are observations (examples, instances) of these measurements.

Objectives:

On the basis of the training data we would like to:
- Accurately predict unseen test cases.
- Understand which inputs affect the outcome, and how.
- Assess the quality of our predictions and inferences.

Philosophy:
    
- It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
- One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
- It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!]
- This is an exciting research area, having important applications in science, industry and finance.
- Statistical learning is a fundamental ingredient in the training of a modern data scientist.

### Unsupervised learning
- No outcome variable, just a set of predictors (features) measured on a set of samples.
- Objective is more fuzzy — find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
- Difficult to know how well your are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised learning.

Statistical Learning versus Machine Learning
- Machine learning arose as a subfield of Artificial
Intelligence.
- Statistical learning arose as a subfield of Statistics.
- There is much overlap — both fields focus on supervised
and unsupervised problems:
- Machine learning has a greater emphasis on large scale
applications and prediction accuracy.
- Statistical learning emphasizes models and their
interpretability, and precision and uncertainty.
- But the distinction has become more and more blurred,
and there is a great deal of “cross-fertilization”.
- Machine learning has the upper hand in Marketing!

#### Course Texts
The course will cover most of the material in this
[Springer book (ISLR)](http://link.Springer.com/book/10.1007/978-1-4614-7138-7) published in 2013, which
the instructors coauthored with Gareth James
and Daniela Witten. Each chapter ends with an R
lab, in which examples are developed. By January
1st, 2014, an electronic version of this book will be
available for free from the instructors’ websites.
[This Springer book (ESL)](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf) is more mathematically
advanced than ISLR; the second edition was published in 2009, and coauthored by the instructors
and Jerome Friedman. It covers a broader range
of topics. The book is available from Springer and
Amazon, a free electronic version is available from
the instructors’ websites.