# Introduction to Data Science (for complete beginners) 

## 1. What is Data Science?

* Depends who you ask, but my definition is):
	*Data Science is a multidisciplinary field that combines Software Engineering, Statistics and good old fashioned domain knowledge.
	* Data Scientists are people who are better at statistics than any software engineer and better at software engineering than any statistician

* What's the difference between Automation and Data Science?
	* Automation is about creating an agent who is able to perform a task, Data Science is about providing new insight into problem domains

* Okay, so what is Big Data then?
	* Good question ...
	* Big Data is phrase that has almost become meaningless, what makes data big?
	* My opinion, if your dataset doesn't fit in the main memory of your machine, your data is big.
	* Worth noting, bigger doesn't always mean better. More important to get a represntative sample



## The Data Science lifecycle

## 2. Where do I get my data?

That depends entirely on what kind of data you wish to work with, and what you're trying to achieve. If you wish to experiment with some of the techniques covered today, or want to play with some new toy from Google (TensorFlow anyone?) then pubically available datasets are a great chance to get to grips with some of these tools. 

However if you wish to deliver some new insight for you or the business, then you'll need to find/build your own dataset.

### Public Datasets

There are many great resources online with collections of publically available datasets, although some of my personal favourites include: 

* [Kaggle (the home for all things Data Science)](https://www.kaggle.com/datasets)
* [University of California, Irvine Machine Learning repository](http://archive.ics.uci.edu/ml/)
* [Quandl](https://www.quandl.com)
* [Amazon's AWS Datasets](http://aws.amazon.com/datasets/)
* [University of Edinburgh's dataset collection](http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html)


## 3. Understanding your data

Before you begin modelling your data there is really two important questions

* What kind of task do I want this data to perform with this data (Regression vs Classification vs Clustering)
* What kind of learning do I need to in order to achieve that task (Supervised vs Unsupervised)

### Supervised vs Unsupervised Learning

When discussing learning methods (we want our algorithms to learn features of our data, in order to create models) we often mention what type of learning we wish to do, these are  categorised into *supervised* and *unsupervised* learning tasks. 

#### Supervised

Supervised learning is the process of learning by example, similar to a teacher teaching a student. Your training data consists of training examples where we have some feature/predictor variables and a label/target variable.

$$\mathbf{x} = (x_1,x_2,...,x_n)^T$$
$$f(\mathbf{x}) \rightarrow y  $$

An example of a supervised task would be to determine the credit worthiness, using features obtained from their financial history, and we have similar data of customers of customer who may have defaulted in the past. 

#### Unsupervised

Unsupervised learning is the opposite, our data is unlabeled in this case, and our algorithms try to find some structure or rules based on the data we provide. 

One of the most famous examples of an unsupervised machine learning algorithm in the real world would probably be Googles PageRank algorithm. 

### Regression vs Classification vs Clustering 

## Classification

* Classification is probably the most intuitive task, your algorithm will attempt to assign your observations into discrete classes, based on previously observed examples.

* An algorithm that implements classifiation is called a **classifier**, and are instances of a supervised learning task.

* i.e. We are classifying based on example classes we've seen before, where we have some label determining the correct result. 

Typical classification tasks include:

* Spam or Ham detection for emails
* Credit worthyness, would you give this person a loan?
* Detecting whether or not a human face is present in the picture

* To the board for an Oranges to Lemons example!

## Regression

* Where classifiers deal with discrete classifications, regression algorithms deal with continuous response variables. 
* Remember $y=mx + c$ ? Then you've used a regression algorithm!
* Regression models aim to predict a response variable, given some known variables. 
* When you're predicting within a range already observed in the dataset, you are *interpolating*
* When you're predicting outside of the range of the dataset, you are *extrapolating*

Typical regression tasks include:

* Predicting the price of the stock market 
* Calculating life expectancy
* I'm sure people in the room have even more examples...


## Clustering

* Unlike Classification and Regression, Clustering is an unsupervised task
* Wish to group similar or related data into some sort of cluster
* This can be a hard task to get right, and can be used to achieve different goals
* Generally used where getting hold of labelled data is impossible or infeasable! 

Typical examples include:

* Topic modeling, grouping articles together based on underlying "themes" of your articles
* Forming connected graphs of references between documents to infer relevancy (ever heard of Google?)
* Analysis of 'tribe' behaviour in social groups


Whenever we have a new dataset to work with, it's important to be able to manipulate and play with our new dataset. This helps us get a feel for our dataset, and allows us to:

* Maximise the insight into the dataset and it's main characteristics 
* Detect mistakes, missing values, outliers and anomalies
* Determine relationships between the input variables
* Improve the quality of our data and avoid feeding our models with garbage
* Determine what algorithms we wish to use to build our models. 

So with this in mind ...

## 4. How do I preprocess my data?

When creating statistical models, our models are only as good as the data we provide. This experience is nicely summarised by the expression "Garbage in, Garbage out".

Therefore it's worth investing time upfront to ensure your data is up to par, saving you headaches (and poor results) in the future. For a Data Scientist, this preprocessing takes up the majority of your time.

As a result, there are some common problems that you need to look out for.

## Balanced Datasets

Imagine you were building a classifier to determine whether a Physicist, who has recently been awarded their PhD, was likely to be awarded a nobel prize in the future. What simple rule would guarantee excellent performance of your classifier?

This is why having balanced datasets is important as they can skew the percieved performance of your classifier. 

## Scaling

Below is a table containing the first 5 records from a collection, and we're looking at two columns in particular. Can you spot what may be an issue here?


In [2]:
%matplotlib inline
import numpy as np
import pandas as pd

wineQ = pd.io.parsers.read_csv(
    'https://raw.githubusercontent.com/schafer14/Machine-Learning-Wine-DataSets/master/files/winelist.csv',
    )
header = wineQ.columns.values
wineQ.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


## 5. Building your models

A combination of task + learning method generally drives the choice of our models. 

Today we will discuss three different models, k-Nearest Neighbour, Decision Trees, and their bigger brother, Random Forests.

These models were chosen today because they are relatively simple to understand, quick to explain, and the resulting model is human interpretable (i.e. we understand why a model made a choice



### K-Nearest Neighbour

k-Neartest Neighbour (or k-NN for short) is a classifier that tries to to determine the label of the input, based on its relative distance in some multidimensional feature space. 

Essentially you're trying to find similar examples, and assign classes based on similarity.

In [8]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 15

# import some data to play with
wineQ = pd.io.parsers.read_csv(
    'https://raw.githubusercontent.com/schafer14/Machine-Learning-Wine-DataSets/master/files/winelist.csv',
    )
X = wineQ.drop('quality', axis=1)  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = wineQ['quality']

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for weights in ['uniform']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, weights))

plt.show()

TypeError: unhashable type

### Decision Trees (and Random Forests)

Imgine the game 20 Questions, where you try and and guess the name of a celebrity by asking  a series of questions about them. 
Decision tress are a very similar concept, by maximising a function that measures *information gain* you can create a classifier that asks a series of questions about data to try and discern the class of that data point. 

Decision Trees are not particularly performant models, and are actually known as weak leaners. But imagine, could you build lots of these things, which all ask slightly different questions, and through some wisdom of the crowd come to better conclusions? Yes we can! This is called an *ensamble method*, and it allows you to combine a collection of weak learners, and create a strong learner. 

## 6. Evaluating your models

Now that you have your models you need some method of scoring or measuring their performance relative to one another. 

It's bad practice to test your model on the same data your trained it with. So we will generally take our dataset and split it into a testing component, and a training compoent. 30/70 is generally considered to be a good split.

### Precision, Recall and Accuracy

When evaluating models, accuracy is the most intuitive measure. Surely if I am correct more often, then my classifier is great?

Because this isn't strictly true, we have two other measures, *Precision* and *Recall*

Precision is the ability of a classifier not to miclassify a posiitve as a negative

Recall is the ability of the classifier to find all posiitve examples 

$precision = \frac{tp}{tp + fp}$

$recall = \frac{tp}{tp + fn}$

### Overfitting

## 7. Conclusions and further reading

Today we have covered:

* What Data Science is, and took our first step to becoming world leading Data Scientists
* Looked at some popular online data repositories
* Discussed the differences between supervised and unsupervised learning
* Considered the differences between classification, regression and clustering tasks
* Discovered why it's so important to look at our data before we even start to process or model it
* The idea of Garbage in, Garbage out
* Cleaning and scaling our data, in order to improve the performance of our models
* Learnt about three popular models for machine learning: 
	* k-Nearest Neighbours
	* Decision Trees
	* Random Forests
* Discussed how we should evaluate our models, and why evaluation is so important in the first place. 

### Further watching/listening/reading

Watch:

Andrew Ng's Stanford course on Machine Learning

Listen:

Talking Machines
Goldmans Sachs episode on Data Science
Data Viz




