# Intro to Data Science @ SzISz

## Table of contents

- <a href="#Administration">Administration</a>
- <a href="#Intro">Intro</a>
- <a href="#Overview">Overview</a>
- <a href="#Regression">Regression</a>

## Administration

### Curriculum:
- Overview, technical basics, regression
- Data Discovery, Naive linear classifiers
- Data Transformation, Decision trees
- Dimensionality Reduction
- ...

### Requirements:

A selected project submitted to one of 
<a href="https://www.kaggle.com/competitions">kaggle.com</a>'s competitions.

## Intro

### WTF is Data Science?

According to a random venn diagram:

<img src="http://b-i.forbesimg.com/gilpress/files/2013/05/Data_Science_VD.png" width=300 align="left">

As a metro map: 

<a href="http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png" target="new">
    <img src="http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png" width=500 align="left">
</a>

### At the end of the day:

It's just a fancier name for Data Mining. Maybe throw some more hacking skill to the mix.


### Who is a Data Scientist then?

- "A data scientist is a statistician who lives in San Francisco"
- "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."


### Thanks, much clearer now. (NOT) Can you please tell me at least what does he do? 
#### A.k.a: the typical workflow - The Knowledge Discovery Process

<img src="http://www.cs.utexas.edu/users/csed/doc_consortium/DC99/wooley-image1.gif">

## Overview

There is a lot of "implicit" information in the data which humans can't directly observe, but can be extracted by statistic methods (a.k.a. _analytics_). Our goal is exactly this. Basically, there are two main types of analytics:

### Descriptive analytics

**Goal:** To extract the most valuable information from a given dataset.  
**Example:** A store has some information on its customers, and from that information it can determine what type of people visit its stores (like students, retirees, etc.). This way it can adjust the stores open hours to fit the need of the different group of customers it serves. (This is called clustering.)

### Predictive analytics

**Goal:** Being able to make predictions on missing information based on previous knowledge.  
**Example:** When you apply for a loan, the bank gets your data, and puts it into its model for predicting the probability of you repaying that loan. Depending on this prediction it can choose to grant you the loan you asked for or not.
  
---
  
There is another way of categorizing the methods that we will learn about in this course: **supervised** and **unsupervised** learning.  
**Supervised learning** is based on data that is already 'labeled'. In other words we have data for which we know what the correct output is. We train our model on this dataset, and after this our model can predict the output of any input we give it. The simplest supervised learning method is the linear regression.  
With **unsupervised learning** we don't know what the correct output should be - we try to detect structure in the data. The simplest example for this is descriptive statistics, or the above clustering example.

### Validation

How can we validate our model/output? In the case of unsupervised learning, we can't. With supervised learning, however the basic idea is pretty straightforward. We split our dataset into two parts: training and test set. We train our model _only on the training set_, and then compare the model's output on the test set to the known good output on it.

## Regression, and a little scikit-learn

<img src="http://vignette2.wikia.nocookie.net/nickelodeon/images/1/14/Ren%2B%2BStimpy.jpg" width=200 align=left>

In [None]:
import numpy as np

from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

%matplotlib inline
import matplotlib.pyplot as plt

## First, generate the data and add some noise to it

In [None]:
# get 100 samples from a normal distribution
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)

# y is the signum function of X
y = (X > 0).astype(np.float) 

# now add some noise to X
X[X > 0] *= 4 
X += .3 * np.random.normal(size=n_samples)

# reshape X to be a 100 x 1 "matrix", or a column vector
X = X[:, np.newaxis] 

## Now introducing pipelines

Since we only want a logistic regression in our model, we could simply use the LogisticRegression() function from the imported linear_model module. However, there is a useful concept called **pipeline**, which really comes in handy when dealing with more complicated models.

When dealing with data, we may first want to transform our data to make it more digestible to our estimators (e.g. getting rid of some attributes). There can be multiple transformation steps involved in our process, and each transformation may have multiple parameters that can be tweaked independently. Pipelines provide a wrapping for these steps which makes working with these transformations easier and more conscise.

In [None]:
# In this example, we only want a logistic regression
logistic = linear_model.LogisticRegression()
pipe = Pipeline(steps=[('logistic', logistic)])

# However, there is a regularization parameter C that can be choosen freely for our logistic regression estimator
# Here we define what values of C should we try to make our model with
Cs = np.logspace(-4, 4, 3) # C = [0.0001, 1.0, 10000]

# GridSearchCV is the tool to try all the combinations of freely choosen parameters with our model
# Now we only change one parameter, but with every new parameter we multiply the number of cases to try
# So if we got another parameter with 4 values to try, there would be 12 cases altogether
estimator = GridSearchCV(pipe, dict(logistic__C=Cs))
estimator.fit(X, y)

In [None]:
# We can see what the results of the different models were:
estimator.cv_results_

In [None]:
# to see how our best model behaves
X_test = np.linspace(-5, 10, 300)
X_test = X_test[:, np.newaxis]

bestimator = estimator.best_estimator_
prediction = bestimator.predict(X_test)

plt.figure(1, figsize=(4, 3))
plt.clf()
plt.plot(X_test.ravel(), prediction)
plt.scatter(X.ravel(), y, color='black', zorder=20)