![](imgs/deepsense_header.png)

# Machine Learning and Big Data

A course by [deepsense.io](http://deepsense.io/).

## Part 0: Tools

This is a small and gentle introduction to tools we are going to use.

![](imgs/ipython_notebook.png)

### Tools

In this course we will use:

* [Python](https://www.python.org/) - a popular programming language
* [scikit-learn](http://scikit-learn.org/) - the main machine learning library for Python

As interface/IDE, will use:

* [IPython Notebook](http://ipython.org/notebook.html) - interactive environment in browser 

We are going to use a few common Python libraries to 

* [Pandas](http://pandas.pydata.org/) - tabular data, reading and writing,
* [matplotlib](http://matplotlib.org/) - plots,
* [seaborn](http://stanford.edu/~mwaskom/software/seaborn/) - beautiful plots,
* [NumPy](http://www.numpy.org/) - numerical operations.

Don't be afraid if some of these tools are new to you. You are here to learn it. :)
As long as you have some very basic knowledge of Python, you should follow. 


### Installation

There are many ways to install Python and its libraries. The easiest one is to use the [Anaconda](https://store.continuum.io/cshop/anaconda/) distribution. It installs Python (2.7 or 3.4) along with the most useful libraries for data science.

The only other libraries we need to install is `seaborn`, `pyspark` and `findspark`.

Anaconda package comes with IPython Notebook.

### Labs

During this workshop we will use IPython Notebook (with necessary libraries) installed on Amazon Web Services. You will be given a link looking like `ec2-54-148-43-50.us-west-2.compute.amazonaws.com`.

### IPython Notebook

IPython Notebook allows a few things, which are 

* iterative development,
* modifying code on feedback,
* incorporating plots,
* adding notes (including pictures and equations).

This environment is frequently used in data science, especially during exploratory data analysis, and during initial machine learning.

It's data, it's the unknown. So you need a lot of feedback from it.

Oh, and IPython Notebooks are great for learning and presenting. 

(BTW: This note is in IPython Notebook as well!)

## Code

Press `[shift]+[enter]` to execute a cell.

In [None]:
2 + 5

In [None]:
pets = ["cat", "python", "elephant"]

for pet in pets:
    print("I have a {}. A wonderful animal, indeed!".format(pet.upper()))

In [1]:
[len(pet) for pet in pets]

NameError: name 'pets' is not defined

## Markdown

**Markdown** in a [markup language](https://en.wikipedia.org/wiki/Markup_language) allowing us to write notes.

It includes:

* lists,
* *italics*,
* **bold text**,
* ~~strikethrough~~,
* `monospaced code`,
* [links](http://deepsense.io/).


Moreover, it has support both for inline $\LaTeX$ formulas such as $\sqrt{2} = 1.41\ldots$ and full line equations:

$$\sum_{k=1}^\infty \frac{(-1)^k}{2k-1} = 1 - \tfrac{1}{3} + \tfrac{1}{5} + \ldots = \frac{\pi}{4} $$

And, of course, code blocks:

```
var list = ["a", "list", "in", "JavaScript", "ES6"];
list
  .map((x) => x.length)
  .reduce((x, y) => x + y);

```


## Code autocompletion

In [None]:
from numpy import random

In [None]:
random.ran

 Start writing `random.r` and press `tab`. 

In [None]:
random.randn?

## IPython magic

Cells starting with `%%something` are special. For example, we can measure thier performance!

In [None]:
%%timeit
acc = 0
for x in range(1000000):
    acc += 1

In [None]:
%%timeit
acc = 0
for x in range(1000000):
    acc += x**5 - 3 * x**2

## Shell commands

Only in unix-based systems (Linux, Mac OS X)

In [None]:
!ls

In [None]:
files = !ls
for f in files:
    if f[3] == "1":
        print(f)

## HTML and JavaScript

In [None]:
from IPython.display import Javascript, HTML

In [None]:
Javascript("alert('It is JavaScript!')")

In [None]:
HTML("We can <i>generate</i> <code>html</code> code <b>directly</b>!")

## Plots

In [None]:
# make plots appearing in the notebook
%matplotlib inline

# a plotting library
import matplotlib.pyplot as plt

# a numerical library
import numpy as np

In [None]:
X = np.linspace(-5, 5, 100)  # a vector of hundred equally-spaced points
Y = np.sin(X)                # sine of each X
plt.plot(X, Y)               # a line plot

## Displaying other content

In [None]:
from IPython import display

In [None]:
display.Image(url="http://imgs.xkcd.com/comics/python.png")

In [None]:
display.YouTubeVideo("H6dLGQw9yFQ")

In [None]:
display.Latex(r"$\lim_{x \to 0} (1+x)^{1/x} = e$")

## Interactivity

In [None]:
from IPython.html.widgets import interact

In [None]:
@interact
def greeting(text="World"):
    print "Hello {}".format(text.upper())

In [None]:
@interact
def greeting(x=2, y=2):
    print "{} times {} is {}".format(x, y, x*y)

## Other notes

* Feel free to write your notes in the notebook.
* There will be exercises. Ones with a star (★) are additional - i.e. don't feel compelled to d them, do them only is you have extra time.