# Data-mining and Introduction to Machine Learning using Python
***

##### Why and what is [IPython notebook](https://ipython.org/notebook.html)? easy to explore from the notebook and easy to share and visualize. Can be used with R, Julia, Python

Data-mining, data-crunching, data-munging, data-wrangling. 

It all involves certain activities like data cleaning, data transformation, data preparation, data visualization.

But it all starts with getting ***RAW DATA***

# Getting raw data

#### To first to some validating and analysis of data, we need to get the raw data

### A http request

In [None]:
import requests

print(requests.get("http://www.blocket.se/stockholm?q=macbook air").text)

### Our research for buying a new computer

#### Here we parse the html to a format which we can use

In [None]:
import lxml.html

page = lxml.html.parse("http://www.blocket.se/stockholm?q=macbook")
# This is probably illegal. Do not use at all
items_data = []
for el in page.getroot().find_class("media item_row ptm pbm nmt"):
    links = el.find_class("item_link")
    images = el.find_class("item_image")
    prices = el.find_class("list_price")
    if links and images and prices and prices[0].text:
        items_data.append({"name": links[0].text,
                           "image": images[0].attrib['src'],
                           "price": int(prices[0].text.split(":")[0].replace(" ", ""))})
print(len(items_data))
items_data

In [None]:
%matplotlib inline
prices = []
prices_for_retina = []
for item in items_data:
    prices.append(item['price'])
    if 'Retina' in item['name']:
        prices_for_retina.append(item['price'])
    
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
plt.hist(prices, 5, histtype='bar')
plt.hist(prices_for_retina, 2, histtype='bar',color='blue')
plt.show()

# Communicating with APIs

In [None]:
import requests

response = requests.get("https://www.googleapis.com/books/v1/volumes", params={"q":"machine learning"})
raw_data = response.json()
titles = [item['volumeInfo']['title'] for item in raw_data['items']]
titles

***
#### Apis - [reddit comments on apis](https://www.reddit.com/r/webdev/comments/3wrswc/what_are_some_fun_apis_to_play_with/)
* Riot Games
* [import.io](https://www.import.io/) - turns HTML pages to JSON
* Spotify
* Telegram
* SoundCloud
* Reddit
* YouTube
* Wunderground - Weather
* Kandy - Massaging and video calling
* Star Wars - All aboard the hype train
* Marvel Comics
* Mashape - API Market
* [FoaaS](http://www.foaas.com/) - Fuck off as a Service
* BreweryDB - Beer, beer, beer
* Slack
* Geo Names - Names of places
* Common Crawl - Lots of data, petabytes of it
* Programmers API - TV Show Info
* FitBit - FitBit fitness tracker
* JawBone - Jawbone fitness tracker
* Moneypot - BitCoin Gambling
* Steam - PCMR incoming
* Twilio - VoiP & Text Messaging
* Firebase - Build your own API
* IBM Watson - Cognitive Computing with IBM
* Lob - Email Postcards
* Algolia - Search as a Service
* Battle.net - Blizzard
* Free Geo IP - Get geolocation of IP
* The Counted - officer-involved killings in the US
* Wolfram Alpha
* IFTTT
* USDA National Nutrient Database - United States Department of Agriculture
* Twitter
* Nutritionix - Nutrition DB
* Geode Systems - lotsa data
* Programmable Web API's - /thread (though keep going, this is fun)
* Pokémon API
* Open Weather Map - yet more weather

#### Download some stock data using the api from [Quandl](https://www.quandl.com/)

In [None]:
import pandas as pd
import sys
sys.path.append('/usr/local/lib/python3.5/site-packages/quandl/')
import quandl

df = quandl.get('WIKI/GOOGL')

In [None]:
df.head()

### Taking out "relevant" features, we want to at least make the data as easy as possible without losing information

In [None]:

# We do not need any of the other columns 
# since these will have the same information only in a nicer format
df = df[['Adj. Open',  'Adj. High',  'Adj. Low',  'Adj. Close', 'Adj. Volume']]

# high - low percentage per day
df['highlow_percentage'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0

# percentage change per day
df['percentage_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

In [None]:
df = df[['Adj. Close', 'highlow_percentage', 'percentage_change', 'Adj. Volume']]
df.head()

# Machine Learning

Subfield of artificial intelligence. 

***Learning from examples and experience***

* AlphaGo competing in the game Go

* Similar approach is used in learning to play Mario [MarIO](https://www.youtube.com/watch?v=qv6UVOQ0F44)
 * A great video for an introduction on neural networks
 
***Why is Machine Learning?***

How would start by identifying a fruit

Difference between orange and an apple
```python
def define_fruit():
    # lots of code

def detect_colors(image):
    #lots of code

def analyze_shapes(image):
    # more code

def guess_texture(image):
    # code.............
    
```

Machine Learning Problems can be seperated into a few large categories:

* ***supervised learning***, which data comes from additional attributes that we can predict. Can be either:
 * ***classification***, discrete output variables
 * ***regression***, continuous output variables
* ***unsupervised learning***, which training data consist of input without any target values.

## Supervised Learning
Features already in place

* Collect training data
* Train Classifier
* Make Predictions

We want to classify orange from apple

Weight | Texture | Label
--- | --- | ---
150 | Bumpy | Orange
170 | Bumpy | Orange
140 | Soft | Apple
130 | Soft | Apple
... | ... | ...

In [None]:
# input to the classifier
features = [[140, 'soft'], [130,'soft'], [150,'bumpy'],[170,'bumpy']]

# output for the classifier
labels = ['apple', 'apple','orange','orange']

# We need to change the type of input for the classifier so we rename the features into ints

# 1 = 'soft', 0 = 'bumpy'
features = [[140, 1], [130,1], [150,0],[170,0]]

# 0 = 'apple', 1 = 'orange'
labels = [0,0,1,1]

### Now that we have the data let's make something from it

In [None]:
from sklearn import tree

# doesnt know anything yet
clf = tree.DecisionTreeClassifier()

Now we will train the data to the classifier with the training data that we have

In [None]:
clf = clf.fit(features,labels)

*** Notice that we want to predict something that is not in the training data ***

In [None]:
clf.predict([[160,0]])

# Classic Machine Learning Problem

## [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)

Classify the type of flower based on 
* width
* length
* petal width 
* petal length

scikit comes with some practices [datasets](http://scikit-learn.org/stable/datasets/)

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

print(iris.feature_names)
print(iris.target_names)
print(iris.data[0])
print(iris.target[0])

# Preparing Data
Examples used to "test" the classifiers accurracy.

will ***NOT*** be part of the training data

Usually a standard is to have ***80% training*** and ***20% testing***. But by no means a goto from the start

### Testing data

In [None]:
# dataset is constructed in a great way
# so we can get testing data from each example by taking 0,50,100

test_index = [0,50,100]

### Training data

In [None]:
# built based on c computation and mutation of arrays, i.e. its pretty fast
import numpy as np

# training data
train_target = np.delete(iris.target, test_index)
train_data = np.delete(iris.data, test_index, axis=0)

# testing data
test_target = iris.target[test_index]
test_data = iris.data[test_index]

## Training classifier

In [None]:
clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)

# We expect the classifier to give us [0,1,2] from test_target
print(test_target)

print(clf.predict(test_data))

# DONE! we have now made our first machine learning

### Classification
* Naive Bayes
* Decision trees / Random Forest
* SVM
* KNN
* ... many more!

Recall the features that we have, we need to evaluate them carefully.


next we will visualize the decision tree to go into more in depth of what actually happens

Code is taken from scikits homepage on [decision tree visualization](http://scikit-learn.org/stable/modules/tree.html)

## Visualize the Decision Tree Classifier

In [None]:
import sys
sys.path.append('/usr/local/lib/python3.5/site-packages/pydot/')

import pydotplus
from sklearn.externals.six import StringIO
from IPython.display import Image  
dot_data = StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("iris.pdf")
Image(graph.create_png()) 

In [None]:
# first testing flower
print(test_data[0], test_target[0])

print(iris.feature_names)
print("targets: ",iris.target_names)

In [None]:
# second testing flower
print(test_data[1], test_target[0])




# Parting words!

### *Choosing good features is one of your most important jobs*

cause the ML-models are dumb....

### *No free hunch*

* be skeptic and try to find loopholes with the data
* Models get a score function and can be cross-validated but the is for another time


# Hope it was fun everyone! Please give feedback on what can be improved!