# Chapter 1: Getting Started with Python Machine Learning


+ Python is a flexible language for rapid prototyping.
+ The underlying algorithms are all written in optimized C or C++.
+ Thus the resulting code is fast and robust enough to be used in production as well.


+ Code for the book available at https://github.com/luispedro/BuildingMachineLearningSystemsWithPython


+ Machine Learning is also referred to as Data Mining or Predictive Analysis.


#### Machine Learning Workflow (most of the time will be spent in rather mundane tasks)
+ Reading in the data and cleaning it
+ Exploring and understanding the input data
+ Analyzing how best to present the data to the learning algorithm
+ Choosing the right model and learning algorithm
+ Measuring the performance correctly

#### Feature Engineering
Often you will not feed the data directly into your machine learning algorithm. Instead you will find that you can refine parts of the data before training. Many times the machine learning algorithm will reward you with increased performance. You will even find that a simple algorithm with refined data generally outperforms a very sophisticated algorithm with raw data. This part of the machine learning workflow is called feature engineering, and is most of the time a very exciting and rewarding challenge

#### When Stuck, Use
+ http://metaoptimize.com/qa
+ http://stats.stackexchange.com
+ http://stackoverflow.com
+ #machinelearning
+ https://freenode.net/
+ http://www.TwoToReal.com

#### Machine Learning Blogs
+ http://blog.kaggle.com

#### SciPy.org [SciPy, Pandas, NumPy, Matplotlib, SymPy, IPython]  
Python is an interpreted language, and the reason for its popularity is its ability to off load number crunching tasks to lower-layered C or FORTRAN extensions such as NumPy (provides highly optimized multidimensional arrays) and SciPy (uses NumPy data structures to provide a set of fast numerical recipes).
+ http://www.scipy-lectures.org/index.html#
+ http://scipy.org/docs.html

## NumPy


In [49]:
import sys
print("-- python version: %s" % sys.version)

# don't do the following in order to avoid name collisions
# from numpy import *
# instead use this
import numpy as np
print("-- numpy version: %s" % np.version.full_version)
print("-- numpy version: %s" % np.__version__)
import scipy as sp
print("-- scipy version: %s" % sp.__version__ )
import sklearn as skl
print("-- scikit-learn version: %s" % skl.__version__)
import matplotlib as mpl
print("-- matplotlib version: %s" % mpl.__version__)

-- python version: 3.5.2 |Anaconda 4.0.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
-- numpy version: 1.10.4
-- numpy version: 1.10.4
-- scipy version: 0.17.0
-- scikit-learn version: 0.17.1
-- matplotlib version: 1.5.1


In [48]:
## NumPy array
a=np.array([0,1,2,3,4,5])
print("-- numpy array: %s" % a)
print("-- dimensions: %s" % a.ndim)
print("-- shape: "); print(a.shape)

-- numpy array: [0 1 2 3 4 5]
-- dimensions: 1
-- shape: 
(6,)


In [41]:
## Transform a into 2 dimensional matrix.
b=a.reshape((3,2))
print("-- 3x2 matrix: "); print(b)
print("-- dimensions: %s" % b.ndim)
print("-- shape: "); print(b.shape)

-- 3x2 matrix: 
[[0 1]
 [2 3]
 [4 5]]
-- dimensions: 2
-- shape: 
(3, 2)


In [42]:
## NumPy optimization: copies are shallow copies.
b[1][0]=77
print("-- b: "); print(b)
print("-- a: "); print(a)

## For deep copy, use copy().
c=a.reshape((3,2)).copy()
print("-- c: "); print(c)
c[0][0]=-99
print("-- a: "); print(a)
print("-- c: "); print(c)

-- b: 
[[ 0  1]
 [77  3]
 [ 4  5]]
-- a: 
[ 0  1 77  3  4  5]
-- c: 
[[ 0  1]
 [77  3]
 [ 4  5]]
-- a: 
[ 0  1 77  3  4  5]
-- c: 
[[-99   1]
 [ 77   3]
 [  4   5]]


In [43]:
## Operations are propagated to the individual elements.
d=np.array([1,2,3,4,5])
print("-- d: "); print(d)
print("-- d*2: "); print(d*2)
print("-- d**2: "); print(d**2)

## Contrast this with ordinary Python lists.
o=[1,2,3,4,5]
print("-- o: "); print(o)
print("-- o*2: "); print(o*2)
print("-- o**2: "); 
try:
    print(o**2)
except TypeError as typeError:
    print("TypeError: " + str(typeError))

-- d: 
[1 2 3 4 5]
-- d*2: 
[ 2  4  6  8 10]
-- d**2: 
[ 1  4  9 16 25]
-- o: 
[1, 2, 3, 4, 5]
-- o*2: 
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
-- o**2: 
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'


In [44]:
## Numpy indexing.
print("-- a[np.array([2,3,4])]: "); print(a[np.array([2,3,4])])
print("-- a>4: "); print(a>4)
print("-- a[a>4]: "); print(a[a>4])
print("-- trim outliers; a[a>4]=4; a: "); a[a>4]=4; print(a)
print("-- clip outliers; a.clip(0,4): "); print(a.clip(0,4))

-- a[np.array([2,3,4])]: 
[77  3  4]
-- a>4: 
[False False  True False False  True]
-- a[a>4]: 
[77  5]
-- trim outliers; a[a>4]=4; a: 
[0 1 4 3 4 4]
-- clip outliers; a.clip(0,4): 
[0 1 4 3 4 4]


In [45]:
## Numpy NAN.
e=np.array([1,2,np.NAN,3,4])
print("-- e: "); print(e)
print("-- np.isnan(e): "); print(np.isnan(e))
print("-- e[~np.isnan(e)]"); print(e[~np.isnan(e)])
print("-- np.mean(e[~np.isnan(e)])"); print(np.mean(e[~np.isnan(e)]))

-- e: 
[  1.   2.  nan   3.   4.]
-- np.isnan(e): 
[False False  True False False]
-- e[~np.isnan(e)]
[ 1.  2.  3.  4.]
-- np.mean(e[~np.isnan(e)])
2.5


In [46]:
## Comparing runtime
## we should strive to use highly optimized NumPy or SciPy extension functions (ex: dot())
import timeit as tt
normal_py_sec=tt.timeit('sum(x*x for x in range(1000))',number=10000)
print("-- Normal Python: %f sec" % normal_py_sec)
native_np_sec=tt.timeit('sum(na*na)',setup="import numpy as np; na=np.arange(1000)",number=10000)
print("-- Native Numpy: %f sec" % native_np_sec)
good_np_sec=tt.timeit('na.dot(na)',setup="import numpy as np; na=np.arange(1000)",number=10000)
print("-- Good Numpy: %f sec" % good_np_sec)

-- Normal Python: 0.981800 sec
-- Native Numpy: 0.890005 sec
-- Good Numpy: 0.014418 sec


In [47]:
## Speed comes at a price
## NumPy arrays can hold only one data type
f=np.array([1,2,3])
print("-- f: "); print(f)
print("-- f data type: "); print(f.dtype)

## Numpy coerces to the most reasonable common data type
g=np.array([1,"stringy"])
print("-- g: "); print(g)
print("-- g data type: "); print(g.dtype)
h=np.array([1,"stringy",set([1,2,3])])
print("-- h: "); print(h)
print("-- h data type: "); print(h.dtype)

-- f: 
[1 2 3]
-- f data type: 
int32
-- g: 
['1' 'stringy']
-- g data type: 
<U11
-- h: 
[1 'stringy' {1, 2, 3}]
-- h data type: 
object
