# Welcome to the ACM Machine Learning Subcommittee! 

Python libraries we will be using:

* Numpy 
    - numerical computation
* Pandas 
    - handling heterogenous, non-numerical data
* Scikit-Learn 
    - training and evaluating models
    - Fetching datasets
* Matplotlib 
     - pretty pictures
     - pyplot - MATLAB-like syntax in python

In [2]:
from __future__ import print_function
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

# Numpy Primer

Main data structure: ndarray

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.html

In [15]:
zero_matrix = np.zeros(shape=(3,3))
ones_matrix = np.ones(shape=(3,5))
rand_matrix = np.random.rand(5,2)

print(zero_matrix, zero_matrix.shape)
print(ones_matrix, ones_matrix.shape)
print(rand_matrix, rand_matrix.shape)

[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]] (3, 3)
[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]] (3, 5)
[[ 0.40183315  0.04235873]
 [ 0.84510916  0.99702359]
 [ 0.56819211  0.90863983]
 [ 0.25558301  0.27273653]
 [ 0.44254216  0.36666297]] (5, 2)


In [21]:
mult_matrix = np.dot(ones_matrix, rand_matrix)  # dot product == matrix-matrix multiply, matrix-vector multiply

print(mult_matrix, mult_matrix.shape) # (3 x 5) * (5 x 2) => (3 x 2)

[[ 2.5132596   2.58742165]
 [ 2.5132596   2.58742165]
 [ 2.5132596   2.58742165]] (3, 2)


# Array slicing

Shorthand syntax for accessing sub-arrays by 'slicing' along the array's dimensions

Suppose we only want rows 3 (inclusive) to 5 (exclusive) and columns 4 to 7. We would use the following line

    array[3:5, 4:7]

In [39]:
# array[start : stop]
# array[x_start : x_stop, y_start : y_stop, ...]

print(zero_matrix[:,:]) # the full matrix
print()
print(zero_matrix[2,:]) # just the bottom row
print()
print(ones_matrix[0, 2:5]) # row 0, columns 2,3,4 => shape=(1,3)
print()
print(rand_matrix[:3, 0:]) # rows 0,1,2, columns 0,1 => shape=(3,2)

[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]

[ 0.  0.  0.]

[ 1.  1.  1.]

[[ 0.40183315  0.04235873]
 [ 0.84510916  0.99702359]
 [ 0.56819211  0.90863983]]


# Pandas

But what if I want to use heterogenous, possibly non-numerical data?

pd.dataframe

* Pandas main data structure
* Heterogenous data
* Supports non-numerical data

Example Dataset: Boston Housing

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

In [50]:
from sklearn.datasets import load_boston

"""
BOSTON HOUSING DATASET
    - 506 samples
    - 13 features
    - 1  target value ()
"""

boston_housing = load_boston()
print(boston_housing.data.shape)
print(boston_housing.target.shape)

(506, 13)
(506,)


In [None]:
X = boston_housing.data
y = boston_housing.target