# Machine learning modules

## Numpy

Manipulating `numpy` arrays is an important part of doing machine learning (or, really, any type of scientific computation) in python. Numpy arrays are similar to Python lists but have much more functionality and are used mathematically as matrices.

In [None]:
import numpy as np

# Generating a random array
X = np.random.rand(3, 5)
print(X)

Accessing elements is similar to list indexing, but is in multiple dimensions

In [None]:
# get a single element 
# (here: an element in the first row and column)
X[0, 0]

Access an entire row

In [None]:
X[1]

Or an entire column

In [None]:
X[:, 1]

You can also do mathematical operations, like a matrix transpose:

$$\begin{bmatrix}
    1 & 2 & 3 & 4 \\
    5 & 6 & 7 & 8
\end{bmatrix}^T
= 
\begin{bmatrix}
    1 & 5 \\
    2 & 6 \\
    3 & 7 \\
    4 & 8
\end{bmatrix}
$$



In [None]:
X.T

In [None]:
# Creating a row vector of evenly spaced numbers over a specified interval.
y = np.linspace(0, 12, 5)
print(y)

In [None]:
# Turning the row vector into a column vector
print(y[:, np.newaxis])

In [None]:
# Reshaping an array
print(X.reshape(5, 3))

Arrays of integers or boolean values can also be used as indices

In [None]:
# Indexing by an array of integers (fancy indexing)
indices = np.array([3, 1, 0])
print(indices)
X[:, indices]

## matplotlib

Another important part of machine learning is the visualization of data.  The most common
tool for this in Python is [`matplotlib`](http://matplotlib.org).  It is an extremely flexible package, and
we will go over some basics here.

Since we are using Jupyter notebooks, let us use one of IPython's convenient built-in "[magic functions](https://ipython.org/ipython-doc/3/interactive/magics.html)", the "matoplotlib inline" mode, which will draw the plots directly inside the notebook.

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Plotting a line
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x));

In [None]:
# Scatter-plot points
x = np.random.normal(size=500)
y = np.random.normal(size=500)
plt.scatter(x, y);

`imshow` displays an matrix as if it were an image

In [None]:
# Showing images using imshow
# - note that origin is at the top-left by default!

x = np.linspace(1, 12, 100)
y = x[:, np.newaxis]

im = y * np.sin(x) * np.cos(y)
print(im.shape)

plt.imshow(im);

In [None]:
# Contour plots 
# - note that origin here is at the bottom-left by default!
plt.contour(im);

## Pandas

Pandas is a library for data import, export, and manipulation. Unlike numpy, data in Pandas are represented by DataFrames, which function more like a database than a large matrix

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'A' : 1.,
                   'B' : pd.Timestamp('20130102'),
                   'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D' : np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),
                   'F' : 'foo' })
df

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

In [None]:
df.describe()

In [None]:
df.sort_values(by='E')

In [None]:
df['A']

In [None]:
df['A'] = np.random.rand(4)

In [None]:
df

In [None]:
df[df.A > 0.7]

Importing and exporting is very easy with CSV or other formats

In [None]:
df.to_csv('foo.csv')

In [None]:
with open('foo.csv', 'r') as f:
    print(f.read())

In [None]:
df2 =  pd.read_csv('foo.csv')
df2

Plotting is simple with pandas

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.head()

In [None]:
plt.figure(); df.plot(); plt.legend(loc='best')

It is also easy to convert a pandas DataFrame into a numpy array

In [None]:
ndf = np.array(df)
type(ndf)

In [None]:
ndf[:5,:]

## Exercise

Use Pandas to import one of the following datasets as a DataFrame:

+ http://archive.ics.uci.edu/ml/datasets/Abalone
+ https://archive.ics.uci.edu/ml/datasets/Yeast
+ http://archive.ics.uci.edu/ml/datasets/Wine

Once you have the data in a Pandas DataFrame, either use Numpy or Pandas to convert all data into numerical form, and normalize each feature `X` by computing `(X - minimum(X))/(maximum(X) - minimum(X))`

The Pandas documentation (https://pandas.pydata.org/pandas-docs/stable/10min.html is a good guide) can be helpful!