# MILK machine learning python tool package

# 1. Motivation - one paragraph explaining the main objective of the library and the problem it is trying to solve.
The Milk library is a machine learning toolkit [1]. The main features Milk are based on supervised and unsupervised classification algorithms. It complements other packages by emphasis on speed and low memory usage. The milk package implements the performance sensitive code in C++ which gives it an upper hand in speed since C++ is a compiled language and python being interpreted language as we have learned in the class 1 lecture. This is behind Python-based interfaces for convenience. The major supervised classifiers covered by the milk package are Support Vector Machines, k-Nearest Neighbor, random forests, and decision trees. Milk supports k-means clustering (the ability to cluster millions of data points efficiently) and affinity propagation for unsupervised learning. Another main feature is feature selection for classification. 

# 2. Context - alternative solutions for solving the problem
There are other machine learning libraries including scikit learn, theano, pylearn2, etc. to solve the classification problems.

# 3. Installation instructions, platform restriction and dependent libraries

### Installation instructions
- Step 1: Windows installation can be performed using the command prompt by simply entering "pip install milk".
If the above step did not work; it might have been due to some of the system softwares not upgraded like c++ or visual basic etc.

- Step 2: The alternative approach to install milk package is download a wheel file (Installing binary extensions [2]) from this website [3]. Choose the appropriate package file to download depending on the version of the python you are using on your computer. The commands 'import sys' and 'print (sys.version)' will let you know the version of python you are using. 
Once you have downloaded it onto your computer, open command prompt and go the directory that contains the downloaded wheel file and type 'pip install downloadedfilename'. This will solve the issue we have faced in step 1. The following video demonstrates the step 2 if something is not clear [4]. 

- To check the successful installation - type 'import milk' in python.

### platform restriction 
#### Mac users
- From the terminal run: "pip install milk"

### dependent libraries
- Milk package is dependent on numerical python (numpy) and is optimized for numpy arrays. It is also based on libsvm [5] for support vector machines classification algorithm. 

In [1]:
import milk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
?milk

# 4. Minimal working example

There many supervised and unsupervised classifers we can choose using Milk package. Milk package supports changing various parameters to enable the user to control the classifier inputs (i.e., it is not like blackbox package).

Here we demonstrated the minimal working example using the default supervised classifier in the Milk package which is the support vector machine. We can check the various options for the user control using the command '?milk.defaultclassifier'. 

In [None]:
# load data from ex2data csv file which has two features 'X1' and 'X2'. The class labels are in the column names 'class'
df = pd.read_csv('ex2data.csv')
df.head()

In [None]:
# plot the observations with the two features 'X1' and 'X2'
plt.scatter(df['X1'], df['X2'], c=df['class'], cmap=plt.cm.Paired)
plt.xlabel('X1')
plt.ylabel('X2')

In [None]:
#Initializing the classifier by defining the features and class labels
features = np.array(df.iloc[:,:2])
labels = np.array(df['class'])
classifier = milk.supervised.defaultclassifier()#SVM classifier is the default supervised classifier

#training the model
model_defaultclassifier = classifier.train(features, labels)

In [None]:
#Classify new points
# Testing the model with data point [-2,-3] which belongs to False class
model_defaultclassifier.apply([-2,-3])

In [None]:
# Testing the model with data point [3,2] which  belongs to True class
model_defaultclassifier.apply([3,2])

# 5. 2-3 examples of typical use-cases

## Building a tree classifier using milk package

In [None]:
#Building a tree using milk package
#Define the tree
tree=milk.supervised.tree_learner(min_split=1)
#Developing a model using the defined tree with features and labels as the inputs
model_treeClassifier = tree.train(features, labels)

In [None]:
# Predictions using the tree model
model_treeClassifier.apply([-2,-3])

In [None]:
model_treeClassifier.apply([3,2])

## SVM classifier using C value and various kernels

In [None]:
#Building a SVM classifier using milk package. This time we specify C value and the kernel we want to use. 
# There are various kernels to explore using this package: 
# 'rbf_kernel',
# 'polynomial_kernel',
# 'precomputed_kernel',
# 'dot_kernel',
# 'svm_raw',
# 'svm_binary',
# 'svm_to_binary',
# 'svm_sigmoidal_correction',
# 'sigma_value_fisher',
# 'fisher_tuned_rbf_svm',

features = np.array(df.iloc[:,:2])
labels = np.array(df['class'])
classifier = milk.supervised.svm_simple(C=10,kernel=milk.supervised.svm.rbf_kernel(2.**-3))

#training the model
model_SVM_simple = classifier.train(features, labels)

In [None]:
# Predictions using the SVM model
model_SVM_simple.apply([-2,-3])

In [None]:
model_SVM_simple.apply([3,2])

## K nearest neighbor supervised classifier using milk package

In [None]:
import milk.supervised.knn

X = np.array([[0,0,0],[1,1,1]])         
Y = np.array([ 1, -1 ])

#Defining the KNN model
kNN = milk.supervised.knn.kNN(k=1)#'k=1' defines no. of neighbors to consider.
#Training the model with vectors X and Y
kNN = kNN.train(X,Y)

In [None]:
#Testing the KNN model
kNN.apply(X[0]) == Y[0]

In [None]:
#Testing the KNN model
kNN.apply([0,0,1]) == Y[1]

## Explaining feature selection using Milk package for supervised learning.

In [None]:
#The dataset Auto has 8 feature vectors in total.

# load data
df1 = pd.read_csv('Auto.csv')

#Converting horsepower feature from object to float
from ipykernel import kernelapp as app
df1['horsepower'] = df1['horsepower'].convert_objects(convert_numeric=True)

df1.head()

In [None]:
# Defining the class labels suggesting low or high mileage for a given car with the feature vectors.
# If the mileage of the car is less than the median of the mileage vector then its a low mileage and is represented with a '0' else '1' representing high mileage.

mlgmedian=df1['mpg'].median()

# #Creating the response variable based on median of mpg
df1['labels'] = (df1['mpg'] > mlgmedian).astype(int)

df1.head()

In [None]:
# Selecting the most important features using the milk package
import milk.supervised.featureselection
features = np.array(df1.iloc[:,:8])
labels = np.array(df1['labels'])
selected_features = milk.supervised.featureselection.sda(features,labels)
selected_features

As we can observe from the above output of the selected_features; the vectors [7,0] in the df1 dataframe refer to the mpg and cyliners which came out to be the important predictors to determine the low or high mileage for a given car. 

The vector 7 which refers to mpg should obviously be the most important predictor since the response variable label is defined based on the median mpg value.

# Unsupervised learning using Milk package

## Kmeans unsupervised classifier

In [None]:
# Demonstrating Kmeans using Milk package
df3 = pd.read_csv('kmeans_simple.csv')
df3.head()

In [None]:
# visualize data
df3.plot.scatter('x1','x2')

In [None]:
cluster_ids,centroids=milk.unsupervised.kmeans(df3, 3, distance='euclidean', max_iter=1000)
cluster_ids,centroids

In [None]:
bb=pd.DataFrame(centroids)
bb

In [None]:
#plotting the centroids along with the datapoints. The 'x' marks on the plot represent the centroids obtained from the kmeans algorithm.
bb.columns=['a','b']
plt.scatter(bb['a'],bb['b'],marker='x',s=100)
plt.scatter(df3['x1'],df3['x2'])

## PCA using Milk package

In [None]:
#Defining random numbers. We considered 4 features in X
np.random.seed(123)
X = np.random.rand(10,4)
X[:,1] += np.random.rand(10)**2*X[:,0] 
X[:,1] += np.random.rand(10)**2*X[:,0] 
X[:,2] += np.random.rand(10)**2*X[:,0] 

#Applying PCA using milk package
Y,V = milk.unsupervised.pca(X, zscore=True)
# Y is the transformed matrix with the same dimension of X
# V contains the principal components
Y,V

# 6. List other interesting or useful features (additional examples are not required)

- All the features of the MILK package are well explained in reference [6].

- SVMs (Support Vector Machines) -  supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis [5 - 7]. 

- k-means - a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining [6, 8].

- Random Forests - an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees [6, 9].

- LASSO (least absolute shrinkage and selection operator) - a regression analysis method that performs variable selection and regularization to enhance prediction accuracy and interpretability of a statistical model [6, 10].

- Self-organizing maps - a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples, called a map. This is a method to reduce dimensionality [6, 11].

- Stepwise Discriminant Analysis for feature selection -  a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables) [6, 12].

- Non-negative matrix factorisation - a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements [6, 13].

- Affinity propagation - a clustering algorithm based on the concept of "message passing" between data points [6, 14].

# 7. Summary and personal assessment of the library

The Milk package is a powerful machine learning toolkit. The speed at which it performs and the low memory usage make Milk an ideal machine learning tool for large datasets that would otherwise be timely to run using other kits. Additionally, the many features Milk encompasses makes it a versatile tool to run many different models for analysis. As seen from the examples in part 5, Milk's many features are used in different ways including supervised and unsupervised learning, cluster analysis, decision trees, and much more. 

As seen from the above examples in which we used the Milk package, we believe that it performed well for the different scenarios and datasets we used. Milk also demonstrated that it is very flexible with defining inputs into the model. Each of the different features we used performed the task we assigned to it the correct way, and it does it efficiently and effectively. We would recommend using the Milk package for any machine learning project. 

# 8. References

[1] https://pypi.python.org/pypi/milk/#downloads

[2] https://docs.python.org/3/installing/

[3] http://www.lfd.uci.edu/~gohlke/pythonlibs/

[4] https://www.youtube.com/watch?v=jnpC_Ib_lbc

[5] https://www.csie.ntu.edu.tw/~cjlin/libsvm/

[6] http://pydoc.net/Python/milk/0.6.1/

[7] https://en.wikipedia.org/wiki/Support_vector_machine

[8] https://en.wikipedia.org/wiki/K-means_clustering

[9] https://en.wikipedia.org/wiki/Random_forest

[10] https://en.wikipedia.org/wiki/Lasso_(statistics)

[11] https://en.wikipedia.org/wiki/Self-organizing_map

[12] https://en.wikipedia.org/wiki/Discriminant_function_analysis

[13] https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

[14] https://en.wikipedia.org/wiki/Affinity_propagation