>### HW6.0. 
>In mathematics, computer science, economics, or management science what is mathematical optimization?  
Give an example of a optimization problem that you have worked with directly or that your organization has worked on.  
Please describe the objective function and the decision variables.  
Was the project successful (deployed in the real world)?  
Describe.

>### HW6.1 
>Optimization theory:  
For unconstrained univariate optimization what are the first order  Necessary Conditions for Optimality (FOC).  
What are the second order optimality conditions (SOC)?  
Give a mathematical defintion.  
Also in python, plot the univariate function 
$X^3 -12x^2-6$ defined over the real  domain -6 to +6. 

>Also plot its corresponding first and second derivative functions. Eyeballing these graphs, identify candidate optimal points and then classify them as local minimums or maximums.  
Highlight and label these points in your graphs. Justify your responses using the FOC and SOC.

>For unconstrained multi-variate optimization what are the first order  Necessary Conditions for Optimality (FOC).  
What are the second order optimality conditions (SOC)?  
Give a mathematical defintion.  
What is the Hessian matrix in this context?

>###HW6.3 Convex optimization 
>What makes an optimization problem convex?  
What are the first order  Necessary Conditions for Optimality in convex optimization.  
What are the second order optimality conditions for convex optimization?  
Are both necessary to determine the maximum or minimum of candidate optimal solutions?

>Fill in the BLANKS here:  
Convex minimization, a subfield of optimization, studies the problem of minimizing BLANK functions over BLANK sets. The BLANK property can make optimization in some sense "easier" than the general case - for example, any local minimum must be a global minimum.

>###HW 6.4
>The learning objective function for weighted ordinary least squares (WOLS) (aka weight linear regression) is defined as follows:  

>$0.5* sumOverTrainingExample_i*(weight_i * (W * X_i - y_i)^2)$

>Where training set consists of input variables X ( in vector form) and a target variable y, and W is the vector of coefficients for the linear regression model.

>Derive the gradient for this weighted OLS by hand; showing each step and also explaining each step.



<strong>Step 1:</strong> $1/2*\sum{ weight_i*(WX_i-y_i)^2)}$
> This is the starting step.  

<strong>Step 2:</strong> $1/2*\sum{ weight_i*(WX_i-y_i)*(WX_i-y_i)}$
> Square the values within the sum prior to taking derivative with respect to W  

<strong>Step 3:</strong> $1/2*\sum{ weight_i*(W^2X_i^2-2WX_iy_i+y_i^2)}$
> Multiply terms and combine  

<strong>Step 4:</strong> $1/2*\sum{ weight_i*(2WX_i^2-2X_iy)}$
> Take derivative with respect to W

<strong>Step 5:</strong> $1/2*\sum{ weight_i*2X_i(WX_i-y)}$
> Factor remaining terms

<strong>Step 6:</strong> $\sum{ weight_i*X_i(WX_i-y)}$
> 1/2 and 2 cancel out

>###HW 6.5
>Write a MapReduce job in MRJob to do the training at scale of a weighted OLS model using gradient descent.  

>Generate one million datapoints just like in the following notebook:  http://nbviewer.ipython.org/urls/dl.dropbox.com/s/kritdm3mo1daolj/MrJobLinearRegressionGD.ipynb  

>Weight each example as follows: 

>$weight(x)= abs(1/x)$

>Sample 1% of the data in MapReduce and use the sampled dataset to train a (weighted if available in SciKit-Learn) linear regression model locally using  SciKit-Learn (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

>Plot the resulting weighted linear regression model versus the original model that you used to generate the data. Comment on your findings.

In [3]:
import numpy as np
import pylab 
size = 1000000
x = np.random.uniform(-4, 4, size)
y = x * 1.0 - 4 + np.random.normal(0,0.5,size)
data = zip(y,x)
np.savetxt('LinearRegression.csv',data, delimiter = ",")

In [4]:
!head LinearRegression.csv

-6.650992125999841242e+00,-2.625148229118576815e+00
-1.798573558430904828e+00,1.501572938676488000e+00
-4.738800795810907296e+00,-4.718581851287790840e-01
-4.564200222550757857e+00,-1.741144385132784578e-01
-6.169588744364752131e+00,-1.597100336185655500e+00
-4.927822157877639775e+00,-5.294701138536934693e-01
-6.149374572233639036e+00,-1.520726945757780335e+00
-8.180457655822662488e+00,-3.847785247801682296e+00
-1.230371438409911145e+00,2.830359334988319375e+00
-4.483833424271095325e+00,1.563974898459772334e-01


In [None]:
%%writefile GD_WOLS_LinearRegression.py
import numpy as np
from mrjob.job import MRJob
class GD_WOLS_LinearRegression(MRJob):
    INTERNAL_PROTOCOL = PickleProtocol
    
    def configure_options(self):
        super(GD_WOLS_LinearRegression, self).configure_options()
        self.add_passthrough_option(
            '--weights_file'
                , dest='weights_file'
                , help='Weights file')
    
    def init_weights(self):
        # check weights file 
        with open(self.options.weights_file,'r') as r:
            self.weights = np.array(float(v) for v in r.readline().split(','))
        
        # initialze gradient for this iteration
        self.partial_gradient_values = np.array([0]*len(self.weights))
        self.partial_count = 0
    
    def partial_gradient(self, _, line):
        D = map(float,line.split(','))
        y_hat = self.weights[0]+self.weights[1:]*D[1:]
        self.partial_gradient_values = np.array(self.partial_gradient_values[0]
                                            + D[0]-y_hat
                                         , self.partial_gradient_values[1]
                                             +(D[0]-y_hat)*D[1])
        self.partial_count = self.partial_count + 1
    
    def partial_gradient_emit(self):
        yield None, (self.partial_gradient_values, self.partial_count)
    
    
        

>###HW6.6 Clean up notebook for GMM via EM

>Using the following notebook as a starting point:

>http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/0t7985e40fovlkw/EM-GMM-MapReduce%20Design%201.ipynb 

>Improve this notebook as follows:  
* Add in equations into the notebook (not images of equations)   
* Number the equations  
* Make sure the equation notation matches the code and the code and comments refer to the equations numbers  
* Comment the code  
* Rename/Reorganize the code to make it more readable  
* Rerun the examples similar graphics (or possibly better graphics)  

>### HW6.7 Implement Bernoulli Mixture Model via EM
>Implement the EM clustering algorithm to determine Bernoulli Mixture Model for discrete data in MRJob.

>As a unit test:


>As a test: use the same dataset from HW 4.5, the Tweet Dataset. 
Using this data, you will implement a 1000-dimensional EM-based Bernoulli Mixture Model  algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using K = 4.  
Repeat this experiment using your KMeans MRJob implementation fron HW4.  
Report the rand index score using the class code as ground truth label for both algorithms and comment on your findings.

>Here is some more information on the Tweet Dataset.

>Here you will use a different dataset consisting of word-frequency distributions 
for 1,000 Twitter users. These Twitter users use language in very different ways,
and were classified by hand according to the criteria:

>0: Human, where only basic human-human communication is observed.

>1: Cyborg, where language is primarily borrowed from other sources
(e.g., jobs listings, classifieds postings, advertisements, etc...).

>2: Robot, where language is formulaically derived from unrelated sources
(e.g., weather/seismology, police/fire event logs, etc...).

>3: Spammer, where language is replicated to high multiplicity
(e.g., celebrity obsessions, personal promotion, etc... )

>Check out the preprints of  recent research,
which spawned this dataset:

>http://arxiv.org/abs/1505.04342  
>http://arxiv.org/abs/1508.01843

>The main data lie in the accompanying file:

>topUsers_Apr-Jul_2014_1000-words.txt

>and are of the form:

>USERID,CODE,TOTAL,WORD1_COUNT,WORD2_COUNT,...
>.
>.

>where

>USERID = unique user identifier
>CODE = 0/1/2/3 class code
>TOTAL = sum of the word counts

>Using this data, you will implement a 1000-dimensional K-means algorithm in MrJob on the users
by their 1000-dimensional word stripes/vectors using several 
centroid initializations and values of K.