# SIDAS502: Math Methods for Data Science 


# School of Information, University of Michigan


## Week3: Optimization/Canonical Formulation/Loss Function 

### Version 1.3

This assignment is designed to help you gain some practical knowledge of how optimization works. You should be able to describe how optimization works in everyday life, mathematics, and data science. At the end of this assignment you should be able to identify canonical formulations for linear and logistic regression and express the intuition of a loss function and how it related to optimization. Please read the directions carefully, as we want to avoid submissions that are marked incorrect due to formatting mistakes. You will be using sympy, numpy, and scipy for this assignment. 

Please enter your name: "Aseem Sachdeva"

# Part1: Optimization Problem 

<b>1.</b> Below is the function f(x) that you are trying to minimize. 

![function plotted on xy axis graph](assets/optimization.png)

<strong>1.1</strong> \[1 pt\] Where is the global maximum?

Please store the answer in the form of a single-character string named <strong>ANS11</strong>. 

In [170]:
ANS11="A"
# YOUR CODE HERE
#raise NotImplementedError()

In [171]:
assert type(ANS11) == str, "Problem 1.1, testing ANS11, type of value stored in variable does not match the expected type. Expecting String."

<strong>1.2</strong> \[1 pt\] Where is the global minimum?

Please store the answer in the form of a single-character string named <strong>ANS12</strong>.

In [172]:
ANS12= "F"
# YOUR CODE HERE
#raise NotImplementedError()

In [173]:
assert type(ANS12) == str,  "Problem 1.2, testing ANS12, type of value stored in variable does not match the expected type. Expecting String."

## Part2: Linear Regression Implementation on Data 

In this problem, you will be implementing linear regression models in scikit-learn library, as well as ols implementation in the statsmodels library.  

Difference between Statsmodels and Scikit-learn libraries: 

Scikit-learn: Simple and efficient tools for data mining and data analysis

Statsmodels: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Sources:
<ul>
    <li>https://towardsdatascience.com/introduction-to-linear-regression-in-python-c12a072bedf0</li>
    <li>https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76</li>
    <li>https://www.statsmodels.org/stable/index.html</li>
    <li>https://scikit-learn.org/stable/</li>
</ul>

### 2. Simple Regression with Scikit-learn 

In [174]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd

In [175]:
#Let's read in nyc flight csv file 
nyc = pd.read_csv('assets/nyc.csv')

In [176]:
##Let's explore what is inside the data 
nyc.head()

Unnamed: 0.1,Unnamed: 0,X,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
0,1,1,2013,1,1,517.0,2.0,830.0,11.0,UA,N14228,1545,EWR,IAH,227.0,1400,5.0,17.0
1,2,2,2013,1,1,533.0,4.0,850.0,20.0,UA,N24211,1714,LGA,IAH,227.0,1416,5.0,33.0
2,3,3,2013,1,1,542.0,2.0,923.0,33.0,AA,N619AA,1141,JFK,MIA,160.0,1089,5.0,42.0
3,4,4,2013,1,1,544.0,-1.0,1004.0,-18.0,B6,N804JB,725,JFK,BQN,183.0,1576,5.0,44.0
4,5,5,2013,1,1,554.0,-6.0,812.0,-25.0,DL,N668DN,461,LGA,ATL,116.0,762,5.0,54.0


In [177]:
## Data Cleaning: Drop all null values
nyc=nyc.dropna()

In [178]:
## Change datatype from dataframe to array 
x = nyc['distance'].values.reshape((-1, 1))

In [179]:
## model building 
model = LinearRegression().fit(x, nyc['arr_delay'])

In [180]:
## Show the slope the model 
print('slope:', model.coef_)

slope: [-0.00483183]


In [181]:
## Show the intercept of the model 
print('intercept:', model.intercept_)

intercept: 11.027244871764514


In [182]:
## predict arr_delay time when distance is 20
model.predict([[20]])

array([10.9306082])

#### An example of  simple linear regression with Scikit-learn

Please apply simple linear regression in Scikit-learn to build a model called <strong>arr_dep_model</strong> to get the relationship between arrival delay(dependent variable) and departure delay(independent variable).

In [183]:
x1= nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(x1, nyc['arr_delay'])

<strong>2.1</strong> \[1 pt\] What is the slope of the arr_dep_model?
Please store the answer into variable <strong>ANS21</strong> and round your answer to the thousandth place (eg: `ANS41=0.111`).

In [184]:
x1= nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(x1, nyc['arr_delay'])
ANS21= arr_dep_model.coef_
ANS21 = np.round(ANS21, 3)

# YOUR CODE HERE
#raise NotImplementedError()

In [185]:
assert type(ANS21) == np.ndarray, "Problem 2.1, testing ANS21, type of value stored in variable does not match the expected type. Expecting np.array"

<strong>2.2</strong> \[1 pt\] What is the intercept of the arr_dep_model?

Please store the answer into variable <strong>ANS22</strong>.  (round your answer to the thousandth place, eg: `ANS41=0.111`)

In [186]:
ANS22 = arr_dep_model.intercept_
ANS22 = np.round(ANS22, 3)
# YOUR CODE HERE
#raise NotImplementedError()

In [187]:
assert type(ANS22) == np.float64, "Problem 2.2, testing ANS22, type of value stored in variable does not match the expected type. Expecting np.float64"

<strong>2.3</strong> \[1 pt\] Put the answers from 2.1 and 2.2 into slope-intercept form into a variable called <strong>ANS23</strong>. Store the equation as a string with no spaces and no Y=/f(x)=. Use a lower case “x” as the representation of the independent variable. Please round the values to the thousandths place, as you did for questions 2.1 and 2.2.

In [188]:
ANS21= arr_dep_model.coef_
ANS21 = np.round(ANS21, 2)
ANS22 = arr_dep_model.intercept_
ANS22 = np.round(ANS22, 3)



ANS23 = "1.02x+-4.057"
# YOUR CODE HERE
#raise NotImplementedError()

In [189]:
#hidden tests are within this cell
assert type(ANS23) == str, "Problem 2.3, testing ANS23, type of value stored in variable does not match the expected type. Expecting String"

<strong>2.4</strong> \[1 pt\] If a flight’s departure is delayed for 15 minutes, what is its predicted arrival delay? The inputs for your slope and intercept should be rounded to the thousandths place. Please store the answer into variable <strong>ANS24</strong> (rounded to the **hundredths** place, e.g. `ANS24=0.11`). Your answer should be the following type: python float.

In [190]:
ANS21= arr_dep_model.coef_
ANS21 = np.round(ANS21, 3)
ANS22 = arr_dep_model.intercept_
ANS22 = np.round(ANS22, 3)

ANS24 = np.round(arr_dep_model.predict([[15]]),3)

ANS24 = np.float(ANS24)








# YOUR CODE HERE
#raise NotImplementedError()

In [191]:
assert type(ANS24) == float,  "Problem 2.4, testing ANS24, type of value stored in variable does not match the expected type. Expecting Float"

### 3. Multiple  Regression with Scikit-learn
We are going to run through an example of using scikit-learn to do a multiple linear regression model. 

**Begin example:**

In [192]:
# Build linear regression model using TV and Radio as predictors
# Split data into predictors X and output Y
predictors = ['dep_delay', 'distance']
X = nyc[predictors]
y = nyc['arr_delay']

# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)

In [193]:
print(f'intercept = {model.intercept_}')
print(f'coefficient = {model.coef_}')

intercept = -1.4987042957768262
coefficient = [ 1.01791457 -0.00250182]


In [194]:
model.predict(X)

array([ -2.96542015,  -0.9696201 ,  -2.1873548 , ..., 260.29482685,
        90.42100499, 179.45209055])

In [195]:
new_X = [[30, 200]]
print(model.predict(new_X))

[28.53836913]


**Example end**

Now, it is your turn! 

Please apply multiple linear regression in Scikit-learn to build a model called arr_del_model to see how air_time and distance influence arrival delays. You will use the results from this model to answer questions 3.1 - 3.4.

In [196]:
air_distance_predictors = ['air_time', 'distance']
X = nyc[air_distance_predictors]
y = nyc['arr_delay']


lm = LinearRegression()
arr_del_model = lm.fit(X, y)

print(arr_del_model.coef_)
print(arr_del_model.intercept_)
print(arr_del_model.predict([[175,1300]]))

[ 0.70092783 -0.09636352]
-4.275665100362692
[-6.88587014]


<strong>3.1</strong> \[1 pt\] Please identify the coefficient value for air_time. Assign the coefficient to <strong>ANS31</strong> and round to the thousandth decimal point (ex. 1.001). The value of <strong>ANS31</strong> should be a numpy float or a python float, depending on how you solved the problem.


In [197]:
ANS31 = 0.701
# YOUR CODE HERE
#raise NotImplementedError()

In [198]:
assert (type(ANS31) == np.float64) or (type(ANS31) == float), "Problem 3.1, testing ANS31, type of value stored in variable does not match the expected type. Expecting python or numpy float."

<strong>3.2</strong> \[1 pt\] Please identify the coefficient value for distance. Assign the coefficient to <strong>ANS32</strong> and round to the thousandth decimal point (ex. 1.001). The value of <strong>ANS32</strong> should be a numpy float or a python float, depending on how you solved the problem.

In [199]:
ANS32 = -0.096
# YOUR CODE HERE
#raise NotImplementedError()

In [200]:
assert (type(ANS32) == np.float64) or (type(ANS32) == float), "Problem 3.2, testing ANS32, type of value stored in variable does not match the expected type. Expecting python or numpy float."

<strong>3.3</strong> \[1 pt\] Please identify the intercept for the relationship between arr_delay, air_time, and distance. Assign the intercept to <strong>ANS33</strong> and round to the thousandth decimal point (ex. 1.001). The value of <strong>ANS33</strong> should be a numpy float or a python float, depending on how you solved the problem.

In [201]:
ANS33 = np.round(-4.275665100362692,3)
# YOUR CODE HERE
#raise NotImplementedError()

In [202]:
assert type(ANS33) == np.float64, "Problem 3.3, testing ANS33, type of value stored in variable does not match the expected type. Expecting np.float64."

<strong>3.4</strong> \[1 pt\] If a flight’s air_time is 175 and distance is 1300, what is its predicted arrival delay? The input values should be rounded to the thousandths place per your answers to 3.1, 3.2, and 3.3. Please store your answer into variable <strong>ANS34</strong> as a float (rounded to the thousandths place, e.g.: `ANS34=0.110`) 

In [203]:
ANS34 = np.round(-6.88587014,3)
# YOUR CODE HERE
#raise NotImplementedError()

In [204]:
#hidden tests are within this cell

### 4. Linear Regression with statsmodels 

First, we will run through an example of linear regression using statsmodel. 

**Begin example:**

Please implement linear regreesion on the dataset

    Linear regression refers to any linear relationship between the independent and dependent variables, yet it does not indicate how the model is fitted. As a result, there are serveral different approaches to linear regression problems. Here, I would like to introduce one of a common methods called ordinary least squares (OLS). The purpose of OLS is to find the best fit line by minizing the sum of the squared vertical distances (residuals).
    Below is the mathmatic formula of ordinary least squares:
    Assume we have a data of a set pairs of (x1,y1),(x2,y2).., and we are trying to find the best fit line for the data
   (1)Find the mean of x values and y values
        ![the equation of the means for X and Y](assets/Mean.png)
    (2)Calculate the slope of the best fit line
        ![the equation for the slope](assets/slope.png)
    (3)Compute the y-intercept of the line 
        ![the equation for computing the y-intercept](assets/y-intercept.png)
    (4)The regression line with the least square of distance from each data point to the line
        ![a graph showing the regression line with the least square distance](assets/regression_line.png)

In [205]:
#in order to run OLS on the data, we have to install the model below:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import csv
import pandas as pd 
import math

statsmodels.formula.api allows you to use R-Style formulas: y ~ x1 + x2 + x3 + ...

    1.y represents the outcome/dependent variable
    2.x1, x2, x3, etc represent explanatory/independent variables

What is the relationship between depature delay and distance? 

In [206]:
model1 = smf.ols('arr_delay ~ distance', data=nyc).fit()
model1.summary()

0,1,2,3
Dep. Variable:,arr_delay,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,198.0
Date:,"Sun, 20 Oct 2019",Prob (F-statistic):,8.25e-45
Time:,18:59:00,Log-Likelihood:,-135020.0
No. Observations:,26398,AIC:,270000.0
Df Residuals:,26396,BIC:,270100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,11.0272,0.427,25.808,0.000,10.190,11.865
distance,-0.0048,0.000,-14.072,0.000,-0.006,-0.004

0,1,2,3
Omnibus:,28994.599,Durbin-Watson:,1.591
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7569174.795
Skew:,5.203,Prob(JB):,0.0
Kurtosis:,85.3,Cond. No.,2140.0


Based on the model, we know that 

    arr_delay = -0.0048*distance + 11.0272

**Example end**

Now it is your turn!

<strong>4.1</strong> Build an OLS model called model2 that shows the relationship between arr_delay and dep_delay. You will not submit anything for this question, but you need the output to answer 4.2 through 4.4.

In [207]:
model2 = smf.ols('arr_delay ~ dep_delay', data=nyc).fit()
model2.summary()
#raise NotImplementedError()

0,1,2,3
Dep. Variable:,arr_delay,R-squared:,0.84
Model:,OLS,Adj. R-squared:,0.84
Method:,Least Squares,F-statistic:,138200.0
Date:,"Sun, 20 Oct 2019",Prob (F-statistic):,0.0
Time:,18:59:02,Log-Likelihood:,-110950.0
No. Observations:,26398,AIC:,221900.0
Df Residuals:,26396,BIC:,221900.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.0570,0.103,-39.264,0.000,-4.260,-3.854
dep_delay,1.0202,0.003,371.798,0.000,1.015,1.026

0,1,2,3
Omnibus:,3397.727,Durbin-Watson:,1.604
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9831.122
Skew:,0.695,Prob(JB):,0.0
Kurtosis:,5.647,Cond. No.,39.1


<strong>4.2</strong> \[1 pt\] Please identify the coefficient value (slope) for departure_delay. Assign the coefficient to <strong>ANS42</strong> and round to the thousandths decimal place (eg 1.001). The value of <strong>ANS42</strong> should be a python float.

In [208]:
ANS42 = 1.020
# YOUR CODE HERE
#raise NotImplementedError()

In [209]:
assert type(ANS42) == float, "Problem 4.2, testing ANS42, type of value stored in variable does not match the expected type. Expecting python float."

<strong>4.3</strong> \[1 pt\] Please identify the intercept for the relationship between arr_delay and dep_delay. Assign the intercept to <strong>ANS43</strong> and round to the thousandth decimal point (ex. 1.001). The value of <strong>ANS43</strong> should be a python float.

In [210]:
ANS43 = -4.057
# YOUR CODE HERE
#raise NotImplementedError()

In [211]:
assert type(ANS43) == float, "Problem 4.3, testing ANS43, type of value stored in variable does not match the expected type. Expecting python float."

<strong>4.4</strong> \[1 pt\] If a flight was delayed for 10 minutes, how will it affect its arrival delay based on model2? Assign your answer to the variable <strong>ANS44</strong> and round to the thousandths place (eg 1.001). The value of <strong>ANS44</strong> should be a python float. Inputs to the slope and intercept should be rounded to the thousandths place as well. 

In [212]:
ANS44 = 1.0202*10-4.0570

# YOUR CODE HERE
#raise NotImplementedError()

In [213]:
assert type(ANS44) == float, "Problem 4.4, testing ANS44, type of value stored in variable does not match the expected type. Expecting python float."

### 5. Gradient Descent Computation 

<strong>5.1</strong> \[1 pt\] Assume we have a function f(x.y) =5x+4xy-y^2, what is its gradient? Please store the answer into a string type variable <strong>ANS51</strong> (e.g. `ANS51 = "[3x+y 3x-2y]"`, with only a space between the two elements).

In [214]:
ANS51 = "[5+4y 4x-2y]"
# YOUR CODE HERE
#raise NotImplementedError()

In [215]:
#hidden tests are within this cell

<strong>5.2</strong> \[1 pt\] Evaluate the gradient at the point (3,-1). Your solution should be stored into variable ANS52 as a string (e.g. `ANS52= "[1 1]"`, with only a space between the two elements).

In [216]:
ANS52 = "[1 14]"
# YOUR CODE HERE
#raise NotImplementedError()

In [217]:
#hidden tests are within this cell
assert len(ANS52.strip().split()) == 2, "Problem 5.2, testing ANS52, you should have two values separated by a single space"
assert ANS52.strip().count(" ") == 1, "Problem 5.2, testing ANS52, more than one space is being used"

<strong>5.3</strong> \[1 pt\] What is the magnitude of the gradient at the point (3,-1). Please store your answer into variable <strong>ANS53</strong> as a float rounded to three decimal places (thousandths place, e.g. 1.001).

In [229]:
import math
ANS53 = np.round(math.sqrt(197),3)
# YOUR CODE HERE
#raise NotImplementedError()

In [230]:
#hidden tests are within this cell


<strong>5.4</strong> \[1 pt\] Starting from the given point (3,-1) and using the letter `a` as a symbol for your learning rate, find a new point on the gradient. Store your answer into variable <strong>ANS54</strong> as a python string (e.g. ``ANS54 = "(1a, 2-10a)"`` with a comma and only one space between the two elements and no spaces within the elements). 

In [233]:
ANS54 = "(3-a, -1-14a)"
# YOUR CODE HERE
#raise NotImplementedError()

In [234]:
assert len(ANS54.strip().split(", ")) == 2, "Problem 5.4, testing ANS54, make sure to separate the elements by one space"
assert ANS54.strip().count(" ") == 1, "Problem 5.4, testing ANS54, try checking your whitespace"

Sources:
<ul>
    <li>https://www.statsmodels.org/stable/index.html</li>
    <li>https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/</li>
    <li>https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76</li>
</ul>