# Chapter 7: Moving Beyond Linearity

- Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, X, X2, and X3, as predictors. This approach provides a simple way to provide a nonlinear fit to data.
- Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.
- Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.
- Smoothing splines are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.
- Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.
- Generalized additive models allow us to extend the methods above to deal with multiple predictors.

## 7.1 Polynomial Regression
- Generally speaking, it is unusual to use d greater than 3 or 4 because for large values of d, the polynomial curve can become overly flexible and can take on some very strange shapes.
- this is mainly used to understand the variance between the predictor and indepedent variables!
- <img src="./images/Linearity_1.png" width=500px>
- We noticed in the chart that there are two of earners, one of them is above the 250 mark and others are below
- The author uses a logisitic regression to find how age contributes to the person earning more than 250, the drawback is that people earning more than 250 is small relative to the entire population

## 7.2 Step Functions
- Another way to do this is to create different "subsets"
- This way, we can model each of the subset with their own specfic information
- We pretty much break X into different bins
- I(·) is an indicator function that returns a 1 if the condition is true, indicator and returns a 0 otherwise
- The bad things about the step function is that the bins are difficult to adjust or name, it's very abritray
- yi = β0 + β1C1(xi) + β2C2(xi) + . . . + βKCK(xi) + i.
- The number that falls wihtin the bins is the average of the y values given the X's

## 7.3 Basis Functions
- We apply different functions to the values of X. For example, we might look at the the second power or the subgroup from the step functions
- We can think of (7.7) as a standard linear model with predictors b1(xi), b2(xi), . . . , bK(xi).
    - yi = β0 + β1b1(xi) + β2b2(xi) + β3b3(xi) + . . . + βKbK(xi) + error termi. (7.7)
- The predictors can be thought of as the function
- The idea is to have at hand a family of functions or transformations that can be applied to a variable X:
- The basis function are chosen before in time (we know what are trying to maximize)
- Regression splines are a type of basis function
- For a poly function, a basis function can be a powered
- Meaning, basis function are a function that can change or reshape the landscape of the data

## 7.4 Regression Splines

### 7.4.1 Piecewise Polynomials
- The coefficients β0, β1, β2, and β3 differ in different parts of the range of X. The points where the coefficients change are called knots.
- We begin to implement the polynomal where it uses differnt parts of X
- The vlaues of the coefficients change
- For example, a piecewise cubic with no knots is just a standard cubic polynomial, as in  the image of Polynomial Regression with d = 3.
- A piecewise cubic polynomial with a single knot at a point c takes the form
- <img src="./images/Linearity_2.png" width=500px>
- Noticed that when we change the vlaues of the x, we look at differnt models. It'll be intersting to see the values that will be used to implement the values of X's
- The first polynomial function has coefficients β01, β11, β21, β31, and the second has coefficients β02, β12, β22, β32.
- Using more knots leads to a more flexible piecewise polynomial. In general, if we place K different knots throughout the range of X, then we will end up fitting K + 1 different cubic polynomials. Since we look at the equation one more than the number of knots
- For example, we can instead fit piecewise linear functions. In fact, our piecewise constant functions of Section 7.2 are piecewise polynomials of degree 0!
- <img src="./images/Linearity_3.png" width=500px>
- Noticed that in the image above, the left upper graph indicates that our function will be discontinues, meaning that there will be some opening in the graph. There's no way of holding the graph using the piecewise polynomials since we are using 4 differernt variables for each of the functions! 

### 7.4.2 Constraints and Splines
- The top right plot in Figure 7.3 shows the resulting fit.
- As far as the constraint function that was used, its not explained by the author
- In the lower left plot, we have added two additional constraints: now both the first and second derivatives of the piecewise polynomials are continuous at age=50.
- We are requiring that the piecewise polynomial be not only continuous when age=50, but also very smooth.
- So in the top left plot, we are using eight degrees of freedom, but in the bottom left plot we imposed three constraints (continuity, continuity of the first derivative, and continuity of the second derivative), meaning that it has 5 degree of freedom
- In this case, we are looking at the degree of freedom as how free a variable could be. 
- Remember the reason it is 8, is that we have two equation with four variables!
- Once we add some constraints, those variable are no longer free to be what they want to be!
- When we impose new constraints, we are looking at the to decrease the degrees of freedoms
- The curve in the bottom left plot is called a cubic spline.
- The curve in the bottom left plot is called a cubic spline.

### 7.4.3 The Spline Basis Representation
- A cubic spline with K knots can be modeled as
- <img src="./images/Linearity_4.png" width=500px>
- The model will be able to use least squares where we would minimize the total error of the functions
- Remember that we can fit differnt polynomial functions to the data. For example, some polynomial equations will work for other part of the data (wages)
- For the spline basis, we can also include a similar concept
- The most direct way to represent a cubic spline using is to start off with a basis for a cubic polynomial—namely, x, x2, x3—and then add one truncated power basis function per knot
- <img src="./images/Linearity_5.png" width=500px>
- The weird ξ is the truncated power basis function
- One can show that adding a term of the form β4h(x, ξ) to the model (7.8) for a cubic polynomial will lead to a discontinuity in only the third derivative at ξ; the function will remain continuous, with continuous first and second derivatives, at each of the knots. 
    - So, the third derivative will not remain continous. We do not care about the third derivative. Thus, if the first and second deritivatives are continous, then our function will remain continous.
- In other words, in order to fit a cubic spline to a data set with K knots, we perform least squares regression with an intercept and 3+K predictors, of the form X,X2,X3, h(X, ξ1), h(X, ξ2), . . . , h(X, ξK), where ξ1, . . . , ξK are the knots.
- <img src="./images/Linearity_6.png" width=500px>
- I guess the reason we are using 4 is the three (poly to three) 
- A natural spline is a regression spline with additional boundary constraints: the function is required to be linear at the boundary (in the region where X is smaller than the smallest knot, or larger than the largest knot).
- This additional constraint means that natural splines generally produce more stable estimates at the boundaries.
- **A natural spline is a regression spline with additional boundary constraints**
- **We see that the confidence bands in the boundary region appear fairly wild. A natural spline is a regression spline with additional boundary constraints: the function is required to be linear at the boundary (in the region where X is smaller than the smallest knot, or larger than the largest knot). This additional constraint means that natural splines generally produce more stable estimates at the boundaries.**
- Inutuiton: If the data falls within the outer knots, we will see the confidence might appear a bit wild. We need to understand this so we would create a constraint where the function is required to be linear at the outer edge. This makes the problem more stable!!

### 7.4.4 Choosing the Number and Locations of the Knots
- One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.
- The reason it would have to be in three different percentile is because we have four degree of freedom?
- One option is to try out different numbers of knots and see which produces the best looking curve
- A somewhat more objective approach is to use cross-validation, as discussed in Chapters 5 and 6.
- Natural spline can be thought as of naturally choosing boundaries that could fit with a certain guideline
- The cubic spline is choosen to where YOU believe the data will be most apporpriate
- <img src="./images/Linearity_7.png" width=500px>
- The reason that we are looking at the number of degree of freedom is bc we would like to know how our data should be set up
- Four degree of freedom leads to three knots.
- Of cource this can be somewhat technical and does not have a valid reasoning for it
- In Section 7.7 we fit additive spline models simultaneously on several variables at a time. This could potentially require the selection of degrees of freedom for each variable. In cases like this we typically adopt a more pragmatic approach and set the degrees of freedom to a fixed number, say four, for all terms.

### 7.4.5 Comparison to Polynomial Regression
- Regression splines often give superior results to polynomial regression.
- Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree (exponent in the highest monomial term, e.g. X15) to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed.
- Another reason that is helpful is that we do not have to keep a certain polynomail fixed regardless of the data. With splines, we have more flexibility to have the outcomes to be based on the data. 
- The extra flexibility in the polynomial produces undesirable results at the boundaries, while the natural cubic spline still provides a reasonable fit to the data.

## 7.5 Smoothing Splines

### 7.5.1 An Overview of Smoothing Splines
- We create by specifying a set of knots, producing a sequence of basis functions, and then using least squares to estimate the spline coefficients.
- **BASIS FUNCTION: The idea is to have at hand a family of functions or transformations that can be applied to a variable X: function b1(X), b2(X), . . . , bK(X).**
- The one thing that is a bit unclear that in the function, we are estimating the basis function, and not the coefficient given to the basis function!
- In fitting a smooth curve to a set of data, what we really want to do is find some function, say g(x), that fits the observed data well: that is, we want RSS = ∑ni = 1(yi − g(xi))2 to be small.
- However, if we don’t put any constraints on g(xi), then we can always make RSS zero simply by choosing g such that it interpolates all of the yi. Such a function would woefully overfit the data—it would be far too flexible.
- <img src="./images/Linearity_8.png" width=500px>
- λ is a nonnegative tuning parameter.
- The function g that minimizes is known as a smoothing spline.
- Equation 7.11 takes the “Loss+Penalty” formulation that we encounter in the context of ridge regression and the lasso in Chapter 6.
- The term ∑ni = 1(yi − g(xi))2 is a loss function that encourages g to fit the data well, and the term λ g(t)''2dt is a penalty term
- The notation with the double aestricks in the top is  the second derative of the g function
- The first derivative g'(t) measures the slope of a function at t, and the second derivative corresponds to the amount by which the slope is changing
- The second deritative tends to be the roughness of the line: Thinka bout it, as the first derative is near the min or max, the second deritivative will be 0
    - If the second deritivative is 0
    - As the line becomes more wiggly, we know that the deritiative gets larger in abs. value
- (The second derivative of a straight line is zero; note that a line is perfectly smooth.)
- The integral notation is an integral , which we can think of as a summation over the range of t. In other words, g''(t)2dt is simply a measure of the total change in the function g'(t), over its entire range
- Intuition: If we look at the values of the integral portion, we figure that it's area under the curve. Thus, if the the first deritivative is constant, we know that the function will be a close to a straight line
    - Thus, the integral of the second deritivative will be small
    - Conversely, if g is jumpy and variable than g'(t) will vary significantly and the integral of g''(t)2dt will take on a large value.
- Thus if this very small, it will not affect our function as much... However, if our line is not would have to pay a large penalty
- The larger the value of λ, the smoother g will be
    - Why? I am not sure why the an increase in lambda minimizes the penality?
    - It increases the penality bc we want the value of the function to find a smooth function, thus, if we have a large lambda, we are able to find the smoothing of the function
    - If not, when lambda is 0, we do not get the benefit of a smoothing function
    
- SMOOTHING LINES VS REGRESSION
    - <img src="./images/Linearity_13.png" width=400px> 


### 7.5.2 Choosing the Smoothing Parameter λ
- We have seen that a smoothing spline is simply a natural cubic spline with knots at every unique value of xi.
- It is possible to show that as λ increases from 0 to ∞, the effective degrees of freedom, which we write dfλ, decrease from n to 2.
- **effective degrees of freedom instead of degrees of freedom**
    - Usually degrees of freedom refer to the number of free parameters, such as the number of coefficients fit in a polynomial or cubic spline.
    - Although a smoothing spline has n parameters and hence n nominal degrees of freedom, these n parameters are heavily constrained or shrunk down.
    - Hence dfλ is a measure of the flexibility of the smoothing spline—the higher it is, the more flexible (and the lower-bias but higher-variance) the smoothing spline.
        - When a function is more flexible, that we means we do not have a bias towards a specific strucutre. The lower bias with a higher variance means we might overfit the data through
    - The definition of effective degrees of freedom is somewhat technical.
- <img src="./images/Linearity_9.png" width=200px>
    - where ˆg is the solution to (7.11) for a particular choice of λ—that is, it is a n-vector containing the fitted values of the smoothing spline at the training points x1, . . . , xn.
    - Equation 7.12 indicates that the vector of fitted values when applying a smoothing spline to the data can be written as a n × n matrix Sλ (for which there is a formula) times the response vector y.
    - <img src="./images/Linearity_10.png" width=200px> 
        - Which is the sum of the diagonal elements of the matrix Sλ.
- In fitting a smoothing spline, we do not need to select the number or location of the knots—there will be a knot at each training observation, x1, . . . , xn. Instead, we have another problem: we need to choose the value of λ.
- It turns out that the leaveone- out cross-validation error (LOOCV) can be computed very efficiently for smoothing splines, with essentially the same cost as computing a single fit
- <img src="./images/Linearity_11.png" width=400px> 
- The notation ˆg(−i)λ (xi) indicates the fitted value for this smoothing spline evaluated at xi, where the fit uses all of the training observations except for the ith observation (xi, yi).
- In contrast, ˆgλ(xi) indicates the smoothing spline function fit to all of the training observations and evaluated at xi.
- <img src="./images/Linearity_12.png" width=400px> 
- Since there is little difference between the two fits, the smoothing spline fit with 6.8 degrees of freedom is preferable, since in general simpler models are better unless the data provides evidence in support of a more complex model.

## 7.6 Local Regression
- Local regression is a different approach for fitting flexible non-linear functions, which involves computing the fit at a target point x0 using only the regression nearby training observations.
- Intuition: We are looking at the observations that are around it!
- Note that in Step 3 of Algorithm 7.1, the weights Ki0 will differ for each value of x0.
- Pretty much, we will be using differernt observatiosn so we need to update it for the appropriate weight that we need to be using with the observations
- <img src="./images/Linearity_15.png" width=400px> 
- In order to perform local regression, there are a number of choices to be made, 
    - Such as how to define the weighting function K, 
    - Whether to fit a linear, constant, or quadratic regression in Step 3 above. 
    - The most important choice is the span s, defined in Step 1 above. The span plays a role like that of the tuning parameter λ in smoothing splines: it controls the flexibility of the non-linear fit.
    - The smaller the value of s, the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations.
- <img src="./images/Linearity_16.png" width=400px> 
- span is a function of the nearest points we will be looking at divide the total space
- if the span is larger, than we will see that the values are more genealized
- The smaller the value of s, the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations.
- the main point to use these weight is to filter that points that are not near the data point or mark that we are looking at 

## 7.7 Generalized Additive Models
- Here we explore the problem of flexibly predicting Y on the basis of several predictors, X1, . . . , Xp. This amounts to an extension of multiple linear regression.
- Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.

### 7.7.1 GAMs for Regression Problems
- It is called an additive model because we calculate a separate fj for each Xj , and then add together all of their contributions.
- We are creating a function for each of the variables and then adding them to porivide a finalzied answer
- The beauty of GAMs is that we can use these methods as building blocks for fitting an additive model. In fact, for most of the methods that we have seen so far in this chapter, this can be done fairly trivially. Take, for example, natural splines, and consider the task of fitting the model
- <img src="./images/Linearity_18.png" width=400px> 
- Here year and age are quantitative variables, and education is a qualitative variable with five levels: HS, HS, More than Coll, Less than Coll, Coll, referring to the amount of high school or college education that an individual has completed. We fit the first two functions using natural splines. We fit the third function using a separate constant for each level, via the usual dummy variable approach of Section 3.3.1.
    - Meaning that we look at the number that are qualitiative with some boundaries where the relationship changes
    - The quantitative number (college years) is composed differently where we look at the info of the differernt categories differntly
- <img src="./images/Linearity_17.png" width=400px> 
- The left-hand panel indicates that holding age and education fixed, wage tends to increase slightly with year; this may be due to inflation.
- The center panel indicates that holding education and year fixed, wage tends to be highest for intermediate values of age, and lowest for the very young and very old.
- The right-hand panel indicates that holding year and age fixed, wage tends to increase with education: the more educated a person is, the higher their salary, on average. All of these findings are intuitive.
- We do not have to use splines as the building blocks for GAMs: we can just as well use local regression, polynomial regression, or any combination of the approaches seen earlier in this chapter in order to create a GAM. GAMs are investigated in further detail in the lab at the end of this chapter.


- Pros and Cons
    - GAMs allow us to fit a non-linear fj to each Xj, so that we can automatically model non-linear relationships that std linear regression will miss. We do not need to manually try out many diff transformation on each vars.
    - The non-linear fits can potentially make more accurate predictions for the response Y.
    - Bc the model is additive, we can examine the effect of each Xj on Y individually while holding all of the other variables fixed.
    - The smoothness of the function fj for the variable Xj can be summarized via degrees of freedom.
    - The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed. However, as with linear regression, we can manually add interaction terms to the GAM model by including additional predictors of the form Xj × Xk. In addition we can add low-dimensional interaction functions of the form fjk(Xj,Xk) into the model; such terms can be fit using two-dimensional smoothers such as local regression, or two-dimensional splines (not covered here).

- GAMs provide a useful compromise between linear and fully nonparametric models.

### 7.7.2 GAMs for Classification Problems
- Equation 7.18 is a logistic regression GAM. It has all the same pros and cons as discussed in the previous section for quantitative responses.
- <img src="./images/Linearity_19.png" width=400px> 
- <img src="./images/Linearity_20.png" width=400px> 
- <img src="./images/Linearity_21.png" width=400px> 
