# SYD DAT 8 Homework 2 - Visualisation and Regression

## Homework - Due Friday 30th June

#### Setup
* Signup for an AWS account

#### Communication
* Imagine you are trying to explain to someone what Linear Regression is - but they have no programming/maths experience? How would you explain the overall process, what a p-value means and what R-Squared means?
* Read the paper [Useful things to know about machine learning]( https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf). 
    * What have we covered so far from this paper? 
    * Explain sections 6-13 in your own words

#### Machine Learning
* Describe 3 ways we can select what features to use in a model
* Complete the first 3 exercises from Chapter 3 of Introduction to Statistical Learning in Python

#### Course Project
* For the following setup a new github repository for your project and share it with Alasdair and Ian over Slack.
* Load the data you have gathered for your project into Python and run some summary statistics over the data. Are there any interesting features of the data that jump out? (Include the code)
* Draft/Sketch (or wireframe) some data visualisations that would be useful for you to explore your data set
* Are there any regresion or clustering techniques you could use in your project? Write them down (with the corresponding scikit learn function) and what you think you would get out of it. Try it out if you get a chance.


**Instructions: copy this file and append your name in the filename, e.g. Homework2_ian_hansel.ipynb.
Then commit this in your local repository, push it to your github account and create a pull request so I can see your work. Remeber if you get stuck to look at the slides going over Fork, Clone, Commit, Push and Pull request.**

#### Communication
**Imagine you are trying to explain to someone what Linear Regression is - but they have no programming/maths experience? How would you explain the overall process, what a p-value means and what R-Squared means?**

Linear regression is a statistical model that attempts to use one independent variable (X; the predictor) to predict the value of a dependent variable (Y). For example, using one's height to predict their weight.

A slightly more complex version uses multiple predictors to determine the value of a single target. This is known as multiple linear regression. 

In both cases, the model attempts to create a function - a way of  defining the relationship between the dependent and independent variable. As the name suggests, linear regression create a linear (straight line) relationship between the two variables. The most common way this is achieved is through the ordinary least sqaures method. The model returns a function that minimises the sum of the differences between actual and predicted values of Y.

Once a regression model has been fit for a set of data, statistical software produces a summary table that assesses how well our model can predict Y. The most important values to consider are the p-value and R-Squared. Each of our predictors will receive a corresponding p-value. The overall model will receive a single R-Sqaured value. 

P-values are assigned to each predictor variable and tests the null hypothesis that the variable's coefficient is equal to zero. That is, it has no effect on the dependent variable. Generally, a p-value < 0.05 indicates a significant effect and we can conclude that the corresponding variable does have an influence. For each variable whose p < 0.05, we then look to their corresponding beta coefficient. If the value is positive, the interpretation is that for every 1 unit increase in the predictor variable, the dependent variable will increase by the value of the coefficient. Similarly, a negative beta is indicative of an inverse relationship between the two variables - ie. as X increases, Y is predicted to decrease. 

On the other hand, R-squared assesses the accuracy of the model in its entirity. It is a statistical measure of how close the data fits the model's regression line. Expressed as a percentage, the R-sqaured variable indcates the amount of variability of the response data that is explained by our model. Models with higher R-squared can be interepreted as better fits than those with lower values. 

**Read the paper [Useful things to know about machine learning]( https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).** 

** What have we covered so far from this paper?**
- Classification, specifically logistic regression
- K-Nearest Neighbours 
- cross-validation - test/retest
- overfitting vs underfitting; Variance/Bias Tradeoff
- Feature regularization


**Explain Sections 6-13 in your own words**

*Section 6:Intuition Fails in High Dimensions*

The section invites us to visualise a K-Nearest Neighbours situation. When too many dimensions are included in the model, the choice of which K values are closest to the point of interest becomes effectively random. This raises the importance of reducing the number of features within our model to only the most relevant and informative.

*Section 7: Theoretical Guarantees are not what they seem*
This section offers us a number of theoretical proofs but cautions the reader that though they may hold true, they should not be relied on exclusively when making practical machine learning decision. There is no such thing as a 'free lunch'.  

For example, one theoretical guarantee is that given infite data, the learner will output the correct classifier. However, this ignores the bias-variance tradeoff. Inifite data may lead to overfitting.

Theoretical guarantees should be used to understand algorithm decisions. Just because a learner has a theoretical justification and works in practice does not mean that the underlying theory is responsible. 

*Section 8: Feature Engineering is Key*
The most important determinant of machine learning success is feature selection and manipulation. Raw data does not often come in a form that is easily learnable but a data scientisit can transform it into features amenable to learning. Most of the effort in machine learning fits into these tasks because feature engineering is domain-specific. The goal of feature engineering is to provide many indepenedent features that correlate well with the class. 

Attempts have been made to automate feature selection by, for example, asking the learner to remove features that offer the least information. However, this ignores interaction effects between features. Ultimatly, feature engineering remains within the realm of human inuition, creativity and 'black art'.


*Section 9: More data beats a cleverer algorith*
Generally, a model will perform better if we add more data rather than if we improve on the algorithm and keep data constant. However, due to the bottleneck of time, adding more data may not always be the right decision. This is why simpler clasffiers are often used in practice over more complex ones. 

In the end, practioneers should aim to produce learners that create 'human-understandable output'. This may be more difficult to quantify than measuring accuracy and computational cost but those high in this type of output will offer the most value. 

*Section 10: Learn many models, not just one*
There is no one correct learner.  Only the right learner for a given application. Today, best practice involves combining many variations of learners - creating model ensembles.

Model ensembles can be created by:
- bagging = combining classifiers that have been learned through resampling of training data
- boosting = weights given to traning examples. Weights vary so that each new classifier focuses on examples the previous learner got wrong.
- stacking = outputs of individual classifers are used as inputs for new 'higher order' classifers.


*Section 11: Simplicity does not imply accuracy*
Simpler models, or hypothesis spaces, are not necessarily more accurate than complex ones. However, simpler is preferred because simplicity is a virtue in its own right. 

*Section 12: Representable does not imply learnable*
Functions that can be represented are not, by defintion, learnable. For example, if a hypothesis space has many local optima, a learner may not find the true function even if it can be represented. 

*Section 13: Correlation does not imply causation*
Machine learning occurs using observational data rather than experimental design. Thus, relationships found between variables using maching learning should not be assumed as causal.

Correlations do, however, signal the potential for a causal relationship and should therefore drive future investigations.



   
 
 





 










#### Machine Learning

**Describe 3 ways we can select what features to use in a model**

1. Fit Separate Linear Regression models for every combination of features: This would be the most accurate way to determine which features have the most influence on our predictor and can be completed when a small number of features need to be tested. 

However, if a large number of features are to be tested, the computational time required becomes a bottleneck. We also increase the risk of adding redundant features - those that do not add any predictive value to the model. The more we add the greater the chance of corrlinearity as well which we do not want in our models.

2. Ridge Regression - A 'penalty term' - lambda is added as a tuning parameter in the model. This tuning parameter tries to keep the model as simple as possible. The more coefficients we add, the larger the penalty on the coefficient size such that the more features added, the more the beta coefficients of each predictor reduces close to, but never exactly, zero. 

3. Lasso - unlike ridge regression, the lasso algorithm allows some beta to reach zero depending on how many features we add. When a predictor's beta reaches zero, we can deem that feature irrelevant since it has effectively been removed from the model. 





**Complete the first 3 exercises from chapter 3 of Introduction to Statistical Learning in Python**
  
*Question 1:*

Each of the predictor variables in Table 3.4 (TV, Radio, Newspaper) is given an associated p-value. This value indicates whether a significant relationship exists between it and the target variable, Sales (as measured in thousands of units).

The null hypothesis for each p-value is that there exists no relationship between the predictor variable and Sales. A p-value < 0.05 is sufficient to reject the null and conclude that, all else equal, the corresponding predictor variable has a significant impact on number of units sold. 
      
Based on Table 3.4 we can conclude that of the 3 predictors within the multiple regression model, TV and Radio advertisement significantly Sales. The amount allocated to newspaper advertisement has no effect. 
      
To offer a meaningful conclusion, the model predicts that, all else held constant, a \$1000 increase in the budget for TV advertisement will lead to an additional 460 units sold. Similarly, if the company were to increase their radio advertising budget by an additional $1000 (and hold everything else constant), the company can expect an increase of 189 units sold. 

*Question 2:*

The K-nearest neighbours (KNN) algorithm is used for both classification and regression. In both cases, the number of neighbours (K) is selected by the user. The choice of K influences the fit of the model.

For classification models: Assume K = 3. The KNN classification model will find the 3 closest points in the training data set to the value whose group membership we are trying to predict.  The model identifies the value as belonging to the class in which the majority of the K neighbour observations belong. 

For Regression Models: The output of a KNN algorithm is an estimate for a continuous variable. The use of KNN in regression is an example of a non-parametric model. Unlike linear regression, this model makes no assumptions about the underlying structure of f(x). 
Similar to the classification model, K is selected by the user. Again, assuming K=3, the model will determine the 3 closest points to the value in question. The predicted Y value is an average of the known Y values of these points. 

*Question 3:*

(a) iv. Is correct since the interaction effect between GPA and gender shows that, all else held constant, regardless at all levels of GPA males will earn less than females (males were assigned a value of 0. 0 x GPA for the interaction term will always be zero).

(b) The predicted salary of a female with IQ of 110 and G.P.A of 4.0 is \$137, 100.

(c) False. The seize of a beta coefficient does not indicate the amount of evidence that the predictor influences the target. That more relevant value to consider when determining whether the interaction term is present would be the associated p-value. 