<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Week 6 Review - Solutions

 _**Author:** Noelle B. (DSI-DEN)_

---
We will review the learning objectives of each lesson this week and answer questions related to them.

---
## 5.07 Object Oriented Programming

### Explain why object-oriented programming (OOP) is central to the Python programming language.

**Q1.** Define `object oriented programming` and `functional programming` and give a few examples of each paradigm.

> **Answer:**  
- **Object oriented programming**: Creating your own data types that can maintain their own state and have methods & functions associated with them. Ex: Python, Ruby
- **Functional Programming**: Completing all tasks using functions in clever ways. Ex: R, Scala

### Build a Python class.

**Q2.** Build a `Tree` class in python that takes in at least a `tree_type` attribute and has at least one method (ex. grow).

In [3]:
# Answer:
class Tree:
    def __init__(self, tree_type, height):
        self.tree_type = tree_type
        self.height = height
    
    def grow(self, grow_amount):
        print('Growing...')
        self.height += grow_amount
        print(f'Done! My new height is {self.height}')

In [4]:
new_tree = Tree('aspen', 72)
new_tree.grow(15)

Growing...
Done! My new height is 87


### Describe object mutability in Python.

**Q3.** What is object mutability in Python?

> **Answer:**  
The state of an object can change. For example, we changed the height of the tree in the above example.

### Understand how Scikit-Learn uses OOP to build estimators and transformers.

**Q4.** Beifly describe how Scikit-Learn uses OOP to build estimators and transformers.

> **Answer:**  
Sklearn uses OOP to build estimators and transformers. This can be seen in the code for sklearn and is the reason that there are certain attributes that can only be accessed after an estimator is fit.

### Set up their working directory to import a class from another file.

**Q5.** If I had a `Tree` class in my `growtree.py` file, how would I import this into Python?

In [None]:
# Answer:
from growtree import Tree

---
## 6.01 Classification and Regression Trees

### Understand the intuition behind decision trees.

**Q6.** Briefly describe what a decision tree is and how it works.

> **Answer:**  
A decision tree is a model that attempts to predict values based on a series of splits. It is comprised of a root node, parent nodes, child nodes, and finally leaf nodes that depict the prediction of the value.

### Calculate Gini.

**Q7.** Calculate the Gini impurity for the following list:  
`cities = ['Denver', 'Fort Collins', 'Boulder', 'Fort Collins', 'Denver']`

> **Answer:**  
Gini impurity = 1 - sum(P(class i)^2  
= 1 - P('Denver')^2 - P('Fort Collins')^2 - P('Boulder')^2  
= 1 - (2/5)^2 - (2/5)^2 - (1/5)^2  
= 1 - (4/25) - (4/25) - (1/25)  
= 16/25  
= 0.64

### Describe how decision trees use gini to make decisions.

**Q8.** How is gini used in decision trees to make decisions?

> **Answer:**  
The nodes are determined by finding the feature that would decrease the gini impurity by the most from the parent node to the child node.

### Fit, generate predictions from, and evaluate decision tree models.

**Q9.** Using the following dataset, fit, generate predictions from, and evaluate a decision tree.

In [9]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# load data
bc = load_breast_cancer()
bcdf = pd.DataFrame(bc.data, columns = bc.feature_names)
bcdf['type'] = bc.target

X = bcdf.drop(columns='type')
y = bcdf['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [10]:
# Answer:
dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)
print('Training accuracy:', dt.score(X_train, y_train))
print('Testing accuracy:', dt.score(X_test, y_test))

Training accuracy: 1.0
Testing accuracy: 0.951048951048951


###  Interpret and tune max_depth, min_samples_split, min_samples_leaf, ccp_alpha & visualize a decision tree.

**Q10.** Describe what the following hyperparameters do in decision trees:
- max_depth
- min_samples_split
- min_samples_leaf
- ccp_alpha

>**Answer:** (all from global lesson 6.01)  
- `max_depth`: The maximum depth of the tree. By default, the nodes are expanded until all leaves are pure (or some other argument limits the growth of the tree). In the 20 questions analogy, this is like "How many questions we can ask?"
- `min_samples_split`: The minimum number of samples required to split an internal node. By default, the minimum number of samples required to split is 2. That is, if there are two or more observations in a node and if we haven't already achieved maximum purity, we can split it!
- `min_samples_leaf`: The minimum number of samples required to be in a leaf node (a terminal node at the end of the tree). By default, the minimum number of samples required in a leaf node is 1. (This should ring alarm bells - it's very possible that we'll overfit our model to the data!)
- `ccp_alpha`: A complexity parameter similar to $\alpha$ in regularization. As ccp_alpha increases, we regularize more. By default, this value is 0.

---
## 6.02 Bagging

### Define ensemble model.

**Q11.** Define `ensemble model`.

> **Answer:**   
Ensemble models are several weak learners that combine to form a strong learner. Wisdom of the crowds.

### Name three advantages of using ensemble models.

**Q12.** Name three advantages of using ensemble models.

>**Answer:** (from global lesson 6.02)  
- The `statistical` benefit to ensemble methods: By building one model, our predictions are almost certainly going to be wrong. Predictions from one model might overestimate housing prices; predictions from another model might underestimate housing prices. By "averaging" predictions from multiple models, we'll see that we can often cancel our errors out and get closer to the true function $f$.
- The `computational` benefit to ensemble methods: It might be impossible to develop one model that globally optimizes our objective function. (Remember that CART reach locally-optimal solutions that aren't guaranteed to be the globally-optimal solution.) In these cases, it may be impossible for one CART to arrive at the true function $f$. However, generating many different models and averaging their predictions may allow us to get results that are closer to the global optimum than any individual model.
- The `representational` benefit to ensemble methods: Even if we had all the data and all the computer power in the world, it might be impossible for one model to exactly equal $f$. For example, a linear regression model can never model a relationship where a one-unit change in $X$ is associated with some different change in $Y$ based on the value of $X$. All models have some shortcomings. (See the no free lunch theorems.) While individual models have shortcomings, by creating multiple models and aggregating their predictions, we can actually create predictions that represent something that one model cannot ever represent.

### Define and execute bootstrapping.

**Q13.** Define `bootstrapping`.

> **Answer:**  
Bootstrapping is the process of generating a sample of the data with replacement.

**Q14.** Create a bootstrapped sample given the following data:  
`example = [1, 4, 5, 2, 12, 6, 3, 19, 27, 10, 24, 42]`

In [14]:
import numpy as np

example = [1, 4, 5, 2, 12, 6, 3, 19, 27, 10, 24, 42]

In [16]:
# Answer:
np.random.choice(example, size=len(example), replace=True)

array([ 6,  2,  5,  3, 24,  1, 19,  1, 42,  3,  6,  6])

### Fit and evaluate bagged decision trees.

**Q15.** Using the following dataset, fit and evaluate bagged decision trees.

In [17]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier

# load data
bc = load_breast_cancer()
bcdf = pd.DataFrame(bc.data, columns = bc.feature_names)
bcdf['type'] = bc.target

X = bcdf.drop(columns='type')
y = bcdf['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [18]:
# Answer
bag = BaggingClassifier(random_state=42) # default is decision tree

bag.fit(X_train, y_train)
print('Training accuracy:', bag.score(X_train, y_train))
print('Testing accuracy:', bag.score(X_test, y_test))

Training accuracy: 0.9929577464788732
Testing accuracy: 0.9440559440559441


---
## 6.03 Random Forests

### Define ensemble model and describe three rationales for using ensemble models.

**Q16.** What is an ensemble model and why would you use them?

>**Answer:**   
Ensemble models are many weak learners combined to make a strong learner. They tend to be less overfit and more accurate than single models.

### Execute bootstrapping.

**Q17.** How is bootstrapping used in Random Forest models?

>**Answer:**  
Each random forest tree is built using a bootstrapped sample of the data.

### Describe the differences among and implement bagged decision trees, random forests, and ExtraTrees.

**Q18.** Describe the differences between bagged decision trees, random forests, and ExtraTrees.

>**Answer:**  
In all three models, bootstrapped samples are used to build each decision tree and the results are aggregated to come up with a final prediction. Random forests are like bagged decision trees, but each node only sees a random subset of features. ExtraTrees are like random forests, but the split is randomly chosen for each feature.

### Identify advantages and disadvantages of using CART models.

**Q19.** What are some advantages and disadvantages of using CART models?

>**Answer:**  
- Advantages: can have high accuracy, are simple to explain, interpretable
- Disadvantages: tendency to overfit (hence why we use things like random forests)

---
## 6.04 Boosting

### Understand the differences between bagging and boosting

**Q20.** What is the difference between bagging and boosting?

> **Answer:**  
Bagging uses many models trained individually and aggregates the results. Boosting uses the results of the previous model to create a hopefully better model.

### Understand how boosting is an ensemble method

**Q21.** Describe how boosting works.

>**Answer:**  
In general, boosting combines many weak learners one at a time to create a strong learner, iterating off the previous results to make the model better.

### Learn the pros and cons to using boosting models

**Q22.** What are the pros and cons to using a gradient-boosting model?

> **Answer:** (from global lesson 6.04)  
- Pros: Natural handling of mixed data types (= heterogeneous features), Predictive power, Robustness to outliers in output space (via robust loss functions)
- Cons: Scalability: Due to the sequential nature of boosting, it can hardly be parallelized, Difficult hyperparameters to tune

### Learn the effect of boosting on the bias-variance trade-off

**Q23.** What happens to bias and variance when using boosting models?

> **Answer:**  
Bias tends to decrease and variance can sometimes decrease as well.

### Learn the math and procedure for AdaBoost, the "classic" boosting model

**Q24.** Briefly describe how AdaBoost works.

> **Answer:**  
AdaBoost fits a sequence of weak lerners on repeatedly modified versions of the data. Each iteration it weights observations based on how well the model performed on the previous iteration. It focuses its next model on the misclassifications/weaknesses of the prior models.

### Understand the differences between AdaBoost and gradient-boosting models

**Q25.** Briefly describe how AdaBoost is different from gradient-boosting models.

> **Answer:**  
AdaBoost adjusts each model based on the wrong classifications of the previous models, while gradient boosting uses the residuals of the previous model to improve the following model.

---
## 6.05 Support Vector Machines (SVMs)

### Describe linear separability.

**Q26.** What does it mean for data to be linearly separable?

> **Answer:**  
Data that is linearly separable can be completely separated by a line.

### Differentiate between maximal margin classifiers, support vector classifiers, and support vector machines.

**Q27.** Briefly describe Maximal Margin SVMs.

> **Answer:**  
Maximal Margin SVMs attempt to find the best separating hyperplane that completely separates the categories.

**Q28.** Briefly describe Soft Margin SVMs.

> **Answer:**  
Soft Margin SVMs allow for a little more error than Maximal Margin SVMs. They find the best separating hyperplane that mostly separates the categories and measures the distance between wrongly classified points to the margin.

**Q29.** Briefly describe Kernel SVMs.

> **Answer:**  
Kernel SVMs take advantage of the Kernel Trick to increase dimensionality in order to better separate the data.

### Implement SVMs in scikit-learn.

**Q30.** Using the provided data, implement a Support Vector Classifier and score your model.

In [4]:
# imports
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# data
bc = load_breast_cancer()
bcdf = pd.DataFrame(bc.data, columns = bc.feature_names)
bcdf['type'] = bc.target
X_train, X_test, y_train, y_test = train_test_split(bcdf.drop(columns='type'), bcdf['type'])

In [5]:
# Answer:
svc = SVC()

svc.fit(X_train, y_train)
print('Training accuracy:', svc.score(X_train, y_train))
print('Testing accuracy:', svc.score(X_test, y_test))

Training accuracy: 1.0
Testing accuracy: 0.5944055944055944




### Describe the effects of C and kernels on SVMs.

**Q31.** What is the Kernel trick and why is it useful in SVMs?

> **Answer:**  
The Kernel trick increases the dimensionality of the data by curving the axis. This allows us to have high dimensionality without sacrificing computing ease.

---
## 6.06 Generalized Linear Models (GLMs)

### Describe generalized linear models.

**Q32.** What are the three components of GLMs?

> **Answer:**  
Systematic component, random component, and link function.

### Fit Poisson and Gamma regression models in statsmodels.

**Q33.** Use the given data to fit a Poisson regression using statsmodels.

In [18]:
# imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import poisson

# create random poisson data
# https://www.datacamp.com/community/tutorials/probability-distributions-python
# https://stats.stackexchange.com/questions/27443/generate-data-samples-from-poisson-regression
X_data = np.random.choice(np.linspace(0, 15, 500), size=10000)
y = poisson.rvs(mu=3, size=10000)
X = sm.add_constant(X_data)

In [19]:
# Answer:
# From global lesson 6.06
glm_poi = sm.GLM(
    y, X,
    family=sm.families.Poisson(link = sm.families.links.log())
).fit()

glm_poi.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,10000.0
Model:,GLM,Df Residuals:,9998.0
Model Family:,Poisson,Df Model:,1.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-19237.0
Date:,"Thu, 23 Apr 2020",Deviance:,10774.0
Time:,12:40:42,Pearson chi2:,9810.0
No. Iterations:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,1.1014,0.012,95.335,0.000,1.079,1.124
x1,-0.0006,0.001,-0.427,0.669,-0.003,0.002


### Interpret coefficients from Poisson and Gamma regression models.

**Q34.** Suppose you get the following coefficient for your Poisson regression: `0.054`. Interpret the meaning of this coefficient.

In [20]:
np.exp(0.054)

1.05548460215508

> **Answer:**  
As $X_i$ increases by 1, I expect $Y$ to increase by a factor of $e^{\beta_1}$.  
So, as our X increases by 1, we expect y to increase by 1.055 times.

### Describe iteratively reweighted least squares.

**Q35.** Briefly describe how iteratively reweighted least squares works.

> **Answer:** (from global lesson 6.06)  
A solution is initially guessed, then iteratively refined until we converge on an answer. IRLS is a special cause of a gradient descent algorithm.

---
## 6.07 Gradient Descent

### Understand the intuition behind gradient descent & Implement gradient descent.

**Q36.** Briefly describe how gradient descent works and what it is used for.

> **Answer:**  
Gradient descent is used to find the values that minimize your loss function. The new position is calculated using the old position and subtracting alpha times the gradient/derivative of that position.
1. Compute the derivative
2. Move in the appropriate direction
3. Repeat steps 1 & 2
4. Stop when your move is small enough

### Understand common pitfalls associated with gradient descent.

**Q37.** What are a few common pitfalls in gradient descent?

> **Answer:**  
Finding a local minimum instead of a global minimum, too large of an alpha, too small of an alpha.

### Identify solutions for common pitfalls.

**Q38.** What are some techniques to avoid failure to converge?

> **Answer:**  
Change your alpha - if alpha is too small or too large it will not converge.