# DSC 80: Lab 08

### Due Date: Tuesday, May 25th 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import lab08 as lab

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scaling Transformations: log vs square root

**Question 1**

A scaling transformation transforms the scale of the data of a particular quantitative column. Mathematically, each data point $x_i$ is replaced with the transformed value $y_i = f(x_i)$, where $f$ is a transformation function. In general, it is not easy to select a good transformation for a given prediction problem. There are many transformations to choose from and each has a different mathematical intuition. 

Generally, the goal of a scaling transformation is to change the data from a complicated, non-linear relationship into a *linear* relationship. Linear relationships are very easy to understand and easily used by models (e.g. linear regression).

Non-linear growth is a commonly seen relationship in data. Sometimes this growth is *exponential* and sometimes it is by a *fixed power*. The scaling transformations that turn these types of growth linear are *log* and *root* transformations respectively.

In this problem you need to decide what transformation can be applied to a given dataset in order to make the relationship as *linear as possible*.


* To practice: create a dataframe consisting of the numbers $1$ to $99$ squared and plot the values. Then apply the square root transformation, add another column to the original dataframe and plot the columns side by side. What change do you observe?  

* Now repeat exactly the same steps but this time create a dataframe with an exponential distribution by raising the value `e` to the powers 1 to 99. Plot these values, perform a log transformation and plot the results, as above. What did you observe?

* Let's apply these ideas to the real dataset `homeruns`. You are given a MLB home run dataset with 120 yearly observations from 1900 to 2019. It includes a count of the number of [home runs](http://m.mlb.com/glossary/standard-stats/home-run) hit each year. You need to decide what transformation works better for this dataset: square root or a log transformation. 

*Note (A few helpful hints)*: 
* You may find `sns.regplot` and `scipy.stats.linregress` useful for judging the effectiveness of your transformations! 
* Recall that a well fit linear model has no patterns in it's residuals -- `sns.residplot` can help with this decision.
* If you need a refresher on correlation coefficients, see [DSC10](https://www.inferentialthinking.com/chapters/15/1/Correlation.html) as well as the Discussion 08 notebook.

Create a function `best_transformation` that returns an integer with the value corresponding to the following choices:

1. Square root transformation.
2. Log transformation
3. Both work the same.
4. Neither gives a transformation revealing a linear relationship. 


In [None]:
homeruns_fp = os.path.join('data', 'homeruns.csv')
homeruns = pd.read_csv(homeruns_fp)

# Diamond Pricing 

The next problems deal with predicting the price of a diamond based on standard measured properties of diamonds. You will use linear regression to predict the price, while improving the quality of your predictions using *feature engineering*.

Since this question is supposed to help you understand feature engineering, **you will be building these features from scratch**, instead of using the built in `sklearn` or `pandas` methods.

The diamond dataset is downloaded using `seaborn`, via `sns.load_dataset('diamonds')`. The dataset is a DataFrame with 53940 rows and 10 variables:

|column|description|
|---|---|
|price|price in US dollars (326 - 18,823 USD)|
|carat|weight of the diamond (0.2 - 5.01)|
|cut|quality of the cut (Fair, Good, Very Good, Premium, Ideal)|
|color|diamond colour, from J (worst) to D (best)|
|clarity|a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))|
|x|length in mm (0 - 10.74)|
|y|width in mm (0 - 58.9)|
|z|depth in mm (0 - 31.8)|
|depth|total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43 - 79)|
|table|width of top of diamond relative to widest point (43 - 95)|

In [None]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

### Ordinal Encoding

**Question 2**

Every categorical variable in the dataset is an ordinal column. Recall that *ordinal encoding* is a feature transformation that maps the values of an ordinal column to the natural numbers (preserving the order of the column values). Create a function `create_ordinal` that takes in `diamonds` and returns a dataframe of ordinal features with names `ordinal_<col>` where `<col>` is the original categorical column name.

*Note*: Remember, you are creating this function using basic pandas. You should create a helper function that takes in a single column and an ordering for that column!

### Nominal Encoding 

**Question 3**

**One-hot encoding**

Even though the categorical variables in the dataset are ordinal, we can still treat them as nominal by forgetting about the ordering of the columns. Treating the categorical columns as nominal, we might one-hot encode them. 

Create a function `create_one_hot` that takes in `diamonds` and returns a dataframe of one-hot encoded features with names `one_hot_<col>_<val>` where `<col>` is the original categorical column name, and `<val>` is the value found in the categorical column `<col>`.

*Note 1:* Create a helper function that creates the one-hot encoding for a single column. **Do not** use `sklearn` or `pd.get_dummies` for this question!

*Note 2:* The code in lecture for one-hot-encoding is inefficient and simply there to illustrate how it works. Make sure this function you create is optimized otherwise it will timeout on gradescope.

**Encoding with proportions**

Similar to the one-hot encoding case, you can replace a value in a nominal column with the likelihood that value appears in the column. This might be a reasonable approach to predicting the price of a diamond, as you might expect *rarer attributes to be considered more valuable* than common ones.

Create a function `create_proportions` that takes in `diamonds` and returns a dataframe of proportion-encoded features with names `proportion_<col>` where `<col>` is the original categorical column name.

*Note:* If a column consists of the values `['a', 'b', 'a', 'c']`, then the proportion encoded column is `[0.5, 0.25, 0.5, 0.25]`. 

### Quantitative Encoding (quadratic features)

**Question 4**

Linear regression doesn't capture non-linear dependencies between variables. However, you can create features that encode such dependencies *before* fitting your regression model. Creating polynomial features is one way to do this. For example, the `diamonds` dataset contains each dimension for the stone (`x`,`y`,`z`). However, different combinations of size may be more valuable than others: a "deep and wide" diamond might be considered more valuable than a shallow, but "long and wide" diamond.

Create a function `create_quadratics` that takes in `diamonds` and returns a dataframe of quadratic-encoded features `<col1> * <col2>` where `<col1>` and `<col2>` are the original quantitative columns. The output array should contain every distinct pair of pairs of columns - aside from `price`, which should be left out).

*Note*: **Do not** use `sklearn` for this question! It is ok to loop through the columns of `diamonds` to do this question.

### Comparing Performance

**Question 5**

Which features are most able to predict the price of a diamond in a linear regression model? 

Among the original columns, `carat` gives the best predictions when used in a *single-variable* linear regression model. Below, you will fit a single-variable linear regression model for each variable (both in the dataset, as well as the engineered features from the questions above).

* What is the $R^2$ of a regression model built on the variable `carat`?
* What is the RMSE of the linear-predictor built on `carat` (in USD)?
* What is the *second best* feature in the original dataset (as measured by $R^2$)?
* What is the best *new* feature engineered (including the ones in Q2, Q3, Q4) in the question above (as measured by $R^2$)?
* Which *categorical* feature results in the best predictor (as measured by $R^2$)?

Now, you will compare a multivariate regression model fitted with the original (quantitative) columns with a multivariate regression model fitted with both the original (quantitative) columns, as well as the features engineered in the problem above (Q4). 

* What is the percent decrease in RMSE between the two models (given as a number between 0 and 1)? (*Note*: RMSE is measured in USD! But no need to round).

Create a function `comparing_performance` that returns a list containing the 6 answers above.

*Hint:* Use the `sklearn` pattern included below. Train many linear regression models and sift through the results!

In [None]:
from sklearn.linear_model import LinearRegression

X = ...
y = ...

lr = LinearRegression()
lr.fit(X, y)  # X is dataframe of training data; y a series of prices
lr.score(X, y)  # R-squared
lr.predict(X) # predicted prices

# Feature engineering with `Sklearn`

In this section, you will use `sklearn` transformers/estimators for feature engineering. While everything you do with `sklearn` is possible to do with Pandas, `sklearn` transformers will enable you to couple your feature engineering with your modeling. This will allow you to more quickly build and assess your models in `sklearn`.

Recall from lecture that `sklearn` is built on `numpy`, and so it's objects speak `ndarray` objcets *not* `DataFrame` objects! Each of the methods below should (1) first turn the input Pandas DataFrame into a numpy array, then (2) use the `.transform` method of an initialized `sklearn.Transformer` object. You should **not** use dataframe methods like apply in this problem.

In [None]:
from sklearn.preprocessing import Binarizer, QuantileTransformer, FunctionTransformer

### Turning a quantitative column into a binary column

In this section, you will create a `TransformDiamonds` class that contains the transformation-methods below. In the starter code, there is a skeleton for `TransformDiamonds` that is initialized with a dataframe `diamonds`.

**Question 6**

In the `diamonds` dataset, define a stone as *large* if it is greater than 1 carat. Use the `Binarizer` class to code up this transformation logic. Create a method `transformCarat` that takes in a dataframe like `diamonds` and returns a binarized `carat` column (an `np.ndarray`) as described above.

**Question 7**. You now will transform the `carat` column so that each diamonds weight (in carats) is replaced with the *percentile* in which its weight lies. The percentile is measure with reference to the entire input dataset to the object `TransformDiamonds`. Create a method `transform_to_quantiles` that takes in a dataframe like `diamonds` and returns an `np.ndarray` of quantiles of the weight (i.e. carats) of each diamond.

*Hint:* To do this, use `QuantileTransformer` in `sklearn.preprocessing`. Note, that you will have to use the `.fit` method before transforming, because `QuantileTransformer` needs to know what the percentiles it can map values to their quantiles!

*Note:* You will see an warning in the doctest saying there is less number of rows (10) than the default number of quantiles (1000) for `QuantileTransformer`. This is expected behavior since the doctest only transforms the first 10 rows of the dataframe.

**Question 8** Next, you will recreate a feature giving the "depth percentage" of a diamond. Suppose the approximate depth percentage of a diamond is $Depth \% = \frac{z}{(x+y)/2} \times 100$ where $x,y,z$ are the dimensions of the diamond given by columns of the same name. Create a method `transform_to_depth_pct` that takes in a dataframe like `diamonds` and returns an `np.ndarray` consisting of the approximate depth percentage of each diamond. Percentage should be between 0 and 100. You can compare your results from the `depth` column in the original dataset. 

*Hint:* Use `FunctionTransformer` in `sklearn.preprocessing`; your 'custom function' needs to input an `ndarray`, not a `DataFrame`.

*Again*: It may seems like unnecessary function because **apply** does the "same" thing. You will learn in the lecture that `FunctionTransformer` greatly simplifies the preprocessing step

*Note*: Zero division errors can be ignored, use `np.NaN` as is.

## Congratulations! You're done!

* Submit the lab on Gradescope