<a href="https://colab.research.google.com/github/JulieMew/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Copy_of_Unit_2_Sprint_1_Study_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

## Questions

When completing this section, try to limit your answers to 2-3 sentences max and use plain english as much as possible. It's very easy to hide incomplete knowledge and understanding behind fancy or technical words, so imagine you are explaining these things to a non-technical interviewer.

1. What is train/test split?
```
A function in Sklearn model selections for splitting data arrays into two subsets: for training data and for testing data.
```

2. What is a baseline?
```
A starting point for comparisons.
```

3. What is one-hot encoding and how do you do it?
```
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
```

4. What is Linear Regression?
```
Linear Regression is the process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line.
```

5. What is MAE? R^2?
```
Mae is a measure of errors between paired observations expressing the same phenomenon. R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values.
```

6. What are coefficients?
```
The coefficients are the numbers that multiply the variables or letters.
```

7. What is RMSE?
```
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
```



## Practice Problems

# **Use any Data Set you want**

Do train/test split

In [0]:
# Split the data

train = df[df['price'] < 2500]

test = df[df['price'] >= 2500]

In [0]:
train.shape, test.shape #Check the data split.

In [0]:
train['dishwasher'].mean() 

In [0]:
# Arrange y target vectors
target = 'dishwasher'
y_train = train[target]
y_test = test[target]

In [0]:
y_train

In [0]:
y_test

In [0]:
# Get mean baseline
print('Mean Baseline (using 0 features)')
guess = y_train.mean()

In [0]:
guess

In [0]:
# Train Error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (dishwasher): {mae:.2f} price')

In [0]:
# Test Error
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error (dishwasher): {mae:.2f} price')

In [0]:
import plotly.express as px

px.scatter( train,
    x='dishwasher',
    y='price',
    text='price',
    title='how dishwashers change price',
    trendline='ols',  # Ordinary Least Squares
)

In [0]:
# Train Error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (dishwasher): {mae:.2f} fitness_center')

What is the Mean Absolute Error

In [0]:
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f' {mae:.2f} percentage points')

In [0]:
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f' {mae:.2f} percentage points')

What is the R^2 score

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)

In [0]:
from IPython.display import display, HTML
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures


# Credit for PolynomialRegression: Jake VanderPlas, Python Data Science Handbook, Chapter 5.3
# https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html#Validation-curves-in-Scikit-Learn
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), 
                         LinearRegression(**kwargs))


polynomial_degrees = range(1, 10, 2)
train_r2s = []
test_r2s = []

for degree in polynomial_degrees:
    model = PolynomialRegression(degree)
    display(HTML(f'Polynomial degree={degree}'))
    
    model.fit(X_train, y_train)
    train_r2 = model.score(X_train, y_train)
    test_r2 = model.score(X_test, y_test)
    display(HTML(f'<b style="color: blue">Train R2 {train_r2:.2f}</b>'))
    display(HTML(f'<b style="color: red">Test R2 {test_r2:.2f}</b>'))

    plt.scatter(X_train[:,0], y_train, color='blue', alpha=0.5)
    plt.scatter(X_train[:,0], y_test, color='red', alpha=0.5)
    plt.xlabel(features)
    plt.ylabel(target)
    
    x_domain = np.linspace(X.min(), X.max())
    curve = model.predict(x_domain)
    plt.plot(x_domain, curve, color='blue')
    plt.show()
    display(HTML('<hr/>'))
    
    train_r2s.append(train_r2)
    test_r2s.append(test_r2)
    
display(HTML('Validation Curve'))
plt.plot(polynomial_degrees, train_r2s, color='blue', label='Train')
plt.plot(polynomial_degrees, test_r2s, color='red', label='Test')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('R^2 Score')
plt.legend()
plt.show() 

Do one-hot encoding Encode your categorical feature(s).

In [0]:
# label encoding the data 
from sklearn.preprocessing import LabelEncoder 
  
le = LabelEncoder() 
  
df['created']= le.fit_transform(df['price']) 
df['created']= le.fit_transform(df['price']) 

In [0]:
# importing one hot encoder from sklearn 
from sklearn.preprocessing import OneHotEncoder 
  
# creating one hot encoder object by default 
# entire data passed is one hot encoded 
onehotencoder = OneHotEncoder() 
  
df = np.array(columnTransformer.fit_transform(df), dtype = np.str) 

Fit your model.

In [0]:
import category_encoders as ce

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [0]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for i in range(1, len(X_train.columns) + 1):
  selector = SelectKBest(score_func=f_regression, k=i)
  model = LinearRegression()

  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)
  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE with {i} features: {mae:.0f}')

What is the Mean Absolute Error and R^2 score for your model?

In [0]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for i in range(1, len(X_train.columns) + 1):
  selector = SelectKBest(score_func=f_regression, k=i)
  model = LinearRegression()

  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)
  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE with {i} features: {mae:.0f}')

In [0]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for i in range(1, len(X_train.columns) + 1):
  selector = SelectKBest(score_func=f_regression, k=i)
  model = LinearRegression()

  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)
  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE with {i} features: {mae:.0f}')

Print or plot the coefficients

In [0]:
# TODO: Fit the model
# start with linear regression

model_full = LinearRegression()
model_full.fit(X_train, y_train)

In [0]:
# Calculate training error
y_pred = model_full.predict(X_train)
mean_absolute_error(y_train, y_pred)

In [0]:
# TODO: Apply the model to new data
y_pred = model_full.predict(X_test)
mean_absolute_error(y_test, y_pred)

In [0]:
# get coefficients
model_full.intercept_, model_full.coef_

In [0]:
beta0 = model_full.intercept_
beta1, beta2 = model_full.coef_
print(f'y = {beta0} + {beta1}x1 + {beta2}x2')

In [0]:
# This is easier to read
print('Intercept', model_full.intercept_)
coefficients = pd.Series(model_full.coef_, features)
print(coefficients.to_string())

In [0]:
X_train.describe()

In [0]:
model = model_full 

In [0]:
model.predict([[0, 0]])

In [0]:
model.predict([[1, 0]])

In [0]:
model.predict([[1, 0]]) - model.predict([[0, 0]])

In [0]:
model.predict([[2, 0]])

In [0]:
model.predict([[2, 0]]) - model.predict([[1, 0]])  # The same diff - linear!

In [0]:
model.predict([[2, 100]])

In [0]:
model.predict([[2, 100]]) - model.predict([[2, 0]])

In [0]:
model.predict([[3, 100]])

In [0]:
model.predict([[3, 100]]) - model.predict([[2, 100]])

In [0]:
model.predict([[3, 200]])

In [0]:
model.predict([[3, 200]]) - model.predict([[3, 100]])

# **What to study**

-train/test split


-baseline


-one-hot encoding


-Linear Regression


-MAE


-R^2


-coefficients


-RMSE