## 🎯 Machine Learning Reboot Challenge 🚀
Get ready to apply your knowledge in a hands-on experience with our Airbnb data.

We'll work through:

- 🧹 Data Cleaning
- 🚦 Train-Test Split
- 📈 Linear Regression
- 🌐 Random Forest Regressor
- 🔁 Cross-Validation
- 🎯 Logistic Regression
- 🎯 K-Means Clustering

At the end of each task, you'll answer questions 📝 to test your understanding.

Let's dive in and bring these concepts to life! 🏊‍♀️🏊‍♂️

### 1) Understand 👏 The 👏 Data 👏

For this challenge, we're going to be working through some real life AirBnb data. We'll be using data from multiple cities, so we're going to focus on making sure our code is nice and reusable (that way we don't have to write out all of our steps for data manipulation each time we work with another csv!).

First up: [Asheville, North Carolina](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/asheville_airbnb.csv). Download the csv from the link provided. We're going to be doing a linear regression, trying to predict the `price` of an AirBnb given all the other info we have about it! Load the DataFrame and take a look at the distribution of the dependent variable. Also check out its minimum and maximum values. 

N.B. Practice reading in your data cleanly and take care to set an `index_col` when you load up your DataFrame. 
    

In [None]:
! mkdir -p data
! curl "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/asheville_airbnb.csv" > "data/asheville_airbnb.csv"

In [None]:
import pandas as pd

df = pd.read_csv("data/asheville_airbnb.csv", index_col = 0)

# $CHALLENGIFY_BEGIN
df["price"] = df["price"].str.replace("$", "").str.replace(",", "").astype(float)

df["price"].min()

df["price"].max()
# $CHALLENGIFY_END

Looks quite skewed! We're going to focus on AirBnb listings priced above 50 dollars and less than 1500 dollars. Create a DataFrame named `reduced` that reflects this change.

In [None]:
# $CHALLENGIFY_BEGIN
reduced = df[(df["price"] > 50) & (df["price"] < 1500)].copy()
# $CHALLENGIFY_END

Run the asserts throughout the notebook to make sure you're on the right track!

In [None]:
assert(reduced.shape == (2746, 21))

Now look at all of our columns, pick out only ones that you think might help us with our linear regression task (along with our price column!).

In [None]:
# $CHALLENGIFY_BEGIN
reduced.columns
# $CHALLENGIFY_END

Only once you've done some investigation yourself,<details>
<summary>click here for a hint 👆</summary>



Here we've compiled a list for you that should serve as a good starting point, but you're welcome to pick some of your own.

```
    interesting_cols = [
    'price',
    'room_type',
    'accommodates',
    'bathrooms_text',
    'bedrooms',
    'beds',
    'minimum_nights',
    'number_of_reviews',
    'review_scores_rating',
    'instant_bookable']
```
    

In [None]:
# $CHALLENGIFY_BEGIN

interesting_cols = [
    'price',
    'room_type',
    'accommodates',
    'bathrooms_text',
    'bedrooms',
    'beds',
    'minimum_nights',
    'number_of_reviews',
    'review_scores_rating',
    'instant_bookable'
]
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN

relevant = reduced[interesting_cols].copy()
# $CHALLENGIFY_END

__Question 1:__
What are the primary differences between feature selection and feature engineering in the context of data preprocessing for machine learning models?

 A) Feature selection involves creating new features from existing ones, while feature engineering involves selecting the most relevant features. <br>
 B) Feature selection involves transforming numerical features into categorical ones, while feature engineering involves encoding categorical features.<br>
C) Feature selection focuses on reducing the dimensionality of the dataset by choosing only relevant features, while feature engineering involves creating new features from existing ones.<br>
D) Feature selection is only applicable to linear models, while feature engineering is used for non-linear models.

Save your answer as a string (either "A", "B", "C" or "D") in the variable below

In [None]:
answer_1 = "Save answer letter here"

__Question 2:__ What would be a __useful__ example of feature engineering for this data?

A) Adding together `bedrooms` and `beds` to make a combined `all_bed_info` columns <br>
B) Using the `latitude` and `longitude` points for each listing to create a `distance_from_downtown` feature<br>
C) Using `price` divded by `bedrooms` to create a new `price_per_room` feature<br>
D) Converting our `review_scores_rating` into a categorical variable

Save your answer as a string (either "A", "B", "C" or "D") in the variable below

In [None]:
answer_2 = "Save answer letter here"

Check your null and missing values. Proportionally speaking, how much of the dataset do they represent? Do we have a solid imputation strategy or can we just drop them?

In [None]:
# $CHALLENGIFY_BEGIN
relevant.isna().sum()
# $CHALLENGIFY_END

Make your decision and proceed to the next test cell.

In [None]:
# $CHALLENGIFY_BEGIN
relevant.dropna(inplace = True)
# $CHALLENGIFY_END

In [None]:
# Test cell
assert(relevant.shape == (2409, 10))

Now we need to take everything we have here and ensure that it's ready to be passed to our model. That means it has to be expressed as a number! So let's extract information from our `string` columns (`instant_bookable` and `bathrooms_text` we're looking at you 👀 - we may need some `regex` here to help us) and One Hot Encode our `room_type` (`pd.get_dummies()` is a very useful function for helping us do this). We will provide you with the cleaning function here but you will have to apply it to the DataFrame yourself.

In [None]:
import re

def extract_number(text):
    if text and type(text)==str:
        match = re.search(r'\d+(\.\d+)?', text)
        return float(match.group()) if match else None
    else:
        return None

In [None]:
# $CHALLENGIFY_BEGIN
relevant['bathrooms_text'] = relevant['bathrooms_text'].apply(extract_number)

relevant['instant_bookable'] = relevant['instant_bookable'].map({'t': 1, 'f': 0})

relevant = pd.get_dummies(relevant, columns=['room_type'])
# $CHALLENGIFY_END

In [None]:
# Run the next test

In [None]:
assert("object" not in list(relevant.dtypes))

__Question 3:__ What does the regular expression re.search(r'\d+(\.\d+)?', text) in the cell above do when applied to a given text?

A) It matches any sequence of digits. <br>
B) It matches any floating-point number in the text.<br>
C) It matches any integer or decimal number in the text.<br>
D) It matches any sequence of characters that starts with a digit.<br>

Save your answer as a string (either "A", "B", "C" or "D") in the variable below

In [None]:
question_3 = "Save answer letter here"

Great, all sorted! You've done all of your cleaning steps! Before we proceed, we're going to wrap up everything we've just done into one cleaning function. Why? Because it'll make working with other datasets so much easier! Down the line. Copy and paste your code chunks from the cells above into one large function.

In [None]:
def df_cleaner(df):
    # Converts a messy DataFrame into one
    # that contains only the relevant columns
    # with no null values and only numerical data.
    return clean_df

In [None]:
# $CHALLENGIFY_BEGIN
def df_cleaner(df):
    
    copy = df.copy()
    
    copy["price"] = copy["price"].str.replace("$", "", regex = True).str.replace(",", "", regex = True).astype(float)
    
    reduced = copy[(copy["price"] > 50) & (copy["price"] < 1500)].copy()
    
    relevant = reduced[interesting_cols].copy()
    
    relevant['bathrooms_text'] = relevant['bathrooms_text'].apply(extract_number)

    relevant['instant_bookable'] = relevant['instant_bookable'].map({'t': 1, 'f': 0})

    relevant = pd.get_dummies(relevant, columns=['room_type'])
    
    clean_df = relevant.dropna()
    
    return clean_df
# $CHALLENGIFY_END

Once you've coded your function, run the cell below to test it out!

In [None]:
new_df = pd.read_csv("data/asheville_airbnb.csv", index_col = 0)
df_cleaner(new_df)

In [None]:
# Run this test cell below

In [None]:
new_df = pd.read_csv('data/asheville_airbnb.csv')
assert(df_cleaner(new_df).shape == (2408, 13))
assert("object" not in df_cleaner(new_df).dtypes)

### 2) Train Test Split

To model, we need to create our X and y then split up our data with a train test split! We'll do an `80/20` split with a random state of `42`

In [None]:
clean_df = df_cleaner(new_df)

In [None]:
# $CHALLENGIFY_BEGIN
X = clean_df.drop("price", axis = 1)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
y = clean_df["price"]
# $CHALLENGIFY_END

In [None]:
from sklearn.model_selection import train_test_split
# $CHALLENGIFY_BEGIN
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# $CHALLENGIFY_END

__Question 4__: What is the primary purpose of performing a train-test split in data science when building a machine learning model?

A) To divide the dataset into multiple subsets for parallel processing.<br>
B) To divide the dataset into two parts: one for training the model and one for testing its performance.<br>
C) To merge two datasets for increased model accuracy.<br>
D) To ensure the model has access to the entire dataset during training.



In [None]:
answer_4 = "Save answer letter here"

Now we need to scale our `X_train` - to keep it simple let's use MinMax

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.preprocessing import MinMaxScaler
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# $CHALLENGIFY_END

__Question 5:__ What is the purpose of applying Min-Max Scaler to features in data preprocessing for machine learning?

A) To eliminate outliers from the dataset.<br>
B) To reduce the dimensionality of the data.<br>
C) To standardize the features to have a mean of 0 and a standard deviation of 1.<br>
D) To scale the features to a specific range, usually between 0 and 1.


In [None]:
answer_5 = "Save answer letter here"

__Question 6:__ Perhaps more importantly - __why__ do we scale our data in ML?

A) To reduce the number of features in the dataset for faster processing.<br>
B) To ensure that the model always converges to the global optimum.<br>
C) To avoid numerical instability and speed up the optimization process.<br>
D) To eliminate outliers and anomalies from the dataset.

In [None]:
answer_6 = "Save answer letter here"

Now we're ready to model! Fit a simple Linear Regression model from `sklearn` to your training data

### 3) Linear Regression

First up - calculate our baseline Mean Squared Error for a linear regression model? Think through what our simplest possible guess will be. Then calculate the MSE of guessing that every time for our test set

__Question 7:__ What is the simplest and most common baseline guess for a machine learning regression model?


A) Median of the target variable<br>
B) Maximum value of the target variable<br>
C) Mean of the target variable<br>
D) Minimum value of the target variable

In [None]:
answer_7 = "Save answer letter here"

In [None]:
# $CHALLENGIFY_BEGIN
price_mean = y_train.mean()
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
baseline = ((y_test - price_mean) ** 2).mean()
baseline
# $CHALLENGIFY_END

In [None]:
assert(round(baseline) == 16467)

Now instantiate a Linear Regression model from `sklearn` and fit it to your data!

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.linear_model import LinearRegression
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)
# $CHALLENGIFY_END

Predict on your test set and __DO NOT FORGET TO SCALE YOUR X_TEST!__

In [None]:
# $CHALLENGIFY_BEGIN
X_test_scaled = scaler.transform(X_test)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
y_pred = linear_model.predict(X_test_scaled)
# $CHALLENGIFY_END

Compute the MSE between your predictions (use `sklearn.metrics` to expedite things) and your real answers. Assign it to the variable `mse`

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)

mse
# $CHALLENGIFY_END

In [None]:
assert(mse < 9000)

What is the R-squared of your model prediction? Save it in a variable `r_2`

In [None]:
# $CHALLENGIFY_BEGIN
r_2 = linear_model.score(X_test_scaled, y_test)
# $CHALLENGIFY_END

In [None]:
assert(r_2 > 0.4)

__Question 8:__ What does the coefficient of determination, R-squared, measure in the context of regression models?

A) The percentage of variance in the dependent variable explained by the independent variable/s.<br>
B) The percentage of variance in the independent variable explained by the dependent variable/s.<br>
C) The percentage of correct predictions made by the regression model.<br>
D) The percentage of outliers in the dataset that affect the regression model's performance.

In [None]:
answer_8 = "Save answer letter here"

### 4) Random Forest Regresion

Not a bad first attempt - let's try quickly implementing a different model - a RandomForestRegressor - to see if we get better results. Instantiate a vanilla (no hyperparam tuning) RandomForestRegressor with a `random state` of 42. Evaluate your model in the same way as before with `rf_mse` and `rf_r_2` variables storing your results.

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.ensemble import RandomForestRegressor
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_scaled, y_train)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
rf_preds = rf.predict(X_test_scaled)

rf_mse = mean_squared_error(y_test, rf_preds)

rf_r_2 = linear_model.score(X_test_scaled, y_test)
# $CHALLENGIFY_END


In [None]:
assert rf_mse < 8000
assert rf_r_2 > 0.45

__Question 9:__ How does a Random Forest algorithm work in simple terms?

A) It creates multiple decision trees and combines their predictions to make more accurate and robust predictions.<br>
B) It randomly selects features from the dataset to build a single powerful decision tree.<br>
C) It uses a random process to shuffle the data and find the best fit for the target variable.<br>
D) It relies on the randomness of the data to make predictions without using any decision trees.

In [None]:
answer_9 = "Save answer letter here"

### 5) Cross Validation

[Here](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/new_york.csv) is the dataset for New York's listings. Use your cleaning function to preprocess it, then see how the two different models perform against each other and against a new NYC baseline MSE! 

In [None]:
! curl https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/new_york.csv > data/new_york.csv

Now we're going to use cross validation to see which one performs better first. We're going to try our `LinearRegression`, our `RandomForestRegressor` and also a `KNNRegressor`.

Remember each of the steps - this should be muscle memory now:

- Load the DataFrame
- Clean it (using the function we wrote above)
- Create the X and y
- Train test split (use `random_state = 42` for this notebook)
- Calculate baseline MSE
- Scale the X_train
- Cross validate each model, scoring with "mse"
- Fit the best model on the train for real
- Predict on the test
- Calculate the test MSE 

In [None]:
# Load the data (don't forget index_col)

In [None]:
# $CHALLENGIFY_BEGIN
ny_df = pd.read_csv("data/new_york.csv", index_col = 0)
# $CHALLENGIFY_END

In [None]:
# Clean all at once

In [None]:
# $CHALLENGIFY_BEGIN
clean_ny = df_cleaner(ny_df)
# $CHALLENGIFY_END

In [None]:
# Create your X and y

In [None]:
# $CHALLENGIFY_BEGIN
ny_X = clean_ny.drop("price", axis = 1)
ny_y = clean_ny["price"]
# $CHALLENGIFY_END

In [None]:
# Train test split (with random_state = 42)

In [None]:
# $CHALLENGIFY_BEGIN
ny_X_train, ny_X_test, ny_y_train, ny_y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# $CHALLENGIFY_END

In [None]:
# Calculate the baseline

In [None]:
# $CHALLENGIFY_BEGIN
ny_mean = ny_y_train.mean()
ny_baseline = ((ny_y_test - ny_mean)**2).mean()
ny_baseline
# $CHALLENGIFY_END

In [None]:
# Scale the X_train

In [None]:
# $CHALLENGIFY_BEGIN
ny_scaler = MinMaxScaler()
ny_X_scaled = ny_scaler.fit_transform(ny_X_train)
# $CHALLENGIFY_END

In [None]:
# Generate cross validation models for all three models (remember random_state of 42 for Random Forest)
# You will also have to think carefully about how you get an "mse" out of your cross validation!

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
# $CHALLENGIFY_BEGIN
ny_linear_model = LinearRegression()
ny_knn_regressor = KNeighborsRegressor()
ny_rf_model = RandomForestRegressor(random_state = 42)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.model_selection import cross_val_score
ny_linear_score = cross_val_score(ny_linear_model, ny_X_scaled, ny_y_train, cv = 5, scoring = "neg_mean_squared_error")
ny_neighbor_score = cross_val_score(ny_knn_regressor, ny_X_scaled, ny_y_train, cv = 5, scoring = "neg_mean_squared_error")
ny_rf_score = cross_val_score(ny_rf_model, ny_X_scaled, ny_y_train, cv = 5, scoring = "neg_mean_squared_error")
# $CHALLENGIFY_END


In [None]:
print(ny_linear_score.mean(), ny_neighbor_score.mean(), ny_rf_score.mean())

__Question 10:__ Why do we perform cross-validation in machine learning?

A) To evaluate the model's performance on the training data.<br>
B) To estimate the model's performance on unseen data and assess its generalization ability.<br>
C) To increase the size of the training data for better model training.<br>
D) To reduce overfitting and prevent the model from memorizing the training data.

In [None]:
answer_10 = "Save answer letter here"

In [None]:
# Fit our best model for real

In [None]:
# $CHALLENGIFY_BEGIN
ny_rf_model.fit(ny_X_scaled, ny_y_train)
# $CHALLENGIFY_END

In [None]:
# Predict on scaled X_test

In [None]:
# $CHALLENGIFY_BEGIN
ny_X_test_scaled = ny_scaler.transform(ny_X_test)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
ny_rf_pred = ny_rf_model.predict(ny_X_test_scaled)
# $CHALLENGIFY_END

In [None]:
# Calculate MSEs

In [None]:
# $CHALLENGIFY_BEGIN
ny_rf_mse = mean_squared_error(ny_y_test, ny_rf_pred)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
ny_rf_mse
# $CHALLENGIFY_END

### 5) Logistic Regression

While we've been doing such a good job at AirBnB that an airline company has heard of our data science talents 👀

They have assigned us a task! Use [this data](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/Invistico_Airline.csv) to figure out what is making their customers satisfied or not given a whole host of other features. They have given us their dataset and told us the goal is simple:

- Show which features are having the largest impact on customer satisfaction

The only other clue they have given us is there are some null values in our `'Arrival Delay in Minutes'` column and that they would like us to fill that column with the `median` for our `.fillna()` strategy. The rest is up to you now, you'll have to do all of the feature preprocessing and transformation yourself and fit your own Logistic Regression (from `sklearn` model). Then you'll answer a few questions on the data. 

Remember your process for exploreing your data:
1) Take the time to investigate your target <br>
2) See if there is a class imbalance<br>
3) Check your independent variables <br>
4) Make sure everything is nicely scaled (just using your training data split)<br>
5) Fit on the train<br>
6) Predict on the __scaled__ test data

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.linear_model import LogisticRegression
# $CHALLENGIFY_END

Loading the data:

In [None]:
! curl "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/Invistico_Airline.csv" > data/Invistico_Airline.csv

In [None]:
# $CHALLENGIFY_BEGIN
airline_df = pd.read_csv("data/Invistico_Airline.csv")
# $CHALLENGIFY_END

In [None]:
# Once you've given the data a first look - fill missing ["Arrival 
# Delay in Mins"] values with the median value as we have been instructed
# $CHALLENGIFY_BEGIN
airline_df['Arrival Delay in Minutes'].fillna(airline_df['Arrival Delay in Minutes'].median(), inplace=True)
# $CHALLENGIFY_END

In [None]:
# Make sure our target "Satisfaction" is a 0 or 1, not a string

In [None]:
# $CHALLENGIFY_BEGIN
airline_df["satisfaction"] = (airline_df["satisfaction"] == "satisfied").astype(int)
# $CHALLENGIFY_END

In [None]:
# Convert categorical variables into dummy/indicator 
# variables (i.e., one-hot encoding w/ pd.get_dummies())
# $CHALLENGIFY_BEGIN
airline_df = pd.get_dummies(airline_df, drop_first=True)
# $CHALLENGIFY_END

In [None]:
# Create your X and y

In [None]:
# $CHALLENGIFY_BEGIN
X, y = airline_df.drop("satisfaction", axis = 1), airline_df["satisfaction"]
# $CHALLENGIFY_END

In [None]:
# Train test split (use random_state = 42)

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# $CHALLENGIFY_END

In [None]:
# Scale your X_train

In [None]:
# $CHALLENGIFY_BEGIN
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# $CHALLENGIFY_END

In [None]:
# Fit a logitistic regression model on your train

In [None]:
# $CHALLENGIFY_BEGIN
log_model = LogisticRegression(max_iter = 1000)
log_model.fit(X_train_scaled, y_train)
# $CHALLENGIFY_END

In [None]:
# Scale your X_test and predict

In [None]:
# $CHALLENGIFY_BEGIN
X_test_scaled = scaler.transform(X_test)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
log_preds = log_model.predict(X_test_scaled)
# $CHALLENGIFY_END

In [None]:
# Create two variables - accuracy and conf_matrix - for your accuracy and confusion matrix

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
# $CHALLENGIFY_BEGIN
accuracy = accuracy_score(y_test, log_preds)
conf_matrix = confusion_matrix(y_test, log_preds)
# $CHALLENGIFY_END

A quick test to make sure you're on the right track

In [None]:
assert(accuracy > 0.8)

Plot your confusion matrix visually with the following code:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Not Satisfied', 'Satisfied'], 
            yticklabels=['Not Satisfied', 'Satisfied'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

Now calculate your `precision` and `recall` by hand and save them as variables

In [None]:
# $CHALLENGIFY_BEGIN
precision, recall = 0.8531885073580939, 0.851339067198098

# $CHALLENGIFY_END

In [None]:
assert(round(precision, 2) == 0.85)
assert(round(recall, 2) == 0.85)

__Question 11:__ In binary classification, what is the main difference between precision and recall?

A) Precision measures the ability of a model to correctly identify positive instances, while recall measures the ability to correctly identify negative instances.<br>
B) Precision measures the ability of a model to correctly identify negative instances, while recall measures the ability to correctly identify positive instances.<br>
C) Precision measures the overall accuracy of the model, while recall measures the model's ability to handle imbalanced datasets.<br>
D) Precision is the ratio of true positives to the sum of true positives and false positives, while recall is the ratio of true positives to the sum of true positives and false negatives.

In [None]:
answer_11 = "Save answer letter here"

Take a moment to look at your model's coefficients - run the code below to make a DataFrame that shows the coefficient for each feature if you'd like

In [None]:
import pandas as pd
coefficients = log_model.coef_[0]
column_names = X.columns
coef_df = pd.DataFrame({'Feature': column_names, 'Coefficient': coefficients})


In [None]:
# $CHALLENGIFY_BEGIN
coef_df.sort_values(by = "Coefficient")
# $CHALLENGIFY_END

__Question 12:__ Based on the provided DataFrame showing the coefficients of our model, what can we conclude about the impact of a __one minute increase__ in the "Arrival Delay in Minutes" feature on the model's results? Think carefully about all the steps you have taken to produce your model.

A) One minute increase in "Arrival Delay in Minutes" has a considerable impact on the model's predictions.<br>
B) One minute increase in "Arrival Delay in Minutes" has no impact on the model's predictions.<br>
C) The impact of a one-minute increase in the Arrival Delay in Minutes feature on the model's results probably has a fairly small impact given that the shown coefficient was calculated based on our MinMax scaled data.

In [None]:
answer_12 = "Save answer letter here"

__Question 13:__ In the provided DataFrame showing the coefficients of a logistic regression model, which of the following features appears to have the most significant impact on the model's predictions?

A) Inflight wifi <br>
B) Inflight entertainment<br>
C) Cleanliness<br>
D) On-board service



In [None]:
answer_13 = "Save answer letter here"

### 6) K-Means Clustering: Optimal number of NY Boroughs

Create a DataFrame that only include the `latitude` and `longitude` columsn from the above NY AirBnB dataset. We are going to do a little unsupervised learning to see if we can replicate the New York city zone limits from our own data!

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/Reboot-2/shutterstock-152208935.webp">

As you can see we have 5 key boroughs in New York City: Staten Island, Brooklyn, Queesion, the Bronx, and Manhattan! Is this the optimal number of boroughs? We're going to use the placement of our apartments listed on AirBnB to see if an unsupervised learning method will replicate the same kinds of boundaries!

Create `locations` - a DataFrame with only the lat and longs from our AirBnB dataset.

In [None]:
# $CHALLENGIFY_BEGIN
locations = ny_df[["latitude", "longitude"]].copy()
# $CHALLENGIFY_END

Import KMeans and set your number of cluster to 5 then fit it on `locations`

In [None]:
# $CHALLENGIFY_BEGIN
from sklearn.cluster import KMeans
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
k_model = KMeans(n_clusters = 5)
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
k_model.fit(locations)
# $CHALLENGIFY_END

Add your labels to your DataFrame.

In [None]:
# $CHALLENGIFY_BEGIN
locations["labels"] = k_model.labels_ 
# $CHALLENGIFY_END

In [None]:
# $CHALLENGIFY_BEGIN
locations

Scatterplot your results with a different colour for each neighborhood

In [None]:
# $CHALLENGIFY_BEGIN
sns.scatterplot(data = locations, x ="longitude", y = "latitude", hue = "labels")
# $CHALLENGIFY_END

Do we approximate similar boundaries for the boroughs of New York from our sampled data?

Try some different `n_neighbours` and use the elbow method to see if you find an optimal number of clusters.

__Question 14:__ How does the K-means algorithm work in layman's terms?

A) K-means algorithm finds the mean of all data points and groups them based on their distance to this mean.<br>
B) K-means algorithm calculates the distance between each data point and its nearest neighbor to create clusters.<br>
C) K-means algorithm randomly selects K data points as cluster centers, then assigns each data point to the nearest center and recalculates the center's position based on the data points in that cluster.<br>
D) K-means algorithm sorts the data points in ascending order and assigns the first K points to the same cluster, then continues with the next K points until all data points are grouped.

In [None]:
answer_14 = "Save answer letter here"

__Question 15:__ K Means is an example of an unsupervised learning technique where we start out with no labels for our data. Which of the following is an example of unsupervised learning technique?

A) Principal Component Analysis (PCA), a technique used for feature reduction and data dimensionality reduction.<br>
B) Decision Tree, a model used for classification and regression tasks with labeled data.<br>
C) Support Vector Machine (SVM), a model used for binary classification with labeled data.<br>
D) Random Forest, an ensemble model combining multiple decision trees for classification and regression with labeled data.

In [None]:
answer_15 = "Save answer letter here"

### TEST YOUR ANSWERS

Make sure you have saved all of your answers as uppercase "A", "B", "C" or "D" and then run the cell below

In [None]:
# $CHALLENGIFY_DELETE
# Define correct answers as variables
answer_1 = "C"  
answer_2 = "B"  
answer_3 = "B"  
answer_4 = "B"  
answer_5 = "D"  
answer_6 = "C"  
answer_7 = "C"  
answer_8 = "A"  
answer_9 = "A"  
answer_10 = "B" 
answer_11 = "A" 
answer_12 = "C" 
answer_13 = "B" 
answer_14 = "C" 
answer_15 = "A" 


# $CHALLENGIFY_DELETE

In [None]:
from test_answers import check_answers

student_answers = {
    "question_1": answer_1,  
    "question_2": answer_2,  
    "question_3": answer_3,  
    "question_4": answer_4,  
    "question_5": answer_5,  
    "question_6": answer_6,  
    "question_7": answer_7,  
    "question_8": answer_8,  
    "question_9": answer_9,  
    "question_10": answer_10,
    "question_11": answer_11,
    "question_12": answer_12,
    "question_13": answer_13,
    "question_14": answer_14,
    "question_15": answer_15,
}

check_answers(student_answers)