<font size="+3"><strong>Predicting Price with Neighborhood</strong></font>

In [2]:
import warnings
from glob import glob

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import wqet_grader
from category_encoders import OneHotEncoder
from IPython.display import VimeoVideo
from sklearn.linear_model import LinearRegression, Ridge  # noqa F401
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.utils.validation import check_is_fitted

warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 2 Assessment")

In the last lesson, we created a model that used location — represented by latitude and longitude — to predict price. In this lesson, we're going to use a different representation for location: neighborhood. 

In [2]:
VimeoVideo("656790491", h="6325554e55", width=600)

# Prepare Data

## Import

In [3]:
def wrangle(filepath):
    # Read CSV file
    df = pd.read_csv(filepath)

    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 400_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Split "lat-lon" column
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    

    return df

In the last lesson, we used our `wrangle` function to import two CSV files as DataFrames. But what if we had hundreds of CSV files to import? Wrangling them one-by-one wouldn't be an option. So let's start with a technique for reading several CSV files into a single DataFrame. 

The first step is to gather the names of all the files we want to import. We can do this using pattern matching. 

In [4]:
VimeoVideo("656790237", h="1502e3765a", width=600)

**Task 2.3.1:** Use [`glob`](https://docs.python.org/3/library/glob.html#glob.glob) to create a list that contains the filenames for all the Buenos Aires real estate CSV files in the `data` directory. Assign this list to the variable name `files`.

- [<span id='technique'>Assemble a list of path names that match a pattern in <span id='tool'>glob.](../%40textbook/02-python-advanced.ipynb#Working-with-strings-)

In [5]:
files = glob("./data/buenos-aires-real-estate-[0-5].csv")
files

['./data/buenos-aires-real-estate-2.csv',
 './data/buenos-aires-real-estate-5.csv',
 './data/buenos-aires-real-estate-4.csv',
 './data/buenos-aires-real-estate-3.csv',
 './data/buenos-aires-real-estate-1.csv']

In [6]:
# Check your work
assert len(files) == 5, f"`files` should contain 5 items, not {len(files)}"

The next step is to read each of the CSVs in `files` into a DataFrame, and put all of those DataFrames into a list. What's a good way to iterate through `files` so we can do this? A `for` loop!

In [7]:
VimeoVideo("656789768", h="3b8f3bca0b", width=600)

**Task 2.3.2:** Use your `wrangle` function in a `for` loop to create a list named `frames`. The list should the cleaned DataFrames created from the CSV filenames your collected in `files`.

- [What's a <span id='term'>for loop</span>?](../%40textbook/01-python-getting-started.ipynb#Python-for-Loops)
- [<span id='technique'>Write a for loop in <span id='tool'>Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-for-Loops)

In [8]:
frames = []
for item in files:
    df = wrangle(item)
    frames.append(df)
frames[0].head()

Unnamed: 0,operation,property_type,place_with_parent_names,price,currency,price_aprox_local_currency,price_aprox_usd,surface_total_in_m2,surface_covered_in_m2,price_usd_per_m2,price_per_m2,floor,rooms,expenses,properati_url,lat,lon
2,sell,apartment,|Argentina|Capital Federal|Recoleta|,215000.0,USD,3259916.0,215000.0,40.0,35.0,5375.0,6142.857143,,1.0,3500.0,http://recoleta.properati.com.ar/12j4v_venta_d...,-34.588993,-58.400133
9,sell,apartment,|Argentina|Capital Federal|Recoleta|,341550.0,USD,5178717.72,341550.0,,90.0,,3795.0,8.0,2.0,,http://recoleta.properati.com.ar/100t0_venta_d...,-34.588044,-58.398066
12,sell,apartment,|Argentina|Capital Federal|Monserrat|,1386000.0,ARS,1382153.13,91156.62,39.0,33.0,2337.349231,42000.0,,,,http://monserrat.properati.com.ar/t05l_venta_d...,-34.62332,-58.397461
13,sell,apartment,|Argentina|Capital Federal|Belgrano|,105000.0,USD,1592052.0,105000.0,,33.0,,3181.818182,1.0,1.0,,http://belgrano.properati.com.ar/zsd5_venta_de...,-34.553897,-58.451939
17,sell,apartment,|Argentina|Capital Federal|Villa del Parque|,89681.0,USD,1359779.19,89681.0,46.0,39.0,1949.586957,2299.512821,,1.0,1500.0,http://villa-del-parque.properati.com.ar/12q2f...,-34.628813,-58.47223


In [9]:
# Check your work
assert len(frames) == 5, f"`frames` should contain 5 items, not {len(frames)}"
assert all(
    [isinstance(frame, pd.DataFrame) for frame in frames]
), "The items in `frames` should all be DataFrames."

The final step is to use pandas to combine all the DataFrames in `frames`. 

In [10]:
VimeoVideo("656789700", h="57adef4afe", width=600)

**Task 2.3.3:** Use [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) to concatenate the items in `frames` into a single DataFrame `df`. Make sure you set the `ignore_index` argument to `True`.

- [<span id='technique'>Concatenate two or more DataFrames using <span id='tool'>pandas.](../%40textbook/03-pandas-getting-started.ipynb#Concatenating-DataFrames)

In [11]:
df = pd.concat(frames, ignore_index=True)
df.head()

Unnamed: 0,operation,property_type,place_with_parent_names,price,currency,price_aprox_local_currency,price_aprox_usd,surface_total_in_m2,surface_covered_in_m2,price_usd_per_m2,price_per_m2,floor,rooms,expenses,properati_url,lat,lon
0,sell,apartment,|Argentina|Capital Federal|Recoleta|,215000.0,USD,3259916.0,215000.0,40.0,35.0,5375.0,6142.857143,,1.0,3500.0,http://recoleta.properati.com.ar/12j4v_venta_d...,-34.588993,-58.400133
1,sell,apartment,|Argentina|Capital Federal|Recoleta|,341550.0,USD,5178717.72,341550.0,,90.0,,3795.0,8.0,2.0,,http://recoleta.properati.com.ar/100t0_venta_d...,-34.588044,-58.398066
2,sell,apartment,|Argentina|Capital Federal|Monserrat|,1386000.0,ARS,1382153.13,91156.62,39.0,33.0,2337.349231,42000.0,,,,http://monserrat.properati.com.ar/t05l_venta_d...,-34.62332,-58.397461
3,sell,apartment,|Argentina|Capital Federal|Belgrano|,105000.0,USD,1592052.0,105000.0,,33.0,,3181.818182,1.0,1.0,,http://belgrano.properati.com.ar/zsd5_venta_de...,-34.553897,-58.451939
4,sell,apartment,|Argentina|Capital Federal|Villa del Parque|,89681.0,USD,1359779.19,89681.0,46.0,39.0,1949.586957,2299.512821,,1.0,1500.0,http://villa-del-parque.properati.com.ar/12q2f...,-34.628813,-58.47223


In [12]:
# Check your work
assert len(df) == 6582, f"`df` is the wrong size: {len(df)}."

Excellent work! You can now clean and combine as many CSV files as your computer can handle. You're well on your way to working with big data. 📈

## Explore

Looking through the output from the `df.head()` call above, there's a little bit more cleaning we need to do before we can work with the neighborhood information in this dataset. The good news is that, because we're using a `wrangle` function, we only need to change the function to re-clean all of our CSV files. This is why functions are so useful.

In [13]:
VimeoVideo("656791659", h="581201dc92", width=600)

**Task 2.3.4:** Modify your `wrangle` function to create a new feature `"neighborhood"`. You can find the neighborhood for each property in the `"place_with_parent_names"` column. For example, a property with the place name `"|Argentina|Capital Federal|Palermo|"` is located in the neighborhood is `"Palermo"`. Also, your function should drop the `"place_with_parent_names"` column.

Be sure to rerun all the cells above before you continue.

- [<span id='technique'>Split the strings in one column to create another using <span id='tool'>pandas.](../%40textbook/03-pandas-getting-started.ipynb#Splitting-Strings)

In [79]:
def wrangle(df1):
    # Read CSV file
    df = pd.read_csv(df1)
#     df = df1

    
    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 400_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Split "lat-lon" column
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    
#     df['neighborhood'] = df['place_with_parent_names'].str.split('|', expand=True)[3]
#     df = df.drop(columns='place_with_parent_names')
    
    return df

In [80]:
files = glob("./data/buenos-aires-real-estate-[0-5].csv")
files
frames = []
for item in files:
    df = wrangle(item)
    frames.append(df)
frames[0].head()
df1 = pd.concat(frames)

In [81]:
df1['place_with_parent_names']

2               |Argentina|Capital Federal|Recoleta|
9               |Argentina|Capital Federal|Recoleta|
12             |Argentina|Capital Federal|Monserrat|
13              |Argentina|Capital Federal|Belgrano|
17      |Argentina|Capital Federal|Villa del Parque|
                            ...                     
8589            |Argentina|Capital Federal|Barracas|
8590             |Argentina|Capital Federal|Almagro|
8593            |Argentina|Capital Federal|Barracas|
8601         |Argentina|Capital Federal|San Nicolás|
8604               |Argentina|Capital Federal|Boedo|
Name: place_with_parent_names, Length: 6582, dtype: object

In [82]:
df1['neighborhood'] = df1['place_with_parent_names'].str.split('|', expand=True)[3]
df1 = df1.drop(columns='place_with_parent_names')

In [83]:
df1['neighborhood']

2               Recoleta
9               Recoleta
12             Monserrat
13              Belgrano
17      Villa del Parque
              ...       
8589            Barracas
8590             Almagro
8593            Barracas
8601         San Nicolás
8604               Boedo
Name: neighborhood, Length: 6582, dtype: object

In [84]:
df = df1

In [85]:
# Check your work
assert df.shape == (6582, 17), f"`df` is the wrong size: {df.shape}."
assert (
    "place_with_parent_names" not in df
), 'Remember to remove the `"place_with_parent_names"` column.'

In [86]:
df['neighborhood']

2               Recoleta
9               Recoleta
12             Monserrat
13              Belgrano
17      Villa del Parque
              ...       
8589            Barracas
8590             Almagro
8593            Barracas
8601         San Nicolás
8604               Boedo
Name: neighborhood, Length: 6582, dtype: object

## Split

At this point, you should feel more comfortable with the splitting data, so we're going to condense the whole process down to one task. 

In [43]:
VimeoVideo("656791577", h="0ceb5341f8", width=600)

**Task 2.3.5:** Create your feature matrix `X_train` and target vector `y_train`. `X_train` should contain one feature: `"neighborhood"`. Your target is `"price_aprox_usd"`. 

- [What's a <span id='term'>feature matrix?](../%40textbook/15-ml-regression.ipynb#Linear-Regression)
- [What's a <span id='term'>target vector?](../%40textbook/15-ml-regression.ipynb#Linear-Regression)
- [<span id='technique'>Subset a DataFrame by selecting one or more columns in <span id='tool'>pandas.](../%40textbook/04-pandas-advanced.ipynb#Subset-a-DataFrame-by-Selecting-One-or-More-Columns) 
- [<span id='technique'>Select a Series from a DataFrame in <span id='tool'>pandas.](../%40textbook/04-pandas-advanced.ipynb#Select-a-Series-from-a-DataFrame) 

In [87]:
target = "price_aprox_usd"
features = ["neighborhood"]
y_train = df[target]
X_train = df[features]
display(y_train)
display(X_train)

2       215000.00
9       341550.00
12       91156.62
13      105000.00
17       89681.00
          ...    
8589     73536.95
8590    119000.00
8593     62000.00
8601    125000.00
8604     78000.00
Name: price_aprox_usd, Length: 6582, dtype: float64

Unnamed: 0,neighborhood
2,Recoleta
9,Recoleta
12,Monserrat
13,Belgrano
17,Villa del Parque
...,...
8589,Barracas
8590,Almagro
8593,Barracas
8601,San Nicolás


In [45]:
# Check your work
assert X_train.shape == (6582, 1), f"`X_train` is the wrong size: {X_train.shape}."
assert y_train.shape == (6582,), f"`y_train` is the wrong size: {y_train.shape}."

# Build Model

## Baseline

Let's also condense the code we use to establish our baseline. 

In [46]:
VimeoVideo("656791443", h="120a740cc3", width=600)

**Task 2.3.6:** Calculate the baseline mean absolute error for your model.

- [<span id='term'>What's a performance metric?](../%40textbook/12-ml-core.ipynb#Performance-Metrics)
- [<span id='term'>What's mean absolute error?](../%40textbook/12-ml-core.ipynb#Performance-Metrics)
- [<span id='technique'>Calculate summary statistics for a DataFrame or Series in <span id='tool'>pandas.](../%40textbook/05-pandas-summary-statistics.ipynb#Working-with-Summary-Statistics)
- [<span id='technique'>Calculate the mean absolute error for a list of predictions in <span id='tool'>scikit-learn.](../%40textbook/15-ml-regression.ipynb#Calculating-the-Mean-Absolute-Error-for-a-List-of-Predictions)

In [88]:
y_mean = y_train.mean()
y_pred_baseline = [y_mean] * len(y_train)
print("Mean apt price:", y_mean)
print("Baseline MAE:", mean_absolute_error(y_train, y_pred_baseline))

Mean apt price: 132383.83701458527
Baseline MAE: 44860.10834274133


The mean apartment price and baseline MAE should be similar but not identical to last lesson. The numbers will change since we're working with more data.

## Iterate

If you try to fit a `LinearRegression` predictor to your training data at this point, you'll get an error that looks like this:

```
ValueError: could not convert string to float
```

What does this mean? When you fit a linear regression model, you're asking scikit-learn to perform a mathematical operation. The problem is that our training set contains neighborhood information in non-numerical form. In order to create our model we need to **encode** that information so that it's represented numerically. The good news is that there are lots of transformers that can do this. Here, we'll use the one from the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/index.html) library, called a [`OneHotEncoder`](https://contrib.scikit-learn.org/category_encoders/onehot.html).

Before we build include this transformer in our pipeline, let's explore how it works. 

In [49]:
VimeoVideo("656792790", h="4097efb40d", width=600)

**Task 2.3.7:** First, instantiate a `OneHotEncoder` named `ohe`. Make sure to set the `use_cat_names` argument to `True`. Next, fit your transformer to the feature matrix `X_train`. Finally, use your encoder to transform the feature matrix `X_train`, and assign the transformed data to the variable `XT_train`.

- [What's <span id='term'>one-hot encoding?](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#One-Hot-Encoding)
- [<span id='technique'>Instantiate a transformer in <span id='tool'>scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#One-Hot-Encoding)
- [<span id='technique'>Fit a transformer to training data in <span id='tool'>scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#One-Hot-Encoding)
- [<span id='technique'>Transform data using a transformer in <span id='tool'>scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#One-Hot-Encoding)

In [50]:
ohe = OneHotEncoder(use_cat_names=True)
ohe.fit(X_train) ### trains 

XT_train = ohe.transform(X_train) ### uses your encoder to transform feature matrix
print(XT_train.shape)
XT_train.head()

(6582, 57)


Unnamed: 0,neighborhood_Recoleta,neighborhood_Monserrat,neighborhood_Belgrano,neighborhood_Villa del Parque,neighborhood_Villa Pueyrredón,neighborhood_Almagro,neighborhood_Palermo,neighborhood_,neighborhood_Tribunales,neighborhood_Balvanera,...,neighborhood_Velez Sarsfield,neighborhood_Monte Castro,neighborhood_Las Cañitas,neighborhood_Constitución,neighborhood_Parque Avellaneda,neighborhood_Villa Soldati,neighborhood_Pompeya,neighborhood_Versalles,neighborhood_Villa Real,neighborhood_Catalinas
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
# Check your work
assert XT_train.shape == (6582, 57), f"`XT_train` is the wrong shape: {XT_train.shape}"

Now that we have an idea for how the `OneHotEncoder` works, let's bring it into our pipeline.

In [52]:
VimeoVideo("656792622", h="0b9d189e8f", width=600)

**Task 2.3.8:** Create a pipeline named `model` that contains a `OneHotEncoder` transformer and a `LinearRegression` predictor. Then fit your model to the training data. 

- [What's a <span id='term'>pipeline?](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#scikit-learn-in-Production)
- [<span id='technique'>Create a pipeline in <span id='tool'>scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Creating-a-Pipeline-in-scikit-learn)

In [107]:
model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    Ridge()
)

model.fit(X_train, y_train) ## Now FIT on the X_train, and the y_train data

Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['neighborhood'], use_cat_names=True)),
                ('ridge', Ridge())])

In [56]:
# Check your work
check_is_fitted(model[-1])

Wow, you just built a model with two transformers and a predictor! When you started this course, did you think you'd be able to do something like that? 😁

## Evaluate

Regardless of how you build your model, the evaluation step stays the same. Let's see how our model performs with the training set.

In [57]:
VimeoVideo("656792525", h="09edc1c3d6", width=600)

**Task 2.3.9:** First, create a list of predictions for the observations in your feature matrix `X_train`. Name this list `y_pred_training`. Then calculate the training mean absolute error for your predictions in `y_pred_training` as compared to the true targets in `y_train`.

- [<span id='technique'>Generate predictions using a trained model in <span id='tool'>scikit-learn.](../%40textbook/15-ml-regression.ipynb#Generating-Predictions-Using-a-Trained-Model)
- [<span id='technique'>Calculate the mean absolute error for a list of predictions in <span id='tool'>scikit-learn.](../%40textbook/15-ml-regression.ipynb#Calculating-the-Mean-Absolute-Error-for-a-List-of-Predictions)

In [108]:
y_pred_training = model.predict(X_train)
mae_training = mean_absolute_error(y_train, y_pred_training)
print("Training MAE:", round(mae_training, 2))

Training MAE: 39350.22


Now let's check our test performance. 

**Task 2.3.10:** Run the code below to import your test data `buenos-aires-test-features.csv` into a DataFrame and generate a list of predictions using your model. Then run the following cell to submit your predictions to the grader.

- [What's generalizability?](../%40textbook/12-ml-core.ipynb#Generalization)
- [<span id='technique'>Generate predictions using a trained model in <span id='tool'>scikit-learn.](../%40textbook/15-ml-regression.ipynb#Generating-Predictions-Using-a-Trained-Model)
- [<span id='technique'>Calculate the mean absolute error for a list of predictions in <span id='tool'>scikit-learn.](../%40textbook/15-ml-regression.ipynb#Calculating-the-Mean-Absolute-Error-for-a-List-of-Predictions)

In [109]:
X_test = pd.read_csv("data/buenos-aires-test-features.csv")[features]
y_pred_test = model.predict(X_test)
y_pred_test[:5]

array([246624.69462384, 161355.96873409,  98232.05130782, 110846.03037654,
       127777.53819661])

In [110]:
wqet_grader.grade("Project 2 Assessment", "Task 23.10", list(y_pred_test))

# Communicate Results

If we write out the equation for our model, it'll be too big to fit on the screen. That's because, when we used the `OneHotEncoder` to encode the neighborhood data, we created a much wider DataFrame, and each column/feature has it's own coefficient in our model's equation.

<center><img src="../images/proj-2.006.png" alt="Equation: y = β0 + β1 x1 + β2 x2 + ... + β59 x59 + β60 x60 " style="width: 800px;"/></center>

This is important to keep in mind for two reasons. First, it means that this is a **high-dimensional** model. Instead of a 2D or 3D plot, we'd need a 58-dimensional plot to represent it, which is impossible! Second, it means that we'll need to extract and represent the information for our equation a little differently than before. Let's start by getting our intercept and coefficient.

In [None]:
VimeoVideo("656793909", h="fca67856b4", width=600)

**Task 2.3.11:** Extract the intercept and coefficients for your model. 

- [What's an <span id='term'>intercept</span> in a linear model?](../%40textbook/12-ml-core.ipynb#Model-Types)
- [What's a <span id='term'>coefficient</span> in a linear model?](../%40textbook/12-ml-core.ipynb#Model-Types)
- [<span id='technique'>Access an object in a pipeline in <span id='tool'>scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Accessing-an-Object-in-a-Pipeline)

In [93]:
intercept = model.named_steps['linearregression'].intercept_
coefficients = model.named_steps['linearregression'].coef_
print("coefficients len:", len(coefficients))
print(coefficients[:5])  # First five coefficients

coefficients len: 57
[2.64168099e+17 2.64168099e+17 2.64168099e+17 2.64168099e+17
 2.64168099e+17]


In [94]:
# Check your work
assert isinstance(
    intercept, float
), f"`intercept` should be a `float`, not {type(intercept)}."
assert isinstance(
    coefficients, np.ndarray
), f"`coefficients` should be a `float`, not {type(coefficients)}."
assert coefficients.shape == (
    57,
), f"`coefficients` is wrong shape: {coefficients.shape}."

We have the values of our coefficients, but how do we know which features they belong to? We'll need to get that information by going into the part of our pipeline that did the encoding.

In [95]:
VimeoVideo("656793812", h="810161b84e", width=600)

**Task 2.3.12:** Extract the feature names of your encoded data from the `OneHotEncoder` in your model.

- [Access an object in a pipeline in scikit-learn.](../%40textbook/13-ml-data-pre-processing-and-production.ipynb#Accessing-an-Object-in-a-Pipeline)

In [96]:
feature_names = model.named_steps['onehotencoder'].get_feature_names()
print("features len:", len(feature_names))
print(feature_names[:5])  # First five feature names

features len: 57
['neighborhood_Recoleta', 'neighborhood_Monserrat', 'neighborhood_Belgrano', 'neighborhood_Villa del Parque', 'neighborhood_Villa Pueyrredón']


In [97]:
# Check your work
assert isinstance(
    feature_names, list
), f"`features` should be a `list`, not {type(features)}."
assert len(feature_names) == len(
    coefficients
), "You should have the same number of features and coefficients."

We have coefficients and feature names, and now we need to put them together. For that, we'll use a Series.

In [101]:
VimeoVideo("656793718", h="1e2a1e1de8", width=600)

**Task 2.3.13:** Create a pandas Series named `feat_imp` where the index is your `features` and the values are your `coefficients`.

- [<span id='technique'>Create a Series in <span id='tool'>pandas.](../%40textbook/03-pandas-getting-started.ipynb#Working-with-Columns)

In [102]:
feat_imp = pd.Series(coefficients, index=feature_names)
feat_imp.head()

neighborhood_Recoleta            2.641681e+17
neighborhood_Monserrat           2.641681e+17
neighborhood_Belgrano            2.641681e+17
neighborhood_Villa del Parque    2.641681e+17
neighborhood_Villa Pueyrredón    2.641681e+17
dtype: float64

In [103]:
# Check your work
assert isinstance(
    feat_imp, pd.Series
), f"`feat_imp` should be a `float`, not {type(feat_imp)}."
assert feat_imp.shape == (57,), f"`feat_imp` is wrong shape: {feat_imp.shape}."
assert all(
    a == b for a, b in zip(sorted(feature_names), sorted(feat_imp.index))
), "The index of `feat_imp` should be identical to `features`."

To be clear, it's definitely not a good idea to show this long equation to an audience, but let's print it out just to check our work. Since there are so many terms to print, we'll use a `for` loop.

In [104]:
VimeoVideo("656797021", h="dc90e6dac3", width=600)

**Task 2.3.14:** Run the cell below to print the equation that your model has determined for predicting apartment price based on longitude and latitude.

- [What's an f-string?](../%40textbook/02-python-advanced.ipynb#Working-with-f-strings-)

In [105]:
print(f"price = {intercept.round(2)}")
for f, c in feat_imp.items():
    print(f"+ ({round(c, 2)} * {f})")

price = -2.641680987765331e+17
+ (2.6416809877672438e+17 * neighborhood_Recoleta)
+ (2.641680987766318e+17 * neighborhood_Monserrat)
+ (2.6416809877669786e+17 * neighborhood_Belgrano)
+ (2.6416809877663885e+17 * neighborhood_Villa del Parque)
+ (2.6416809877664346e+17 * neighborhood_Villa Pueyrredón)
+ (2.641680987766549e+17 * neighborhood_Almagro)
+ (2.641680987766981e+17 * neighborhood_Palermo)
+ (2.6416809877663184e+17 * neighborhood_)
+ (2.6416809877664378e+17 * neighborhood_Tribunales)
+ (2.6416809877664026e+17 * neighborhood_Balvanera)
+ (2.6416809877670768e+17 * neighborhood_Barrio Norte)
+ (2.6416809877664813e+17 * neighborhood_Once)
+ (2.6416809877665766e+17 * neighborhood_San Telmo)
+ (2.6416809877660138e+17 * neighborhood_Villa Lugano)
+ (2.641680987766641e+17 * neighborhood_Coghlan)
+ (2.6416809877664666e+17 * neighborhood_Barracas)
+ (2.641680987766642e+17 * neighborhood_Villa Urquiza)
+ (2.6416809877665594e+17 * neighborhood_Abasto)
+ (2.6416809877665786e+17 * neighborhoo

<div class="alert alert-block alert-warning">
<b>Warning:</b> In the first lesson for this project, we said that you shouldn't make any changes to your model after you see your test metrics. That's still true. However, we're breaking that rule here so that we can discuss overfitting. In future lessons, you'll learn how to protect against overfitting without checking your test set.
</div>

In [106]:
VimeoVideo("656799309", h="a7130deb64", width=600)

**Task 2.3.15:** Scroll up, change the predictor in your model to `Ridge`, and retrain it. Then evaluate the model's training and test performance. Do you still have an overfitting problem? If not, extract the intercept and coefficients again (you'll need to change your code a little bit) and regenerate the model's equation. Does it look different than before?

- What's <span id='term'>overfitting?
- What's <span id='term'>regularization?
- What's <span id='term'>ridge regression?

In [None]:
# Check your work
assert isinstance(
    model[-1], Ridge
), "Did you retrain your model using a `Ridge` predictor?"

We're back on track with our model, so let's create a visualization that will help a non-technical audience understand what the most important features for our model in predicting apartment price. 

In [3]:
VimeoVideo("656798530", h="9a9350eff1", width=600)

**Task 2.3.16:** Create a horizontal bar chart that shows the top 15 coefficients for your model, based on their absolute value.

- [What's a <span id='term'>bar chart</span>?](../%40textbook/07-visualization-pandas.ipynb#Bar-Charts)
- [<span id='technique'>Create a bar chart using <span id='tool'>pandas</span></span>.](../%40textbook/07-visualization-pandas.ipynb#Bar-Charts)

Looking at this bar chart, we can see that the poshest neighborhoods in Buenos Aires like [Puerto Madero](https://en.wikipedia.org/wiki/Puerto_Madero) and [Recoleta](https://en.wikipedia.org/wiki/Recoleta,_Buenos_Aires) increase the predicted price of an apartment, while more working-class neighborhoods like [Villa Soldati](https://en.wikipedia.org/wiki/Villa_Soldati) and [Villa Lugano](https://en.wikipedia.org/wiki/Villa_Lugano) decrease the predicted price. 

Just for fun, check out [this song](https://www.youtube.com/watch?v=RGlunBDvsaw) by Kevin Johansen about Puerto Madero. 🎶

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
