## Predicting Price with Size


In [None]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression #build a model
from sklearn.metrics import mean_absolute_error #evaluate a model
from sklearn.utils.validation import check_is_fitted #validate our model


In [None]:
In this project, you're working for a client who wants to create a model that can predict the price of apartments in the city of Buenos Aires — with a focus on apartments that cost less than $400,000 USD.

 Write a function named wrangle that takes a file path as an argument and returns a DataFrame.

In [None]:
def wrangle(filepath):
    #reading the data into a csv file
    df=pd.read_csv(filepath)
    return df

Now that we have a function written, let's test it out on one of the CSV files we'll use in this project. 

In [None]:
df = wrangle("data/buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()

At this point, your DataFrame df should have no more than 8,606 observations. check using assert function

In [None]:
# Check your work
assert (
    len(df) <= 8606
), f"`df` should have no more than 8606 observations, not {len(df)}."

Task 2.1.3: Add to your wrangle function so that the DataFrame it returns only includes apartments in Buenos Aires ("Capital Federal") that cost less than $400,000 USD. Then recreate df from data/buenos-aires-real-estate-1.csv by re-running the cells above. 

In [None]:
def wrangle(filepath):
    #reading the data into a csv file
    df=pd.read_csv(filepath)
    #subsetting properties in "Capital Federal"
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    #subsetting "apartment"
    mask_apt=df["property_type"]=="apartment"
    #subset where "price_aprox_usd" < 400000
    mask_price=df["price_aprox_usd"] < 400000
    df=df[mask_ba & mask_apt & mask_price]
    return df

In [None]:
df = wrangle("data/buenos-aires-real-estate-1.csv")
print(df.shape)

To check your work, df should no have no more than 1,781 observations.

In [None]:
# Check your work
assert (
    len(df) <= 1781
), f"`df` should have no more than 1781 observations, not {len(df)}."

In [None]:
##histogram results show presence of outliers
plt.hist(df["surface_covered_in_m2"])
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes");

In [None]:
## summary statistics for the "surface_covered_in_m2" feature. 


In [None]:
def wrangle(filepath):

    #reading the data into a csv file

    df=pd.read_csv(filepath)

    #subsetting properties in "Capital Federal"

    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")

    #subsetting "apartment"

    mask_apt=df["property_type"]=="apartment"

    #subset where "price_aprox_usd" < 400000

    mask_price=df["price_aprox_usd"] < 400000

    df=df[mask_ba & mask_apt & mask_price]
    # removing outliers by filtering the surface_covered_in_m2
    low,high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low,high)
    df = df[mask_area]
    

    return df

Correlation between price and area

In [None]:
plt.scatter(x=df["surface_covered_in_m2"],y=df["price_aprox_usd"])
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Bueno Aires: Price vs. Area");

## Splitting feature matrix

In [None]:
features = ["surface_covered_in_m2"]
X_train = df[features]
X_train.head()

Fit a linear regression model to the mexico-city-real-estate-2.csv data set to relate "price_aprox_usd" and "surface_covered_in_m2".

In [None]:
# Import data
columns = ["price_aprox_usd", "surface_covered_in_m2"]
mexico_city2 = pd.read_csv("data/mexico-city-real-estate-2.csv",usecols=columns)
# Drop rows with missing values
mexico_city2.dropna(inplace=True)

# Split data into feature matrix
X = mexico_city2[["surface_covered_in_m2"]]
y = mexico_city2["price_aprox_usd"]

# Instantiate predictor
lr = LinearRegression()

# Fit predictor to data
lr.fit(X,y)

Read the data from mexico-city-real-estate-4.csv into a DataFrame and then generate a list of price predictions for the properties using your model lr

In [None]:
# Import data
mexico_city4 = pd.read_csv("data/mexico-city-real-estate-4.csv",usecols=["surface_covered_in_m2"])

# Drop missing values
mexico_city4.dropna(inplace=True)

# Generate predictions
price_pred = lr.predict(mexico_city4)

# Print predictions
price_pred[:5]

In [None]:
https://youtu.be/Q81RR3yKn30

Ridge Regression

Sometimes,the values for coefficients and the intercept - both positive and negative - are very large. When you see this in a linear model — especially a high-dimensional model — what's happening is that the model is overfitting to the training data and then can't generalize to the test data. Some people call this the curse of dimensionality. ☠️

The way to solve this problem is to use regularization, a group of techniques that prevent overfitting. In this case, we'll change the predictor from LinearRegression to Ridge, which is a linear regressor with an added tool for keeping model coefficients from getting too big.

Calculating the Mean Absolute Error for a List of Predictions

Plots are great for displaying information, but a value that tells you the typical error in a prediction is helpful too. This value is called the mean absolute error, and it's defined as the average value of the magnitude of the error in the predictions. The closer the MAE is to 0, the better our model fits the data. scikit-learn will do this for you if you pass it the price predictions from your regression model and the actual prices from the test data set. Let's see how our lr model did by comparing its predictions to the true values in mexico_city_labels.


In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(price_pred_example, mexico_city_labels)

Access an Attribute of a Trained Model

After training a model that fits a straight line to your data, you can now obtain the parameters that fit your line. We're particularly interested in the slope regr_lr.coef_ and the axis intercept regr_lr.intercept_


In [None]:
print(lr.coef_)

In [None]:
print(lr.intercept_)

Multicollinearity

When you're creating a linear model that uses many features to make predictions, some of those features can be highly correlated with each other. This isn't a problem that's going to break your model; it will still make predictions and it might have good performance metrics. But it is an issue if you want to interpret the coefficients for your model because it becomes hard to tell which features are truly important. 

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import data
columns = [
    "price",
    "price_aprox_local_currency",
    "price_aprox_usd",
    "surface_total_in_m2",
    "surface_covered_in_m2",
    "price_per_m2",
]
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)

# Drop missing values
mexico_city1.dropna(inplace=True)

mexico_city1.head()

In [None]:
mexico_city1.corr()

 fit a linear regression model for surface_covered_in_m2 as a function of price_aprox_usd and price_aprox_local_currency.

In [None]:
lr = LinearRegression()
lr.fit(
    mexico_city1[["price_aprox_usd", "price_aprox_local_currency"]],
    mexico_city1["surface_covered_in_m2"],
)

In [None]:
print(lr.coef_)

 We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this

In [None]:
mexico_city1 = mexico_city1.drop(
    ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1
)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1.head()

In [None]:
##Splitting the target

In [None]:
target = "price_aprox_usd"
y_train = df[target]
y_train.shape

In [None]:
### Model Building

Baseline
One way to think about this is to see how a "dumb" model would perform on the same data. Some people also call this a naïve or baseline model, but it's always a model makes only one prediction

Calculate the mean of your target vector y_train and assign it to the variable y_mean.

In [None]:
y_mean = y_train.mean()
y_mean ##the predicted value of a naive model

In machine learning, a regression problem is when you need to build a model that's going to predict a continuous, numerical value, like the sale price of an apartment. One of the models that you can use for regression problems is called linear regression. In it's simplest form, we fit a model that will predict a single output variable (called a target vector) as a linear function of a single input variable (called a feature matrix). 

 Add a line to the plot below that shows the relationship between the observations X_train and our dumb model's predictions y_pred_baseline. Be sure that the line color is orange, and that it has the label "Baseline Model".

In [None]:
plt.plot(X_train, y_pred_baseline, color="orange", label="Baseline Model")
plt.scatter(X_train, y_train)
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Buenos Aires: Price vs. Area")
plt.legend();

Calculate the baseline mean absolute error for your predictions in y_pred_baseline as compared to the true targets in y

In [None]:
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean apt price", round(y_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))

Iterate

The next step in building a model is iterating. This involves building a model, training it, evaluating it, and then repeating the process until you're happy with the model's performance. Even though the model we're building is linear, the iteration process rarely follows a straight line. Be prepared for trying new things, hitting dead-ends, and waiting around while your computer does long computations to train your model.

In [None]:
model = LinearRegression()# instantiating a model
model.fit(X_train,y_train)# fitting the model
# Check if the model is fitted
#check_is_fitted(model)
y_pred_training = model.predict(X_train)#Making predictions using tje training set
#y_pred_training[:5]
mae_training = mean_absolute_error(y_train,y_pred_training)#evaluating using the man absolute error
#print("Training MAE:", round(mae_training, 2))

In [None]:
##making prediction on the test data
X_test = pd.read_csv("data/buenos-aires-test-features.csv")[features]
y_pred_test = pd.Series(model.predict(X_test))
y_pred_test.head()

In [None]:
#intercept of a model

In [None]:
intercept = round(model.intercept_,2)
print("Model Intercept:", intercept)
assert any([isinstance(intercept, int), isinstance(intercept, float)])

In [None]:
#coefficient of a model

In [None]:
intercept = round(model.intercept_,2)
print("Model Intercept:", intercept)
assert any([isinstance(intercept, int), isinstance(intercept, float)])

In [None]:
##generating an equation
print(f"apt_price = {intercept} + {coefficient} * surface_covered")

Add a line to the plot below that shows the relationship between the observations in X_train and your model's predictions y_pred_training. Be sure that the line color is red, and that it has the label "Linear Model".

In [None]:
plt.plot(X_train, model.predict(X_train),color="r",label="Linear Model")
plt.scatter(X_train, y_train)
plt.xlabel("surface covered [sq meters]")
plt.ylabel("price [usd]")
plt.legend();