## Exercise

1. Import the `housing.csv` dataset from the dataset directory. The ```good_bad``` feature is an "artifical" binary feature indicating if the house is nice or not. **Note** that this is not exactly the same dataset as `kc_house_data.csv`!

2. Use logistic regression to predict the ```good_bad``` feature with ```sqft_living```

3. Find the accuracy of your model.

4. Print a report of the scores of the classification

In [1]:
import pandas as pd

df = pd.read_csv("../../datasets/housing.csv")
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,good_bad
0,6865200140,20140529T000000,485000.0,4,1.0,1600,4300,1.5,0,0,...,1600,0,1916,0,98103,47.6648,-122.343,1610,4300,0
1,3362400511,20150304T000000,570000.0,3,1.75,1260,3328,1.0,0,0,...,700,560,1905,0,98103,47.6823,-122.349,1380,3536,0
2,3362400431,20140626T000000,518500.0,3,3.5,1590,1102,3.0,0,0,...,1590,0,2010,0,98103,47.6824,-122.347,1620,3166,1
3,2331300505,20140613T000000,822500.0,5,3.5,2320,4960,2.0,0,0,...,1720,600,1926,0,98103,47.6763,-122.352,1700,4960,0
4,1994200024,20141104T000000,511000.0,3,1.0,1430,3455,1.0,0,0,...,980,450,1947,0,98103,47.6873,-122.336,1450,4599,0


In [2]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X=df[["sqft_living"]], y=df["good_bad"])

y_pred = model.predict(df[["sqft_living"]])

In [3]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_true=df["good_bad"], y_pred=y_pred)

print(f"Accuracy: {acc:.2%}")

Accuracy: 59.80%


We see that our model predicts samples correclty roughly 60% of the time. Let's look at the classification report.

In [4]:
from sklearn.metrics import classification_report

report = classification_report(y_true=df["good_bad"], y_pred=y_pred)

print(report)

              precision    recall  f1-score   support

           0       0.61      0.84      0.71       347
           1       0.55      0.27      0.36       255

    accuracy                           0.60       602
   macro avg       0.58      0.55      0.53       602
weighted avg       0.59      0.60      0.56       602



The report sheds some light on the strenghts and weaknesses of the model. Some insights:

- The model is bad at identifying good houses. Only 27% of the good houses are recognized.
- Even when a house is predicted as being a good one, we could essentially flip a coin on whether to trust this assessment or not. Only 55% of the time, it will be correct.
- If we filter out houses that are predicted as being "bad" by the model, then we will filter out 84% of the bad houses (besides some houses that are not actually bad). If we are okay with this tradeoff, we can narrow down the houses to a much smaller number.
- We also see that there are about 100 more "bad" houses in the dataset than there are "good" ones, which might explain the better performance of our model on the "bad" houses.