# Predicting values -1

This notebooks aims to help on starting with ML using the [housing dataset]
The goal is to create a simple model using some basic EDA, apply it to our housing data and calculating the performance of it. 

In [1]:
import pandas as pd

housing = pd.read_csv('housing-classification-iter-0-2.csv')

## Initial exploration

What columns exist on this data? What are their data types?

In [5]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
 9   Expensive     1460 non-null   int64  
dtypes: float64(1), int64(9)
memory usage: 114.2 KB


Do we have missing values on this dataset?

In [2]:
housing.isna().sum()

LotArea           0
LotFrontage     259
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
Expensive         0
dtype: int64

In [12]:
housing.dropna()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,preds1
0,8450,65.0,856,3,0,0,2,0,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0,0
2,11250,68.0,920,3,1,0,2,0,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,62.0,953,3,1,0,2,0,0,0,0
1456,13175,85.0,1542,3,2,0,2,349,0,0,1
1457,9042,66.0,1152,4,2,0,1,0,0,1,1
1458,9717,68.0,1078,2,0,0,1,366,0,0,0


Do we have duplicated information?

In [3]:
housing.duplicated().sum()

14

Is there any column that helps us identify if a house is expensive or not?

In [14]:
housing.groupby(['Fireplaces','Expensive']).agg(count = ('Expensive','count'))

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Fireplaces,Expensive,Unnamed: 2_level_1
0,0,680
0,1,10
1,0,482
1,1,168
2,0,78
2,1,37
3,0,3
3,1,2


## Create your first model

Based on the previous exploration, you have found some column that have some relation to the price of a house. Now it's your turn to create a python function to classify if a house is going to be expensive (`1`) or not (`0`). Read the following article on the [platform](https://platform.wbscodingschool.com/courses/data-science/14406/) to understand more about this process.

Your function should look like that:

```py
def my_dummy_model(variable_affecting_the_price_of_a_house): 
    """
    Given a variable that affects the price of a house, 
    return if it is expensive (1) or not (0)
    """
    if variable_affecting_the_price_of_a_house > 12345: 
        return 1
    else:
        return 0
```

What are the predictions of your model?

In [7]:
  #Calculate with Fireplaces
def price_range(Fireplaces):
    if Fireplaces > 1:
        return 1
    else:
        return 0

In [24]:
housing['preds1'] = [price_range(Fireplaces) for Fireplaces in housing['Fireplaces']]
housing.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,preds1
0,8450,65.0,856,3,0,0,2,0,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0,0
2,11250,68.0,920,3,1,0,2,0,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0,0


## Evaluate its performance

How can we evaluate our model? Is there a way to check the performance of it?

In [17]:
correctness_check = (
housing
    .filter(['Expensive','preds1'])
    .assign(check = lambda x: x['Expensive'] == x['preds1'])
)
correctness_check

Unnamed: 0,Expensive,preds1,check
0,0,0,True
1,0,0,True
2,0,0,True
3,0,0,True
4,0,0,True
...,...,...,...
1455,0,0,True
1456,0,1,False
1457,1,1,True
1458,0,0,True


In [19]:
(correctness_check['check'].sum() / correctness_check.shape[0])

0.8226027397260274

In [20]:
# remove the predictions from our original data
housing.drop(columns=['preds1'], inplace=True)

**The concept of train and test - 2**


to split our housing data to sets, train and test.
Create a model (function) based on an exploration only using the train set.
Apply the model to our train and test set.
Evaluate the performance of both sets.

In [21]:
import pandas as pd

housing = pd.read_csv('https://raw.githubusercontent.com/JoanClaverol/housing_data/main/housing-classification-iter-0-2.csv')

## Train and test creation

Is there a way to create a train and a test using sklearn?

In [22]:
# Create a train and a test with sklearn
from sklearn.model_selection import train_test_split

X = housing.drop(columns=['Expensive'])
y = housing.filter(['Expensive'])

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=8)

Explore and create the model using the train set

In [23]:
# quick exploration to our train data.
(
X_train
    .assign(Expensive = y_train)
    .groupby(['Fireplaces','Expensive'])
    .agg(count = ('Expensive','count'))
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Fireplaces,Expensive,Unnamed: 2_level_1
0,0,532
0,1,8
1,0,334
1,1,121
2,0,63
2,1,32
3,0,3
3,1,2


**Create your model after exploring the train set**


Based on your exploration, what features will have the most impact on the predictions? Create a function based on those features.

In [25]:
# Let's predict  
X_train['preds1'] = [price_range(Fireplaces) for Fireplaces in X_train['Fireplaces']]

In [26]:
# results train data
(
X_train
    .filter(['preds1'])
    .assign(expensive = y_train)
    .assign(check = lambda x: x['expensive'] == x['preds1'])['check']
    .sum()
 ) / len(y_train)

0.821917808219178

In [30]:
# results test data
acc_2nd = (
X_test
  .assign(preds1 = lambda x: [price_range(Fireplaces) for Fireplaces in x['Fireplaces']])
  .filter(['preds1'])
  .assign(expensive = y_test)
  .assign(check = lambda x: x['expensive'] == x['preds1'])['check']
  .sum()
 ) / len(y_test)

acc_2nd

0.8246575342465754

In [31]:
(
  X_train
    .assign(expensive = y_train)
    .groupby(['expensive'])
    .agg(count = ('expensive','count'))
)

Unnamed: 0_level_0,count
expensive,Unnamed: 1_level_1
0,932
1,163
