<a href="https://colab.research.google.com/github/ag-wnl/kaggle_models/blob/main/housing_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/shiv1709/House_price_prediction/master/USA_Housing.csv')
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


# **Exploring Dataset**

In [3]:
print(df.columns.values)

['Avg. Area Income' 'Avg. Area House Age' 'Avg. Area Number of Rooms'
 'Avg. Area Number of Bedrooms' 'Area Population' 'Price' 'Address']


In [4]:
df.shape

(5000, 7)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB


In [6]:
for i in df.columns:
    print(f"column : {i} and no. of unique values: {len(df[i].unique())}")

#lower number of unique values => categorical, if large no => continuous values

column : Avg. Area Income and no. of unique values: 5000
column : Avg. Area House Age and no. of unique values: 5000
column : Avg. Area Number of Rooms and no. of unique values: 5000
column : Avg. Area Number of Bedrooms and no. of unique values: 255
column : Area Population and no. of unique values: 5000
column : Price and no. of unique values: 5000
column : Address and no. of unique values: 5000


In [7]:
for itr, i in enumerate(df['Address'][0:5]):
    print('\n', itr+1, ' ',i)


 1   208 Michael Ferry Apt. 674
Laurabury, NE 37010-5101

 2   188 Johnson Views Suite 079
Lake Kathleen, CA 48958

 3   9127 Elizabeth Stravenue
Danieltown, WI 06482-3489

 4   USS Barnett
FPO AP 44820

 5   USNS Raymond
FPO AE 09386


below:
1)pd.Series(i) to convert the string i into a pandas Series so that you can use the .str methods.

2)used [0] and .iloc[0] to get the first element of each extracted Series since .extract returns a DataFrame, but you are looking for individual strings.

In [8]:
for itr, i in enumerate(df['Address'][0:10]):
    print('\n')
    apart = pd.Series(i).str.extract(r'(\w+\s?\w*\.\s?\w+)')[0]  # Extracts 'Michael Ferry Apt.'
    place = pd.Series(i).str.extract(r'([A-Z][a-z]+)')[0]  # Extracts 'Laurabury'
    state = pd.Series(i).str.extract(r'([A-Z]{2})')[0]  # Extracts 'NE'
    print(f"{itr}) Apartment: {apart.iloc[0]}, Place: {place.iloc[0]}, State: {state.iloc[0]}")




0) Apartment: Ferry Apt. 674, Place: Michael, State: NE


1) Apartment: nan, Place: Johnson, State: CA


2) Apartment: nan, Place: Elizabeth, State: WI


3) Apartment: nan, Place: Barnett, State: US


4) Apartment: nan, Place: Raymond, State: US


5) Apartment: Islands Apt. 443, Place: Jennifer, State: KS


6) Apartment: nan, Place: Daniel, State: CO


7) Apartment: nan, Place: Joyce, State: TN


8) Apartment: nan, Place: Gilbert, State: US


9) Apartment: nan, Place: Unit, State: DP


##Adding Columns place and state to make more unique features hopefully

In [9]:
df['Place'] = df['Address'].str.extract(r'([A-Z][a-z]+)', expand=False)
df['State'] = df['Address'].str.extract(r'([A-Z]{2})', expand=False)

df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Place,State
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",Michael,NE
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA...",Johnson,CA
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",Elizabeth,WI
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820,Barnett,US
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386,Raymond,US


Rounding the value of number of rooms

### **Rounding off values to define a proper scale**

In [22]:
df['Avg. Area Number of Rooms'] = df['Avg. Area Number of Rooms'].round()
df['Avg. Area Number of Bedrooms'] = df['Avg. Area Number of Bedrooms'].round()
df['Avg. Area House Age'] = df['Avg. Area House Age'].round()
df['Avg. Area Income'] = (df['Avg. Area Income'] // 10000)*10000
df['Area Population'] = (df['Area Population'] // 10000)*10000

df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Place,State
0,70000.0,6.0,7.0,4.0,20000.0,1059034.0,Michael,NE
1,70000.0,6.0,7.0,3.0,40000.0,1505891.0,Johnson,CA
2,60000.0,6.0,9.0,5.0,30000.0,1058988.0,Elizabeth,WI
3,60000.0,7.0,6.0,3.0,30000.0,1260617.0,Barnett,US
4,50000.0,5.0,8.0,4.0,20000.0,630943.5,Raymond,US


In [25]:
for i in df.columns:
    print(f"column : {i} and no. of unique values: {len(df[i].unique())}")

column : Avg. Area Income and no. of unique values: 9
column : Avg. Area House Age and no. of unique values: 8
column : Avg. Area Number of Rooms and no. of unique values: 9
column : Avg. Area Number of Bedrooms and no. of unique values: 5
column : Area Population and no. of unique values: 7
column : Price and no. of unique values: 5000
column : Place and no. of unique values: 1193
column : State and no. of unique values: 62


**Creating categorical features  -> all which have low number of unique values over the dataset**

In [28]:
cat_features = ['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']
out_feature = 'Price'

Making Encoders

In [29]:
from sklearn.preprocessing import LabelEncoder
label_encoders = {}

df1 = pd.DataFrame()

for feature in cat_features:
    label_encoders[feature] = LabelEncoder()
    df1[feature] = label_encoders[feature].fit_transform(df[feature])

In [33]:
df1.head(7)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,5,3,4,2,2
1,5,3,4,1,4
2,4,3,6,3,3
3,4,4,3,1,3
4,3,2,5,2,2
5,6,2,3,2,2
6,4,3,5,1,6


Stacking categorical features, to convert to tensors

In [37]:
cat_features = np.stack([df1['Avg. Area Income'], df1['Avg. Area House Age'], df1['Avg. Area Number of Rooms'], df1['Avg. Area Number of Bedrooms'], df1['Area Population']], axis = 1)
cat_features

array([[5, 3, 4, 2, 2],
       [5, 3, 4, 1, 4],
       [4, 3, 6, 3, 3],
       ...,
       [4, 4, 2, 0, 3],
       [4, 3, 4, 3, 4],
       [4, 3, 4, 2, 4]])

In [42]:
cat_features = torch.tensor(cat_features, dtype = torch.int64)
#categorical features always int
cat_features

  cat_features = torch.tensor(cat_features, dtype = torch.int64)


tensor([[5, 3, 4, 2, 2],
        [5, 3, 4, 1, 4],
        [4, 3, 6, 3, 3],
        ...,
        [4, 4, 2, 0, 3],
        [4, 3, 4, 3, 4],
        [4, 3, 4, 2, 4]])

Continous Features

In [44]:
cont_features = ['Place', 'State']

In [46]:
# cont_values = np.stack([df[i].values for i in cont_features], axis = 1)
# cont_values = torch.tensor(cont_values, dtype = torch.float)
# cont_values

**Now our Dependent Value (Output)**

In [48]:
y = torch.tensor(df['Price'].values, dtype = torch.float).reshape(-1, 1)
y

tensor([[1059033.5000],
        [1505890.8750],
        [1058988.0000],
        ...,
        [1030729.5625],
        [1198656.8750],
        [1298950.5000]])

In [49]:
cat_features.shape, y.shape

(torch.Size([5000, 5]), torch.Size([5000, 1]))

## **Embedding for categorical features**

In [51]:
cat_dims = [len(df1[col].unique()) for col in ['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
cat_dims

[9, 8, 9, 5, 7]

In [53]:
#^^this tells how many inputs embedding layers has and how many output of embedding layer we need to create

To set output dimension:
(Rule) should be set based on input dimension and should be min of->  min(50, no. of unique values in that feature/2)

In [54]:
embedding_dim = [(x, min(50, (x+1) // 2)) for x in cat_dims]
embedding_dim

[(9, 5), (8, 4), (9, 5), (5, 3), (7, 4)]

Creating embedding Layer

In [55]:
import torch.nn as nn
import torch.nn.functional as F

embeded_representation = nn.ModuleList([nn.Embedding(inp, out) for inp, out in embedding_dim])
embeded_representation

ModuleList(
  (0): Embedding(9, 5)
  (1): Embedding(8, 4)
  (2): Embedding(9, 5)
  (3): Embedding(5, 3)
  (4): Embedding(7, 4)
)

In [57]:
pd.set_option('display.max_rows', 500)
embedding_val = []

for i,e in enumerate(embeded_representation):
    embedding_val.append(e(cat_features[:, i]))

embedding_val

[tensor([[-0.8846, -1.0587,  0.6495,  0.7460, -0.5137],
         [-0.8846, -1.0587,  0.6495,  0.7460, -0.5137],
         [-1.1311,  0.9083,  0.1261,  0.6482, -0.6468],
         ...,
         [-1.1311,  0.9083,  0.1261,  0.6482, -0.6468],
         [-1.1311,  0.9083,  0.1261,  0.6482, -0.6468],
         [-1.1311,  0.9083,  0.1261,  0.6482, -0.6468]],
        grad_fn=<EmbeddingBackward0>),
 tensor([[-1.5902,  0.0477,  0.5649, -0.5513],
         [-1.5902,  0.0477,  0.5649, -0.5513],
         [-1.5902,  0.0477,  0.5649, -0.5513],
         ...,
         [ 0.8115, -0.3430,  0.6055,  0.4808],
         [-1.5902,  0.0477,  0.5649, -0.5513],
         [-1.5902,  0.0477,  0.5649, -0.5513]], grad_fn=<EmbeddingBackward0>),
 tensor([[ 0.0130, -0.7576,  2.0050, -0.0554,  0.7508],
         [ 0.0130, -0.7576,  2.0050, -0.0554,  0.7508],
         [-0.6022, -0.5371,  0.8519, -0.1774,  0.2153],
         ...,
         [ 1.4094, -0.2659,  0.0943,  0.9846,  0.8617],
         [ 0.0130, -0.7576,  2.0050, -0.0554

In [58]:
z = torch.concat(embedding_val, axis = 1)
z

tensor([[-0.8846, -1.0587,  0.6495,  ...,  0.3583, -0.9259, -1.6701],
        [-0.8846, -1.0587,  0.6495,  ...,  1.0507,  0.4214,  0.5756],
        [-1.1311,  0.9083,  0.1261,  ..., -0.4191,  2.0477, -1.3497],
        ...,
        [-1.1311,  0.9083,  0.1261,  ..., -0.4191,  2.0477, -1.3497],
        [-1.1311,  0.9083,  0.1261,  ...,  1.0507,  0.4214,  0.5756],
        [-1.1311,  0.9083,  0.1261,  ...,  1.0507,  0.4214,  0.5756]],
       grad_fn=<CatBackward0>)

In [59]:
dropout = nn.Dropout(0.4)
final_embed = dropout(z)
final_embed

tensor([[-1.4743, -1.7646,  1.0825,  ...,  0.0000, -0.0000, -2.7835],
        [-0.0000, -1.7646,  1.0825,  ...,  1.7511,  0.7024,  0.9593],
        [-0.0000,  0.0000,  0.0000,  ..., -0.6984,  3.4128, -2.2495],
        ...,
        [-1.8851,  1.5138,  0.2101,  ..., -0.6984,  0.0000, -2.2495],
        [-0.0000,  1.5138,  0.2101,  ...,  1.7511,  0.0000,  0.9593],
        [-0.0000,  1.5138,  0.2101,  ...,  0.0000,  0.0000,  0.9593]],
       grad_fn=<MulBackward0>)

In [61]:
class forwardNN(nn.Module):
    def __init__(self, embedding_dim, out_sz, layers, dropout_ratio = 0.5):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(inp, out) for inp, out in embedding_dim])
        self.emb_drop = nn.Dropout(dropout_ratio)

        layerlist = []
        n_emb = sum((out for inp, out in embedding_dim))
        n_in = n_emb

        # for i in layers:
        #     layerlist.append(nn.Linear(n_in, i))
        #     layerlist.append(nn.ReLU(inplace = True))
        #     layerlist.append(nn.BatchNorm1d(i))
        #     layerlist.append(nn.Dropout(dropout_ratio))
        #     n_in = i
        # layerlist.append(nn.Linear(layers[-1], out_sz))

        self.layers = nn.Sequential(
            nn.Linear(n_in, i),
            nn.ReLU(inplace = True),
            nn.BatchNorm1d(i),
            nn.Dropout(dropout_ratio),
            nn.Linear(layers[-1], out_sz)
        )

        def forward(self, x_cat, x_cont):
            embeddings = []
            for i, e in enumerate(self.embeds):
                embeddings.append(e(x_cat[:, i]))
            x = torch.cat(embeddings, 1)
            x = self.emb_drop(x)

            x_cont = self.bn_cont(x_cont)
            x = torch.cat([x, x_cont], 1)
            x = self.layers(x)
            return x

5000