## Project: Zillow's market value estimation

**Introduction:**\
A Zestimate is Zillow’s estimated market value for a home, computed using a proprietary formula including public and user-submitted data, such as details about a home (bedrooms, bathrooms, home age, etc.), location, property tax assessment information and sales histories of the subject home as well as other homes that have recently sold in the area.

**Objective:**\
In this competition, Zillow is asking you to predict the log-error between their Zestimate and the actual sale price, given all the features of a home. The log error is defined as\
$logerror=log(Zestimate)−log(SalePrice)$\
and it is recorded in the transactions file train.csv. In this competition, you are going to predict the logerror for the months in Fall 2017.\[1pt]

### Import of python libraries

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf

### Import of the data

In [19]:
# Import the file into our working environment
df_data = pd.read_csv('df_data_processed.csv', sep="\t", index_col=0)

In [20]:
df_data.head()

Unnamed: 0,yearbuilt,regionidcity,regionidcounty,regionidzip,roomcnt,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,assessmentyear
0,1963.0,37688.0,3101.0,96337.0,0.0,0.0,0.0,1581.0,2016.0
1,1963.0,37688.0,3101.0,96337.0,0.0,0.0,0.0,1581.0,2015.0
2,1959.0,51617.0,3101.0,96095.0,0.0,0.0,0.0,73026.0,2016.0
3,1948.0,12447.0,3101.0,96424.0,0.0,0.0,0.0,5068.0,2016.0
4,1947.0,12447.0,3101.0,96450.0,0.0,0.0,0.0,1776.0,2016.0


In [17]:
# Import the file into our working environment
df_label = pd.read_csv('df_label_processed.csv', sep="\t", index_col=0)

In [18]:
df_label.head()

Unnamed: 0,taxvaluedollarcnt
0,9.0
1,27516.0
2,1434941.0
3,1174475.0
4,440101.0


In [21]:
print('DataFrame df_data has shape = ', df_data.shape)
print('DataFrame df_label has shape = ', df_label.shape)

DataFrame df_data has shape =  (2950951, 9)
DataFrame df_label has shape =  (2950951, 1)


In [28]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_data, df_label, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build the neural network model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train_scaled, y_train, epochs=10, batch_size=1024, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1cff3d48460>

In [29]:
# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error (RMSE):", rmse)

Root Mean Squared Error (RMSE): 771343.8434182632
