# EDA project as part of the neuefische Data Science Bootcamp 2024

---

# Introduccion

This notebook was created as part of the DataSience Bootcamp, which I attended in spring 2024 to make my way out of academia and into the world of data and predictions. 

The aim of this EDA project is to demonstrate what we've learnt over the last few weeks in a (slightly) larger project and to help a simulated customer find a house.

---

## About the Client

![client](./misc/client.jpeg)

## Questions

In order to help the client find two houses, i'll guide the EDA along the following questions:

- How does the location affect the price of the house?
- Does the time of the year affect houses prices?
   - And if so, what other factors may influence this dependency?

## Hypothesis

Moreover, the following specific hypotheses will underpin the conclusions drawn from the data and support the final recommendations:

1. Houses in the city are more expensive than in the countryside.
2. The fluctuation of house prices over the course of the year depends on the region.
3. Houses in need of renovation are less affected by price fluctuations.

---

## About the Dataset

The dataset, provided by neuefische, is a version of the widely used dataset of house sale prices for King County, which includes Seattle. It consists of homes sold between May 2014 and May 2015, and consists of the following columns/features:

| column name | description |
| --- | ----------- |
| id | unique identified for a house |
| date | house was sold |
| price | is prediction target |
| bedrooms | # of bedrooms |
| bathrooms | # of bathrooms |
| sqft_living | footage of the home |
| sqft_lot | footage of the lot |
| floors | floors (levels) in house |
| waterfront | House which has a view to a waterfront |
| view | quality of view |
| condition $^1$ | How good the condition is ( Overall ) |
| grade $^1$ | overall grade given to the housing unit, based on King County grading system |
| sqft_above | square footage of house apart from basement |
| sqft_basement | square footage of the basement |
| yr_built | Built Year |
| yr_renovated | Year when house was renovated |
| zipcode | zip |
| lat | Latitude coordinate |
| long | Longitude coordinate |
| sqft_living15 | The square footage of interior housing living space for the nearest 15 neighbors |
| sqft_lot15 | The square footage of the land lots of the nearest 15 neighbors |

$^1$: more information about the _Grade_ & _Condition_ categories may be [found here](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r)

---

# Preparations

## Import libraries & set global settings

In a first step, import the needed library and define global settings.

In [116]:
# --- hide code --- import ---
import os, json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium as fl
import geopandas as gpd
from geopy.distance import distance 


# plot styles
#plt.style.use('https://github.com/dhaitz/matplotlib-stylesheets/raw/master/pitayasmoothie-light.mplstyle')
plt.style.use('https://github.com/dhaitz/matplotlib-stylesheets/raw/master/pacoty.mplstyle')

%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"

sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")

## Import raw data from csv & geojson files

Import raw dataset from csv and geojson files for Choroplots. Also, get rid of variables, features and columns which are not used anyway & rename columns to `lower_case()`.

In [117]:
# --- hide code --- import ---
# import the data from a csv-file + drop 'id'
df = pd.read_csv(os.path.join('data', 'eda.csv'))
df.drop(columns=['id'], inplace=True)
df.columns = map(str.lower, df.columns)

# load GeoJSON
geo_json_full = os.path.join('data', 'KingCounty.geojson')
geo_json = os.path.join('data', 'KC_area.geojson')
with open(geo_json, 'r') as jsonFile:
    geo_json_data = json.load(jsonFile)

# open GeoJSON file as geopandas 
gdf = gpd.read_file(geo_json_full)
gdf.drop(columns=['OBJECTID', 'ZIPCODE','COUNTY'], inplace=True)
gdf.rename(columns={'COUNTY_NAME': 'COUNTY', 'ZIP': 'zipcode', 'PREFERRED_CITY': 'city'}, inplace=True)
gdf.columns = map(str.lower, gdf.columns)

# load pre_calculated distance matrix between every house (precalculated, because this took some minutes)
#############

---

# Functions

Here I define functions which come handy in the course of this analysis. These are:
- convert square_feet to _square_meters

In [118]:
# --- hide code --- functions ---
# convert square_feet to square_meters
def sqft2sqm(cols):
    for col in cols:
        if col in df.columns:
            df[col.replace('sqft', 'sqm')] = df[col].apply(lambda x: x / 10.764).round(2)
            df.drop(columns=[col], inplace=True)

def get_nearest_featx(ref_house, feat):
    pass

---

# Data Cleaning

## Data Inspection

First, inspect the dataframe which contains the raw housing sales data.

In [119]:
# --- hide code --- inspect ---
print(df.info(), '\n')
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           21597 non-null  object 
 1   price          21597 non-null  float64
 2   bedrooms       21597 non-null  float64
 3   bathrooms      21597 non-null  float64
 4   sqft_living    21597 non-null  float64
 5   sqft_lot       21597 non-null  float64
 6   floors         21597 non-null  float64
 7   waterfront     19206 non-null  float64
 8   view           21534 non-null  float64
 9   condition      21597 non-null  int64  
 10  grade          21597 non-null  int64  
 11  sqft_above     21597 non-null  float64
 12  sqft_basement  21145 non-null  float64
 13  yr_built       21597 non-null  int64  
 14  yr_renovated   17749 non-null  float64
 15  zipcode        21597 non-null  int64  
 16  lat            21597 non-null  float64
 17  long           21597 non-null  float64
 18  sqft_l

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2014-10-13,221900.0,3.0,1.0,1180.0,5650.0,1.0,,0.0,3,7,1180.0,0.0,1955,0.0,98178,47.5112,-122.257,1340.0,5650.0
1,2014-12-09,538000.0,3.0,2.25,2570.0,7242.0,2.0,0.0,0.0,3,7,2170.0,400.0,1951,19910.0,98125,47.721,-122.319,1690.0,7639.0
2,2015-02-25,180000.0,2.0,1.0,770.0,10000.0,1.0,0.0,0.0,3,6,770.0,0.0,1933,,98028,47.7379,-122.233,2720.0,8062.0


Then, also perform the following steps:

- fix `yr_renovated` -> divide by 10
- check for duplicates
- check for missing data
- edit data types

In [120]:
# --- hide code --- fix yr_renovated ---
df['yr_renovated'] = df['yr_renovated'].apply(lambda x: x / 10)

In [121]:
# --- hide code --- check for duplicate rows ---
print('duplicates: ', df.duplicated().value_counts()[False] - df.shape[0])

duplicates:  0


In [122]:
# --- hide code --- check for missing values ---
print('missing values per feature:')
nullseries = df.isnull().sum()
print(nullseries[nullseries > 0])

missing values per feature:
waterfront       2391
view               63
sqft_basement     452
yr_renovated     3848
dtype: int64


There are a few features with missing values:
- _sqft_basement_:
  - area of basement can be calculated from the dataset by `sqft_living - sqft_above`
- _yr_renovated_:
  - not possible to impute
  - furthermore, most (~96%) of the values are `0`
  - column will be dropped
- _view & waterfront_:
  - first, find nearest house, if this is closer than 500m, I assume both houses has the same view/waterfront 
  - if house is further away, the missing value will be imputed with the median view/waterfront of the zipcode


In [123]:
# --- hide code --- fix missing values ---
# basement
df['sqft_basement'] = df.apply(lambda x: x['sqft_living'] - x['sqft_above'], axis=1)

# yr_renovated
df.drop(columns=['yr_renovated'], inplace=True)

# view

# def get_nearest_feat(r, feat):
#     dis = df.apply(lambda x: distance((x['lat'], x['long']), (r['lat'], r['long'])).km, axis=1)

#     # d, i = [], []
#     # for index, row in df.iterrows():
#     #     d.append(distance((row['lat'], row['long']), (r['lat'], r['long'])).km)
#     #     i.append(index)
#     # tmp = pd.DataFrame(data = {'d': d, 'i': i})
#     tmp = dis.sort_values()

#     if tmp[1] < 1:
#         return 77
#         # return df.l[1]
#     else:
#         return 99

# y = df.apply(lambda x: get_nearest_feat(x, 'view') if np.isnan(x['view']) else x['view'], axis=1)

# x = df['view'].fillna(df['zipcode'].map(df.groupby('zipcode')['view'].median().to_dict()))
# nullseries = x.isnull().sum()
# print(nullseries[nullseries > 0])

# view
zip_view_dict = df.groupby('zipcode')['view'].median().to_dict()
df['view'] = df['view'].fillna(df['zipcode'].map(zip_view_dict))

# waterfront
zip_waterfront_dict = df.groupby('zipcode')['view'].median().to_dict()
df['waterfront'] = df['waterfront'].fillna(df['zipcode'].map(zip_view_dict))

# ----- check for missing values -----
print('missing values per feature:')
nullseries = df.isnull().sum()
print(nullseries[nullseries > 0])

missing values per feature:
Series([], dtype: int64)


In [124]:
# --- hide code --- data types ---
# float -> int
df.price = df.price.astype('int64')
df.bedrooms = df.bedrooms.astype('int64')
df.waterfront = df.waterfront.astype('int64')
df.view = df.view.astype('int64')

# transform date column to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

## add features

Add new features to dataframe:
- transform square_feet to square_meters

- add feature `sqm_garden` which tells about the soze of the garden (`lot - (living/floors)`)
  - this assumes, that the total footage of the house is equally distributed across floors
  
- add new date features to be able to group by
  - month
  - week

In [125]:
# --- hide code --- add features ---
# sqft -> sqm
sqft2sqm(['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15'])

# add size of garden
df["sqm_garden"] = df["sqm_lot"] - (df["sqm_above"] / df["floors"])

# add date features
df["date_month"] = df["date"].dt.month
df["date_week"] = df["date"].dt.isocalendar().week
df['date_year_week'] = df["date"].dt.strftime('%Y-%W') # needed to sort data according to date

## final descriptive inspection

After cleaning up, check the dataframe once more. But everything looks fine now.

In [126]:
# --- hide code --- inspect ---
print(df.info(), '\n')
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            21597 non-null  datetime64[ns]
 1   price           21597 non-null  int64         
 2   bedrooms        21597 non-null  int64         
 3   bathrooms       21597 non-null  float64       
 4   floors          21597 non-null  float64       
 5   waterfront      21597 non-null  int64         
 6   view            21597 non-null  int64         
 7   condition       21597 non-null  int64         
 8   grade           21597 non-null  int64         
 9   yr_built        21597 non-null  int64         
 10  zipcode         21597 non-null  int64         
 11  lat             21597 non-null  float64       
 12  long            21597 non-null  float64       
 13  sqm_living      21597 non-null  float64       
 14  sqm_lot         21597 non-null  float64       
 15  sq

Unnamed: 0,date,price,bedrooms,bathrooms,floors,waterfront,view,condition,grade,yr_built,...,sqm_living,sqm_lot,sqm_above,sqm_basement,sqm_living15,sqm_lot15,sqm_garden,date_month,date_week,date_year_week
0,2014-10-13,221900,3,1.0,1.0,0,0,3,7,1955,...,109.62,524.9,109.62,0.0,124.49,524.9,415.28,10,42,2014-41
1,2014-12-09,538000,3,2.25,2.0,0,0,3,7,1951,...,238.76,672.8,201.6,37.16,157.0,709.68,572.0,12,50,2014-49
2,2015-02-25,180000,2,1.0,1.0,0,0,3,6,1933,...,71.53,929.02,71.53,0.0,252.69,748.98,857.49,2,9,2015-08
