# House Sales in King County, USA
----

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

**id** :a notation for a house

**date**: Date house was sold

**price**: Price is prediction target

**bedrooms**: Number of Bedrooms/House

**bathrooms**: Number of bathrooms/bedrooms

**sqft_living**: square footage of the home

**sqft_lot**: square footage of the lot

**floors** :Total floors (levels) in house

**waterfront** :House which has a view to a waterfront

**view**: Has been viewed

**condition** :How good the condition is Overall

**grade**: overall grade given to the housing unit, based on King County grading system

**sqft_above** :square footage of house apart from basement

**sqft_basement**: square footage of the basement

**yr_built** :Built Year

**yr_renovated** :Year when house was renovated

**zipcode**:zip code

**lat**: Latitude coordinate

**long**: Longitude coordinate

**sqft_living15** :Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area

**sqft_lot15** :lotSize area in 2015(implies-- some renovations)

In [None]:
# Dependencies and Setup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import linregress
import gmaps
import os

# Importing Data

- Load and read csv file

In [None]:
# File to Load:
file_to_load='kc_house_data.csv' 

# Read kc_house_data csv file and store into data frame:
kc_house_data_df=pd.read_csv(file_to_load)
kc_house_data_df.head()

# Slicing Data
- Only display in dataframe columns: 'id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'waterfront', 'yr_built', 'zipcode', 'lat', 'long'

In [None]:
# Columns to display
house_data_df = kc_house_data_df[['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 
                                  'waterfront', 'yr_built', 'zipcode', 'lat', 'long']]

# Formatting
#house_data_df.style.format({"price": "${:,.2f}", "bathrooms":"{:.2f}"})

house_data_df.head()



# Statiscal Summary
- Obtain statiscal summary of dataframe

In [None]:
house_data_df.describe()

In [None]:
house_data_df.dtypes

# 1. What’s the average amount of bedrooms and bathrooms in a house and is the price higher when the house has more bedrooms or bathrooms?

In [None]:
# Calculate average amount of bedrooms
avg_bedroom = house_data_df['bedrooms'].mean()
print(f" The average amount of bedrooms in a house is {round(avg_bedroom,0)}")


# Calculate average amount of bathrooms
avg_bathroom = house_data_df['bathrooms'].mean()
print(f" The average amount of bathrooms in a house is {round(avg_bathroom,2)}")

Scatter plots with linear regression and r-squared value (bedrooms vs. price and bathrooms vs. price) 

In [None]:
# Build scatter plot for each data type:
plt.figure(figsize=(9,6))
x_values = house_data_df['bedrooms']
y_values = house_data_df['price']

# Perform a linear regression on wind speed vs. latitude:
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)

# Get regression values:
regress_values = x_values * slope + intercept

# Create line equation string:
line_eq = 'y = ' + str(round(slope,2)) + 'x + ' + str(round(intercept,2))

# Create plot:
plt.scatter(x_values,y_values, marker='o', color='royalblue', s=[70], edgecolors='black')
plt.plot(x_values,regress_values,'darkred', linewidth=2)

# Incorporate the other graph properties:
plt.title('bedrooms vs. price', fontsize=20)
plt.ylabel('price', fontsize=16, color='black')
plt.xlabel('bedrooms', fontsize=16, color='black')
plt.annotate(line_eq,(15,8.000000e+04), fontsize=18, color='darkred')
plt.ticklabel_format(style='plain')
#plt.grid(False)
    
# Format the labels on y-axis with dollar sign
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['${:,.0f}'.format(x) for x in current_values])


# Print r-squared value:
print(f'The r-squared is: {rvalue}')

# Save the figure:
# plt.savefig('output_data/Bedrooms vs. Price.png')

# Show plot:
plt.show()

In [None]:
# Amount of Bedrooms Histogram
house_data_df["bedrooms"].hist(bins=25)
plt.ylabel('count', fontsize=12, color='black')
plt.xlabel('bedrooms', fontsize=12, color='black')
plt.title("Amount of Bedrooms")

# Save the figure:
plt.savefig('output_data/Amount of Bedrooms Histogram.png')

plt.show()

In [None]:
# house_data_df with only price and bedrooms columns
house_data_df[["price", "bedrooms"]]

# Find the average price by amount of bedrooms
bedroom_mean = house_data_df.groupby('bedrooms').mean()["price"]
#bedroom_mean

# Change series to dataframe
bedroom_df = bedroom_mean.to_frame(name='price')

# Formatting
#bedroom_df.style.format({"price": "${:,.2f}"})

bedroom_df

In [None]:
# Average Price by Bedrooms Bar Graph
bedroom_df.plot.bar(rot=360)
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
#plt.xlim([-0.5, 10.5])
plt.ylabel('average price', fontsize=12, color='black')
plt.xlabel('bedrooms', fontsize=12, color='black')
plt.title("Average Price by Bedrooms")
plt.grid()

# Format the labels on y-axis to currency
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['${:,.0f}'.format(x) for x in current_values])

# Remove legend
plt.gca().get_legend().remove()

# Save the figure:
plt.savefig('output_data/Average Price by Bedrooms Bar Graph.png')

plt.show()

In [None]:
# Build scatter plot for each data type:
plt.figure(figsize=(9,6))
x_values = house_data_df['bathrooms']
y_values = house_data_df['price']

# Perform a linear regression on wind speed vs. latitude:
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)

# Get regression values:
regress_values = x_values * slope + intercept

# Create line equation string:
line_eq = 'y = ' + str(round(slope,2)) + 'x + ' + str(round(intercept,2))

# Create plot:
plt.scatter(x_values,y_values, marker='o', color='royalblue', s=[70], edgecolors='black')
plt.plot(x_values,regress_values,'darkred', linewidth=2)

# Incorporate the other graph properties:
plt.title('bathrooms vs. price', fontsize=20)
plt.ylabel('price', fontsize=16, color='black')
plt.xlabel('bathrooms', fontsize=16, color='black')
plt.annotate(line_eq,(2,7.000000e+06), fontsize=18, color='darkred')
#plt.grid(False)

# Format the labels on y-axis with dollar sign
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['${:,.0f}'.format(x) for x in current_values])

# Print r-squared value:
print(f'The r-squared is: {rvalue}')

# Save the figure:
# plt.savefig('output_data/Bathrooms vs. Price.png')

# Show plot:
plt.show()

In [None]:
# Amount of Bathrooms Histogram
house_data_df["bathrooms"].hist(bins=25)
plt.ylabel('count', fontsize=12, color='black')
plt.xlabel('bathrooms', fontsize=12, color='black')
plt.title("Amount of Bathrooms")

# Save the figure:
plt.savefig('output_data/Amount of Bathrooms Histogram.png')

plt.show()


In [None]:
# house_data_df with only price and bathrooms columns
house_data_df[["price", "bathrooms"]]

# Find the average price by amount of bathrooms
bathroom_mean = house_data_df.groupby('bathrooms').mean()["price"]
#bathroom_mean

# Change series to dataframe
bathroom_df = bathroom_mean.to_frame(name='price')

# Formatting
#bathroom_df.style.format({"price": "${:,.2f}"})

bathroom_df

In [None]:
# Average Price by Bathrooms Bar Graph
bathroom_df.plot.bar(rot=90)
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)

plt.ylabel('average price', fontsize=12, color='black')
plt.xlabel('bathrooms', fontsize=12, color='black')
plt.title("Average Price by Bathrooms")
plt.grid()

# Format the labels on y-axis to currency
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['${:,.0f}'.format(x) for x in current_values])

# Remove legend
plt.gca().get_legend().remove()

# Save the figure:
plt.savefig('output_data/Average Price by Bathrooms Bar Graph.png')

plt.show()

# 2. What’s the correlation between sqft_living and pricing and do larger sqft_living greater than 6,000 sqft with a waterfront view cost more or less than those without a waterfront view?

Scatter plot with linear regression and r-squared value (sqft_living vs. price)

Dataframe for larger sqft_living greater than 6,000 sqft with waterfront view and without waterfront view price columns

Scatter plots with linear regression and r-squared value (larger sqft_living greater than 6,000 sqft with waterfront view and without waterfront view vs. price)

# 3. Are house sale prices higher in higher income neighborhoods?

Scatter plot with linear regression and r-squared value (zipcode vs. price) 

Dataframe for 10 most expensive houses by neighborhood zipcode include lat, lng, House ID, Price, and zipcode columns

Plot markers for top 10 most expensive houses by neighborhood zipcode or lat and lng on a map with pins containing House ID, Price, and zipcode

# 4. Do houses with a waterfront view or without a waterfront view have more price outliers?

Waterfront vs. Non_waterfront house prices analysis using pie and bar chart 

In [None]:
# Non_waterfront only
non_wf=house_data_df.loc[house_data_df["waterfront"]==0]
non_wf
non_wf.count()["id"]


In [None]:
# waterfront properties only
wf=house_data_df.loc[house_data_df["waterfront"]==1]
wf
wf.count()["id"]

Pie chart to show waterfront vs non-waterfront properties

In [None]:
labels = ["Waterfront", "Non-waterfront"]

# The values of each section of the pie chart - from waterfront and non waterfront properties count
sizes = [163, 21450]

# The colors of each section of the pie chart
colors = ["red", "lightblue"]


explode = (0.5, 0)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct="%1.1f%%", shadow=True, startangle=150)
plt.savefig('output_data/ waterfront vs nonwaterfront piechart.png')
plt.show()

In [None]:
bins = [0,1000,2000,3000,4000,5000,6000,7000]
group_label=["<1000", "1000-2000", "2000-3000", "3000-4000","4000-5000","5000-6000",">6000"]
pd.cut(wf["sqft_living"], bins, labels=group_label).head()

In [None]:
wf["living_sqft_range"]=pd.cut(wf["sqft_living"], bins, labels=group_label)
wfrange=wf.groupby(["living_sqft_range"])
wfrange_df=wfrange['price','bedrooms','bathrooms'].median()
print(wfrange_df)

In [None]:
bins = [0,1000,2000,3000,4000,5000,6000,7000]
group_label=["<1000", "1000-2000", "2000-3000", "3000-4000","4000-5000","5000-6000",">6000"]
pd.cut(non_wf["sqft_living"], bins, labels=group_label).head()

In [None]:
non_wf["living_sqft_range"]=pd.cut(non_wf["sqft_living"], bins, labels=group_label)
non_wfrange=non_wf.groupby(["living_sqft_range"])
non_wfrange_df=non_wfrange['price','bedrooms','bathrooms'].median()
print(non_wfrange_df)

In [None]:
wf_vs_non=pd.merge(wfrange_df,non_wfrange_df, how="left", on = ["living_sqft_range"])

wfnon_df = wf_vs_non.rename(columns={"price_x":"price_wf",
                                    "bedrooms_x":"bedrooms_wf",
                                    "bathrooms_x":"bathrooms_wf",
                                    "price_y":"price_nonwf",
                                    "bedrooms_y":"bedrooms_nonwf",
                                    "bathrooms_y": "bathrooms_nonwf"})

wfnon_df

In [None]:
wfnon_df=wfnon_df.reset_index(level=0)

In [None]:
wfnon_df

In [None]:
pandas_bar=wfnon_df
pandas_bar.plot(kind="bar",x="living_sqft_range", y=["price_wf","price_nonwf"],rot=45)

plt.xlabel("Living Sqft Range")
plt.ylabel("Prices (Millions)")
plt.title("Waterfront vs. Non-Waterfront")
plt.savefig('output_data/ waterfront vs nonwf price barchart.png')
plt.show()

Scatter plot with linear regression and r-squared value (waterfront bedrooms vs. price) 

In [None]:
# Build scatter plot for each data type:
plt.figure(figsize=(9,6))
x_values = wf['bedrooms']
y_values = wf['price']

# Perform a linear regression 
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)

# Get regression values:
regress_values = x_values * slope + intercept

# Create line equation string:
line_eq = 'y = ' + str(round(slope,2)) + 'x + ' + str(round(intercept,2))

# Create plot:
plt.scatter(x_values,y_values, marker='o', color='royalblue', s=[70], edgecolors='black')
plt.plot(x_values,regress_values,'darkred', linewidth=2)

# Incorporate the other graph properties:
plt.title('Waterfront bedrooms vs. prices', fontsize=20)
plt.ylabel('price (in millions)', fontsize=16, color='black')
plt.xlabel('bedrooms', fontsize=16, color='black')
plt.annotate(line_eq,(3,4000000), fontsize=18, color='darkred')
#plt.grid(False)


# Print r-squared value:
print(f'The r-squared is: {rvalue}')

# Save the figure:
plt.savefig('output_data/Waterfront Bedrooms vs. Price.png')

# Show plot:
plt.show()

Boxplot (waterfront & price)

In [None]:
view_df=wf[["price"]]

flierprops = dict(marker='o', markerfacecolor='r', markersize=8, markeredgecolor='black')

view_df.boxplot(flierprops=flierprops)
plt.ylabel("Price (in Millions)")

plt.xticks([1],['waterfront'])
plt.ylim([0,6000000])
plt.savefig('output_data/ Waterfront vs. Price boxplot.png')
plt.show()

# 5. Do newly built homes cost more than older built homes and where/ which zipcodes are the majority of the newly built homes located at?

Scatter plot with linear regression and r-squared value (yr_built vs. price) 

Dataframe for top 6 most newly built houses include lat, lng, House ID, Price, and yr_built columns

Plot markers for the top 6 most newly built houses using lat and lng on a map with pins containing House ID, Price, and yr_built