# Princeton House Prices and Ethnicity
By Brandon Feder and Atticus Wang

# Introduction

Princeton has many diverse houses ranging in size, location, age, and many other factors. With this comes a huge range of house prices. But how do the characteristics of a house determine it's price? This our research question: **"How do factors such as the number of bathrooms, location, year built, and number of bathrooms effect the price of the house?"**

Beatrice Bloom, a Princeton Residential Specialist, provides many great resources about the Princeton housing market including a table of houses sold in Princeton since 2011. This data can be found [here](https://www.realestate-princeton.com/market-analysis/princeton-pending/). We indented to use this data to answer our question.

We also hope to use [US census data](https://www.census.gov/data.html) about the income, ethnicity, and age of people in different areas of Princeton to help in our prediction.

Based off our own experience, we predict that neighborhood/address and year built have a significant impact on the price of the house. In addition, we believe that the ethnicity and income of the residents of the neighborhood the houses reside will be good predictors of the house's price.

# Analysis Plan

### Variables
The response variable in our analysis will be the price of a house in dollars.

The predictors will include location, number of bedrooms, number of full bathrooms, number of half
bathrooms, style, year built, parking-lot size, and previous selling price.

Some other relevant variables include the number of days on the market and data about the human population in the area of town the houses resides (such as age,
race/ethnicity, income, etc).

### Analysis Plan

First, we will tidy up the Princeton real estate market dataset, and extract US census data pertaining to age, race/ethnicity, income, and other variables in the Princeton area on the level of census blocks. We will then analyze the correlation between house prices and resident race and ethnicity, and potentially other correlations.

Next, we will build a linear model using a part of the data set with house prices as the response variable. Because there are many potential predictors, we will use the step function to select the best model. Then, for the rest of the dataset, we will predict house prices using the chosen predictors and compare our predictions with actual house prices. Finally, we will study why the model differs from real data, and whether there are temporal trends to house prices.

### Preliminary Analysis


#### Load required libraries

In [4]:
install.packages("tidyverse")
library(tidyverse)

Installing package into ‘/mnt/MainStorage/bfeder/R/x86_64-pc-linux-gnu-library/4.0’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



#### Load raw housing data

In [5]:
house <- read.csv("./data/pton-market-data.csv")

#### Format price column

In [6]:
house <- house %>% 
mutate(Price = strtoi(str_replace_all(str_sub(Sold.Price, 2, -4), ",", "")))

#### Calculate top 10 most expensive neighborhoods (on average)

In [7]:
house %>%
group_by(Neighborhood) %>%
summarise(meanPrice = mean(Price, na.rm = TRUE)) %>%
arrange(desc(meanPrice)) %>%
head(10)

Neighborhood,meanPrice
<chr>,<dbl>
The Preserve,2150000.0
Carnegie Lake,1645562.5
Battelfield Area,1350950.0
Institute,1338176.6
Pretty Brook Area,1312639.5
princeton Ridge,970000.0
Western Section,897657.4
The Glen,883113.5
Hun Area,862042.5
Institute Area,831696.7


#### Calculate top 10 most expensive neighborhoods (on average)

In [8]:
house %>% 
group_by(Address) %>%
summarise(meanPrice = mean(Price, na.rm = TRUE)) %>%
arrange(desc(meanPrice)) %>%
head(10)

Address,meanPrice
<chr>,<dbl>
Garrett Ln,2695000
Pheasant Hil Rd,2610000
Libary Pl,2476938
Fredrick Ct,2290000
Bogart Ct,2213125
Cradle Rock Rd,2138000
Morven Pl,2042000
Grasmere Way,2031250
Running Cedar Rd,1967458
Province Line,1950000


# Data
Here is a summary of the house price data. We were not yet able to obtain the US census data do to an issue with our API key.

In [10]:
dplyr::glimpse(house)

Rows: 3,331
Columns: 16
$ X                         [3m[90m<chr>[39m[23m "49-F", "44-H", "218", "12", "93-95", "58", …
$ Address                   [3m[90m<chr>[39m[23m "Palmer Sq", "Nassau St", "Birch Ave", "Birc…
$ Neighborhood              [3m[90m<chr>[39m[23m "Princeton Center", "Princeton Center", "Pri…
$ Bed.Rooms                 [3m[90m<int>[39m[23m 0, 0, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3,…
$ Full.Baths                [3m[90m<int>[39m[23m 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2,…
$ Half.Baths                [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,…
$ Style                     [3m[90m<chr>[39m[23m "Flat", "Flat", "Twin", "Twin", "Bungalow", …
$ Year.Built                [3m[90m<chr>[39m[23m "1932", "1932", "1929", NA, "1940", "1900", …
$ Lot.Size                  [3m[90m<chr>[39m[23m NA, NA, NA, "0.04", "0.07", "0.08", "0.11", …
$ Original.Price            [3m[90m<chr>[39m[23m "$320,000.00", "$369,000.00", "$

# Reference
Beatrice Bloom website: https://www.realestate-princeton.com/market-analysis/princeton-pending/

CRAN `tidycensus` package: https://walker-data.com/tidycensus/

US census data website: https://www.census.gov/data.html