In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [19]:
housing<- read_delim("housing.csv", col_names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"),
                  delim = ' '
                  )
head(housing, 5)

Parsed with column specification:
cols(
  crim = [31mcol_character()[39m,
  zn = [31mcol_character()[39m,
  indus = [31mcol_character()[39m,
  chas = [31mcol_character()[39m,
  nox = [31mcol_character()[39m,
  rm = [31mcol_character()[39m,
  age = [31mcol_character()[39m,
  dis = [31mcol_character()[39m,
  rad = [31mcol_character()[39m,
  tax = [31mcol_character()[39m,
  ptratio = [31mcol_character()[39m,
  black = [31mcol_character()[39m,
  lstat = [31mcol_character()[39m,
  medv = [31mcol_character()[39m
)



crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [22]:
housing_del_col <- select(housing, - c(chas, nox) )

head(housing_del_col, 5)

crim,zn,indus,rm,age,dis,rad,tax,ptratio,black,lstat,medv
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
0.00632,18.0,2.31,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
0.02731,0.0,7.07,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
0.02729,0.0,7.07,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
0.03237,0.0,2.18,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
0.06905,0.0,2.18,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


**Boston Housing Analysis**

Relevant Background Information:

The Boston Housing Dataset is a dataset that contains relevant information about individual residential areas describing various parameters that affect housing prices in the area of Boston, Massachusetts. This information was collected from the U.S Census Service.

Below are all dataset columns:
crim - per capita crime rate by town
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (1 if tract bounds river; 0 otherwise) - delete
nox - concentration of nitrous oxide compounds (pp10million) - delete
rm - average number of rooms per dwelling 
age - proportion of owner-occupied units built prior to 1940
dis - weighted distances to five Boston employment centres
rad - index of accessibility to radial highways
tax - full-value property-tax rate per 10,000 (dollars)
ptratio - pupil-teacher ratio by town
black - 1000(Bk - 0.63)^2 where Bk is the proportion of African-Americans by town
lstat - % lower status of the population
medv - Median value of owner-occupied homes in 1000's (dollars)
These columns present correlations/causations between the data and housing prices.
The data set was obtained from the following website: https://www.kaggle.com/c/boston-housing

Question we intend to answer: Given the entry of a new house into the dataset in the 1970s, we are trying to predict the valuation of this house based on pre-existing data. This valuation is determined by various predictors of which we intend to determine which predictor has the strongest correlation to median house price in 1000’s of dollars. Of which we can also determine the most significant variable a prospective buyer should consider when purchasing a house in 1970s Boston.


Expected Outcomes:

We expect to develop a fairly accurate model to predict the valuation of houses in Boston

This allows us to be aware of ball-park prices of houses in Boston so that we do not end up paying too much. This could potentially prevent real estate agents from scamming their clients. Additionally, it allows contractors to determine whether or not they should build more houses in a given area.

Strongest likely predictors: *per capita crime, avg number of rooms per dwelling*


Methodology:

Determine which predictors have the highest correlation (via R2 value) using linear regression to the medv (median value of owner-occupied homes in 1000’s of dollars). We will create a correlation matrix to measure the relationship between the other variables (with the exception of nox and chas), and then use the correlation to create a scatter plot between the medv and the variable with the strongest correlation. 

To evaluate the model, we will compare the predicted median values of owner-occupied homes in 1000's of dollars by our model to the actual values in the test set, using only the strongest predictor variable. 
