# Data Analysis Report for Advanced Models: Modeling Housing Prices


In [None]:
# Run this code to load the required packages
suppressMessages(suppressWarnings(suppressPackageStartupMessages({
  library(coursekata)
})))

homes <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRMLCEGy4pTxlu28UBHHVVwXmmdA4vP5Jbd1USFVzpuyVlBcbG_TW65zO5MVtG6MnTN95sfEzD0e4yk/pub?gid=129450771&single=true&output=csv")
#homes_test <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRJFz-5HKKP_1IFmW1kWWE5vA73a4wB-ujgUqpLf7cmbiW-2RFAQeILAqwCLBKlNjU0WfrgxEnT7wv2/pub?gid=977580172&single=true&output=csv")


<div class="alert alert-block alert-info">

## Link to Guide and Rubric

The **Guide** answers the question "What is a data analysis report?" and the **Rubric** can be used to help you write a good data analysis report.
https://docs.google.com/document/d/1Ghk8HJyC0L15lwDZK8Nyrpj-t8tQpvKjUKNSiPMM7us/edit?usp=sharing

</div>

<div class="alert alert-block alert-info">

## Your Goal

A home is often the largest and most expensive purchase a person will make. But homes also don't have a price tag on them. It's hard to judge the true value of a home. Real estate websites such as Zillow, Redfin, or Trulia often use statistical models to help people make a guess at how much a home is worth. 

In this project, you will develop an advance model that makes predictions about the sale prices of homes based on a training data set called `homes` with 1460 homes. However, your model will be compared against other students' models on `homes_test` (a fresh set of data from 1459 homes). Your goal is to make accurate predictions not just in your training data but also in the testing data. So think hard about whether the variables or polynomials you include in your model should be included in your model of the DGP. 

## Data

The dataset you will be working with describes 1460 residential home sales in Ames, Iowa from the years 2006 - 2010, as reported by the Ames City Assessor's Office. Ames is located about 30 miles north of Des Moines (the state capitol) and is home to Iowa State University (the largest University in the state). Each row (observation) represents the latest sale of a home (one row per home in dataset). Columns represent home features and sale prices (outcome). More information on the specific variables in this data set can be found in [this document](https://drive.google.com/file/d/1ks61Le4HFRyYN9cYhfWg2J5fgSxvVVmZ/view?usp=sharing).


## Instructions

Your task is to use R to explore variation in the data, model the variation, evaluate your models, and then write up your methods and findings in a report for a real estate website (a client like Zillow or Redfin). Your complete data analysis report will have 5 sections: Introduction, Explore Variation, Model Variation, Evaluate Models, Discussion/Conclusions.

It is up to you to decide what kind of advanced model you would like to make and what explanatory variables you will include in your model. Your goal is to create a model that is able to explain the variation in sale price and thus makes better predictions of sale price. The top 10 models from the class (measured by the F statistic and/or the rationale for the model) will receive extra credit. Good luck!

</div>

In [None]:
str(homes)

## Intro/Overview of the Problem or Question

<div class="alert alert-block alert-info"> The goal of this section is to provide an overview of the context, situation, or problem.

A good introduction section typically includes the following topics (but not necessarily in this order): 

- A description of the question or problem you are investigating and why this question is important

- A description of the data you will use in your investigation, such as:
  - where the data came from 
  - why and how it was collected
  - what cases and variables are included

- Your initial hypothesis (perhaps also stated as a word equation), specifying outcome and explanatory variables, and why you think your hypothesis is plausible

</div>

## Explore Variation

<div class="alert alert-block alert-info">

The goal of this section is to explore variation in your explanatory and outcome variables. That exploration will almost certainly include visual displays of your data.

A good exploring variation section typically includes the following topics (but not necessarily in this order): 

- A description of how you cleaned and prepared your data and why, such as: 
  - filtering cases 
  - handling missing data 
  - recoding or creating new variables

- Visualizations or tables to explore the distributions of relevant variables and hypothesized relationships among variables

- Descriptions of the visualizations or tables, and explanations of how they relate to the hypotheses or research questions

</div>

## Model Variation 

<div class="alert alert-block alert-info">

The goal of this section is to create a model or models that uses explanatory variables to explain some of the variation in your outcome variable.

A good modeling variation section typically includes the following topics (but not necessarily in this order): 

- The best fitting model (or models), expressed in GLM notation, to represent your research question 

- The interpretation of your parameter estimates in the units appropriate to your research question

- A visual display of your model overlaid on the data 

- The creation and interpretation of an ANOVA table to assess how well the model fits the data, and a comparison of the fit of alternative models when applicable 


</div>

## Evaluate Models 

<div class="alert alert-block alert-info">

The goal of this section is to discuss your model in relation to other plausible models of the DGP.

A good evaluating models section typically includes the following topics (but not necessarily in this order): 

- The construction and interpretation of a confidence interval in relation to your research question 

- An evaluation of your model(s) against the empty model using p-value or confidence intervals, and a rationale for which model you opt to retain 

- An evaluation of your model(s) against other simpler models model using p-value or confidence intervals, and a rationale for which model you opt to retain 


</div>

## Conclusions 

<div class="alert alert-block alert-info">

The goal of this section is to help your audience understand what can be learned from your data analysis.

A good conclusion section typically includes the following topics (but not necessarily in this order): 

- A summary of what you did, what you found, and how it relates to the motivating question 

- A discussion of the implications of the results, what they mean for the audience or the world, and possible limitations of the findings  



</div>

<div class="alert alert-block alert-info">

## After you are done...

Go through this document and delete all the cells with blue text boxes (the instructions). You will be then left with a data analysis report.

</div>