# <center>Project (Ch. 1-6): Modeling Variation</center> 

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

# Rename and reorder categories for "Year" in StudentSurvey
StudentSurvey$Year <- factor(as.numeric(StudentSurvey$Year), levels = c(1:5), labels = c("", "9th", "10th", "11th", "12th"))

# Load the NFL games Data 

nfl_games <- read.csv("https://docs.google.com/spreadsheets/d/1AB80wVIbwt1XE6lb3YI8-cmOydWDg8aF2_d9hh0UQ_k/pub?gid=1027337637&single=true&output=csv")

# Load the World data 

World <- read.csv("https://docs.google.com/spreadsheets/d/1yLn2kQ-1yL3p4VuMmK8ruFBmIWEDp5dlJszl2tKE79k/pub?gid=0&single=true&output=csv")

<div class="alert alert-block alert-info">

## Link to Guide and Rubric
The **Guide** answers the question "What is a data analysis report?" and the **Rubric** can be used to help you write a good data analysis report. 

You can access both the guide and the rubric at this link: https://bit.ly/data-analysis-report-guide

</div>

<div class="alert alert-block alert-info">

## Data

Three data frames are preloaded in this notebook using the code above. Explore the data frames and choose one that interests you for this assignment.

Option 1: `World`

- This data frame has data from the countries of the world. It contains 218 observations on 33 variables such as population, life expectancy, GDP, health, economy, and happiness.
- [Click here](https://docs.google.com/document/d/1otfwo54jp-DNu2HRBVSQ6lO5562P-xPNQJrR_st6KNk/pub) for a description of all the variables.

Option 2: `nfl_games`

- This data frame has data on the results for all NFL games Games from the 1999 to 2023 seasons. It contains 6707 observations on 29 variables such as game date, points scored, game times, temperature, and stadium.
- [Click here](https://docs.google.com/document/d/1WA-CCYO38RhO0x257pZS62r3m13cKawMzQls4U9y5EE/pub) for a description of all the variables.

Option 3: `StudentSurvey`

- The data set contains data from an in-class survey given to introductory statistics students over several years. It contains 362 observations on 17 variables such as exercise, smoking, GPA, and piercings.
- [Click here](https://www.rdocumentation.org/packages/Lock5Data/versions/2.6/topics/StudentSurvey) for a description of all the variables. 

*NOTE: We have changed the names of the categories in the variable `Year` in the `StudentSurvey` 
data frame from "FirstYear, Sophomore, Junior, Senior" to: "9th, 10th, 11th, 12th"*

</div>

<div class="alert alert-block alert-info">

## Instructions

Your task is to use R to explore and model variation in the data with the goal of reporting your findings to a target audience.

Depending on which data set you choose, your target audience for your report will be as follows:

- Option 1: `World` --> Audience: Your local policy makers
- Option 2: `nfl_games` --> Audience: Sports historians or enthusiasts
- Option 3: `StudentSurvey` --> Audience: Your school newspaper

Note: A full data analysis report will have 5 sections: Introduction, Explore Variation, Model Variation, Evaluate Models, Discussion/Conclusions. At this stage of the course, your report will only include the **first three sections** (Introduction, Explore Variation, and Model Variation), however, it should still be a standalone report, with its own concluding statements.

</div>

## Intro/Overview of the Problem or Question

<div class="alert alert-block alert-info"> 

The goal of this section is to provide an overview of the context, situation, or problem.

A good introduction section typically includes the following topics (but not necessarily in this order): 

- A description of the question or problem you are investigating and why this question is important

- A description of the data you will use in your investigation, such as:
  - where the data came from 
  - why and how it was collected
  - what cases and variables are included

- Your initial hypothesis (perhaps also stated as a word equation), specifying outcome and explanatory variables, and why you think your hypothesis is plausible

</div>

## Explore Variation

<div class="alert alert-block alert-info">

The goal of this section is to explore variation in your explanatory and outcome variables. That exploration will almost certainly include visual displays of your data.

A good exploring variation section typically includes the following topics (but not necessarily in this order): 

- A description of how you cleaned and prepared your data and why, such as: 
  - filtering cases 
  - handling missing data 
  - recoding or creating new variables

- Visualizations or tables to explore the distributions of relevant variables and hypothesized relationships among variables

- Descriptions of the visualizations or tables, and explanations of how they relate to the hypotheses or research questions

</div>

## Model Variation 

<div class="alert alert-block alert-info">

The goal of this section is to create a model or models that uses explanatory variables to explain some of the variation in your outcome variable.

A good modeling variation section typically includes the following topics (but not necessarily in this order): 

- The best fitting model (or models), expressed in GLM notation, to represent your research question 

- The interpretation of your parameter estimates in the units appropriate to your research question

- A visual display of your model overlaid on the data 

- The creation and interpretation of an ANOVA table to assess how well the model fits the data, and a comparison of the fit of alternative models when applicable 


</div>

<div class="alert alert-block alert-info">

## After you are done...

Go through this document and delete all the cells with blue text boxes (the instructions). You will be then left with a data analysis report.

</div>

## Evaluate Models 

<div class="alert alert-block alert-warning">

#### COMING SOON

The goal of this section is to discuss your model in relation to other plausible models of the DGP.

A good evaluating models section typically includes the following topics (but not necessarily in this order): 

- The construction and interpretation of a confidence interval in relation to your research question 

- An evaluation of your model(s) against the empty model using p-value or confidence intervals, and a rationale for which model you opt to retain 


</div>

## Conclusions 

<div class="alert alert-block alert-warning">

#### COMING SOON

The goal of this section is to help your audience understand what can be learned from your data analysis.

A good conclusion section typically includes the following topics (but not necessarily in this order): 

- A summary of what you did, what you found, and how it relates to the motivating question 

- A discussion of the implications of the results, what they mean for the audience or the world, and possible limitations of the findings  


</div>