# Title

## Introduction

Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.

Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.

UPDATE (Mar 1, 2022): If it doesn’t make sense to infer a scale parameter, you can choose another parameter, or choose a second variable altogether. Ultimately, we’re looking for a comprehensive inference analysis on one parameter spread across 2+ groups (with at least one hypothesis test), plus a bit more (such as an investigation on the variance, a quantile, or a different variable). In total, you should use both bootstrapping and asymptotics somewhere in your report at least once each. Also, your hypothesis test(s) need not be significant: it is perfectly fine to write a report claiming no significant findings (i.e. your p-value is large).

Identify and describe the dataset that will be used to answer the question. Remember, this dataset is allowed to contain more variables than you need – feel free to drop them!

Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.

- Background information: Why is it important? Cite some relevant literature


- What is your (inferential) research question?


- What is your population of interest?


- How do you collect your data?


- What is your point estimate? (e.g., mean difference between male and female income, proportions of Democrats voters, etc.)


- State your hypothesis if applicable


With only about 8.3 percent of the population not being insured at any point in the past year, health insurance has a widespread impact on the American population (Keisler-Starkey & Bunch, 2022). With factors influencing insurance premius such as age, state and federal laws, and many more, there are many considerations insurance companies make to determine their prices. Many companies will also take insurance premiums out of employees pay to cover these costs, so it is important for employees to know the major contributors to their insurance costs, and the actions that can take to reduce them (Fontinelle, 2022). For this reason, we decided to explore if people who smoke are charged more for health insurance than those who do not in the United States, as this is one of few variables impacting insurance costs that people can actually control. Our population of interest is people in the United States who have health insurance policies. To answer our inferential question, we will calculate the mean difference between yearly insurance charges of smokers versus non smokers as our point estimate, using the null hypothesis that there is no difference in charges and the alternative hypothesis that charges are greater for smokers. Our data has been collected from the US Health Insurance Dataset and contains 1338 rows of data on peoples age, sex, smoking status (yes or no), insurance charges, and others.

## Preliminary Results

In [4]:
# Run this cell before continuing.
library(tidyverse)
library(datateachr)
library(repr)
library(digest)
library(infer)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In this section, you will:

Demonstrate that the dataset can be read from the web into R.
Clean and wrangle your data into a tidy format.
Plot the relevant raw data, tailoring your plot in a way that addresses your question.
Compute estimates of the parameter you identified across your groups. Present this in a table. If relevant, include these estimates in your plot.
Be sure to not print output that takes up a lot of screen space.


In [5]:
# Read in dataset
insurance <- read_csv("https://raw.githubusercontent.com/Yuji03b/STAT-201-GROUP-1/main/insurance.csv")

# Tidy Data
insurance <- insurance |>
    select(-bmi, -children)

head(insurance)

[1mRows: [22m[34m1338[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): sex, smoker, region
[32mdbl[39m (4): age, bmi, children, charges

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,smoker,region,charges
<dbl>,<chr>,<chr>,<chr>,<dbl>
19,female,yes,southwest,16884.924
18,male,no,southeast,1725.552
28,male,no,southeast,4449.462
33,male,no,northwest,21984.471
32,male,no,northwest,3866.855
31,female,no,southeast,3756.622


In [6]:
# Describe columns of interest
summary(insurance |> select(smoker, charges))

    smoker             charges     
 Length:1338        Min.   : 1122  
 Class :character   1st Qu.: 4740  
 Mode  :character   Median : 9382  
                    Mean   :13270  
                    3rd Qu.:16640  
                    Max.   :63770  

In [7]:
# Plot raw data (line graphs colored by smoker variable)
# Compute estimates and plot raw data with these estimates included
# Mean and standard deviation grouped by smoker
# Can maybe use boxplots here

## Methods Plan

The previous sections will carry over to your final report (you’ll be allowed to improve them based on feedback you get). Begin this Methods section with a brief description of “the good things” about this report – specifically, in what ways is this report trustworthy?

Continue by explaining why the plot(s) and estimates that you produced are not enough to give to a stakeholder, and what you should provide in addition to address this gap. Make sure your plans include at least one hypothesis test and one confidence interval. If possible, compare both the bootstrapping and asymptotics methods.

Finish this section by reflecting on how your final report might play out:

What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

- What is your point estimate?
- How do you quantify the errors of your estimates?
- How do you establish statistical significance of your findings?
- What do you expect to find?
- What are the potential challenges/drawbacks?
- How do you make sure your analysis is reproducible?

## References

Datta, A. (2019) <i>US Health Insurance Dataset</i> [Data set]. Kaggle. https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset

Keisler-Starkey, K., & Bunch, L.N. (2022, September 13). <i>Health Insurance Coverage in the United States: 2021</i>. United States Census Bureau.     
    https://www.census.gov/library/publications/2022/demo/p60-278.html#:~:text=Highlights,8.6%20percent%20or%2028.3%20million

Fontinelle, A. (2022, March 2). <i>How Much Does Health Insurance Cost?</i>. Investopedia. 
    https://www.investopedia.com/how-much-does-health-insurance-cost-4774184