#STAT 201 Project Proposal

##Introduction

There has been considerable debate regarding the effects of prominent alcohol use on academic/educational performance, such as GPA and academic average. While it is quite challenging to create a direct causal link between the two variables, those within academics have attempted to obtain metrics ultimately related to academic performance. Research from the American Economist Journal has suggested that drinking in high school positively correlates with absenteeism rates (Austin, 2012). Furthermore, regarding the relationship between absenteeism and academic performance, findings presented by the American Educational Research Association indicate that "overall absences are negatively associated with academic achievement" (Klein et al., 2022). Taking these previously denoted findings into account, our group has set out to answer the following research question: does the level of secondary school alcohol consumption affect the amount of school absenteeism? To conduct this research, we will utilize a dataset from Kaggle that examines students' academic and socioeconomic attributes across two Portuguese secondary schools ("https://raw.githubusercontent.com/riddhibattu/STAT201/main/Maths.csv"). The data set includes the following information that will be utilized in our research (Chauhan, 2022): 
- Student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- Workday alcohol consumption (Dalc)
- Weekend alcohol consumption (Walc)
- Number of school absences (absences)



In this case, the level of workday alcohol consumption and weekend alcohol consumption, scaled from 1-5 (from 1 - very low to 5 - very high), will serve as our response variable, and the number of school absences will subsequently serve as our explanatory variable. We will use the mean of our response variable as the parameter and standard deviation as the scale parameter. Considering that the data at hand only measures these variables across two schools in Portugal, in addition to the requirements of the assignment, our group will conduct a bootstrap sampling distribution using the data to generate a sampling distribution to mimic the population distribution of all schools in Portugal. Finally, hypothesis testing will be conducted to answer the aforementioned research question. Our null hypothesis is that there is no relationship between alcohol consumption among secondary school students and the rate of school absences. In contrast to this, our alternative hypothesis, as suggested by findings from the American Economist Journal (Austin, 2012), is that increased alcohol consumption among secondary school students will result in increased school absences. 

##Preliminary Results

We can read the dataset into R using Librarys and read_csv. Then, we clean the data by choosing the columes we are interested and filter out any NA rows. We also change colume Dalc to workday_ac and colume Walc to weekend_ac to make the data more clear.

In [1]:
library(cowplot)
library(datateachr)
library(digest)
library(infer)
library(repr)
library(tidyverse)
library(dplyr)

alcohol_data <- read_csv("https://raw.githubusercontent.com/riddhibattu/STAT201/main/Maths.csv") 

head(alcohol_data)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[1mRows: [22m[34m395[39m [1mColumns: [22m[34m33[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
[32mdbl[39m (16): age, Medu, Fedu, traveltime, studytime, failure

school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
GP,F,18,U,GT3,A,4,4,at_home,teacher,⋯,4,3,4,1,1,3,6,5,6,6
GP,F,17,U,GT3,T,1,1,at_home,other,⋯,5,3,3,1,1,3,4,5,5,6
GP,F,15,U,LE3,T,1,1,at_home,other,⋯,4,3,2,2,3,3,10,7,8,10
GP,F,15,U,GT3,T,4,2,health,services,⋯,3,2,2,1,1,5,2,15,14,15
GP,F,16,U,GT3,T,3,3,other,other,⋯,4,3,2,1,2,5,4,6,10,10
GP,M,16,U,LE3,T,4,3,services,other,⋯,5,4,2,1,2,5,10,15,15,15


In [4]:
#choosing the columes we will be working with and drop empty rows
alcohol_data_clean <- alcohol_data %>%
mutate(workday_ac = Dalc) %>%
mutate(weekend_ac = Walc) %>%
select('school','workday_ac', 'weekend_ac', 'absences') %>%
drop_na()
head(alcohol_data_clean)

school,workday_ac,weekend_ac,absences
<chr>,<dbl>,<dbl>,<dbl>
GP,1,1,6
GP,1,1,4
GP,2,3,10
GP,1,1,2
GP,1,2,4
GP,1,2,10


#Methods

We are using the dataset from UCI Machine Learning Repository, a trustworthy source for datasets. In this project, we will use both bootstrap methods to resample for the null distribution: the absence of students who consume minimum alcohol. Then, we will use an asymptotic approach to calculate the 95% confidence interval for the mean. We can use an asymptotic approach because the size of our data is large enough, the samples are taken independently, and the estimator used is random.

In this project, we expect to use the hypothesis test to find if there is a correlation between alcohol consumption and student attendance. The purpose of this project is to help students to find out if there is such a correlation. If yes, students can refer to the result for a better work-life balance plan. In the future, we can test if there is a correlation between student attendance and academic success to investigate further the correlation between drinking and students’ academic performance.



#Reference

Austin, W. A. (2012). THE EFFECTS OF ALCOHOL USE ON HIGH SCHOOL ABSENTEEISM. 
    American Economist, 57(2), 238-252. 
    Retrieved from https://www.proquest.com/scholarly-journals/effects-alcohol-use-on-high-school-absenteeism/docview/1113789486/se-2

Chauhan, A. (2022, September 15). Alcohol effects on study. Kaggle. 
    Retrieved November 4, 2022, from https://www.kaggle.com/datasets/whenamancodes/alcohol-effects-on-study 

Klein, M., Sosu, E. M., & Dare, S. (2022). School Absenteeism and Academic Achievement: Does the Reason for Absence Matter? 
    AERA Open, 8. https://doi.org/10.1177/23328584211071115

