In [1]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

 Title: Alcohol Consumption and Regions: An Investigation

Intro

Alcohol consumption is oftentimes involved in activities as a form of bonding and to facilitate socialization, moreover, it can be a part of one's culture. For example, there are "wet cultures" where alcohol is common in everyday life, which are in European countries around the Mediterranean and "consumed with meals", with lower rates of abstinence. Or, there are "dry cultures" where alcohol is uncommon with higher rates of abstinence, such as Canada(Bloomfield et al., 2003). The differences of alcohol in countries have been showed in studies previously. For example, a study found that German respondents had an alcohol consumption that was twice as high as US respondents (Bloomfield et al., 2003)

It is important to note that excessive consumption of alcohol is linked to many negative health outcomes. It may lead to crimes, road incidents, diseases and health impacts (Ritchie et al., 2018). Alcohol can cause addiction, and after drinking alcohol the effects can persist for hours (Babor et al., 2010). A European longitudinal study also demonstrated that job loss had a positive association with hazardous drinking over the span of 6 years (Bosque-Prous et al., 2015). Thus, one can see the substantial impacts of alcohol consumption on populations.

Our analysis uses data, "Happiness and Alcohol Consumption" found on Kaggle.com and will seek to answer the following question: What regions would have a certain alcohol consumption (beer, wine and spirit per capita)?  In this analysis, this expansive dataset was collected by Marcos Pessotto. In our dataset, we will be using the variables: “Region”, “Beer_PerCapita” and “Wine_PerCapita”. For our project, we hypothesize that the higher the beer and wine consumption per capita, it will most likely to be located on the Western Europe region as Western Europe regions borders the Mediterranean, where "wet culture" is prevalent(Bloomfield et al., 2003). 

To begin analyzing our data, we first loaded in all the packages we needed using the 'library' function.
Our dataset is taken from Kaggle. 

First, we had to download the dataset from the web onto the computer. Second, we uploaded the file into the juypter directory we are working in.
Next, we set the seed to ensure that the sequence of numbers that is randomly generated is reproducible and we read in the data using the 'read_csv' function with a relative pathway. Below is the data that was read. 

In [None]:
set.seed(20)

url <- "https://drive.google.com/uc?export=download&id=1jjBVsxq8p2lxJlHJ37QiwPDmb-sK_V6L"

alcohol_data <- read.csv(url)

alcohol_data

Doing a quick analysis of the data reveals that there are 9 variables. The dataset is pretty tidy - each row is one observation - an observation of one specific country and its region, and its respective elements such as HappinessScore, HDI, GDP per capita etc. 
For our specific question, we did not need to consider all the columns in this dataset, so we selected region, Beer_PerCapita and Wine_PerCapita.

We grouped the dataset by region to tell which regions there are for when plotting it in the graph, as we want to colour by regions and label which regions there are.

In [None]:
data_grouped <- alcohol_data %>%
                group_by(Region) %>%
                summarize()
data_grouped

And then we selected the columns that we would be using for our predictors as well as the target variable.

In [None]:
data_selected2 <- alcohol_data %>%
                 select(Region, Beer_PerCapita, Wine_PerCapita)

data_selected2

Now we split our data into 75% and 25% in order to extract the training and testing data into two separate frames, and we show the training dataset below.

In [None]:
alcoholdata_split<- initial_split(data_selected2, prop = 0.75, strata = Region)
alcohol_train <- training(alcoholdata_split)
alcohol_test <- testing(alcoholdata_split)

alcohol_train

In [None]:
Region_asia <- alcohol_train %>%
                filter(Region == "Southeastern Asia" | Region == "Eastern Asia") %>%
                rowwise() %>%
                mutate(Region = "Asia")

full_join(Region_asia, alcohol_train)

We then grouped the regions and showed the number of observations of each region.

In [None]:
data_summary <- alcohol_train %>%
            group_by(Region) %>%
            summarize(n = n())
data_summary

The bar graph below shows a visualization of this table.

In [None]:
data_summary_graph <- data_summary %>%
                    ggplot(aes(x = n, y = Region)) + 
                    geom_bar(stat = "identity") + 
                    xlab("Number of beer and wine per capita observations") + 
                    ylab("Region") + 
                    ggtitle("Figure 1:Region vs Number of beer and wine per capita observations")


data_summary_graph

We created a scatterplot below to compare the predictor variables we have selected which was Beer per capita and wine per capita.

In [None]:
alcohol_training_plot <- alcohol_train %>%
                        ggplot(aes(x = Wine_PerCapita, y = Beer_PerCapita, colour = Region)) +
                        geom_point() +
                        labs(x = "Wine Per Capita (L)", y = "Beer Per Capita (L)", colour = "Region") +
                        ggtitle("Figure 2:Beer Per Capita vs Wine Per Capita")
alcohol_training_plot

Firstly, we balanced the classes of the testing and the training data by combining the Asias and Australias and the americas together and we selected the regions Asias and Australia, the Americas for us to classify. 

In [None]:
data_combined <- data_selected %>%
            
                

We will now use R to perform cross validation and choose the best K, and we will create a recipe for preprocessing data, and a model specification for K-nearest neighbors regression.  

In [None]:
alcohol_vfold <- vfold_cv(alcohol_train, v = 10, strata = Region)

We created the recipe using by the training data. 

In [None]:
alco_recipe <- recipe(Region ~ Beer_PerCapita + Wine_PerCapita, data = alcohol_train) %>%
                step_scale(all_predictors()) %>%
                step_center(all_predictors())

Now we made the model with the tune() function 

In [None]:
alco_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
            set_engine("kknn") %>%
            set_mode("classification")


Now we put everything in a workflow and we will run cross validation for a grid of numbers from 1 to 100 that we created using the tibble. 

In [None]:
alco_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

alco_vals

alco_workflow <- workflow() %>%
                add_recipe(alcohol_training_recipe) %>%
                add_model(alco_spec)%>%
                tune_grid(resamples = alcohol_vfold, grid = alco_vals) %>%
                collect_metrics()

alco_workflow
