# STAT 301 Final Report 
#### group members and date

In [2]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(httr))
suppressPackageStartupMessages(library(utils))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(MASS))
suppressPackageStartupMessages(library(car))
suppressPackageStartupMessages (library(infer))
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(GGally))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))

## Introduction

- Relevant background information on the topic. Motivate the question you are about to add.
- Formulate one or two questions for investigation and detail the dataset that will be utilized to address these questions, indicating if the primary goal is inference or prediction.
- Make sure that the question(s) can be answer with the data available.
- Align your question/objectives with the existing literature.
- To contextualize your study, include a minimum of two scientific publications (these should be listed in the References section).

## Methods and Results

### Data

- Read the data into R using reproducible code (i.e., from an open source and not a local directory in your server or computer)
- Include a citation of its source
- Include any information you have about data collection (e.g., observational vs experimental)
- Describe the variables as done in assignment 1
- If (absolutely) needed, indicate which variables will be pre-selected (or dropped) and provide a clear justification of your selection. Unless needed, you should use data-driven methods to select variables.

### Exploratory Data Analysis (EDA)

- Clean and wrangle your data into a tidy format.
- Include 2 effective and creative visualizations 
- Explore the association of some potential explanatory variables with the response (use colours, point types, point size and/or faceting to include more variables)
- Highlight potential problems (e.g., multicollinearity or outliers)
- Transform some variables if needed and include a clear explanation (e.g. log-transformation may be useful when outliers are present)
- Plot the relevant raw data, tailoring your plot to address your question.
- Make sure to explore the association of the explanatory variables with the response.
- Any summary tables that are relevant to your analysis (e.g., summarize number of observation in groups, indicate if NAs exist).

In [3]:
# load data
url <- "https://raw.githubusercontent.com/anniew02/stat301-final-project/main/placementdata.csv" 
raw_data <- read.csv(url)
head(raw_data, 5)

# check for null values in dataset (there are none)
anyNA(raw_data)

Unnamed: 0_level_0,StudentID,CGPA,Internships,Projects,Workshops.Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks,PlacementStatus
Unnamed: 0_level_1,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<int>,<int>,<chr>
1,1,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
2,2,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
3,3,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
4,4,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
5,5,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed


In [4]:
data <- raw_data %>% 
    dplyr::select(-StudentID) %>%
    mutate(across(c(PlacementStatus, ExtracurricularActivities, PlacementTraining), as.factor)) %>%
    rename("WorkshopsCertifications" = "Workshops.Certifications")
head(data, 5)

Unnamed: 0_level_0,CGPA,Internships,Projects,WorkshopsCertifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks,PlacementStatus
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<fct>,<fct>,<int>,<int>,<fct>
1,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
2,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
3,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
4,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
5,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed


### Methods: Plan

- Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
- Provide a detailed justification of the method(s) used. The analysis must be based primarily on methods learned in the class (other method can be used for comparison).
- Make sure that the analysis responded the question posed and that the proposed method is appropriate for the characteristics of the data.
- If variable selection methods are used, justify the method used and explain what data will be used.
- If various models will be compared explain how you will select a final one.
- If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
- Make sure to interpret/explain the results you obtain. 
- If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., which coefficient(s) is(are) statistically significant? interpretation of significant coefficients, How does the model fit the data)?
- A careful model assessment must be conducted.
- Include no more than 3 visualizations and/or tables to support your points. Ensure your tables and/or figures are labelled with a figure/table number and readable fonts.

## Discussion

- Interpret the results you obtained in the previous section with respect to the main question/goal of your project.
- Summarize what you found and the implications/impact of your findings.
- If relevant, discuss whether your results were what you expected to find.
- Discuss how your model could be improved;
- Discuss future questions/research this study could lead to.

## References

- At least two citations of literature relevant to the project.
- The citation format is your choice – just be consistent.
- Make sure to cite the source of your data as well.