# Exploratory Data Analysis and Visualization

## Objective

The objective in this notebook is as follows: 
1. Load the notebook from the web
2. Clean and wrangle the data into a tidy format
3. Propose a visualization that is relevant to addressing the question and to explore the data.
    * propose a high quality plot or set of plots of the same kind
    * explain why it is relevant in addressing the question or exploring the data

In [1]:
# >>> Load Dependencies <<< #
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# >>> Define paths to repository and data folder <<< #
repo_path  <- "https://raw.githubusercontent.com/fuminaba/STAT-301"
data_parent_path <- "/fuminaba-project/Data"

# >>> Load Training and Testing Data <<< #
data.train <- paste0(repo_path, data_parent_path, "/adult.data")  %>% 
    read_csv(col_names = F)
data.test  <- paste0(repo_path, data_parent_path, '/adult.test') %>% 
    read_csv(col_names = F, skip = 1)

# >>> Define feature names and rename columns <<< #
feature_names <- c('age', 'workclass', 'fnlwgt', 'education',
                   'education-num', 'marital-status', 'occupation',
                   'relationship', 'race', 'sex', 'capital-gain', 
                   'capital-loss', 'hours-per-week', 'native-country',
                   'income')

names(data.train) <- feature_names
names(data.test) <- feature_names

# >>> For EDA, we will combine train and test data <<< #
data.all <- rbind(data.train, data.test)

[1mRows: [22m[34m32561[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
[32mdbl[39m (6): X1, X3, X5, X11, X12, X13

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m16281[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
[32mdbl[39m (6): X1, X3, X5, X11, X12, X13

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALS

In [3]:
DataExplorer::create_report(data = data.all);



processing file: report.rmd



1/42                                 
2/42 [global_options]                
3/42                                 
4/42 [introduce]                     
5/42                                 
6/42 [plot_intro]                    
7/42                                 
8/42 [data_structure]                
9/42                                 
10/42 [missing_profile]               
11/42                                 
12/42 [univariate_distribution_header]
13/42                                 
14/42 [plot_histogram]                
15/42                                 
16/42 [plot_density]                  
17/42                                 
18/42 [plot_frequency_bar]            
19/42                                 
20/42 [plot_response_bar]             
21/42                                 
22/42 [plot_with_bar]                 
23/42                                 
24/42 [plot_normal_qq]                
25/42                                 
26/42 [plot_response_qq]          

output file: D:/Fumi/STAT-301/2-EDA-and-Visualization/report.knit.md




"C:/Users/finaba/AppData/Local/ANACON~1/Library/bin/pandoc" +RTS -K512m -RTS "D:\Fumi\STAT-301\2-EDA-and-Visualization\report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc151470c42a12.html --lua-filter "C:\Users\finaba\AppData\Local\Programs\R\R-4.2.3\library\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\finaba\AppData\Local\Programs\R\R-4.2.3\library\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\finaba\AppData\Local\Programs\R\R-4.2.3\library\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\finaba\AppData\Local\Temp\RtmpgXuYrv\rmarkdown-str15147e9d1da4.html" 



Output created: report.html

