/
Coffey_Lukas_Project_EDA.Rmd
executable file
·153 lines (132 loc) · 5.65 KB
/
Coffey_Lukas_Project_EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "Data Project EDA - CMDA 3654 SU19"
author: "Lukas Coffey"
date: "7/26/2019"
output:
pdf_document: default
html_document: default
geometry: margin = 0.5in
header-includes:
- \usepackage{wrapfig}
- \sffamily
- \usepackage{booktabs}
- \usepackage{longtable}
- \usepackage{array}
- \usepackage{multirow}
- \usepackage{wrapfig}
- \usepackage{float}
- \usepackage{colortbl}
- \usepackage{pdflscape}
- \usepackage{tabu}
---
```{r setup, include=FALSE, warning=FALSE, error=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy.opts=list(width.cutoff=80),
tidy=TRUE,
fig.path = "./figures/", # Store all figures here in relative path (make the folder first)
fig.align = "center",
fig.width = 7,
fig.height = 7,
message = FALSE, # Turn off load messages
warning = FALSE # Turn off warnings
)
getwd()
setwd("~/Documents/School/VT/CMDA3654/DataProject/")
```
```{r packages, include=FALSE}
library(tidyverse)
library(tidyselect)
library(readxl)
library(lubridate)
```
# Note
I was unsure whether or not to include my code, so I did include it here.
```{r dataio, results='hide', warning=FALSE, error=FALSE}
# Sales Data
orders <- read_csv("SampleSuperstoreData/Orders-Table1.csv")
returns <- read_csv("SampleSuperstoreData/Returns-Table2.csv")
# This is official census data put into comma-delimited format, you can find it here: https://gist.github.com/erichurst/7882666
zipcodes <- read_csv("ExtraData/zipcodes.txt",
col_types = list(
"Postal Code" = col_number(),
"Latitude" = col_double(),
"Longitude" = col_double()
),
col_names = c("Postal Code", "Latitude", "Longitude"),
skip = 1
)
order_data <- left_join(orders, returns)
order_data$Returned[is.na(order_data$Returned)] <- "No"
unique(order_data$Returned) # Successful join and clean of Returned column
# Transform to factors
order_data <- order_data %>%
mutate_each(funs(as.factor), c(5,8:11,13,15:16))
# Split date into years/months
date_col_names <- c("OrderMonth", "OrderDay", "OrderYear")
order_data <- order_data %>%
separate(`Order Date`,
into = date_col_names,
sep = "/",
remove = F,
convert = T)
# Same for shipping date
ship_col_names <- c("ShipMonth", "ShipDay", "ShipYear")
order_data <- order_data %>%
separate(`Ship Date`,
into = ship_col_names,
sep = "/",
remove = F,
convert = T)
# Convert OrderDate and ShipDate to date type
order_data <- order_data %>%
mutate_at(.funs = as.Date,
format = "%m/%d/%y",
.vars = c("Order Date", "Ship Date"))
# Join zip code data
order_data <- left_join(order_data, zipcodes)
# Remove excess columns
order_data <- order_data %>%
select(-OrderDay, -OrderYear, -ShipDay, -ShipYear, -ShipMonth)
```
# Orders Per Month
```{r months}
ggplot(order_data,
aes(x = as.factor(OrderMonth),
fill = Category)) +
geom_bar() +
labs(title = "Orders per Month, by Product Category",
y = "Orders Per Month",
x = "Month of the Year")
```
For this visual, I wanted to see which months of the year produced the most orders. I also wanted to know which categories were doing the best duing these months. We can see a clear upward trend at the end each year, and then a "restart" at the beginning of each year. Order volume is fairly consistent throughout the middle 4-5 months. This end of year spike would most likely correspond to Christmas sales, but I find it interesting that the proportion of categories does not vary much during those months.
# Orders Over Time
```{r years}
ggplot(order_data,
aes(x = `Order Date`,
fill = Region)) +
geom_histogram(bins = 28,
position = "fill") +
labs(title = "Proportion of Orders over Time, by Region",
y = "Proportion of Orders",
x = "Year")
```
Here I'm trying to visualize the proportion of orders going to each region. We don't see a sloping trend in one direction or another, but we do see consistency in order volume. This visual would lead us to believe that we should focus our shipping efforts in the East and West regions mainly, since they hold the largest proportion of orders. This also means that we could push our marketing and sales efforts harder in the Central and South regions; however, this might be difficult since the central U.S. is much less densely populated.
# Orders by Geographic Location
```{r map}
ggplot(na.omit(order_data),
aes(y = Latitude,
x = Longitude,
color = Category)) +
geom_point(aes(size = Sales,
shape = Returned)) +
labs(title = "Coordinates of Sales Region, by Product Category",
y = "Latitude",
x = "Longitude",
size = "Sales Amount (dollars)",
color = "Product Category",
shape = "Item Returned?") +
theme(legend.position = "bottom",
legend.direction = "vertical")
```
In order to make this plot, I found a data set of zip codes and their corresponding latitude/longitude, and joined it by the postal code column. When looking at this plot, we can see a bulk of orders going to the Northeastern U.S. Most of these orders appear to be in the Office Supplies category.
The biggest sales amounts come from orders in what look like big cities scattered across the U.S. I also joined the other datasheet of return data, mapped by Order ID so we can see where the most returns are coming from.
This may be too much data/variables for one plot, but I was curious to see if there were any trends that would not be visible otherwise.