# Demo 01 - Basic Analysis

The first level of exploratory data analysis we might want to perform is a basic analysis of our available data.  This includes summary statistics, evaluating data structure, and performing growth analysis.

To do this, we will use some built-in R functions as well as functionality available to us in the `tidyverse` package.

In [None]:
if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(odbc)) {
    install.packages("odbc", repos = "http://cran.us.r-project.org")
    library(odbc)
}

# ggplot2 is installed with the tidyverse.
library(ggplot2)

## Data Retrieval

In this first section, we will retrieve data for each major table in our data set.  Then we will run summary statistics on each.

In [None]:
conn <- DBI::dbConnect(odbc::odbc(), 
                      Driver = "SQL Server", 
                      Server = "localhost", 
                      Database = "ForensicAccounting", 
                      Trusted_Connection = "True")

In [None]:
buses <- DBI::dbGetQuery(conn, "SELECT BusID, DateFirstInService, DateRetired FROM dbo.Bus;")
employees <- DBI::dbGetQuery(conn, "SELECT EmployeeID, FirstName, LastName FROM dbo.Employee;")
expenseCategories <- DBI::dbGetQuery(conn, "SELECT ExpenseCategoryID, ExpenseCategory FROM dbo.ExpenseCategory;")
vendors <- DBI::dbGetQuery(conn, "SELECT VendorID, VendorName FROM dbo.Vendor;")
vendorExpenseCategories <- DBI::dbGetQuery(conn, "SELECT
	vec.VendorID,
	vec.ExpenseCategoryID,
	v.VendorName,
	ec.ExpenseCategory
FROM dbo.VendorExpenseCategory vec
	INNER JOIN dbo.Vendor v
		ON vec.VendorID = v.VendorID
	INNER JOIN dbo.ExpenseCategory ec
		ON vec.ExpenseCategoryID = ec.ExpenseCategoryID;")

### Buses

In [None]:
str(buses)

The first thing we want to do is clean up the dates.  Then we can get an idea of how long the buses have been in service.

In [None]:
buses$DateFirstInService <- lubridate::ymd(buses$DateFirstInService)
buses$DateRetired <- lubridate::ymd(buses$DateRetired)

In [None]:
summary(buses)
head(buses)

sum(is.na(buses$DateRetired))

There are 700 buses in our total inventory.  344 are still in service as of 2019.

### Employees

We have 12 employees in total.

In [None]:
employees

### Expense Categories

We have 28 expense categories.  Each one has its own rough price but we don't store any of that information directly in the database, as there are different vendors who offer up different prices depending on market circumstances.

In [None]:
expenseCategories %>% arrange(ExpenseCategoryID)

### Vendors

There are 15 vendors.  Each vendor has its own specialties.

In [None]:
vendors %>% arrange(VendorID)

### Vendor Expense Categories

This is a listing of which vendors offer which services.

In [None]:
vendorExpenseCategories %>% arrange(VendorID, ExpenseCategoryID)

We can easily see how many different categories each vendor offers.

In [None]:
vendorExpenseCategories %>%
    group_by(VendorID, VendorName) %>%
    summarize(n = n())

We can also see that there are a few sole-source suppliers.  In an audit, we might investigate the reason why we would have sole-source suppliers in these categories.

In [None]:
vendorExpenseCategories %>%
    group_by(ExpenseCategoryID, ExpenseCategory) %>%
    summarize(n = n())

## Growth Analysis

In this section, we want to see how things have changed over time.

### Active Buses

The first thing we will look at is how many active buses the agency has at its disposal each year.  To make things simpler, we add new buses and take buses out of service once a year.

In [None]:
activeBuses <- DBI::dbGetQuery(conn, "SELECT
	c.CalendarYear,
	COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
	INNER JOIN dbo.Calendar c
		ON b.DateFirstInService <= c.Date
		AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
WHERE
	c.CalendarDayOfYear = 1
	AND c.CalendarYear >= 2011
	AND c.CalendarYear < 2019
GROUP BY
	c.CalendarYear
ORDER BY
	c.CalendarYear;")

In [None]:
options(repr.plot.width=6, repr.plot.height=4) 
ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") +
    ylim(0, 500) +
    theme_minimal()

We see a steady increase in the number of buses by year.  The number of buses is likely the biggest driver for our expenses, so we'd expect to see similar growth over time in expenses.

### Invoices Per Year

The next thing we want to look at is the number of invoices per year our staff handle.

In [None]:
invoicesPerYear <- DBI::dbGetQuery(conn, "SELECT
	c.CalendarYear,
	COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date
GROUP BY
	c.CalendarYear
ORDER BY
	c.CalendarYear;")

In [None]:
ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") +
    theme_minimal()

We can see steady growth through most of the time frame but a huge spike in 2018.  This looks very weird.

### Expenditures Per Year

Knowing that we saw a spike in invoices it would also be illustrative to review the amount of money we spend per year.

In [None]:
expendituresPerYear <- DBI::dbGetQuery(conn, "SELECT
	c.CalendarYear,
	SUM(li.Amount) AS TotalInvoicedAmount
FROM dbo.LineItem li
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date
GROUP BY
	c.CalendarYear
ORDER BY
	c.CalendarYear;")

In [None]:
ggplot(expendituresPerYear, aes(x = CalendarYear, y = TotalInvoicedAmount)) +
    geom_point() +
    geom_line() +
    scale_y_continuous(labels = scales::dollar) +
    labs(x = "Calendar Year", y = "Total Invoiced Amount", title = "Total Invoiced Amount by Year") +
    theme_minimal()

This is looking suspicious.  We were a little under \\$1 million in 2017 and jump up over \\$2 million in 2018.  Yes, there are more buses in the fleet in 2018 but that's a sharp incline.  We'll need to do more research and get back to it.