In [None]:
customer_df = read.csv('Wholesale_customers_data.csv')

In [None]:
customer_df$Channel <- NULL
customer_df$Region <- NULL

In [None]:
dim(customer_df)

In the next few notebooks, we are going to do some Unsupervised Exploration of the `customer` table in our Database.

> What does a data scientist do? PCA on the `customer` table. - Joshua Cook

# Basic Stats

In [None]:
install.packages('moments')

In [None]:
library(moments)

In [None]:
skewness(customer_df)

In [None]:
stats = data.frame(feature=colnames(customer_df))
stats['mean_'] = sapply(customer_df, mean)
stats['sd_'] = sapply(customer_df, sd)
stats['skewness_'] = sapply(customer_df, skewness)
stats

# Sampling the Dataset 

In this notebook, we begin to explore the `customer` table by sampling the table. First, let's sample three random points and examine them. 

In [None]:
library(dplyr, warn.conflicts = FALSE)

In [None]:
set.seed(42)

In [None]:
sample = sample_n(customer_df, 3)

In [None]:
sample

In [None]:
stats

# Sampling for a Statistical Description

We are able to take the mean and standard deviation of the data, but what if we want to visualize it? 

Of course, this dataset is small, but we might want techniques that work even when the dataset is very large.

Let's start by looking at 1% of the data. 

In [None]:
sample_1pct_1 = sample_n(customer_df, 5)

In [None]:
colMeans(sample_1pct_1)

### How does this compare to the actual mean?

In [None]:
colMeans(sample_1pct_1) - stats$mean_

Let's think about this in terms of the standard deviations.

In [None]:
(colMeans(sample_1pct_1) - stats$mean_)/stats$sd_

### Let's try it again

In [None]:
sample_2pct_1 = sample_n(customer_df, 5)

In [None]:
colMeans(sample_2pct_1)

### How does this compare to the actual mean?

In [None]:
colMeans(sample_2pct_1) - stats$mean_

Let's think about this in terms of the standard deviations.

In [None]:
(colMeans(sample_2pct_1) - stats$mean_)/stats$sd_

### How does it do?

### Repeatedly Sample

Let's do it 10 times.

In [None]:
sample_means = colMeans(sample_n(customer_df, 5))

In [None]:
for (i in 1:9) {
    sample_means = (sample_means*(i) + colMeans(sample_n(customer_df, 5)))/(i+1)
}

In [None]:
sample_means

In [None]:
(sample_means-stats$mean_)/stats$sd_

And 50 times.

In [None]:
sample_means = colMeans(sample_n(customer_df, 5))

In [None]:
for (i in 1:49) {
    sample_means = (sample_means*(i) + colMeans(sample_n(customer_df, 5)))/(i+1)
}

In [None]:
(sample_means-stats$mean_)/stats$sd_

And 100 times.

In [None]:
sample_means = colMeans(sample_n(customer_df, 5))

for (i in 1:99) {
    sample_means = (sample_means*(i) + colMeans(sample_n(customer_df, 5)))/(i+1)
}

(sample_means-stats$mean_)/stats$sd_

### What do we notice?

### Take a larger sample

Totally different. Which makes sense ... we're only taking 1% of the data!

What if we take a sample of 10% of the data?

In [None]:
sample_means = colMeans(sample_n(customer_df, 44))

(sample_means-stats$mean_)/stats$sd_

### Is this sample good enough for plotting?

https://stats.stackexchange.com/questions/2541/is-there-a-reference-that-suggest-using-30-as-a-large-enough-sample-size

In [None]:
pairs(sample_n(customer_df, 44))