In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

In [None]:
customer_df = pd.read_csv('../data/Wholesale_customers_data.csv')
customer_df.drop(['Channel', 'Region'], axis=1, inplace=True)

In [None]:
customer_df.shape

In the next few notebooks, we are going to do some Unsupervised Exploration of the `customer` table in our Database.

> What does a data scientist do? PCA on the `customer` table. - Joshua Cook

# Basic Stats

In [None]:
from scipy.stats import skew

In [None]:
skew(customer_df)

In [None]:
import random 

random.sample(range(10), 2)

In [None]:
stats = customer_df.describe().T
stats['skew'] = skew(customer_df)
stats

# Sampling the Dataset 

In this notebook, we begin to explore the `customer` table by sampling the table. First, let's sample three random points and examine them. 

In [None]:
np.random.seed(42)

In [None]:
sample = customer_df.sample(3)

In [None]:
sample

In [None]:
stats

# Sampling for a Statistical Description

We are able to take the mean and standard deviation of the data, but what if we want to visualize it? 

Of course, this dataset is small, but we might want techniques that work even when the dataset is very large.

Let's start by looking at 1% of the data. 

In [None]:
sample_1pct_1 = customer_df.sample(5)

In [None]:
sample_1pct_1.mean()

### How does this compare to the actual mean?

In [None]:
sample_1pct_1.mean() - stats['mean']

Let's think about this in terms of the standard deviations.

In [None]:
(sample_1pct_1.mean() - stats['mean'])/stats['std']

### How does it do?


### Let's try it again

In [None]:
sample_1pct_2 = customer_df.sample(5)

In [None]:
sample_1pct_2.mean() - stats['mean']

In [None]:
(sample_1pct_2.mean() - stats['mean'])/stats['std']

### How does it do?

### Repeatedly Sample

Let's do it 10 times.

In [None]:
sample_means = []
for _ in range(10):
    sample_means.append(customer_df.sample(5).mean())

sample_means = np.array(sample_means)
(sample_means.mean(axis=0)-stats['mean'])/stats['std']

And 50 times.

In [None]:
sample_means = []
for _ in range(50):
    sample_means.append(customer_df.sample(5).mean())

sample_means = np.array(sample_means)
(sample_means.mean(axis=0)-stats['mean'])/stats['std']

And 100 times.

In [None]:
sample_means = []
for _ in range(100):
    sample_means.append(customer_df.sample(5).mean())

sample_means = np.array(sample_means)
(sample_means.mean(axis=0)-stats['mean'])/stats['std']

### What do we notice?

### Take a larger sample

Totally different. Which makes sense ... we're only taking 1% of the data!

What if we take a sample of 10% of the data?

In [None]:
sample_10pct_1 = customer_df.sample(44)
(sample_10pct_1.mean() - stats['mean'])/stats['std']

### Is this sample good enough for plotting?

https://stats.stackexchange.com/questions/2541/is-there-a-reference-that-suggest-using-30-as-a-large-enough-sample-size

In [None]:
sns.pairplot(sample_10pct_1, kind='reg')