# Introduction

This notebook is just here to try making up some ice cream shop data.

One thing I do want to generate is a dataset that invovles causal inference concepts. 

In [None]:
from bayes_tutorial.data import load_baseball
import pandas as pd
from pyprojroot import here
import namegenerator
from faker.providers import company

In [None]:
data = pd.read_csv(here() / 'data/baseballdb/core/Batting.csv')

In [None]:
from faker import Faker

f = Faker()

In [None]:
import janitor
import numpy as np

starter_data = (
    data
    .query("yearID == 2016")
    .select_columns(["playerID", "AB", "H"])
    .rename_columns(
        {
            "playerID":"shopname",
            "AB": "num_customers",  # this is the column that matters the most
            "H": "num_likes"        # this one isn't as important, because I will be generating data.
        }
    )
    .transform_column("shopname", lambda dummy : namegenerator.gen())
    .transform_column("shopname", lambda x: " ".join(x.split("-")))
    .transform_column("shopname", lambda x: x.capitalize())
    .join_apply(lambda x: x["num_likes"] / x["num_customers"] if x["num_customers"] > 0 else np.nan, "fraction_likes")
)

What we are going to do now is generate values of $p$ using a presumed hierarchical model.

Firstly, likes and dislikes could be correlated by their parent chain.
(Some chains are run well, while others are not.)

The distribution of shops per company is such that most of them are independent and locally-owned businesses, while a few are large chains.

In [None]:
from scipy.stats import poisson

num_chain_held_stores = poisson(50).rvs(8)

In [None]:
# Was generated from the previous cell in one particular run
num_chain_held_stores = [55, 48, 48, 54, 44, 62, 38, 58]

There are 8 "chains", and they each have the aforementioned number of stores per chain. (Some healthy competition going on there!) For the purposes of generating data, there is a 9th "chain" is really just a placeholder.

Let's now build the index that maps store to chain (or independent business).

To do this, we will work in two steps:

In [None]:
owner_indices = []
# Firstly, populate chain indices.
for i, n in enumerate(num_chain_held_stores):
    owner_indices.extend([i] * n)

# # Secondly, populate independently-owned businesses' indices.
# for i in range(len(starter_data) - sum(num_chain_held_stores)):
#     owner_indices.append(i + len(num_chain_held_stores))
owner_indices.extend([i + 1] * (len(starter_data) - sum(num_chain_held_stores)))

Now, we shuffle them up!

In [None]:
from random import shuffle

shuffle(owner_indices)

In [None]:
starter_data.add_column("owner_idx", owner_indices).shuffle(reset_index=False)

Now, we are going to generate the $p$ for each of the shops.

Firstly, I'm going to start with a hard-coded population parameter. Most of the shops _are_ going to have a generally positive rating at about 0.7.

In [None]:
p_pop = 0.7

We are going to go into logit space because it allows us to _more easily_
reason about "central tendencies".

In [None]:
from scipy.special import logit, expit

In [None]:
expit(logit(p_pop))

Because there are 8 stores, I will generate a $p$ for each of them.

In [None]:
from scipy.stats import norm

logit(beta(13, 17).rvs(6))

In [None]:
beta(35, 8).rvs(2)

In [None]:
company_ps = [
    0.48427595,  # chain 0
    0.52588245,  # chain 1
    0.34491850,  # chain 2
    0.30949678,  # chain 3
    0.43965704,  # chain 4
    0.31991239,  # chain 5
    0.80628789,  # chain 6
    0.78982137,  # chain 7
    0.86220633,  # independent chains
]

In [None]:
company_mus = logit(company_ps)
company_mus

In [None]:
from scipy.stats import expon

expon(1/4).rvs(9)

In [None]:
company_sigmas = np.array([
    0.21505173,  # chain 0
    0.60319852,  # chain 1
    0.30978955,  # chain 2
    0.16837932,  # chain 3
    0.14264645,  # chain 4
    0.54077756,  # chain 5
    0.18131425,  # chain 6
    0.16748833,  # chain 7
    1.20746328,  # independently-held businesses
])

Now, we can start drawing numbers!

In [None]:
data_generator = starter_data.add_column("owner_idx", owner_indices)
data_generator.shuffle(reset_index=False)

In [None]:
assert len(company_mus) == len(company_sigmas), print(len(company_mus), len(company_sigmas))

In [None]:
company_logit_p = norm(loc=company_mus, scale=company_sigmas)
company_logit_p

In [None]:
from scipy.stats import binom 
(
    data_generator
    .add_column("mus", company_mus[data_generator["owner_idx"]])
    .add_column("sigmas", company_sigmas[data_generator["owner_idx"]])
    .join_apply(lambda x: norm(x["mus"], x["sigmas"]).rvs(), "logit_p")
    .join_apply(lambda x: binom(x["num_customers"], expit(x["logit_p"])).rvs(), "num_favs")
).to_csv(here() / "data/ice_cream_shop.csv")