In [1]:
version = "REPLACE_PACKAGE_VERSION"

## Experiment Design and Analysis
## School of Information, University of Michigan

## Week 1: 
- 1. What are experiments?
- 2. Experimental Design
- 3. Lab vs. Field Experiments
- 4. Online Field Experiments

## Assignment Overview
### The objective of this assignment is to:

- Apply theory of experiment design and knowledge of analysis techniques to real experiment data.


### The total score of this assignment will be 10 points

### Resources:
- StatsModels
    - We recommend using a python library called [StatsModels](https://www.statsmodels.org/stable/index.html) for data analysis  
    
    
- Optional Reading: [Holt C.A, & Laury S.K. Risk Aversion and Incentive Effects. (2002).](https://www.jstor.org/stable/3083270)  


- Dataset used in this assignment: Fixed-Price Auction data [download csv file](assets/assignment1_data.csv)
    - Source for dataset: [Chen, Y., et al. Sealed bid auctions with ambiguity: Theory and experiments. (2007).](https://www.sciencedirect.com/science/article/pii/S0022053107000178)

In [2]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats
#you may or may not use all of the above libraries, and that is OK!
data = pd.read_csv('assets/assignment1_data.csv') #Data for this assignment

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
#uncomment the below line to view readme file for this dataset (includes explanation of variable names)
!cat assets/assignment1_data_readme.md

#uncomment the below line to view snippet of csv file
data.head()

### Assignment Topic: Data analysis of a laboratory experiment on first-price auction

### Background:
We upload data files from a laboratory experiment conducted at the University of Michigan.

There are ten experimental sessions, with eight subjects per session. In this context, subjects are tasked with completing auction and lottery (Holt-Laury 2002) tasks in two orders.

   - In five of the ten sessions, subjects first complete a lottery task, followed by 30 rounds of auctions.
   - In the other five sessions, subjects first complete 30 rounds of auctions, followed by a lottery task.

At the end of each session, subjects complete a demographics survey.

### Data:

The data has the following variables:
   - treatment: the treatment received by the subject
   - session: the session in which the data was collected in the experiment
   - period: the period in the session (multiple periods per session)
   - subject: unique identifier for each subject
   - disttype: di

Unnamed: 0,treatment,session,period,subject,disttype,highdist,lowdist,group,v,b,highbid,lowbid,buy,buy_yes,buy_no,profit,cumprof,timeb,new,lottery_profit,choice1,choice2,choice3,choice4,choice5,choice6,choice7,choice8,choice9,choice10,error,ra,ra_adj,ra1,ra2,ra3,ra4,ra5,pclab,gender,male,female,ethnic,white,asian,african,hispanic,native,other,age,siblings,personality,optim,pessim,neither,emotions,anger,anxiety,confusion,contentment,fatigue,happiness,irritation,moodswings,withdrawal,major,sdmajor,major1,major2,major3,major4,major5
0,k1_8_exp_lot,061018_1,1,1,Low,0,1,1,48,40,52,40,Did Not Buy,0,1,0,0,8,1,160,1,1,1,1,1,1,2,2,2,2,0,6,6,0,0,1,0,0,1,male,1,0,White;,1.0,0.0,0.0,0.0,0,0,30,1,optimistic,1,0,0,contentment;happiness;,0,0,0,1,0,1,0,0,0,2,Electrical Engineering - Signal Processing (st...,0,1,0,0,0
1,k1_8_exp_lot,061018_1,1,2,High,1,0,4,76,15,51,15,Did Not Buy,0,1,0,0,8,1,160,1,1,1,1,1,1,2,2,2,2,0,6,6,0,0,1,0,0,2,male,1,0,African American;Hispanic;,0.0,0.0,0.5,0.5,0,0,28,0,optimistic,1,0,0,anxiety;happiness;,0,1,0,0,0,1,0,0,0,4,public health,0,0,0,1,0
2,k1_8_exp_lot,061018_1,1,3,High,1,0,3,73,53,53,6,Did Buy,1,0,20,20,12,1,200,1,1,1,1,1,1,1,2,2,2,0,7,7,0,0,0,1,0,3,female,0,1,White;,1.0,0.0,0.0,0.0,0,0,31,0,optimistic,1,0,0,contentment;,0,0,0,1,0,0,0,0,0,5,german and film and video studies,0,0,0,0,1
3,k1_8_exp_lot,061018_1,1,4,High,1,0,4,74,51,51,15,Did Buy,1,0,23,23,17,1,160,1,1,1,1,1,1,2,2,2,2,0,6,6,0,0,1,0,0,4,female,0,1,White;,1.0,0.0,0.0,0.0,0,0,19,1,neither,0,0,1,anxiety;fatigue;happiness;irritation;,0,1,0,0,1,1,1,0,0,5,spanish,0,0,0,0,1
4,k1_8_exp_lot,061018_1,1,5,Low,0,1,1,72,52,52,40,Did Buy,1,0,20,20,-2,1,160,1,1,1,1,1,1,1,2,2,2,0,7,7,0,0,0,1,0,5,male,1,0,Asian / Asian American;,0.0,1.0,0.0,0.0,0,0,27,1,optimistic,1,0,0,confusion;contentment;happiness;mood swings;,0,0,1,1,0,1,0,1,0,2,Engineering,0,1,0,0,0


## Part A (6 points)

Suppose subjects were randomly assigned to two treatment groups. We want to know if the randomization was properly applied to these groups. In other words, we want to know if the proportion of participants in these demographic groups are different between the two treatments.

1. To determine if the randomization worked, for each of the two treatments, modify the following ```stats_calculator``` function so that it can input the ```data``` dataframe and tabulates the mean, standard deviation, minimum and maximum of the following variables: female, age, number of siblings, white, asian, african american, hispanic, and other ethnicities. (6 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [5]:
def stats_calculator(provided_data):
    """
    Write the function so that it fills-in the mean, standard deviation, 
    minimum and maximum of the following variables: Female, Age, 
    Number of siblings, White, Asian, African American, Hispanic, and Other ethnicities.
    
    It should return a dataframe with these calculations based on the partially-completed dataframe below.
    
    """
    
    data_clone = data.copy()
    
    mean_list = []
    std_dev_list = []
    maximum_list = []
    minimum_list = []
    
    mean_female = round(data_clone['female'].mean(), 2)
    mean_age = round(data_clone['age'].mean(), 2)
    mean_siblings = round(data_clone['siblings'].mean(), 2)
    mean_asian = round(data_clone['asian'].mean(), 2)
    mean_african = round(data_clone['african'].mean(), 2)
    mean_hispanic = round(data_clone['hispanic'].mean(), 2)
    mean_other = round(data_clone['other'].mean(), 2)
    mean_white = round(data_clone['white'].mean(), 2)
    mean_list.extend((mean_female, mean_age, mean_siblings, mean_white, mean_asian, mean_african, mean_hispanic, mean_other))
    
    
    
    std_female = round(data_clone['female'].std(), 2)
    std_age = round(data_clone['age'].std(), 2)
    std_siblings = round(data_clone['siblings'].std(), 2)
    std_asian = round(data_clone['asian'].std(), 2)
    std_african = round(data_clone['african'].std(), 2)
    std_hispanic = round(data_clone['hispanic'].std(), 2)
    std_other = round(data_clone['other'].std(), 2)
    std_white = round(data_clone['white'].std(), 2)
    
    std_dev_list.extend((std_female, std_age, std_siblings, std_white, std_asian, std_african, std_hispanic, std_other))
    
    max_female = round(data_clone['female'].max(), 2)
    max_age = round(data_clone['age'].max(), 2)
    max_siblings = round(data_clone['siblings'].max(), 2)
    max_asian = round(data_clone['asian'].max(), 2)
    max_african = round(data_clone['african'].max(), 2)
    max_hispanic = round(data_clone['hispanic'].max(), 2)
    max_other = round(data_clone['other'].max(), 2)
    max_white = round(data_clone['white'].max(), 2)
    
    maximum_list.extend((max_female, max_age, max_siblings, max_white, max_asian, max_african, max_hispanic, max_other))
    
    min_female = round(data_clone['female'].min(), 2)
    min_age = round(data_clone['age'].min(), 2)
    min_siblings = round(data_clone['siblings'].min(), 2)
    min_asian = round(data_clone['asian'].min(), 2)
    min_african = round(data_clone['african'].min(), 2)
    min_hispanic = round(data_clone['hispanic'].min(), 2)
    min_other = round(data_clone['other'].min(), 2)
    min_white = round(data_clone['white'].min(), 2)
    
    minimum_list.extend((min_female, min_age, min_siblings, min_white, min_asian, min_african, min_hispanic, min_other))
    
    stats_df = pd.DataFrame(columns=['variable','mean','std. dev.','max','min'])
    variables = ['female','age','siblings','white','asian','african','hispanic','other']
    stats_df['variable'] = variables
    stats_df['mean'] = mean_list
    stats_df['std. dev.'] = std_dev_list
    stats_df['max'] = maximum_list
    stats_df['min'] = minimum_list

#     for variable in stats_df['variable']:
#         stats_df.loc[stats_df['variable']==variable,'mean'] 
        
#         # fill in the rest of the values and be sure you pay attention to the given column and row names!
#         # YOUR CODE HERE
    

    return stats_df
    


Your function should return a dataframe with each of the variables and their completed statistics. Check that it does:

In [6]:
stats_calculator(data)

Unnamed: 0,variable,mean,std. dev.,max,min
0,female,0.64,0.48,1.0,0.0
1,age,22.51,3.49,31.0,18.0
2,siblings,1.64,1.2,5.0,0.0
3,white,0.47,0.5,1.0,0.0
4,asian,0.27,0.44,1.0,0.0
5,african,0.11,0.31,1.0,0.0
6,hispanic,0.07,0.25,1.0,0.0
7,other,0.07,0.27,1.0,0.0


In [7]:
"""Check that the function above outputs the (rounded) statistics"""
assert stats_calculator(data).iloc[0][1] == 0.64, "Part A #1 female mean value differs"
assert stats_calculator(data).iloc[1][2] == 3.49, "Part A #1 age std. dev value differs"

In [8]:
"""Hidden test Part A: Check function abv outputs (rounded) statistics"""
# Hidden tests

'Hidden test Part A: Check function abv outputs (rounded) statistics'

In [9]:
"""Part A: Check function abv outputs (rounded) statistics"""
# Hidden tests

'Part A: Check function abv outputs (rounded) statistics'

## Part B (4 points)

We can also use a more objective measure to identify if our treatment groups were properly randomized.

1. Using a __t-test__ (make sure you use the _correct_ type of t-test) and the ```data``` dataframe again, analyze the differences between the two treatment groups (__k1_8_exp_lot__ and __k1_8_lot_exp__) for the female, age, and hispanic demographic variables by completing the following ```objective_randomization``` function. (4 points)

**Round any calculations to the hundredth decimal. Do not use percentages.**

In [48]:
from scipy.stats import ttest_ind
def objective_randomization(provided_data):
    """
    
    Complete the function that takes the provided data and runs a t-test on the 
    female, age, and hispanic demographic variables between the two treatments
    and outputs the results in the following partially-completed dataframe.
    Round your results to the nearest hundredth.
    Tip: you can choose to use either the statsmodels stats library or the scipy stats library to calculate the t-statistic and p-value.
    
    """
    
    k1_8_exp_lot = data[data['treatment']=='k1_8_exp_lot']
    k1_8_lot_exp = data[data['treatment']=='k1_8_lot_exp']

    ttest_female = ttest_ind(k1_8_lot_exp['female'], k1_8_exp_lot['female'])
    ttest_age = ttest_ind(k1_8_lot_exp['age'], k1_8_exp_lot['age'])
    ttest_hispanic = ttest_ind(k1_8_lot_exp['hispanic'], k1_8_exp_lot['hispanic'])
    
    
    

    ttest_df = pd.DataFrame(columns=['variable','t-statistic','p-value'])
    variables = ['female','age','hispanic']
    ttest_df['variable'] = variables
    
    t_statistic_list = [round(ttest_female[0], 2), round(ttest_age[0], 2), round(ttest_hispanic[0], 2)]
    p_value_list = [round(ttest_female[1], 2), round(ttest_age[1], 2), round(ttest_hispanic[1], 2)]
    
    
    ttest_df['t-statistic'] = t_statistic_list
    ttest_df['p-value'] = p_value_list
    
#     for variable in ttest_df['variable']:
#         g1 = data[data['treatment'] == 'k1_8_lot_exp'][variable]
#         # complete t-test and fill in values of dataframe
#         # YOUR CODE HERE

    
    return ttest_df
   
    


Your function should return a dataframe with each of the variables and their completed t-statistic and p-value across the treatments. 

Check that it does:

In [49]:
objective_randomization(data)

Unnamed: 0,variable,t-statistic,p-value
0,female,-0.23,0.82
1,age,-1.86,0.07
2,hispanic,-0.88,0.38


In [50]:
"""Check that the function above outputs the required statistics"""
result = objective_randomization(data)
assert result.iloc[0][1] == -0.23, "checking the value of the female t-statistic"

In [51]:
"""Part B # 1: Check function abv outputs required statistics"""
# Hidden tests

'Part B # 1: Check function abv outputs required statistics'