# CSS Lab: Online Experiments
This notebook is a working example of how to analyze online experiments.
The lab uses a social influence experiment modeled after [SDW2006].
You can analyze the data from the orginal experiment,
or data from your ownexperiment conducted using the provided oTree module [CSW2016].
The original experiment measured the popularity of songs, but the analysis can be done for any kind of cultural artifact.

## Contents
1. [Setup](#Setup)
    1. [Load python libraries](#Load-python-libraries)
    1. [Import data](#Import-data)
1. [Experiment](#Experiment)
    1. [Descriptive statistics](#Descriptive-statistics)
    1. [Gini coefficient](#Gini-coefficient)
    1. [Market share](#Market-share)
    1. [Unpredictability](#Unpredictability)
2. [References](#References)

## Setup
### Import Python Libraries
We will use several python libraries that make it easier to analyze and plot data.

In [None]:
# Initialization
%pylab inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
import scipy.special as spspec

### Load Data
Now we'll define helper functions to read data from either the original experiment or from oTree's output.

In [None]:
# Class to manipulate data from original SDW2006 eperiment
class SDWData(object):
    def __init__(self, path="external/CW", independent_world=9, num_worlds=9, num_songs=48):
        self.path = path
        self.independent_world = independent_world
        self.num_worlds = num_worlds
        self.num_songs = num_songs
        
    # Get a DataFrame with world_id and song_id columns
    def get_world_song(self, world=None):
        df_sdw = self.get_sdw(world)
        return self.sdw_to_world_song(df_sdw)
    
    # Load the relevant data from the original experiment
    def get_sdw(self, world=None):
        # Load data from SDW experiment 1
        # Load all worlds if world is None
        downloads_file = "{path}/musiclab_data/dynamics_downloads_download_w{world}_v{experiment}.txt"
        song_ids = range(1,self.num_songs+1)
        if world is None:
            world_ids = range(1, self.num_worlds+1)
        else:
            world_ids = [world]
        columns = ['user_id', 'world_id'] \
            + ["dl_{i}".format(i=i) for i in song_ids] \
            + ['timestamp']
        df_raw = pd.concat([
            pd.read_csv(
                downloads_file.format(path=self.path, world=w, experiment=1),
                comment="%",
                header=None,
                names=columns
            )
            for w in world_ids])
        return df_raw

    # Convert SDW2006 data to a more usable format
    def sdw_to_world_song(self, df_raw):
        col_world_id = []
        col_song_id = []
        col_count = []
        song_ids = range(1,self.num_songs + 1)
        world_ids = range(1,self.num_worlds + 1)
        # Get list of world ids present in df_raw
        world_ids = sorted(set(df_raw["world_id"]))
        for cur_world in world_ids:
            # Filter by world
            df_world = df_raw[df_raw["world_id"] == cur_world]
            for cur_song in song_ids:
                col_world_id.append(cur_world)
                col_song_id.append(cur_song)
                count = df_world["dl_{}".format(cur_song)].sum()
                col_count.append(count)
        df_downloads = pd.DataFrame({
            "world_id": col_world_id,
            "song_id": col_song_id,
            "download_count": col_count,
            "rating_count": 0.0,
            "mean_rating": 0.0
        })
        return df_downloads
    
    # Generate a DataFrame with 
    def sample_independent(self, num_worlds=2):
        df_sdw = self.get_sdw(self.independent_world)
        df_sdw["world_id"] = np.random.randint(0, num_worlds, len(df_sdw.index))
        return self.sdw_to_world_song(df_sdw)

class OTreeData(object):
    def __init__(self, data="data/cultural_market.csv", session=None, independent_world=0, num_worlds=4, num_songs=48):
        self.data = data
        self.session = session
        self.independent_world = independent_world
        self.num_worlds = num_worlds
        self.num_songs = num_songs
        
    # Load oTree data into a data frame
    def get_world_song(self, world=None):
        df_otree = self.get_otree(world)
        df_world_song = self.otree_to_world_song(df_otree)
        return df_world_song
    
    def get_otree(self, world=None):
        # Read csv in oTree format
        # We set low_memory=False so pandas can infer column types
        df_otree = pd.read_csv(self.data, low_memory=False)
        # Remove all but desired session
        df_otree = df_otree[df_otree['session.code'] == self.session]
        if world is not None:
            df_otree = df_otree[df_otree['world_id'] == world]
        return df_otree
        
    def otree_to_world_song(self, df_raw):
        # Generate list of songs and worlds
        song_ids = range(self.num_songs)
        # Use all worlds if none is specified
        world_ids = sorted(set(df_raw["player.world"]))
        # Count totals for each world/song combination
        col_world_id = []
        col_song_id = []
        col_download_count = []
        col_mean_rating = []
        col_rating_count = []
        for cur_world in world_ids:
            df_world = df_raw[df_raw["player.world"] == cur_world]
            for cur_song in song_ids:
                # Record song and world id
                col_world_id.append(cur_world)
                col_song_id.append(cur_song)
                # Count the number of downloads
                col_download_count.append(df_world["player.download_{}".format(cur_song)].sum())
                # Find the number of ratings and average rating
                rating_label = "player.rating_{}".format(cur_song)
                df_ratings = df_world[df_world[rating_label] > 0]
                col_rating_count.append(len(df_ratings))
                col_mean_rating.append(df_ratings[rating_label].mean())
        df_world_song = pd.DataFrame({
            "world_id": col_world_id,
            "song_id": col_song_id,
            "download_count": col_download_count,
            "rating_count": col_rating_count,
            "mean_rating": col_mean_rating
        })
        return df_world_song
    
    def sample_independent(self, num_worlds=2):
        df_otree = pd.read_csv(self.data, low_memory=False)
        df_otree = df_otree[df_otree['session.code'] == self.session]
        df_otree["world_id"] = np.random.randint(0, num_worlds, len(df_otree.index))
        return self.otree_to_world_song(df_otree)


The next cell will read in the data from the original Salganik, Dodds, and Watts experiment [SDW2006] and display the first few rows.

If you instead want to analyze data from oTree, remove the `#` from the beginning of the second line and enter the path to the csv file downloaded from oTree and the oTree session id of your experiment.

In [None]:
data = SDWData(path="external/CM", independent_world=9, num_worlds=9, num_songs=48)
#data = OTreeData(data="data/cultural_market.csv", session="3i8pw3kt", independent_world=0)
df_world_song = data.get_world_song()
df_downloads = df_world_song
df_world_song.head()

We also need to choose which quantity to analyze. The options are `download_count` (as in the original experiment) or `mean_rating` (suggested for the oTree experiment).

In [None]:
analysis_column = "download_count"

## Experiment

### Descriptive statistics
First we define some helper functions to calculate statistics about the experiment.

In [None]:
# Count downloads for each song
def count_song_stats(df_downloads):
    # Create list of song_id values present in input
    song_ids = sorted(set(df_downloads["song_id"]))
    # Create DataFrame for songs
    df_songs = pd.DataFrame({"song_id":song_ids}) \
        .set_index("song_id")
    df_songs["download_count"] = [
        df_downloads[df_downloads["song_id"] == cur_song]["download_count"].sum()
        for cur_song in df_songs.index]
    # Calculate mean over all worlds
    df_songs["mean_rating"] = np.zeros(len(df_songs.index))
    df_songs["rating_count"] = np.zeros(len(df_songs.index))
    for cur_song in df_songs.index:
        df = df_downloads[df_downloads["song_id"] == cur_song]
        df = df[~np.isnan(df["mean_rating"])]
        total_rating = float(np.dot(df["mean_rating"], df["rating_count"]))
        total_count = float(df["rating_count"].sum())
        try:
            df_songs["rating_count"] = total_count
            mean_rating = total_rating / total_count
            df_songs.loc[cur_song, "mean_rating"] = mean_rating
        except ZeroDivisionError:
            df_songs.loc[cur_song, "mean_rating"] = np.NaN
    return df_songs

# Count downloads for each world
def count_world_stats(df_downloads):
    # Create list of world_id values present in input
    world_ids = sorted(set(df_downloads["world_id"]))
    # Create DataFrame for worlds
    df_worlds = pd.DataFrame({"world_id":world_ids}) \
        .set_index("world_id")
    # Count downloads for each world
    df_worlds["download_count"] = [
        df_downloads[df_downloads["world_id"] == cur_world]["download_count"].sum()
        for cur_world in df_worlds.index
    ]
    # Calculate mean over all songs
    df_worlds["rating_count"] = np.zeros(len(df_worlds.index))
    df_worlds["mean_rating"] = np.zeros(len(df_worlds.index))
    for cur_world in df_worlds.index:
        df = df_downloads[df_downloads["world_id"] == cur_world]
        df = df[~np.isnan(df["mean_rating"])]
        total_rating = float(np.dot(df["mean_rating"], df["rating_count"]))
        total_count = float(df["rating_count"].sum())
        df_worlds["rating_count"] = total_count
        try:
            mean_rating = total_rating / total_count
            df_worlds.loc[cur_world, "mean_rating"] = mean_rating
        except ZeroDivisionError:
            df_worlds.loc[cur_world, "mean_rating"] = np.NaN            
    return df_worlds

After calculating statistics about the data set, we can use a histogram to visualize how frequent different values are.

* Are all values equally frequent?
* Is there a specific value that seems most frequent?
* How does the frequency of lower values compare to that of higher values?

In [None]:
# Plot histogram of download counts
df_songs = count_song_stats(df_downloads)
df_worlds = count_world_stats(df_downloads)
plt.hist([ x for x in df_songs[analysis_column] if not np.isnan(x)], bins=20)
plt.xlabel(analysis_column)
plt.ylabel("Frequency")

### Gini coefficient
The [Gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient) is a meaure of how unequally a quantity is distributed. A value of 0 corresponds to a completely equal distribution, while a value of 1 corresponds to a single entity having the entire quantity while all others have nothing.

We can use the Gini coefficient to quantify how equally downloads, views, or ratings are distributed between items. 

First we define some helper functions.

In [None]:
def gini(x):
    '''Given a list of counts `x`, return the gini coefficient.'''
    x = [xi for xi in x if not np.isnan(xi)]
    n = len(x)
    gini_num = sum([sum([abs(x_i - x_j) for x_j in x]) for x_i in x])
    gini_den = 2.0 * n * sum([x_i for x_i in x])
    return gini_num / gini_den

def get_world_gini(df_world_song):
    world_ids = sorted(set(df_world_song["world_id"]))
    df_worlds = pd.DataFrame({"world_id": world_ids}).set_index("world_id")
    df_worlds["gini"] = [
        gini(df_downloads.loc[df_downloads["world_id"] == cur_world, analysis_column])
        for cur_world in df_worlds.index]
    return df_worlds

The plot below shows the Gini coefficient for each world.

* How does the Gini coefficient of the independent world compare to the social influence worlds?

In [None]:
# Calculate and plot the gini coefficient for each world
df_world_gini = get_world_gini(df_world_song)
plt.bar(df_world_gini.index, df_world_gini["gini"])
plt.xticks(df_world_gini.index, df_world_gini.index)
plt.xlabel("World")
plt.ylabel("Gini coefficient")

### Market share
The market share represents the popularity of an artifact. It can be calculated from several possible quantities, including ratings and downloads.

The helper functions below calculate the market share as well as the rank of an artifact's market share compared to all others.

In [None]:
# Calculate market shares
def get_market_share(df_world_song):
    '''Return a DataFrame containing song_id, world_id, and market_share columns.'''
    # Create a copy of the input to return
    df = df_world_song.copy()
    # Count the total downloads for each world
    df_worlds = count_world_stats(df)
    world_ids = df_worlds.index
    # Calculate the market share
    if analysis_column == 'mean_rating':
        df["market_share"] = [
            float(df.loc[i, "mean_rating"] * df.loc[i, "rating_count"]) \
                / float(df_worlds.loc[df.loc[i, "world_id"]]["rating_count"])
                / float(df_worlds.loc[df.loc[i, "world_id"]]["mean_rating"])
            for i in df.index]
    elif analysis_column == 'download_count':
        df["market_share"] = [
            float(df.loc[i, "download_count"]) \
                / float(df_worlds.loc[df.loc[i, "world_id"], "download_count"])
            for i in df.index]
    return df

# Calculate market share and rank for each song/world
def get_market_rank(df_world_song):
    # Get market share
    df_world_song["market_share"] = get_market_share(df_world_song)["market_share"]
    # Copy market share, and convert to rank one world at a time
    ranks = []
    for cur_world in sorted(set(df_world_song["world_id"])):
        df = df_world_song[df_world_song["world_id"] == cur_world].copy()
        df["market_rank"] = df["market_share"].rank(ascending=False)
        # Store results for this world in an array
        ranks.append(df)
    # Concatenate results for all worlds
    df_world_song['market_rank'] = pd.concat(ranks)['market_rank']
    # Remove nan entries
    nan_songs = list(df_world_song[np.isnan(df_world_song["market_share"])]["song_id"])
    df = df_world_song
    for cur_song in nan_songs:
        df = df[df["song_id"] != cur_song]
    return df

We plot the market shares (and ranks) of artifacts in social influence worlds as a function of their market shares (and ranks) in the independent world.
* Do you expect the market shares to be correlated between different worlds?
* How would you expect the plot to look if there is no social influence? Strong social influence?
* How do unpopular artifacts compare to moderatley and very popular artifacts?

In [None]:
# Get market share and rank
df_market = get_market_rank(df_world_song)

# Create list of dependent worlds
world_ids = sorted(set(df_world_song["world_id"]))
dependent_worlds = [x for x in world_ids if x != data.independent_world]

# Create a figure
plt.figure(figsize(8,4))
# Plot social influence market share vs independent market share
# Create subplots and use first
plt.subplot(1,2,1)
for cur_world in dependent_worlds:
    plt.plot(
        df_market[df_market["world_id"] == data.independent_world]['market_share'],
        df_market[df_market["world_id"] == cur_world]['market_share'], '.b')
plt.xlabel("Market share (Indep.)")
plt.ylabel("Market share (Social)")
# Plot social rank vs indpendent rank in second subplot
plt.subplot(1,2,2)
for cur_world in dependent_worlds:
    plt.plot(
        df_market[df_market["world_id"] == data.independent_world]['market_rank'],
        df_market[df_market["world_id"] == cur_world]['market_rank'], '.b')
plt.xlabel("Market rank (Indep.)")
plt.ylabel("Market rank (Social)")
plt.tight_layout()

## Unpredictability
Is the success of cultural artifacts more or less predictable when there is social influence? We can compare a given artifact's popularity in different social influence worlds to determine the unpredictability. For the independent case, we have to randomly divide the independent world into multiple worlds and compare between those.

The helper functions below calculate the unpredictability of artifacts in both social influence and indpendent worlds.

In [None]:
def find_unpredictability(df_downloads):
    '''Return a DataFrame indexed by song_id with an `unpredictability` column.'''
    # Create the DataFrame and index from the provided download data
    song_ids = sorted(set(df_downloads["song_id"]))
    world_ids = sorted(set(df_downloads["world_id"]))
    df = pd.DataFrame({"song_id": song_ids}) \
        .set_index("song_id")
    # Get market share of each song in each world
    df_market = get_market_share(df_downloads)
    # Calculate and return the unpredictability based on equation in SDW2006
    df["unpredictability"] = [
        sum([
            sum([
                abs(
                    df_market[
                        (df_market["song_id"] == cur_song)
                        & (df_market["world_id"] == world_j)
                    ]["market_share"].sum()
                    - df_market[
                        (df_market["song_id"] == cur_song)
                        & (df_market["world_id"] == world_k)
                    ]["market_share"].sum()
                ) / spspec.comb(len(world_ids), 2)
                for k, world_k in enumerate(world_ids[j+1:])])
            for j, world_j in enumerate(world_ids)])
        for cur_song in df.index]
    return df

def compare_unpredictability(df_world_song):
    # Calculate unpredictability in social influence worlds
    df_songs = count_song_stats(df_world_song)
    df_social = df_world_song[df_world_song["world_id"] != data.independent_world]
    df_songs["unpredictability"] = find_unpredictability(df_social)["unpredictability"]
    # Calculate unpredictability in independent world
    unpredictability = []
    num_iter = 50
    for i in range(num_iter):
        df_indep_dl = data.sample_independent()
        u_i = find_unpredictability(df_indep_dl)["unpredictability"]
        unpredictability.append(u_i) 
    # Average results
    # Elements are pandas Series objects, which can be added to each other
    u = np.sum(unpredictability, axis=0) / float(num_iter)
    # Add to the song DataFrame
    df_songs["unpredictability_indep"] = u
    return df_songs

The plot below visualizes the unpredictability in social influence worlds vs independent worlds.
* Why would social influence have an effect on the predictability of an artifact's success?
* Would you expect social influence to make success more or less predictable?

In [None]:
# Plot the unpredictability for social and independent worlds
df_songs = compare_unpredictability(df_world_song)
u_social = df_songs["unpredictability"].sum() / data.num_songs
u_indep = df_songs["unpredictability_indep"].sum() / data.num_songs
plt.figure(figsize=(6,4))
plt.bar([1, 2], [u_social, u_indep])
plt.xticks([1,2], ["Social", "Independent"])
plt.ylim([0, max([u_social, u_indep])*2])
plt.xlabel("World")
plt.ylabel("Unpredictability")

## References

[SDW2006] Salganik, M. J., Dodds, P. S., & Watts, D. J. (2006). Experimental study of inequality and unpredictability in an artificial cultural market. _Science_, 311(5762), 854-856.

[CSW2016] Chen, D.L., Schonger, M., & Wickens, C. (2016). oTree - An open-source platform for laboratory, online and field experiments. _Journal of Behavioral and Experimental Finance_, 9, 88-97.