# Making fake data

For the sake of this workshop, we'll be using
some fake data from a fake experiment. This Jupyter
notebook contains code to create the data.

Let's say that we're interested in seeing how people bond
(or build rapport) with one another based on different
kinds of conversations. In our fake experiment, we'll ask
two strangers to come into the lab and ask them to have
a conversation with one another. They'll be randomly 
assigned to either have a conversation about a really good thing
that happened to them in the past week ("celebration"
conversation) or a really bad thing that happened to them in 
the past week ("commiseration" conversation). Their conversation
will be audio-recorded. After they have the conversation,
they'll each individually answer a question about how much
they like their partner.

Let's say that this fake experiment is going to test
the similarity of positive emotion words versus negative
emotion words that people said during those
conversations. We want to see whether partners who use
similar negative and/or positive words in these two
different kinds of conversations end up liking
their partner more.

To figure out negative and positive word similarity,
let's say that we transcribed their conversations and
simply counted up the number of times per turn that
each person used negative words and positive words.
We then get the similarity score by taking the correlation
of those two scores.

This means our data will look something like this:
* **Dependent variable**: `rapport` (1-5 scale)
* **Independent variables**: 
    * `negative_word_similarity` (-1 to +1; correlation 
        of negative words used by conversation partners)
    * `positive_word_similarity` (-1 to +1; correlation 
        of positive words used by conversation partners)
    * `conversation_type` (0 = "commiseration" conversation;
        1 = "celebration" conversation)

**Created by**: A. Paxton (University of 
California, Berkeley)

**Last modified**: 05 April 2018

***

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import random

## Specify how many dyads we want

How many total dyads do we want?

In [3]:
dyad_n = 40

How many commiseration dyads do we want?

In [4]:
commiseration_dyads = 25

However many we have left over is how many
celebration dyads we have.

In [5]:
celebration_dyads = dyad_n - commiseration_dyads

## Add dyads to dataframe

Create dyad IDs and dataframe for commiseration condition.

In [6]:
commiseration_dyad_IDs = range(1, 
                               commiseration_dyads+1)

In [7]:
commiseration_dyad_df = pd.DataFrame({'dyad_ID': commiseration_dyad_IDs,
                                      'condition': 0})

Create dyad IDs and dataframe for celebration condition.

In [8]:
celebration_dyad_IDs = range(commiseration_dyads+1,
                             dyad_n+1)

In [9]:
celebration_dyad_df = pd.DataFrame({'dyad_ID': celebration_dyad_IDs,
                                   'condition': 1})

## Randomly generate rapport variables

Let's make sure that our commiseration mean 
is higher than the celebration mean.

In [10]:
commiseration_mean = 4

In [11]:
celebration_mean = 2

Now, let's create separate distributions for each rapport.

In [12]:
commiseration_rapport = [round(random.triangular(1,5, commiseration_mean), 0)
                         for dyad in range(0, commiseration_dyads)]

In [13]:
celebration_rapport = [round(random.triangular(1,5, celebration_mean), 0) 
                         for dyad in range(0, celebration_dyads)]

And now we add them to the dataframes.

In [14]:
commiseration_dyad_df['rapport'] = commiseration_rapport

In [15]:
celebration_dyad_df['rapport'] = celebration_rapport

## Randomly create similarity scores

Let's create them for the commiseration participants.

In [16]:
commiseration_positive = commiseration_dyad_df['rapport']/5 * .25 + [
        random.random() for dyad in range(0, commiseration_dyads)]

In [17]:
commiseration_positive = [round(score, 2) 
                          if score <= 1 
                          else 1 
                          for score in commiseration_positive ]

In [18]:
commiseration_negative = commiseration_dyad_df['rapport']/5 * .75 + [
            random.random() for dyad in range(0, commiseration_dyads)]

In [19]:
commiseration_negative = [round(score, 2) 
                          if score <= 1 
                          else 1 
                          for score in commiseration_negative ]

And for the celebration participants.

In [20]:
celebration_positive = celebration_dyad_df['rapport']/5 * .5 + [
        random.random() for dyad in range(0, celebration_dyads)]

In [21]:
celebration_positive = [round(score, 2) 
                          if score <= 1 
                          else 1 
                          for score in celebration_positive ]

In [22]:
celebration_negative = celebration_dyad_df['rapport']/5 * .3 + [
        random.random() for dyad in range(0, celebration_dyads)]

In [23]:
celebration_negative = [round(score, 2) 
                          if score <= 1 
                          else 1 
                          for score in celebration_negative ]

Again, we'll add them to the dataframe.

In [24]:
commiseration_dyad_df['negative_word_similarity'] = commiseration_negative

In [25]:
commiseration_dyad_df['positive_word_similarity'] = commiseration_positive

In [26]:
celebration_dyad_df['negative_word_similarity'] = celebration_negative

In [27]:
celebration_dyad_df['positive_word_similarity'] = celebration_negative

## Create unified dataset

Now that we've created separate subsets for each condition,
we'll combine them to create a single experiment dataframe.

In [28]:
experiment_df = commiseration_dyad_df.append(celebration_dyad_df).reset_index(drop=True)

In [29]:
experiment_df

Unnamed: 0,condition,dyad_ID,rapport,negative_word_similarity,positive_word_similarity
0,0,1,2.0,0.54,0.16
1,0,2,3.0,1.0,0.6
2,0,3,4.0,0.92,0.38
3,0,4,3.0,0.93,0.95
4,0,5,2.0,0.89,0.75
5,0,6,4.0,1.0,0.83
6,0,7,4.0,1.0,0.3
7,0,8,3.0,1.0,0.57
8,0,9,3.0,1.0,1.0
9,0,10,2.0,0.56,0.96


## Save to file

In [30]:
experiment_df.to_csv('../data/simulated_experiment_data.csv',
                     sep=',',
                     header=True,
                     index=False)