# Making fake data

For the sake of this workshop, we'll be using
some fake data from a fake experiment. This Jupyter
notebook contains code to create the data.

Let's say that we're interested in seeing how people bond
(or build rapport) with one another based on different
kinds of conversations. In our fake experiment, we'll ask
two strangers to come into the lab and ask them to have
a conversation with one another. They'll be randomly 
assigned to either have a conversation about a really good thing
that happened to them in the past week ("celebration"
conversation) or a really bad thing that happened to them in 
the past week ("commiseration" conversation). 
Their conversation will be audio- and video-recorded.

Let's say that this fake experiment is going to test
the similarity of positive emotion words versus negative
emotion words that people said during those
conversations. We want to see whether partners who use
similar negative and/or positive words in these two
different kinds of conversations are rated as having
higher rapport by expert observers.

To figure out negative and positive word similarity,
let's say that we transcribed their conversations and
simply counted up the number of times per turn that
each person used negative words and positive words.
We then get the similarity score by taking the correlation
of those two scores.

To track rapport, let's say that we recruited 2
expert observers as raters. Let's say that we trained them
to continuously rate rapport from the videos using
a joystick-style method (cf. Sadler, Ethier, Gunn,
Duong, Woody, 2009), creating a time series of ratings between
0 and 1. We then obtained a single rapport rating for the dyad
by taking the mean of the time series.

This means our data will look something like this:
* **Dependent variable**: `rapport` (0 to 1; average of 
    continuous moment-to-moment rating by expert observers)
* **Independent variables**: 
    * `negative_word_similarity` (-1 to +1; correlation 
        of negative words used by conversation partners)
    * `positive_word_similarity` (-1 to +1; correlation 
        of positive words used by conversation partners)
    * `conversation_type` (0 = "commiseration" conversation;
        1 = "celebration" conversation)

**Created by**: A. Paxton (University of 
Connecticut)

**Last modified**: 10 July 2020

***

## Import libraries

When programming, there are a lot of basic functions (or 
specific actions that are applied to data; e.g., addition,
subtraction) that are built into the core of the language. 
To expand the power of programming languages, people often
write and share additional *libraries* or *packages* that
have been developed to incorporate other functions. We
have to `import` each additional package or library before
we can use those functions, which we do below.

We can import entire libraries:

In [1]:
import random

We can assign shorter names to whole libraries
to make typing easier:

In [2]:
import pandas as pd
import numpy as np

And we can even just grab specific functions if we
don't need the whole package:

In [3]:
from sklearn.preprocessing import scale

## Specify how many dyads we want

How many total dyads do we want?

In [4]:
dyad_n = 40

How many commiseration dyads do we want?

In [5]:
commiseration_dyads = 25

However many we have left over is how many
celebration dyads we have.

In [6]:
celebration_dyads = dyad_n - commiseration_dyads

## Add dyads to dataframe

Create dyad IDs and dataframe for commiseration condition.

In [7]:
commiseration_dyad_IDs = range(1, 
                               commiseration_dyads+1)

In [8]:
commiseration_dyad_df = pd.DataFrame({'dyad_ID': commiseration_dyad_IDs,
                                      'condition': 0})

Create dyad IDs and dataframe for celebration condition.

In [9]:
celebration_dyad_IDs = range(commiseration_dyads+1,
                             dyad_n+1)

In [10]:
celebration_dyad_df = pd.DataFrame({'dyad_ID': celebration_dyad_IDs,
                                   'condition': 1})

## Randomly generate rapport variables

Whenever we rely on random number generation in programming, it's
critical to set a seed for our random number generator. This allows
us to facilitate *computational reproducibility*, or the ability for
us to re-run the code and get precisely the same values. In the
case of randomly generated numbers or any programs that incorporate
randomness, it's even more important to set the seed.

In [47]:
random.seed(30)

Let's make sure that our commiseration mean 
is higher than the celebration mean.

In [56]:
commiseration_mean = .6

In [57]:
commiseration_sigma = .2

In [58]:
celebration_mean = .3

In [59]:
celebration_sigma = .1

Now, let's create separate distributions for each rapport.

In [60]:
commiseration_rapport = [round(np.random.normal(commiseration_mean,
                                                commiseration_sigma), 1)
                         for dyad in range(0, commiseration_dyads)]

In [62]:
celebration_rapport = [round(np.random.normal(celebration_mean,
                                              celebration_sigma), 1)
                         for dyad in range(0, celebration_dyads)]

And now we add them to the dataframes.

In [19]:
commiseration_dyad_df['rapport'] = commiseration_rapport

In [20]:
celebration_dyad_df['rapport'] = celebration_rapport

## Randomly create similarity scores

Let's create them for the commiseration participants.

In [21]:
commiseration_positive = commiseration_dyad_df['rapport']/5 * .25 + [
        random.random() for dyad in range(0, commiseration_dyads)]

In [22]:
commiseration_negative = commiseration_dyad_df['rapport']/5 * .75 + [
            random.random() for dyad in range(0, commiseration_dyads)]

And for the celebration participants.

In [23]:
celebration_positive = celebration_dyad_df['rapport']/5 * .5 + [
        random.random() for dyad in range(0, celebration_dyads)]

In [24]:
celebration_negative = celebration_dyad_df['rapport']/5 * .3 + [
        random.random() for dyad in range(0, celebration_dyads)]

Again, we'll add them to the dataframe.

In [25]:
commiseration_dyad_df['negative_word_similarity'] = [round(score, 2)
                                                     for score in commiseration_negative]

In [26]:
commiseration_dyad_df['positive_word_similarity'] = [round(score, 2)
                                                     for score in commiseration_positive]

In [27]:
celebration_dyad_df['negative_word_similarity'] = [round(score, 2)
                                                   for score in celebration_negative]

In [28]:
celebration_dyad_df['positive_word_similarity'] = [round(score, 2)
                                                   for score in celebration_positive]

## Create unified dataset

Now that we've created separate subsets for each condition,
we'll combine them to create a single experiment dataframe.

In [29]:
experiment_df = commiseration_dyad_df.append(celebration_dyad_df).reset_index(drop=True)

Let's say that we want to scale our positive and negative 
similarity scores, rather than using raw correlations.
We have a function that will help us (`scale`),
and we can get more information on how to use it by executing
this command:

In [30]:
?scale

[0;31mSignature:[0m [0mscale[0m[0;34m([0m[0mX[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mwith_mean[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mwith_std[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mcopy[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Standardize a dataset along any axis

Center to the mean and component wise scale to unit variance.

Read more in the :ref:`User Guide <preprocessing_scaler>`.

Parameters
----------
X : {array-like, sparse matrix}
    The data to center and scale.

axis : int (0 by default)
    axis used to compute the means and standard deviations along. If 0,
    independently standardize each feature, otherwise (if 1) standardize
    each sample.

with_mean : boolean, True by default
    If True, center the data before scaling.

with_std : boolean, True by default
    If True, scale the data to unit variance (or equivalently,
    unit standard deviation).

copy : boolean, o

Let's scale the word similarity scores.

In [31]:
experiment_df['negative_word_similarity_scaled'] = scale(experiment_df['negative_word_similarity'],
                                                 axis=0, with_mean=True, with_std=True, copy=True)

In [32]:
experiment_df['positive_word_similarity_scaled'] = scale(experiment_df['positive_word_similarity'],
                                                 axis=0, with_mean=True, with_std=True, copy=True)

How do our data look now?

In [33]:
experiment_df

Unnamed: 0,dyad_ID,condition,rapport,negative_word_similarity,positive_word_similarity,negative_word_similarity_scaled,positive_word_similarity_scaled
0,1,0,0.4,0.82,0.56,0.968296,-0.005719
1,2,0,0.9,0.7,0.33,0.450721,-0.7574
2,3,0,0.6,0.76,0.06,0.709508,-1.639807
3,4,0,0.7,0.74,0.69,0.623246,0.419144
4,5,0,0.3,0.94,0.23,1.48587,-1.084217
5,6,0,0.5,0.19,0.28,-1.74897,-0.920808
6,7,0,0.3,0.54,0.41,-0.239378,-0.495946
7,8,0,0.3,0.35,0.66,-1.058871,0.321098
8,9,0,0.7,0.93,1.02,1.442739,1.497641
9,10,0,0.6,0.97,0.49,1.615264,-0.234492


## Save to file

In [34]:
experiment_df.to_csv('../data/simulated_experiment_data.csv',
                     sep=',',
                     header=True,
                     index=False)