# Data Wrangling

# Simulate data for stroop experiment

## We need to simulate:
- participant ID's
- reaction times
- responses
    - calculate whether or not the responses are correct

## Simulates binary responses.  
- Numpy stands for "number Python" and deals well with the complex math we use in science
- We are taking random choice from the list `["incorrect", "correct"]`
- There is a sample size of 50, meaning we take a random choice 50 times
- We give it a probability `p` which is specified using Python's fraction notation.  Note that the correct and incorrect probabilities are out of 1, but could have been specified differently
- We'll have more corrects than incorrects in the congruent condition, so we swap the fractions in congrent and incongruent in order to simulate that
    - I'm just picking numbers here, there's nothing concrete about these fractions
    - I picked 1/4 and 3/4 because they are easy

In [None]:
import numpy as np
congruent_responses = np.random.choice(["incorrect", "correct"], size=(50,), p=[1./4, 3./4])
incongruent_responses = np.random.choice(["incorrect", "correct"], size=(50,), p=[3./4, 1./4])

#### Check our data

In [None]:
print("congruent", congruent_responses)
print("incongruent", incongruent_responses)

## Simulate reaction time data

The `random.triangular` function generates normally distributed random numbers in a range betwen `a` and `b` with a mode of `x`. 

In [None]:
import random

a = 0.5 # lowest possible reaction time
b = 6 # highest possible reaction time
reaction_time_incongruent = []
reaction_time_congruent = []
for i in range(50):
    x = 4 # mode of reaction time
    reaction_time_incongruent.append(random.triangular(a, b, 3*x - a - b))
    x = 3 # mode of reaction time
    reaction_time_congruent.append(random.triangular(a, b, 3*x - a - b))

## Put all the data together into a nice looking list
We can make a list of tuples by using the `zip()` function, and feeding that output into a list.  The `zip()` function goes through the specified lists and makes tuples out of corresponding values for each item on the list

In [7]:
data_tuples = list(zip(congruent_responses, incongruent_responses, reaction_time_incongruent, reaction_time_congruent))
data_tuples

[('correct', 'incorrect', 2.479999245529081, 2.0050413698653076),
 ('correct', 'incorrect', 3.7360420077741066, 2.8900240212433252),
 ('correct', 'incorrect', 3.702279550921168, 3.6429790235407404),
 ('correct', 'incorrect', 3.940540847175789, 3.194375846994646),
 ('correct', 'incorrect', 5.456142345937917, 3.4881337529731096),
 ('correct', 'incorrect', 3.7571805434745844, 4.423734409990389),
 ('correct', 'correct', 4.061008474591707, 2.1245905187021545),
 ('correct', 'correct', 2.1958800229209023, 2.5232069821399974),
 ('correct', 'incorrect', 5.160250421926483, 3.7955272882494824),
 ('correct', 'incorrect', 5.286178993345807, 3.6213320824598414),
 ('correct', 'incorrect', 3.83787522430487, 3.3124145207799907),
 ('correct', 'incorrect', 3.374761676217749, 3.9270176501552223),
 ('incorrect', 'correct', 5.6711361944031315, 4.757078995780281),
 ('incorrect', 'correct', 3.3220031803379864, 4.682068525841227),
 ('correct', 'incorrect', 5.022698872075701, 2.477330574836642),
 ('correct', 'i

# Pandas DataFrame
A Pandas DataFrame is a Python Object that holds a set of data like a spreadsheet.

In [8]:
import pandas as pd  # it is custom to shorten pandas to pd because we'll be typing it a log
df = pd.DataFrame(data_tuples)  # called the dataframe df, but you can use any name you want... it's a Python object.

## Viewing the data

You can view the whole DataFrame by typing `df` as usual.  But, perhaps your dataset is large and you just want to get a sense of how it's looking.  For that we can use the `head()` method call on our dataframe.  By default it shows the first 5 rows, or you can specify the number of rows you want to see.

In [9]:
df.head()

Unnamed: 0,0,1,2,3
0,correct,incorrect,2.479999,2.005041
1,correct,incorrect,3.736042,2.890024
2,correct,incorrect,3.70228,3.642979
3,correct,incorrect,3.940541,3.194376
4,correct,incorrect,5.456142,3.488134


## Column names

In [10]:
df.columns = ["Incongruent Response", "Congruent Response", "Incongruent RT", "Congruent RT"]

In [12]:
df.head(2)

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT
0,correct,incorrect,2.479999,2.005041
1,correct,incorrect,3.736042,2.890024


df.columns holds the column names, so you can print them if you want to see them

In [13]:
for col in df.columns: 
    print(col)

Incongruent Response
Congruent Response
Incongruent RT
Congruent RT


## Stacking the data 

Many times we need our data to be in long format rather than wide format.  Our data is currently in wide format.  To get it into long format, we can `melt` the data.  This saves us lots of copypasta in excel.

In [14]:
long_df=df.melt(id_vars=['Incongruent Response', 'Congruent Response'],
                  value_vars=['Incongruent RT', 'Congruent RT'],
                  var_name = 'Condition', value_name = 'RT') 
long_df.head()

Unnamed: 0,Incongruent Response,Congruent Response,Condition,RT
0,correct,incorrect,Incongruent RT,2.479999
1,correct,incorrect,Incongruent RT,3.736042
2,correct,incorrect,Incongruent RT,3.70228
3,correct,incorrect,Incongruent RT,3.940541
4,correct,incorrect,Incongruent RT,5.456142


## Indexing data

### `iloc`

Here we use `iloc` to index our data by it's location, just like you would do in excel... In excel, you would find, for example, cell `A5`.  In Pandas, we do `df.ix[4,0]`.  Don't forget we start counting from 0 in Python, so the 5th row is 4, and the 1st column is 0.

In [15]:
df.iloc[4,0]

'correct'

### `loc`

Here we use `loc` to index our data by it's nameDon't forget we start counting from 0 in Python, so the 8th row is 7, and the 1st column is 'Incongruent Response.

In [16]:
df.loc[7,'Incongruent Response']

'correct'

## `index`

In both cases above, we used the `index` or, what we thought was the row number to find our cells.  Pandas created our index for us, and we just pretended it was the row number.  We could have used another column.
If you don't have an index but want one, or if you want to chage the index, you can use the method `.set_index()`

## Selecting Data

You can select data by calling the dataframe, and instead of putting the name of one column, you put in a list of column names that you want.  I'm applying the `head()` method here just to keep the output short.  It's not a part of the selection command.

In [17]:
df[['Incongruent RT', 'Congruent RT']].head()

Unnamed: 0,Incongruent RT,Congruent RT
0,2.479999,2.005041
1,3.736042,2.890024
2,3.70228,3.642979
3,3.940541,3.194376
4,5.456142,3.488134


### Selecting data by condition
You can use expressions to select data that match a number or satisfy an equation. 

In [18]:
df[df["Incongruent Response"] == "correct"].head()

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT
0,correct,incorrect,2.479999,2.005041
1,correct,incorrect,3.736042,2.890024
2,correct,incorrect,3.70228,3.642979
3,correct,incorrect,3.940541,3.194376
4,correct,incorrect,5.456142,3.488134


## Subsetting data

### Subsetting by row

In [19]:
df2 = df[0:3]
df2

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT
0,correct,incorrect,2.479999,2.005041
1,correct,incorrect,3.736042,2.890024
2,correct,incorrect,3.70228,3.642979


### Subsetting by index location

In [20]:
df4 = df.iloc[2:4, 1:4]
df4

Unnamed: 0,Congruent Response,Incongruent RT,Congruent RT
2,incorrect,3.70228,3.642979
3,incorrect,3.940541,3.194376


## Calculate the mean reaction times using the `mean()` method.

In [21]:
print(df['Incongruent RT'].mean())
print(df['Congruent RT'].mean())

3.709285593254048
3.0428483128301345


## Calculate the median reaction times using the `median()` method.

In [22]:
print(df['Incongruent RT'].median())
print(df['Congruent RT'].median())

3.7191607793476376
3.0399235883988607


## Calculate the mode of responses times using the `mode()` method.
Here we're calculating the mode of responses instead of reaction times because `mode()` calculates the most often occuring value.  Our reaction times are all unique values, so the mode function just returns the original data.  If we want to see the mode method in action, we can use it on the responses, and see what the most often occuring responses are in each category.

In [23]:
print(df['Incongruent Response'].mode())
print(df['Congruent Response'].mode())

0    correct
dtype: object
0    incorrect
dtype: object


## Calculate the standard deviation of the reaction times using the `stdev()` method.

In [24]:
print(df['Incongruent RT'].std())
print(df['Congruent RT'].std())

1.1520929579550927
0.9069713659329325


<h2>Functions &amp; Description</h2>
<p>Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions &minus;</p>
<table class="table table-bordered">
<tr>
<th style="text-align:center;">Sr.No.</th>
<th style="text-align:center;">Function</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">1</td>
<td style="text-align:center;">count()</td>
<td>Number of non-null observations</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td style="text-align:center;">sum()</td>
<td>Sum of values</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td style="text-align:center;">mean()</td>
<td>Mean of Values</td>
</tr>
<tr>
<td style="text-align:center;">4</td>
<td style="text-align:center;">median()</td>
<td>Median of Values</td>
</tr>
<tr>
<td style="text-align:center;">5</td>
<td style="text-align:center;">mode()</td>
<td>Mode of values</td>
</tr>
<tr>
<td style="text-align:center;">6</td>
<td style="text-align:center;">std()</td>
<td>Standard Deviation of the Values</td>
</tr>
<tr>
<td style="text-align:center;">7</td>
<td style="text-align:center;">min()</td>
<td>Minimum Value</td>
</tr>
<tr>
<td style="text-align:center;">8</td>
<td style="text-align:center;">max()</td>
<td>Maximum Value</td>
</tr>
<tr>
<td style="text-align:center;">9</td>
<td style="text-align:center;">abs()</td>
<td>Absolute Value</td>
</tr>
<tr>
<td style="text-align:center;">10</td>
<td style="text-align:center;">prod()</td>
<td>Product of Values</td>
</tr>
<tr>
<td style="text-align:center;">11</td>
<td style="text-align:center;">cumsum()</td>
<td>Cumulative Sum</td>
</tr>
<tr>
<td style="text-align:center;">12</td>
<td style="text-align:center;">cumprod()</td>
<td>Cumulative Product</td>
</tr>
</table>

https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm

## Describe the data gives a summary of the numerical data in a given dataset

In [25]:
df.describe()

Unnamed: 0,Incongruent RT,Congruent RT
count,50.0,50.0
mean,3.709286,3.042848
std,1.152093,0.906971
min,1.096242,1.01907
25%,2.842027,2.427086
50%,3.719161,3.039924
75%,4.605331,3.622555
max,5.671136,4.832019


### We can use `=object` to see info about cells that contain objects, not numbers.  This includes text, like our response variables

In [26]:
df.describe(include=['object'])

Unnamed: 0,Incongruent Response,Congruent Response
count,50,50
unique,2,2
top,correct,incorrect
freq,35,39


### We can use `=all` to see all of that info at once.  `NaN` stands for  `not a number`, which is Pandas' N/A value

In [27]:
df. describe(include='all')

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT
count,50,50,50.0,50.0
unique,2,2,,
top,correct,incorrect,,
freq,35,39,,
mean,,,3.709286,3.042848
std,,,1.152093,0.906971
min,,,1.096242,1.01907
25%,,,2.842027,2.427086
50%,,,3.719161,3.039924
75%,,,4.605331,3.622555


### Transforming data

Let's say we need to apply a function to our data.  Lots of times this is important to understand things on different scales.  Let's explore a natural log scale.

In [28]:
import numpy as np
df['Log of Incongruent RT'] = df['Incongruent RT'].transform(np.log)
df.head(3)

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT,Log of Incongruent RT
0,correct,incorrect,2.479999,2.005041,0.908258
1,correct,incorrect,3.736042,2.890024,1.318027
2,correct,incorrect,3.70228,3.642979,1.308949


### Another way to do that

In [29]:
df['Log of congruent RT'] = np.log(df['Congruent RT'])
df.head(3)

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT,Log of Incongruent RT,Log of congruent RT
0,correct,incorrect,2.479999,2.005041,0.908258,0.695665
1,correct,incorrect,3.736042,2.890024,1.318027,1.061265
2,correct,incorrect,3.70228,3.642979,1.308949,1.292802


## Calculating across axes using `apply`
Sometimes you want to calculate across rows, sometimes down columns.  `apply` is the command to do that.

<div class="alert alert-info">
<list>
    <li>Axis 0 is rows</li>
    <li>Axis 1 is columns</li>
</list>



Let's calculate the mean reaction time for each trial.
- First we subset the data

In [30]:
rtData = df.iloc[:-1, 2:4]
rtData.head(2)

Unnamed: 0,Incongruent RT,Congruent RT
0,2.479999,2.005041
1,3.736042,2.890024


In [31]:
df['Mean Reaction Time'] = rtData.apply(np.mean, axis=1)
df.head()

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT,Log of Incongruent RT,Log of congruent RT,Mean Reaction Time
0,correct,incorrect,2.479999,2.005041,0.908258,0.695665,2.24252
1,correct,incorrect,3.736042,2.890024,1.318027,1.061265,3.313033
2,correct,incorrect,3.70228,3.642979,1.308949,1.292802,3.672629
3,correct,incorrect,3.940541,3.194376,1.371318,1.161392,3.567458
4,correct,incorrect,5.456142,3.488134,1.696742,1.249367,4.472138


## Calculate proportion correct by applying a conditional count method.

### This is like `countif` in excel

In [32]:
proportion_correct_incongruent = (df[df["Incongruent Response"] == "correct"].count(axis=0) / 50)[0]
proportion_correct_congruent = (df[df["Congruent Response"] == "correct"].count(axis=0) / 50)[0]
print(proportion_correct_incongruent, proportion_correct_congruent)

0.7 0.22


That was s a complicated statment.  Let's break it down.
- `df[df["Incongruent Response"] == "correct"]` is the exact same conditional search we used in the previous example
- we apply the `count()` method on `axis=0`, which means across rows.
- So we're counting the number of cells in each column that corresponds to correct incongruent responses
- `50` is the number of trials.  
    - When we divide the count by the number of items how we get a proportion.
- Recall that the conditional search returns a Pandas series
    - encapsulate the statement in `()`
    - ask for the first element in the series `[0]`