# Analyzing Data in Pandas

# Simulate the data set again

In [2]:
import random
import numpy as np
import pandas as pd

# Simulate Responses
congruent_responses = np.random.choice(["incorrect", "correct"], size=(50,), p=[3./4, 1./4])
incongruent_responses = np.random.choice(["incorrect", "correct"], size=(50,), p=[1./4, 3./4])

# Simulate Reaction Times
a = 0.5 # lowest possible reaction time
b = 6 # highest possible reaction time
reaction_time_incongruent = []
reaction_time_congruent = []
for i in range(50):
    x = 4 # mode of reaction time
    reaction_time_incongruent.append(random.triangular(a, b, 3*x - a - b))
    x = 3 # mode of reaction time
    reaction_time_congruent.append(random.triangular(a, b, 3*x - a - b))

# Compile Data
data_tuples = list(zip(congruent_responses, incongruent_responses, reaction_time_incongruent, reaction_time_congruent))
df = pd.DataFrame(data_tuples, columns=["Incongruent Response", "Congruent Response", "Incongruent RT", "Congruent RT"]) # columns can be defined here too
df.head()

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT
0,incorrect,correct,4.512119,3.401761
1,incorrect,correct,4.630665,3.049635
2,incorrect,correct,3.451311,3.544197
3,incorrect,correct,4.139284,1.553025
4,incorrect,correct,2.565913,3.495347


## Let's simulate a couple more variables here to help us learn data analysis in Python

In [3]:
preference_for_apples = []
preference_for_crabs = []
a=1
b=100
for i in range(50):
    x = 75
    preference_for_apples.append(random.triangular(a, b, 3*x - a - b))
    x = 35
    preference_for_crabs.append(random.triangular(a, b, 3*x - a - b))

data_tuples = list(zip(preference_for_apples, preference_for_crabs))
df2 = pd.DataFrame(data_tuples, columns=["Apple Preference", "Crab Preference"]) # columns can be defined here too
df2.head()

Unnamed: 0,Apple Preference,Crab Preference
0,78.795446,5.210716
1,64.532171,10.051328
2,49.320005,38.124945
3,103.676105,51.733424
4,86.422666,28.78498


### Concatenate the dataframes

In [4]:
df = pd.concat([df,df2], axis=1, sort=False)
df.head()

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT,Apple Preference,Crab Preference
0,incorrect,correct,4.512119,3.401761,78.795446,5.210716
1,incorrect,correct,4.630665,3.049635,64.532171,10.051328
2,incorrect,correct,3.451311,3.544197,49.320005,38.124945
3,incorrect,correct,4.139284,1.553025,103.676105,51.733424
4,incorrect,correct,2.565913,3.495347,86.422666,28.78498


## Calculate the mean reaction times using the `mean()` method.

In [5]:
print(df['Incongruent RT'].mean())
print(df['Congruent RT'].mean())

4.101313426830777
3.069708255177009


## Calculate the median reaction times using the `median()` method.

In [6]:
print(df['Incongruent RT'].median())
print(df['Congruent RT'].median())

4.428318064424483
3.056854355099855


## Calculate the mode of responses times using the `mode()` method.
Here we're calculating the mode of responses instead of reaction times because `mode()` calculates the most often occuring value.  Our reaction times are all unique values, so the mode function just returns the original data.  If we want to see the mode method in action, we can use it on the responses, and see what the most often occuring responses are in each category.

In [7]:
print(df['Incongruent Response'].mode())
print(df['Congruent Response'].mode())

0    incorrect
dtype: object
0    correct
dtype: object


## Calculate the standard deviation of the reaction times using the `stdev()` method.

In [8]:
print(df['Incongruent RT'].std())
print(df['Congruent RT'].std())

1.0872290950798025
1.26929244589055


<h2>Functions &amp; Description</h2>
<p>Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions &minus;</p>
<table class="table table-bordered">
<tr>
<th style="text-align:center;">Sr.No.</th>
<th style="text-align:center;">Function</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">1</td>
<td style="text-align:center;">count()</td>
<td>Number of non-null observations</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td style="text-align:center;">sum()</td>
<td>Sum of values</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td style="text-align:center;">mean()</td>
<td>Mean of Values</td>
</tr>
<tr>
<td style="text-align:center;">4</td>
<td style="text-align:center;">median()</td>
<td>Median of Values</td>
</tr>
<tr>
<td style="text-align:center;">5</td>
<td style="text-align:center;">mode()</td>
<td>Mode of values</td>
</tr>
<tr>
<td style="text-align:center;">6</td>
<td style="text-align:center;">std()</td>
<td>Standard Deviation of the Values</td>
</tr>
<tr>
<td style="text-align:center;">7</td>
<td style="text-align:center;">min()</td>
<td>Minimum Value</td>
</tr>
<tr>
<td style="text-align:center;">8</td>
<td style="text-align:center;">max()</td>
<td>Maximum Value</td>
</tr>
<tr>
<td style="text-align:center;">9</td>
<td style="text-align:center;">abs()</td>
<td>Absolute Value</td>
</tr>
<tr>
<td style="text-align:center;">10</td>
<td style="text-align:center;">prod()</td>
<td>Product of Values</td>
</tr>
<tr>
<td style="text-align:center;">11</td>
<td style="text-align:center;">cumsum()</td>
<td>Cumulative Sum</td>
</tr>
<tr>
<td style="text-align:center;">12</td>
<td style="text-align:center;">cumprod()</td>
<td>Cumulative Product</td>
</tr>
</table>

https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm

## Describe the data gives a summary of the numerical data in a given dataset

In [9]:
df.describe()

Unnamed: 0,Incongruent RT,Congruent RT,Apple Preference,Crab Preference
count,50.0,50.0,50.0,50.0
mean,4.101313,3.069708,83.10562,35.02807
std,1.087229,1.269292,22.984023,21.596587
min,1.616654,0.712919,31.786232,5.210716
25%,3.120999,2.087029,65.749024,16.981601
50%,4.428318,3.056854,86.392377,32.501331
75%,5.026026,3.994842,103.632848,51.606026
max,5.694001,5.882009,110.10872,89.265397


### We can use `=object` to see info about cells that contain objects, not numbers.  This includes text, like our response variables

In [10]:
df.describe(include=['object'])

Unnamed: 0,Incongruent Response,Congruent Response
count,50,50
unique,2,2
top,incorrect,correct
freq,34,37


### We can use `=all` to see all of that info at once.  `NaN` stands for  `not a number`, which is Pandas' N/A value

In [11]:
df. describe(include='all')

Unnamed: 0,Incongruent Response,Congruent Response,Incongruent RT,Congruent RT,Apple Preference,Crab Preference
count,50,50,50.0,50.0,50.0,50.0
unique,2,2,,,,
top,incorrect,correct,,,,
freq,34,37,,,,
mean,,,4.101313,3.069708,83.10562,35.02807
std,,,1.087229,1.269292,22.984023,21.596587
min,,,1.616654,0.712919,31.786232,5.210716
25%,,,3.120999,2.087029,65.749024,16.981601
50%,,,4.428318,3.056854,86.392377,32.501331
75%,,,5.026026,3.994842,103.632848,51.606026


# Test for normal distribution
Let's see if our data are normally distributed.

To do that, we'll use the `scipy` module. `scipy` stands for scientific python.  In the `stats` folder, there's a module called pearsonr, which calculates correlations and corresponding p-values. `scipy` is compatable with `Pandas`.  If the p-value is less than 0.05, then the distribution is not normal.

In [12]:
import scipy.stats

print(scipy.stats.normaltest(df['Incongruent RT']))
print(scipy.stats.normaltest(df['Congruent RT']))

NormaltestResult(statistic=6.730283817948463, pvalue=0.034557111805740635)
NormaltestResult(statistic=2.3827578526077122, pvalue=0.3038020544594923)


## Correlation
To test for a correlation in Pandas, you can use the `corr()` method to test for all correlations in a dataset.  This is called a correlation matrix.  It's not particularly useful for our analysis here, but it's something you should learn.

In [13]:
df.corr()

Unnamed: 0,Incongruent RT,Congruent RT,Apple Preference,Crab Preference
Incongruent RT,1.0,0.053828,0.078238,0.272555
Congruent RT,0.053828,1.0,-0.099812,-0.05417
Apple Preference,0.078238,-0.099812,1.0,-0.062034
Crab Preference,0.272555,-0.05417,-0.062034,1.0


If you just want to correlate 2 variables, you can call the `corr()` method on one column, and put the other column inside the `corr` method.

In [14]:
df['Incongruent RT'].corr(df['Congruent RT'])

0.05382758297868144

### p-values for correlations
p-value politics aside, you should learn how to calculate them.  We can't do that in Pandas, but we can do it in `scipy.stats.pearsonr`. `scipy.stats.pearsonr` returns a tuple.  The first value is the r-value or correlation coefficient.  The second value is the p-value.

In [15]:
import scipy.stats

scipy.stats.pearsonr(df['Incongruent RT'], df['Congruent RT'])

(0.05382758297868139, 0.7104432091821069)

Those two variables are not related, and we would not be surprised to find that given our null hypothesis (i.e. that the two variables are not related)

# Are responses in the incongruent condition longer than the responses in the congruent condition?

To do that, we'll use a paired sample t-test `scipy.stats.ttest_rel`.  

In [16]:
scipy.stats.ttest_rel(df['Incongruent RT'], df['Congruent RT'])

Ttest_relResult(statistic=4.485574796148902, pvalue=4.3983785411430516e-05)

The t-value of 5.6 tells us that the mean of incongruent is much higher than the mean of congruent, and the p-value of 8.153 e-07 means we would be very very suprised if we reached this result given the null hypothesis that there is no differnece in the means.

## In each condition, is the frequency of "correct" responses different than we would expect than if people were just guessing?

x is the number of correct responses - we already calculated this last time.
n is the sample size (50)
p is the probability of being correct if there was no effect of the letter colour (1)

### Congruent

In [17]:
proportion_correct_congruent = (df[df["Congruent Response"] == "correct"].count(axis=0) / 50)[0]
print('Proportion correct congruent', proportion_correct_congruent)
scipy.stats.binom_test(proportion_correct_congruent, n=50, p=0.99)

Proportion correct congruent 0.74


1.0000000000000444e-100

### Incongruent

In [18]:
proportion_correct_incongruent = (df[df["Incongruent Response"] == "correct"].count(axis=0) / 50)[0]
print('Proportion correct incongruent', proportion_correct_incongruent)
scipy.stats.binom_test(proportion_correct_incongruent, n=50, p=0.99)

Proportion correct incongruent 0.32


1.0000000000000444e-100

In both cases, there is almost a zero probability that the observed responses are the same as we would expect if there was no effect of the colour on the word people read. Had we used p=1, there would be a zero probability.

# Saving Your Data to a file

There are many formats you can save your data to.  Python doesn't care, it can save in any non-proprietary format (and some proprietary ones too).

You can save:

CSV (comma-separated values)

In [20]:
df.to_csv('simulated_data.csv')

Excel

In [26]:
df.to_excel('simulated_data.xlsx')

You can also save your objects by pickling them.  These objects can be loaded into other Python programs directly.  Whether it's a dataframe or any other Python object, it can be pickled.

In [24]:
df.to_pickle('simulated_data.pkl')