# Data Science in Psychology & Neuroscience (DSPN): 

## Lecture 9. Data Wrangling (part 2)

### Date: September 29, 2022

### To-Dos From Last Class:

* Download data for today's wrangling session #1 dataset from <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/misc_exercises/imitation_inhibition_paradigm">Github</a>

### Today:

* Data wrangling in Pandas (stealing heavily from this <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Cheatsheet</a>)
    * Last part of last class: Combining data frames
* Wrangle some real data

### Homework

* Download <a href="https://github.com/hogeveen-lab/DSPN_Fall2022_Git/tree/master/assignment_starters/assign3_starter">Assignment #3 starter kit</a>

# Data wrangling in Pandas

## (finishing from last class) 5. Combining Data Sets 

<img src="img/combining_data.png" width="600">

In [12]:
import pandas as pd
import numpy as np

# function to generate random integers
def random_integer(x,n):
    for i in range(n):
        x.append(np.random.randint(25,75))
    return x

# running the function and check the output
x = []
random_integer(x,6)
# print(x)

# putting together some example data frames
df1 = pd.DataFrame({'pid' : ['P1','P2','P3'],
                   'var1' : [x[0],x[1],x[2]]})
# display(df1)
df2 = pd.DataFrame({'pid' : ['P1','P2','P4'],
                   'var2' : [x[3],x[4],x[5]]})
# display(df2)

# Standard merge approach #1
print('Merge LEFT (right --> left; lose unique rows from the RIGHT df)')
merge_left = pd.merge(df1,df2,on='pid',how='left')
display(merge_left)

# Standard merge approach #2
print('Merge RIGHT (left --> right; lose unique rows from the LEFT df)')
merge_right = pd.merge(df1,df2,on='pid',how='right')
display(merge_right)

# Standard merge approach #3
print('Merge INNER (lose unique data from BOTH dfs)')
merge_inner = pd.merge(df1,df2,on='pid',how='inner')
display(merge_inner)

# Standard merge approach #4
print('Merge OUTER (retain ALL the data)')
merge_outer = pd.merge(df1,df2,on='pid',how='outer')
display(merge_outer)

Merge LEFT (right --> left; lose unique rows from the RIGHT df)


Unnamed: 0,pid,var1,var2
0,P1,65,41.0
1,P2,54,47.0
2,P3,64,


Merge RIGHT (left --> right; lose unique rows from the LEFT df)


Unnamed: 0,pid,var1,var2
0,P1,65.0,41
1,P2,54.0,47
2,P4,,48


Merge INNER (lose unique data from BOTH dfs)


Unnamed: 0,pid,var1,var2
0,P1,65,41
1,P2,54,47


Merge OUTER (retain ALL the data)


Unnamed: 0,pid,var1,var2
0,P1,65.0,41.0
1,P2,54.0,47.0
2,P3,64.0,
3,P4,,48.0


# Wrangle some real data

<img src="img/imit_inhib_fileorg.png" width=500>

## Breaking into 8 code chunks

## 1. Import packages

In [14]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None # to avoid "settingwithcopy" warning, not required
import os
from glob import glob

## 2. Setting paths to the first level data

In [57]:
base_dir = os.getcwd()
# base_dir = '<hard coded path to local parent directory on your computer if you don't have the data and code adjacent to each other>'
P_file_pattern = 'P*.txt'
# P_file_pattern = '/P*.txt'  # option for windows folks where os.path.join assumed backward slashes that GLOB didn't like
first_dir = os.path.join(base_dir,'misc_exercises/imitation_inhibition_paradigm/data/first')
# first_dir = base_dir + '/first' # option for windows folks where os.path.join assumed backward slashes that GLOB didn't like
second_dir = os.path.join(base_dir,'misc_exercises/imitation_inhibition_paradigm/data/second')
# second_dir = base_dir + '/second' # option for windows folks where os.path.join assumed backward slashes that GLOB didn't like
questionaire_filename = os.path.join(second_dir,'ait_questionnaires.csv')
# questionaire_filename = second_dir + '/ait_questionnaires.csv'  # option for windows folks where os.path.join assumed backward slashes that GLOB didn't like

# use GLOB to find ALL participant data files and put together a list
all_files = glob(os.path.join(first_dir,P_file_pattern))
# all_files = glob(first_dir + P_file_pattern)  # option for windows folks where os.path.join assumed backward slashes that GLOB didn't like
# print(all_files) # should show a list of ALL the first-level P*.txt files, if not path must be wonky

## 3. Load a test subject to make sense of things

In [56]:
# reading in the data
sample_df = pd.read_csv(all_files[0],skiprows=5,sep='\t') # Note: looking at file we know that first 5 rows aren't needed and it is tab separated
print('How many rows in initial loaded data frame:',len(sample_df))

# filter the data down to just the experimental block
sample_df = sample_df[sample_df['Name.1']=='AI_Block']

# filter down to just the key releases
sample_df_releases = sample_df[sample_df['Released']=='Released']
print('How many rows in initial loaded data frame:',len(sample_df_releases))

# Final note: If row numbers differ from my output, check what file all_files[0] is. 
# Mine was P8, doesn't matter though we're going to loop through them all eventually!

How many rows in initial loaded data frame: 521
How many rows in initial loaded data frame: 101


## 4. Iterate through to load the first level data
###    * Concatenate all together to create one data frame to rule them all

## 5. Merge with questionnaire data

## 6. Write to trial-level allsubjects csv

#### Pick up next class..
7. Compute summary measures
8. Save to summary allsubjects csv