# Wave 1 Analysis - 9 Year Old Cohort

### 2 Datasets for Wave 1:
- Cohort dataset (details about each student, e.g. demographics, academic scores, etc)
- Time Use dataset (details of the time use diary completed by students)

## Import Required Packages

In [25]:
import math
import statistics
import pandas as pd
import numpy as np
import warnings
from pandas.core.common import SettingWithCopyWarning
from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.weightstats import ztest as ztest

# Silence copy warning & allow all columns and rows to be seen
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
warnings.simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Import Raw Data - Cohort Dataset

In [26]:
# Specify which columns we want to use (using the data dictionaries)
required_cols = ['ID', 'Wgt_9yr', 'Gross_9yr', 'mathsls', 'mma5ap2']

# Read in data
cohort_data = pd.read_csv('0020-01 GUI Child Cohort Wave 1/0020-01 GUI Child Cohort Wave 1_Data/9 Year Cohort Data/CSV/GUI Data_9YearCohort.csv', index_col= 'ID', usecols=required_cols)

# Check the number of rows and columns
cohort_data.shape

(8568, 4)

## Exploratory Data Analysis - Cohort Dataset

- Check the datatypes of each column
- Describe the data - min, max, mean, stdev, etc
- View the first 10 rows of the data

In [27]:
cohort_data.dtypes

Wgt_9yr      float64
Gross_9yr    float64
mma5ap2        int64
mathsls       object
dtype: object

In [28]:
cohort_data.describe()

Unnamed: 0,Wgt_9yr,Gross_9yr,mma5ap2
count,8568.0,8568.0,8568.0
mean,0.999993,6.593909,1.514006
std,0.769568,5.074497,0.499833
min,0.200535,1.322316,1.0
25%,0.509936,3.362492,1.0
50%,0.766355,5.053313,2.0
75%,1.230138,8.111474,2.0
max,4.615382,30.433616,2.0


*Below output is removed from Git as access to the data is to be requested from Growing up in Ireland.*

In [None]:
cohort_data.head(10)

## Clean Raw Cohort Dataset

- Standardise the format of all empty/missing values to use Python NULL so we can use Python .dropna(), .isna(), etc
- Drop any rows where the maths score is missing
- Set the datatype of the maths score as a decimal
- Create 2 new columns - the scaled maths score (1-100) and the percentile rank using the scaled maths score

In [29]:
# NULL values are filled with whitespace, replace with Python numpy NULL value
cohort_data = cohort_data.replace(r'^\s*$', np.nan, regex=True)

# Drop any students who are missing a maths score
cohort_data = cohort_data[cohort_data.mathsls.isna() == False]

print(f'Shape of dataframe after dropping students missing a maths score: {cohort_data.shape}')

# correct datatype from object to decimal(numeric)
cohort_data.mathsls = cohort_data.mathsls.astype('float64')

# Scale Maths logit scores to 1-100 using MinMaxScaler
scaler = MinMaxScaler(feature_range=(1,100))

cohort_data['SCALED_MATHSLS'] = scaler.fit_transform(cohort_data[['mathsls']])

# Calculate Percentile rank for students with maths score
cohort_data['MATHSLS_PERCENTILE_RANK'] = cohort_data['SCALED_MATHSLS'].rank(pct=True) * 100


print(f'Shape of dataframe after adding 2 new columns: {cohort_data.shape}')

Shape of dataframe after dropping students missing a maths score: (8449, 4)
Shape of dataframe after adding 2 new columns: (8449, 6)


### View first 10 rows after data cleaning

*Below output is removed from Git as access to the data is to be requested from Growing up in Ireland.*

In [None]:
cohort_data.head(10)

## Import Raw Data - Time Use Dataset

In [30]:
timeuse_data = pd.read_csv('0020-01 GUI Child Cohort Wave 1/0020-01 GUI Child Cohort Wave 1_Data/Time Use Data/GUI Data_9YearCohort_TimeUse.txt', index_col= 'ID')

timeuse_data.shape

(6228, 500)

## Exploratory Data Analysis - Time Use Dataset

- Check the datatypes of each column
- Describe the data - min, max, mean, stdev, etc
- View the first 10 rows of the data

In [31]:
timeuse_data.dtypes

wgttime9yr      float64
grosstime9yr    float64
DiaryDay          int64
DiaryDat         object
diarymonth        int64
weekend           int64
term              int64
t00_1_A1          int64
t00_1_A2         object
t00_1_A3         object
t00_1_A4         object
t00_1_A5         object
t00_2_A1          int64
t00_2_A2         object
t00_2_A3         object
t00_2_A4         object
t00_2_A5         object
t00_3_A1          int64
t00_3_A2         object
t00_3_A3         object
t00_3_A4         object
t00_3_A5         object
t00_4_A1          int64
t00_4_A2         object
t00_4_A3         object
t00_4_A4         object
t00_4_A5         object
t01_1_A1          int64
t01_1_A2         object
t01_1_A3         object
t01_1_A4         object
t01_1_A5         object
t01_2_A1          int64
t01_2_A2         object
t01_2_A3         object
t01_2_A4         object
t01_2_A5         object
t01_3_A1          int64
t01_3_A2         object
t01_3_A3         object
t01_3_A4         object
t01_3_A5        

In [32]:
timeuse_data.describe()

Unnamed: 0,wgttime9yr,grosstime9yr,DiaryDay,diarymonth,weekend,term,t00_1_A1,t00_2_A1,t00_3_A1,t00_4_A1,t01_1_A1,t01_2_A1,t01_3_A1,t01_4_A1,t02_1_A1,t02_2_A1,t02_3_A1,t02_4_A1,t03_1_A1,t03_2_A1,t03_3_A1,t03_4_A1,t04_1_A1,t04_2_A1,t04_3_A1,t04_4_A1,t05_1_A1,t05_2_A1,t05_3_A1,t05_4_A1,t06_1_A1,t06_2_A1,t06_3_A1,t06_4_A1,t07_1_A1,t07_2_A1,t07_3_A1,t07_4_A1,t08_1_A1,t08_2_A1,t08_3_A1,t08_4_A1,t09_1_A1,t09_2_A1,t09_3_A1,t09_4_A1,t10_1_A1,t10_2_A1,t10_3_A1,t10_4_A1,t11_1_A1,t11_2_A1,t11_3_A1,t11_4_A1,t12_1_P1,t12_2_P1,t12_3_P1,t12_4_P1,t01_1_P1,t01_2_P1,t01_3_P1,t01_4_P1,t02_1_P1,t02_2_P1,t02_3_P1,t02_4_P1,t03_1_P1,t03_2_P1,t03_3_P1,t03_4_P1,t04_1_P1,t04_2_P1,t04_3_P1,t04_4_P1,t05_1_P1,t05_2_P1,t05_3_P1,t05_4_P1,t06_1_P1,t06_2_P1,t06_3_P1,t06_4_P1,t07_1_P1,t07_2_P1,t07_3_P1,t07_4_P1,t08_1_P1,t08_2_P1,t08_3_P1,t08_4_P1,t09_1_P1,t09_2_P1,t09_3_P1,t09_4_P1,t10_1_P1,t10_2_P1,t10_3_P1,t10_4_P1,t11_1_P1,t11_2_P1,t11_3_P1,t11_4_P1,t1_1,t1_2,t1_3,t1_4,t1_5,t1_6,t1_7,t1_8,t1_9,t1_10,T2,T4
count,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0,6228.0
mean,1.000003,9.071475,3.706808,8.067759,1.230893,1.140816,1.022479,1.014933,1.010758,1.007065,1.006262,1.006262,1.006262,1.006262,1.005459,1.005459,1.005459,1.00578,1.00578,1.005941,1.00578,1.00578,1.006101,1.00578,1.00578,1.007868,1.006262,1.006262,1.005941,1.008028,1.03115,1.038054,1.090719,1.141137,1.298651,1.589114,2.38632,2.866249,3.69605,4.718529,6.100674,6.703918,6.374438,6.755459,7.394348,7.6649,7.838953,8.094573,8.580925,8.818561,8.87492,9.045601,9.499839,9.582049,9.534522,9.723346,9.921323,10.063744,9.292229,9.18465,9.388407,9.572094,9.766699,9.758671,9.914419,10.33526,10.794958,11.629897,12.6535,12.785806,12.112396,12.64194,13.49422,14.012685,13.416667,13.489724,13.934008,14.145633,12.845055,12.918112,13.569846,14.099069,13.780026,13.990687,14.500803,14.668754,14.253372,14.017662,13.510116,13.056037,9.957771,8.756262,6.762845,5.911207,3.896435,3.295279,2.720456,2.50562,2.130539,2.026975,1.931439,1.914579,1.982338,2.689627,2.584618,2.705202,2.723025,2.694123,2.722543,2.719974,2.707771,2.682241,2.795601,2.197174
std,0.938916,8.517328,1.978714,3.325677,0.421438,0.347859,0.590712,0.484149,0.414968,0.317492,0.311122,0.311122,0.311122,0.311122,0.304617,0.304617,0.304617,0.305664,0.305664,0.305924,0.305664,0.305664,0.306707,0.305664,0.305664,0.336862,0.307488,0.307488,0.305924,0.342297,1.314272,1.351502,1.939164,2.360376,3.315015,4.787059,7.985418,8.841153,9.212667,11.104606,13.639829,13.771542,9.04874,7.837421,9.068813,9.562517,8.686106,8.926671,9.852014,10.235755,9.254928,9.598116,10.872216,11.050515,11.405145,12.109521,12.724777,13.24857,11.44344,11.265841,11.669127,11.948827,11.863024,11.500499,12.383579,13.766795,15.062757,16.399236,17.689842,17.341386,14.585066,15.338887,16.864811,17.670866,16.41383,16.786458,17.88866,18.585463,16.550663,16.816451,17.758125,18.193701,16.467041,16.395777,16.980992,17.254235,16.723208,16.54594,17.248551,17.699731,15.596547,15.228151,14.3293,14.196028,11.341893,10.864347,9.967575,9.84703,9.063565,8.991792,8.777606,8.765255,2.412032,2.155026,2.185195,2.148447,2.13835,2.152538,2.138626,2.140093,2.147004,2.157004,2.35028,2.366523
min,0.161143,1.461798,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.467334,4.239396,2.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.75,4.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,6.0,5.0,5.0,6.0,7.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,6.0,4.0,6.0,6.0,8.0,9.0,9.0,9.0,9.0,6.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0
50%,0.698816,6.339274,4.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,4.0,5.0,5.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,8.0,8.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,10.0,10.0,9.0,9.0,10.0,11.0,11.0,11.0,12.0,13.0,13.0,13.0,13.0,13.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
75%,1.164133,10.560377,5.0,11.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,4.0,4.0,5.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.25,9.0,7.0,7.0,7.0,8.0,7.0,7.0,7.0,7.0,8.0,9.0,9.0,9.0,11.0,11.0,13.0,13.0,13.0,13.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,12.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0
max,8.953163,81.218182,7.0,12.0,2.0,2.0,22.0,22.0,22.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,18.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0


*Below output is removed from Git as access to the data is to be requested from Growing up in Ireland.*

In [None]:
timeuse_data.head()

## Clean Raw Time Use Dataset

- Standardise the format of all empty/missing values to use Python NULL so we can use Python .dropna(), .isna(), etc


In [33]:
# NULL values are filled with whitespace, replace with Python numpy NULL value
timeuse_data = timeuse_data.replace(r'^\s*$', np.nan, regex=True)

# Check % of NULLs across each column, e.g. 0.5 = 50% of values in this column are NULL
timeuse_data.isnull().mean() * 100

wgttime9yr        0.000000
grosstime9yr      0.000000
DiaryDay          0.000000
DiaryDat          0.000000
diarymonth        0.000000
weekend           0.000000
term              0.000000
t00_1_A1          0.000000
t00_1_A2         99.983943
t00_1_A3        100.000000
t00_1_A4        100.000000
t00_1_A5        100.000000
t00_2_A1          0.000000
t00_2_A2         99.983943
t00_2_A3        100.000000
t00_2_A4        100.000000
t00_2_A5        100.000000
t00_3_A1          0.000000
t00_3_A2         99.983943
t00_3_A3        100.000000
t00_3_A4        100.000000
t00_3_A5        100.000000
t00_4_A1          0.000000
t00_4_A2         99.983943
t00_4_A3        100.000000
t00_4_A4        100.000000
t00_4_A5        100.000000
t01_1_A1          0.000000
t01_1_A2         99.983943
t01_1_A3        100.000000
t01_1_A4        100.000000
t01_1_A5        100.000000
t01_2_A1          0.000000
t01_2_A2         99.983943
t01_2_A3        100.000000
t01_2_A4        100.000000
t01_2_A5        100.000000
t

The format of the time use data is as follows:

sample column name = **t00_1_A1**

This decoded = time 00:00, block 1/4 (i.e. first 15mins of 00hrs), A1 = activity 1



In [34]:
# get all column names that begin with 't'
timeuse_cols = [col for col in timeuse_data if col.startswith('t')]

# specify columns we want to remove
cols_to_remove = ['term', 't1_1', 't1_2', 't1_3', 't1_4', 't1_5', 't1_6', 't1_7', 't1_8', 't1_9', 't1_10']

# final columns to be used is list 1 - list2
timeuse_cols = list(set(timeuse_cols) - set(cols_to_remove))

# sort the columns
timeuse_cols.sort()

## Feature Engineering Time Use Dataset

The following features are added to the time use dataset:
- The number of 15minute intervals each student spent gaming (using the data dictionary, gaming = activity '13')
- The total number of minutes each student spent gaming (i.e. number of intervals * 15 minutes)
- An indicator as to whether they gamed or not (for later use in filtering and visualisations)

In [35]:
# Gaming (activity 13)
timeuse_data_temp_13 = timeuse_data[timeuse_cols]

# Replace all activities recorded other than 'Gaming' (13) with NULLs as we don't care about them
timeuse_data_temp_13[timeuse_data_temp_13 != 13] = np.nan

In [36]:
# Calculate count of 15min slots flagged as gaming per student ID using 'count'
timeuse_data_temp_13['CNT_15M_INTERVALS_COMPGAMES'] = timeuse_data_temp_13.count(axis=1)

# Calculate total time spent by multiplying the count of slots * 15 (as each slot = 15mins)
timeuse_data_temp_13['TOTAL_MINS_COMPGAMES'] = timeuse_data_temp_13.CNT_15M_INTERVALS_COMPGAMES * 15

# Add 2 new columns back to original dataset
timeuse_data = timeuse_data.join(timeuse_data_temp_13[['CNT_15M_INTERVALS_COMPGAMES', 'TOTAL_MINS_COMPGAMES']])

# Create a gaming indicator (1 = Yes, 0 = No)
timeuse_data['COMP_GAMING_IND'] = np.where(timeuse_data['CNT_15M_INTERVALS_COMPGAMES'] > 0, 1, 0)

## Merge Cohort Dataset & Time Use Dataset

In [37]:
# join both datasets using the common 'ID' column for each student
wave1_merged = pd.merge(cohort_data, timeuse_data, how='left', on='ID')

wave1_merged.shape

(8449, 509)

## After merge, drop any students missing a Time Use diary

In [38]:
# Drop anyone present in cohort data with a maths score who did not complete a time-use diary
wave1_merged = wave1_merged[wave1_merged.DiaryDay.isna() == False]

print (f'Final number of rows and columns: {wave1_merged.shape}')

Final number of rows and columns: (6159, 509)


## Exploring the clean merged dataset 

In [39]:
# See breakdown of gaming vs no gaming
wave1_merged.COMP_GAMING_IND.value_counts()

0.0    3866
1.0    2293
Name: COMP_GAMING_IND, dtype: int64

## Statistical Testing

Compare the scaled maths score between those who game and didn't game.

**Null hypothesis:** The maths scores for students who did game = the maths scores for students who didn't game.

**Alternative hypothesis:** The maths scores for students who did game > the maths scores for students who didn't game.

In [40]:
# Split dataset into 2 groups, those who gamed and those who didn't
students_who_gamed = wave1_merged[wave1_merged.COMP_GAMING_IND == 1]
students_who_didnt_game = wave1_merged[wave1_merged.COMP_GAMING_IND == 0]

In [41]:
# perform two sample z-test
zscore, pvalue = ztest(students_who_gamed.SCALED_MATHSLS, students_who_didnt_game.SCALED_MATHSLS, value=0, alternative='larger')

confidence_threshold = 0.05 # 95% (1-0.05)

print('Z Score\t:', zscore)
print('P Value\t:', pvalue)

if (1 - pvalue < confidence_threshold):
    print(1 - pvalue)
    print('Null hypothesis is accepted!')
else:
    print(1 - pvalue)
    print('Null hypothesis is rejected. \nAlternate hypothesis is accepted!')

Z Score	: 5.399379714744775
P Value	: 3.343585105950899e-08
0.9999999665641489
Null hypothesis is rejected. 
Alternate hypothesis is accepted!


In [42]:
# Add ID back as a column by resetting the index
wave1_merged.reset_index(inplace=True)

# Save clean merged dataset ready for Tableau to import
wave1_merged.to_csv('child_wave_1_merged.csv', index=False)