### __Features__
- **id**: Users (2471)
- **event_id**: Individual event; sequential; starting from 1
    - mean: 2070
    - SD: 1590
    - max: 12900
- **down_time**
- **up_time**
- **action_time**
- **activity**
    - options
        1. Nonproduction
        2. Input
        3. Remove/Cut
        4. Replace
        5. Paste
        6. Move From [x1, y1] To [x2, y2]
- **down_event**
- **up_event**
    - options
        1. q
        2. Leftclick
        3. .
        4. ,
        5. Backspace
        6. Space
        7. 
- **text_change**
    - options
        1. NoChange
        2. q
        3. 
        4. Replace (ex: qqqqq qqq => qq)
        5. 
- **cursor_position**




### __Derived Keystroke Features__ (From: [Early prediction of writing quality using keystroke logging](https://doi.org/10.1007/s40593-021-00268-w))

- Features related to timing of pauses
    - **Initial pause time**
    - **Total time**
    - **IKI**
        - *Mean*
        - *Median*
        - *SD*
        - *Max*
    - **IKI within word**
        - *Mean*
        - *SD*
    - **IKI between words**
        - *Mean*
        - *SD*
    - **Time between words**
        - *Mean*
        - *SD*
    - **Time between sentences**
        - *Mean*
        - *SD*
    - **Number of IKI of specific length**
    - **Percentage long pauses between words**
- Features related to revisions
    - **Number of revisions**
    - **Number of leading-edge revisions**
    - **Number of in-text revisions**
    - **Number of backspaces**
    - **Time in single backspacing**
        - *Mean*
        - *SD*
    - **Percentage of characters in final text**
    - **Percentage of characters at leading edge**
- Features related to fluency
    - **Number of characters per burst**
        - *Mean*
        - *SD*
        - *Max*
    - **Number of bursts**
    - **Percentage of R-bursts:** number of revision bursts at leading edge ending in a revision
    - **Percentage of I-bursts:** number of insertion bursts produced away from the leading edge
    - **Percentage of words in P-bursts:** number of words in 'clean' production bursts both initiated and terminated by a long pause (not a revision)
    - **Number of production cycles**
    - **Percentage of linear transitions between words**
    - **Percentage of linear transitions between sentences**
- Features related to verbosity
    - **Total number of keystrokes**
    - **Total number of words**
    - **SD number of keystrokes per 30s**
    - **Slope of the number of keystrokes per 30s**
    - **Entropy of the number of keystrokes per 30s**
    - **Uniformity of the number of keystrokes per 30s**
    - **Local extreme number of keystrokes per 30s**
    - **Distance 30s windows of more than one keystroke**
        - *Mean*
        - *SD*
- Features related to other events
    - **Number of focus shifts to translation or task**
    - **cut/paste/jump events**
        - *Mean*
        - *SD*
    - **Percentage of time spent on other events**
    
Note: IKI == Interkeystroke interval


In [3]:
# Importing packages

# absolutely necessary packages
import numpy as np
import pandas as pd


# temporarily necessary packages
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

In [55]:
# Reading the data
df_train = pd.read_csv('data/train_logs.csv', 
                 header=0)
df_test = pd.read_csv('data/test_logs.csv', 
                 header=0)

In [57]:
print(df_train.head())



         id  event_id  down_time  up_time  action_time       activity  \
0  001519c8         1       4526     4557           31  Nonproduction   
1  001519c8         2       4558     4962          404  Nonproduction   
2  001519c8         3     106571   106571            0  Nonproduction   
3  001519c8         4     106686   106777           91          Input   
4  001519c8         5     107196   107323          127          Input   

  down_event   up_event text_change  cursor_position  word_count  
0  Leftclick  Leftclick    NoChange                0           0  
1  Leftclick  Leftclick    NoChange                0           0  
2      Shift      Shift    NoChange                0           0  
3          q          q           q                1           1  
4          q          q           q                2           1  


In [66]:
print(df_train['id'].nunique())
print('\n',df_train['activity'].unique())
print('\n',df_train['text_change'].unique())
print('\n',df_train['up_event'].unique())

2471

 ['Nonproduction' 'Input' 'Remove/Cut' 'Replace'
 'Move From [284, 292] To [282, 290]' 'Move From [287, 289] To [285, 287]'
 'Move From [460, 461] To [465, 466]' 'Paste'
 'Move From [905, 1314] To [907, 1316]'
 'Move From [565, 743] To [669, 847]' 'Move From [669, 847] To [565, 743]'
 'Move From [1041, 1121] To [1496, 1576]'
 'Move From [1455, 1557] To [1323, 1425]'
 'Move From [2268, 2275] To [2247, 2254]'
 'Move From [213, 302] To [902, 991]' 'Move From [0, 158] To [234, 392]'
 'Move From [460, 465] To [925, 930]' 'Move From [810, 906] To [816, 912]'
 'Move From [186, 187] To [184, 185]' 'Move From [140, 272] To [299, 431]'
 'Move From [114, 140] To [272, 298]'
 'Move From [1386, 1450] To [1445, 1509]'
 'Move From [442, 524] To [296, 378]' 'Move From [408, 414] To [390, 396]'
 'Move From [1144, 1147] To [1142, 1145]'
 'Move From [218, 220] To [206, 208]' 'Move From [164, 165] To [153, 154]'
 'Move From [623, 632] To [624, 633]'
 'Move From [747, 960] To [1041, 1254]'
 'Move Fro

In [125]:
is_alnum = df_train['text_change'].str.contains('q')
print(is_alnum.head(20))
print(is_alnum.shape)

0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9     False
10     True
11     True
12     True
13    False
14     True
15     True
16     True
17     True
18     True
19     True
Name: text_change, dtype: bool
(8405898,)


In [128]:
def word_grouping(df):
    is_alnum = df['text_change'].str.contains('q')
    word_count = [0] * len(is_alnum)  # Initialize with zeros
    j = 0

    for i in range(len(is_alnum)):
        word_count[i] = j
        if not is_alnum.iloc[i] and is_alnum.iloc[i + 1]:
            j += 1

    word_count[-1] = j  # Last element
    
    return pd.Series(word_count)  # Return as a Pandas Series

In [129]:
def iw_iki(df):
    
    return df

In [130]:
iw_iki(df_train)

KeyError: 'text_change'

In [124]:
def calculate_features(df):
    """
    """
    # Create a DataFrame to store the features with a single column of IDs
    features = pd.DataFrame({'id': df['id'].unique()})

    
    # Long pause calculations
    iki = df.groupby('id')['down_time'].diff().fillna(0) #interkeystroke interval
    mean_iki = iki.groupby(df['id']).mean().reset_index(name='mean_iki') #mean of IKI
    median_iki = iki.groupby(df['id']).median().reset_index(name='median_iki') #median of IKI
    std_iki = iki.groupby(df['id']).std().reset_index(name='std_iki') #standard deviation of IKI
    max_iki = iki.groupby(df['id']).max().reset_index(name='max_iki') #maximum of IKI

    features = features.merge(std_iki, on='id', how='left')

    df['word_count'] = word_grouping(df)

    # Calculate the difference in down_time within groups defined by both 'id' and 'word_count'
    df['down_time_diff'] = df.groupby(['id', 'word_count'])['down_time'].diff()
    # Filter out the rows where activity is 'Backspace' or any other non-letter activity
    df_filtered = df[df['activity'] == 'Input']
    # Calculate the mean difference in down_time within each word for each id
    mean_intra_word_iki = df_filtered.groupby(['id', 'word_count'])['down_time_diff'].mean().reset_index()
    # Aggregate this feature at the 'id' level to match the granularity of your features DataFrame
    mean_intra_word_iki = mean_intra_word_iki.groupby('id')['down_time_diff'].mean().reset_index(name='mean_intra_word_iki')
    
    # Calculate the standard deviation of down_time_diff within each word for each id
    std_intra_word_iki = df_filtered.groupby(['id', 'word_count'])['down_time_diff'].std().reset_index()
    # Aggregate this feature at the 'id' level to match the granularity of your features DataFrame
    std_intra_word_iki = std_intra_word_iki.groupby('id')['down_time_diff'].std().reset_index(name='std_intra_word_iki')
   
    # Merge these new features into the features DataFrame
    features = features.merge(mean_intra_word_iki, on='id', how='left')
    features = features.merge(std_intra_word_iki, on='id', how='left')



    #intra_word_iki = df.groupby('id') 



    #mean_iki_within_word
    #std_iki_within_word
    
    #mean_iki_between_words
    #std_iki_between_words

    #mean_time_between_words
    #std_time_between_words

    #mean_time_between_sentences
    #std_time_between_sentences
    
    #n_iki_1
    #n_iki_2
    #n_iki_3
    #n_iki_4
    #n_iki_5


    # Revision calcuations

    # Fluency calculations

    # Verbosity calculations

    # Non-typing event calculations


    return features




IndentationError: expected an indented block after 'for' statement on line 32 (4021508445.py, line 34)

In [105]:
calculate_features(df_train)
text_change = df_train.groupby('id')['text_change']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f615a22fe60>


In [None]:
# Initialize dictionaries to store the features for each ID
features = {
    'id': [], #user ID 
    'std_iki': [], #standard deviation of IKI
    'pct_pauses': [], #percentage of long pauses between words
    'le_revisions': [], #leading-edge revisions
    'mean_sb': [], #mean time in single backspacing
    'mean_mb': [], #mean time in multiple backspacing
    'pct_le_chars': [], #percentage of characters at leading edge
    'pct_r_bursts': [], #percentage of R-bursts
    'num_prod_cycles': [], #number of production cycles
    'ent_per_30': [], #entropy number of keystrokes per 30s
    'loc_ext_per_30':[], #local extreme number of keystrokes per 30s
    'mean_tcpj': [], #mean time cut/paste/jump events
    'SD_tcpj': [], #standard deviation of time cut/paste/jump events
    'pct_other':[], #percentage of time spent on other events
}