# Data Preprocessing Notebook

This notebook outlines the data preprocessing steps and checks for the dataset.


## Data Loading

Load the dataset from the CSV file into a Pandas DataFrame.

We'll print a few of the data frames to verify the important worked; we'll need to join all of these to run statistical analyses later.

In [13]:
import pandas as pd

responses_file_path = '../data/responses.csv'
participants_file_path = '../data/participants.csv'

responses = pd.read_csv(responses_file_path)
participants = pd.read_csv(participants_file_path)


In [14]:
responses.head()

Unnamed: 0,id,stage,accuracy,completeness,innovation,difficulty,n_user_messages,n_internet_resources,time_to_complete_sec
0,0,ideation,7,2,1,6.0,0.0,23.0,2075.0
1,1,ideation,8,4,1,4.0,0.0,11.0,727.0
2,2,ideation,8,6,0,7.0,19.0,10.0,4474.0
3,3,ideation,0,4,0,2.0,0.0,12.0,771.0
4,4,ideation,8,4,3,6.0,37.0,6.0,3371.0


In [15]:
participants.head()

Unnamed: 0,id,cohort,assignment,llm_experience
0,0,expert,llm_internet,Used a few times
1,1,expert,internet_only,Use at least once every few weeks
2,2,expert,llm_internet,Never used
3,3,student,llm_internet,Use almost every day
4,4,student,llm_internet,Use almost every day


In [18]:
# Join the data frames
data = pd.merge(responses, participants, on='id', how='inner')
data.head()

Unnamed: 0,id,stage,accuracy,completeness,innovation,difficulty,n_user_messages,n_internet_resources,time_to_complete_sec,cohort,assignment,llm_experience
0,0,ideation,7,2,1,6.0,0.0,23.0,2075.0,expert,llm_internet,Used a few times
1,0,acquisition,8,6,0,4.0,0.0,22.0,3175.0,expert,llm_internet,Used a few times
2,0,magnification,4,3,0,6.0,0.0,17.0,2158.0,expert,llm_internet,Used a few times
3,0,formulation,3,2,0,2.0,0.0,5.0,1151.0,expert,llm_internet,Used a few times
4,0,release,6,2,0,5.0,0.0,23.0,1816.0,expert,llm_internet,Used a few times


## Missing Values

Check for any missing values in the dataset.


In [19]:
missing_values = data.isnull().sum()
missing_values

id                      0
stage                   0
accuracy                0
completeness            0
innovation              0
difficulty              5
n_user_messages         3
n_internet_resources    3
time_to_complete_sec    3
cohort                  0
assignment              0
llm_experience          0
dtype: int64

Five participants didn't give a self-rated difficulty.

Three participants didn't have a value associated with n_user_messages, n_internet_resources, or time_to_complete_sec.

## Data Types

Ensure that each column has the correct data type.


In [20]:
data_types = data.dtypes
data_types


id                        int64
stage                    object
accuracy                  int64
completeness              int64
innovation                int64
difficulty              float64
n_user_messages         float64
n_internet_resources    float64
time_to_complete_sec    float64
cohort                   object
assignment               object
llm_experience           object
dtype: object

## Outliers

Identify any outliers in the dataset using a statistical rule (values that are more than 3 standard deviations from the mean).


In [21]:
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
data_numeric = data[numeric_cols]
z_scores = ((data_numeric - data_numeric.mean()) / data_numeric.std()).abs()
outliers = (z_scores > 3).sum()
outliers


id                       0
accuracy                 0
completeness             4
innovation               5
difficulty               0
n_user_messages         12
n_internet_resources     9
time_to_complete_sec     9
dtype: int64

## Consistency

Check for consistency in categorical variables, such as the 'stage' column.


In [22]:
unique_stages = data['stage'].unique()
unique_stages


array(['ideation', 'acquisition', 'magnification', 'formulation',
       'release'], dtype=object)

## Cleaning the Data

In [23]:
def clean_dataset(raw_data, cleaned_data_path):
    cleaned_data = raw_data.copy()

    # Remove rows with missing values
    # cleaned_data = data.dropna()
    
    # Save the cleaned data to a CSV file
    cleaned_data.to_csv(cleaned_data_path, index=False)
    
    return cleaned_data_path

cleaned_data_file_path = '../data/cleaned_data.csv' 

# Call the function and obtain the path to the cleaned data file
cleaned_file_path = clean_dataset(data, cleaned_data_file_path)
cleaned_file_path

'../data/cleaned_data.csv'