# Survey Data Cleaning and Preprocessing

In this notebook, we will learn how to clean and preprocess survey data using the Pandas library. Data cleaning is an essential first step in data analysis because real-world data is often messy. Cleaning ensures that our dataset is consistent, accurate, and ready for further analysis.

In [17]:
import pandas as pd
import numpy as np

# Load data from CSV
df = pd.read_csv("./Data/Stress_Dataset.csv")
df.head()

Unnamed: 0,Gender,Age,Have you recently experienced stress in your life?,Have you noticed a rapid heartbeat or palpitations?,Have you been dealing with anxiety or tension recently?,Do you face any sleep problems or difficulties falling asleep?,Have you been dealing with anxiety or tension recently?.1,Have you been getting headaches more often than usual?,Do you get irritated easily?,Do you have trouble concentrating on your academic tasks?,...,Are you facing any difficulties with your professors or instructors?,Is your working environment unpleasant or stressful?,Do you struggle to find time for relaxation and leisure activities?,Is your hostel or home environment causing you difficulties?,Do you lack confidence in your academic performance?,Do you lack confidence in your choice of academic subjects?,Academic and extracurricular activities conflicting for you?,Do you attend classes regularly?,Have you gained/lost weight?,Which type of stress do you primarily experience?
0,0,20,3,4,2,5,1,2,1,2,...,3,1,4,1,2,1,3,1,2,Eustress (Positive Stress) - Stress that motiv...
1,0,20,2,3,2,1,1,1,1,4,...,3,2,1,1,3,2,1,4,2,Eustress (Positive Stress) - Stress that motiv...
2,0,20,5,4,2,2,1,3,4,2,...,2,2,2,1,4,1,1,2,1,Eustress (Positive Stress) - Stress that motiv...
3,1,20,3,4,3,2,2,3,4,3,...,1,1,2,1,2,1,1,5,3,Eustress (Positive Stress) - Stress that motiv...
4,0,20,3,3,3,2,2,4,4,4,...,2,3,1,2,2,4,2,2,2,Eustress (Positive Stress) - Stress that motiv...


## 1. Exploring the Dataset

Before cleaning, it is important to explore the dataset. This helps us understand the structure, data types, and possible issues such as missing values or duplicated columns.

In [9]:
# Display basic information about the dataset
df.info()

# Display summary statistics
df.describe(include="all").T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 843 entries, 0 to 842
Data columns (total 26 columns):
 #   Column                                                                Non-Null Count  Dtype 
---  ------                                                                --------------  ----- 
 0   Gender                                                                843 non-null    int64 
 1   Age                                                                   843 non-null    int64 
 2   Have you recently experienced stress in your life?                    843 non-null    int64 
 3   Have you noticed a rapid heartbeat or palpitations?                   843 non-null    int64 
 4   Have you been dealing with anxiety or tension recently?               843 non-null    int64 
 5   Do you face any sleep problems or difficulties falling asleep?        843 non-null    int64 
 6   Have you been dealing with anxiety or tension recently?.1             843 non-null    int64 
 7   Have you

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Gender,843.0,,,,0.349941,0.477234,0.0,0.0,0.0,1.0,1.0
Age,843.0,,,,20.071174,5.429502,14.0,19.0,19.0,20.0,100.0
Have you recently experienced stress in your life?,843.0,,,,2.997628,1.134639,1.0,2.0,3.0,4.0,5.0
Have you noticed a rapid heartbeat or palpitations?,843.0,,,,2.755635,1.11865,1.0,2.0,3.0,4.0,5.0
Have you been dealing with anxiety or tension recently?,843.0,,,,2.543298,1.20133,1.0,2.0,2.0,3.0,5.0
Do you face any sleep problems or difficulties falling asleep?,843.0,,,,2.786477,1.266959,1.0,2.0,3.0,4.0,5.0
Have you been dealing with anxiety or tension recently?.1,843.0,,,,2.663108,1.266376,1.0,2.0,2.0,4.0,5.0
Have you been getting headaches more often than usual?,843.0,,,,2.628707,1.266593,1.0,2.0,2.0,4.0,5.0
Do you get irritated easily?,843.0,,,,2.702254,1.314213,1.0,2.0,3.0,4.0,5.0
Do you have trouble concentrating on your academic tasks?,843.0,,,,2.699881,1.313673,1.0,2.0,3.0,4.0,5.0


## 2. Handling Missing Values

Survey data often has missing responses. Pandas provides simple methods to detect and handle missing values. We can choose to either remove rows with missing values or fill them with reasonable replacements.

In [18]:
# Check for missing values
df.isnull().sum()

# Example: Fill missing Age values with the median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Example: Drop rows where 'Gender' is missing
df = df.dropna(subset=['Gender'])

## 3. Removing Duplicates

Sometimes survey data contains duplicate rows when participants accidentally submit multiple times. Removing duplicates ensures our analysis is not biased.

In [11]:
# Check for duplicate rows
df.duplicated().sum()

# Remove duplicates
df = df.drop_duplicates()

## 4. Converting Numeric Codes to Meaningful Labels

Survey data often uses numeric codes to represent categorical values. For example, gender might be coded as 0 and 1 instead of 'Female' and 'Male'. Converting these numeric codes to descriptive labels makes the data more readable and easier to interpret during analysis.

In [19]:
# Convert numeric gender codes to descriptive labels
gender_map = {0: 'Female', 1: 'Male'}
df['Gender'] = df['Gender'].map(gender_map)

# Show the unique values to verify the conversion
print("Unique values in Gender column after conversion:")
print(df['Gender'].unique())

Unique values in Gender column after conversion:
['Female' 'Male']


In [13]:
# Check the data type of Gender column and its values
print("Data type of Gender column:", df['Gender'].dtype)
print("\nUnique values in Gender column:")
print(df['Gender'].unique())

Data type of Gender column: int64

Unique values in Gender column:
[0 1]


## 5. Transforming Data Types

It is important that each column has the correct data type. For example, the 'Age' column should be numeric, and categorical responses should be converted to categories.

In [6]:
# Convert Age to integer
df['Age'] = df['Age'].astype(int)

# Convert Gender to categorical
df['Gender'] = df['Gender'].astype('category')

## 6. A Slightly Complex Challenge

Let us now try a slightly more complex cleaning task. Suppose we want to create a new column called **Stress_Level** based on the answers to three survey questions:
- Have you recently experienced stress in your life?
- Do you face any sleep problems or difficulties falling asleep?
- Do you feel overwhelmed with your academic workload?

We will compute the average of these three columns and use it to classify participants as having **Low**, **Medium**, or **High** stress levels.

In [7]:
# Calculate average of selected stress indicators
df['Stress_Score'] = df[[
    'Have you recently experienced stress in your life?',
    'Do you face any sleep problems or difficulties falling asleep?',
    'Do you feel overwhelmed with your academic workload?'
]].mean(axis=1)

# Create categories based on the score
df['Stress_Level'] = pd.cut(
    df['Stress_Score'],
    bins=[0, 2, 3.5, 5],
    labels=['Low', 'Medium', 'High']
)

df[['Stress_Score', 'Stress_Level']].head()

Unnamed: 0,Stress_Score,Stress_Level
0,4.333333,High
1,1.333333,Low
2,3.666667,High
3,2.0,Low
4,2.0,Low
