# Survey Data Manipulation and Transformation

In this notebook, we will explore how to manipulate and transform survey data using Pandas. Data manipulation is a crucial part of analysis because it allows us to filter, reorganize, and summarize data to reveal insights.

In [None]:
import pandas as pd
import numpy as np

# Load data from CSV
df = pd.read_csv("./Data/Stress_Dataset.csv")
df.head()

## 1. Selecting Specific Columns

We can select individual columns or multiple columns to focus on particular parts of the dataset.

In [None]:
# Select the Age and Gender columns
df[['Age', 'Gender']].head()

## 2. Filtering Rows Based on Conditions

Filtering allows us to focus on subsets of the data. For example, we can select only participants above a certain age or with a specific gender.

In [None]:
# Select rows where Age is greater than 20
df[df['Age'] > 20].head()

# Select rows where Gender is 'Female'
df[df['Gender'] == 'Female'].head()

## 3. Creating New Columns

We can create new columns from existing data. For example, let us create a column that flags whether a participant is considered 'young' (Age < 21).

In [None]:
# Create a new column 'Is_Young'
df['Is_Young'] = df['Age'] < 21
df[['Age', 'Is_Young']].head()

## 4. Sorting Data

Sorting allows us to organize the dataset. For example, we may want to sort participants by Age or by responses to a particular question.

In [None]:
# Sort by Age (ascending)
df.sort_values('Age').head()

# Sort by Age (descending)
df.sort_values('Age', ascending=False).head()

## 5. Grouping and Aggregation

Grouping allows us to calculate summary statistics across categories. For example, we can compare average stress levels between genders.

In [None]:
# Group by Gender and compute average Age
df.groupby('Gender')['Age'].mean()

# Group by Gender and compute multiple aggregations
df.groupby('Gender').agg({
    'Age': ['mean', 'max', 'min'],
    'Have you recently experienced stress in your life?': 'mean'
})

## 6. A Slightly Complex Challenge

Let us try a more advanced task. Suppose we want to find out the **average stress score** (using the same three questions as before) by **Gender**, and also check how many participants fall into each group. This combines grouping, aggregation, and column creation.

In [None]:
# Compute average stress score for each participant
df['Stress_Score'] = df[[
    'Have you recently experienced stress in your life?',
    'Do you face any sleep problems or difficulties falling asleep?',
    'Do you feel overwhelmed with your academic workload?'
]].mean(axis=1)

# Group by Gender and summarize
df.groupby('Gender').agg(
    Avg_Stress_Score=('Stress_Score', 'mean'),
    Participant_Count=('Gender', 'count')
)