<img src="../img/UP Data Science Society Logo 2.png" width=700 />

# [3.1] Introduction to Pandas

**Prepared by**:
- Yuan Labuguen

**Weekly Objectives**:
- Learn to load and manipulate data using Pandas
- Understand basic Pandas data structures and operations
- Master data analysis and transformation techniques with Pandas

## 1. Loading and Reading Data with Pandas

Let's start by importing Pandas and learning different ways to load data.

In [1]:
# Import pandas
import pandas as pd
import numpy as np

# Load data from CSV
df = pd.read_csv("./Data/Stress_Dataset.csv")

# Display the first few rows
print("First few rows of the dataset:")
print(df.head())

# Create a DataFrame from a dictionary
data_dict = {
    'Name': ['John', 'Anna', 'Peter'],
    'Age': [25, 28, 30],
    'City': ['New York', 'Paris', 'London']
}
df_dict = pd.DataFrame(data_dict)
print("\nDataFrame from dictionary:")
print(df_dict)

First few rows of the dataset:
   Gender  Age  Have you recently experienced stress in your life?  \
0       0   20                                                  3    
1       0   20                                                  2    
2       0   20                                                  5    
3       1   20                                                  3    
4       0   20                                                  3    

   Have you noticed a rapid heartbeat or palpitations?  \
0                                                  4     
1                                                  3     
2                                                  4     
3                                                  4     
4                                                  3     

   Have you been dealing with anxiety or tension recently?  \
0                                                  2         
1                                                  2         
2            

## 2. Understanding DataFrame Basics

Let's explore the basic properties and methods of a Pandas DataFrame.

In [2]:
# Basic DataFrame information
print("DataFrame Info:")
print(df.info())

# Basic statistics of numeric columns
print("\nBasic Statistics:")
print(df.describe())

# DataFrame dimensions
print("\nDataFrame Shape (rows, columns):", df.shape)

# Column names
print("\nColumn Names:", df.columns.tolist())

# Data types of columns
print("\nData Types:")
print(df.dtypes)

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 843 entries, 0 to 842
Data columns (total 26 columns):
 #   Column                                                                Non-Null Count  Dtype 
---  ------                                                                --------------  ----- 
 0   Gender                                                                843 non-null    int64 
 1   Age                                                                   843 non-null    int64 
 2   Have you recently experienced stress in your life?                    843 non-null    int64 
 3   Have you noticed a rapid heartbeat or palpitations?                   843 non-null    int64 
 4   Have you been dealing with anxiety or tension recently?               843 non-null    int64 
 5   Do you face any sleep problems or difficulties falling asleep?        843 non-null    int64 
 6   Have you been dealing with anxiety or tension recently?.1             843 non-null    int6

## 3. Data Selection and Indexing

Learn different ways to select and index data in a DataFrame.

In [3]:
# Select a single column
print("Single column selection:")
print(df.iloc[:, 0])  # First column

# Select multiple columns
print("\nMultiple column selection:")
print(df.iloc[:, [0, 1]])  # First two columns

# Select rows by index
print("\nRow selection by index:")
print(df.iloc[0:3])  # First three rows

# Select specific rows and columns
print("\nSpecific rows and columns:")
print(df.iloc[0:3, 0:2])  # First three rows and first two columns

# Using loc for label-based indexing
print("\nLabel-based indexing with column names:")
print(df.loc[:, df.columns[0:2]])  # First two columns by name

Single column selection:
0      0
1      0
2      0
3      1
4      0
      ..
838    0
839    1
840    1
841    0
842    0
Name: Gender, Length: 843, dtype: int64

Multiple column selection:
     Gender  Age
0         0   20
1         0   20
2         0   20
3         1   20
4         0   20
..      ...  ...
838       0   21
839       1   19
840       1   19
841       0   20
842       0   19

[843 rows x 2 columns]

Row selection by index:
   Gender  Age  Have you recently experienced stress in your life?  \
0       0   20                                                  3    
1       0   20                                                  2    
2       0   20                                                  5    

   Have you noticed a rapid heartbeat or palpitations?  \
0                                                  4     
1                                                  3     
2                                                  4     

   Have you been dealing with anxiety or 

## 4. Data Filtering and Boolean Indexing

Learn how to filter data using boolean conditions.

In [5]:
# Simple boolean condition
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    first_numeric = numeric_cols[0]
    condition = df[first_numeric] > df[first_numeric].mean()
    print(f"Rows where {first_numeric} is above mean:")
    print(df[condition].head())

# Multiple conditions
if len(numeric_cols) > 1:
    second_numeric = numeric_cols[1]
    condition1 = df[first_numeric] > df[first_numeric].mean()
    condition2 = df[second_numeric] < df[second_numeric].median()
    print(f"\nRows where {first_numeric} is above mean AND {second_numeric} is below median:")
    print(df[condition1 & condition2].head())

# Using isin()
if len(df) > 5:
    values_to_check = df.iloc[0:5, 0].tolist()
    print("\nRows where first column values are in the first 5 values:")
    print(df[df.iloc[:, 0].isin(values_to_check)].head())

Rows where Gender is above mean:
    Gender  Age  Have you recently experienced stress in your life?  \
3        1   20                                                  3    
10       1   19                                                  2    
11       1   20                                                  5    
15       1   19                                                  4    
18       1   18                                                  1    

    Have you noticed a rapid heartbeat or palpitations?  \
3                                                   4     
10                                                  1     
11                                                  3     
15                                                  4     
18                                                  1     

    Have you been dealing with anxiety or tension recently?  \
3                                                   3         
10                                                  2      

## 5. Basic Data Analysis Operations

Explore basic statistical operations and data analysis techniques.

In [7]:
# Basic statistics for numeric columns
numeric_cols = df.select_dtypes(include=[np.number])
categorical_cols = df.select_dtypes(exclude=[np.number])

# Identify meaningful numeric columns (excluding categorical codes)
meaningful_numeric = numeric_cols.columns[~numeric_cols.columns.str.contains('id|gender|category|type', case=False)].tolist()

if meaningful_numeric:
    print("Basic Statistics for Numeric Columns (excluding categorical codes):")
    print("\nMean:")
    print(df[meaningful_numeric].mean())
    print("\nMedian:")
    print(df[meaningful_numeric].median())
    print("\nStandard Deviation:")
    print(df[meaningful_numeric].std())

# Value counts for categorical columns
if not categorical_cols.empty:
    print("\nCategory Distributions:")
    for col in categorical_cols.columns[:2]:  # Show first two categorical columns
        print(f"\nDistribution of {col}:")
        print(df[col].value_counts())
        print(f"Number of unique {col} values:", df[col].nunique())

# Sorting data
if meaningful_numeric:
    sort_col = meaningful_numeric[0]
    print(f"\nTop 5 rows sorted by {sort_col}:")
    print(df.sort_values(by=sort_col, ascending=False).head())

Basic Statistics for Numeric Columns (excluding categorical codes):

Mean:
Age                                                                     20.071174
Have you recently experienced stress in your life?                       2.997628
Have you been dealing with anxiety or tension recently?                  2.543298
Do you face any sleep problems or difficulties falling asleep?           2.786477
Have you been dealing with anxiety or tension recently?.1                2.663108
Have you been getting headaches more often than usual?                   2.628707
Do you get irritated easily?                                             2.702254
Do you have trouble concentrating on your academic tasks?                2.699881
Have you been feeling sadness or low mood?                               2.584816
Have you been experiencing any illness or health issues?                 2.549229
Do you often feel lonely or isolated?                                    2.497034
Do you feel overwhelmed

## 6. Handling Missing Data

Learn how to identify and handle missing data in your DataFrame.

In [8]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Create a sample DataFrame with missing values
df_missing = df.copy()
if len(df) > 0 and len(df.columns) > 0:
    df_missing.iloc[0:3, 0] = np.nan

# Different ways to handle missing values
print("\nDropping rows with missing values:")
print(df_missing.dropna().head())

print("\nFilling missing values with mean (numeric columns):")
numeric_cols = df_missing.select_dtypes(include=[np.number])
if not numeric_cols.empty:
    df_filled = df_missing.copy()
    for col in numeric_cols.columns:
        df_filled[col] = df_filled[col].fillna(df_filled[col].mean())
    print(df_filled.head())

Missing values in each column:
Gender                                                                  0
Age                                                                     0
Have you recently experienced stress in your life?                      0
Have you noticed a rapid heartbeat or palpitations?                     0
Have you been dealing with anxiety or tension recently?                 0
Do you face any sleep problems or difficulties falling asleep?          0
Have you been dealing with anxiety or tension recently?.1               0
Have you been getting headaches more often than usual?                  0
Do you get irritated easily?                                            0
Do you have trouble concentrating on your academic tasks?               0
Have you been feeling sadness or low mood?                              0
Have you been experiencing any illness or health issues?                0
Do you often feel lonely or isolated?                                   0
Do you 

## 7. Grouping and Aggregation

Learn how to group data and perform aggregation operations.

In [16]:
# Basic groupby operations
if len(df.columns) >= 2:
    # Identify meaningful columns for grouping and aggregation
    numeric_cols = df.select_dtypes(include=[np.number])
    categorical_cols = df.select_dtypes(exclude=[np.number])
    
    # Filter out numeric columns that shouldn't be aggregated
    meaningful_numeric = numeric_cols.columns[~numeric_cols.columns.str.contains('id|gender|category|type', case=False)].tolist()
    
    if len(categorical_cols) > 0 and len(meaningful_numeric) > 0:
        # Single column grouping
        group_col = categorical_cols.columns[0]  # First categorical column
        agg_col = meaningful_numeric[0]  # First meaningful numeric column
        
        print(f"Analysis grouped by {group_col}:")
        grouped_stats = df.groupby(group_col)[meaningful_numeric].agg([
            'count',    # Number of records
            'mean',     # Average
            'std',      # Standard deviation
            'min',      # Minimum value
            'max'       # Maximum value
        ])
        print(grouped_stats)
        
        # Calculate group sizes and percentages
        group_sizes = df[group_col].value_counts()
        group_percentages = df[group_col].value_counts(normalize=True) * 100
        
        print(f"\nDistribution of {group_col}:")
        for group in group_sizes.index:
            print(f"{group}: {group_sizes[group]} records ({group_percentages[group]:.1f}%)")
        
        # Multiple column grouping (only if we have 2 or more categorical columns)
        num_cat_cols = len(categorical_cols.columns)
        if num_cat_cols >= 2:
            print(f"\nFound {num_cat_cols} categorical columns. Performing multi-column grouping:")
            group_cols = list(categorical_cols.columns[:2])  # First two categorical columns
            print(f"Analyzing groups by: {group_cols[0]} and {group_cols[1]}")
            multi_grouped = df.groupby(group_cols)[agg_col].agg([
                'count',
                'mean',
                'std'
            ])
            print(multi_grouped)
        else:
            print(f"\nOnly {num_cat_cols} categorical column(s) found. Need at least 2 for multi-column grouping.")
    else:
        if len(categorical_cols) == 0:
            print("No categorical columns found for grouping.")
        if len(meaningful_numeric) == 0:
            print("No meaningful numeric columns found for aggregation.")
else:
    print("Not enough columns for grouping analysis.")

Analysis grouped by Which type of stress do you primarily experience?:
                                                     Age                       \
                                                   count       mean       std   
Which type of stress do you primarily experience?                               
Distress (Negative Stress) - Stress that causes...    32  20.625000  3.338437   
Eustress (Positive Stress) - Stress that motiva...   768  20.091146  5.579533   
No Stress - Currently experiencing minimal to n...    43  19.302326  3.661478   

                                                             \
                                                   min  max   
Which type of stress do you primarily experience?             
Distress (Negative Stress) - Stress that causes...  17   37   
Eustress (Positive Stress) - Stress that motiva...  14  100   
No Stress - Currently experiencing minimal to n...  14   41   

                                                   Have you rec

## 8. Data Manipulation and Transformation

Learn how to manipulate and transform your data using Pandas.

### Note about DataFrame Indices and Display
In Pandas, each row in a DataFrame has an index. By default:
- Indices start at 0 (unlike Excel which starts at 1)
- The `head()` function shows the first 5 rows (indices 0-4)
- These numbers on the left are just row labels, not data values
- You can think of them like line numbers or row IDs

When we display DataFrame results using `print(df.head())`, you'll see:
```
   column1  column2  column3
0    value    value    value    <- Row with index 0
1    value    value    value    <- Row with index 1
2    value    value    value    <- Row with index 2
3    value    value    value    <- Row with index 3
4    value    value    value    <- Row with index 4
```

In [None]:
# Let's work with our simple dictionary DataFrame from section 1
data_dict = {
    'Name': ['John', 'Anna', 'Peter'],
    'Age': [25, 28, 30],
    'City': ['New York', 'Paris', 'London']
}
df_simple = pd.DataFrame(data_dict)
print("Original DataFrame:")
print(df_simple)

# 1. Adding a new column by transforming Age
print("\n1. Creating new column by doubling Age:")
df_simple['Age_Doubled'] = df_simple['Age'] * 2
print(df_simple)

# 2. Applying a custom function to Age
print("\n2. Adding 5 years to Age using a custom function:")
df_simple['Age_Plus_5'] = df_simple['Age'].apply(lambda x: x + 5)
print(df_simple)

# 3. Conditional transformation
print("\n3. Creating age category column:")
df_simple['Age_Category'] = df_simple['Age'].apply(
    lambda x: 'Young' if x < 27 else 'Adult')
print(df_simple)

# 4. String manipulation on Name column
print("\n4. Creating column with uppercase names:")
df_simple['Name_Upper'] = df_simple['Name'].str.upper()
print(df_simple)

# Clean up - remove the columns we created
df_simple = df_simple.drop(['Age_Doubled', 'Age_Plus_5', 
                          'Age_Category', 'Name_Upper'], axis=1)
print("\nBack to original DataFrame:")
print(df_simple)

All columns in the dataset:
['Gender', 'Age', 'Have you recently experienced stress in your life?', 'Have you noticed a rapid heartbeat or palpitations?', 'Have you been dealing with anxiety or tension recently?', 'Do you face any sleep problems or difficulties falling asleep?', 'Have you been dealing with anxiety or tension recently?.1', 'Have you been getting headaches more often than usual?', 'Do you get irritated easily?', 'Do you have trouble concentrating on your academic tasks?', 'Have you been feeling sadness or low mood?', 'Have you been experiencing any illness or health issues?', 'Do you often feel lonely or isolated?', 'Do you feel overwhelmed with your academic workload?', 'Are you in competition with your peers, and does it affect you?', 'Do you find that your relationship often causes you stress?', 'Are you facing any difficulties with your professors or instructors?', 'Is your working environment unpleasant or stressful?', 'Do you struggle to find time for relaxation an