# Introduction to Pandas

## What is Pandas?

Pandas is a powerful and versatile open-source Python library designed for data manipulation and analysis. It provides high-performance, easy-to-use data structures, like **Series** and **DataFrame**, which are essential for handling and analyzing structured data.

## Why Use Pandas?

- **Data Wrangling**: Simplifies tasks like cleaning, reshaping, and merging datasets.
- **Data Exploration**: Makes it easy to load, filter, and analyze data with intuitive syntax.
- **Built-in Functions**: Provides a rich set of methods for statistical analysis, aggregation, and visualization.
- **Seamless Integration**: Works well with other Python libraries, including NumPy, Matplotlib, and scikit-learn.

## Key Features of Pandas

1. **Data Structures**: 
   - `Series`: One-dimensional labeled array, like a column in a spreadsheet.
   - `DataFrame`: Two-dimensional labeled data structure, similar to a table in a database.

2. **Data Handling**:
   - Loading data from CSV, Excel, SQL, or JSON files.
   - Indexing, filtering, and slicing for easy data access.

3. **Data Analysis**:
   - Grouping and aggregation (`groupby`).
   - Handling missing data (`fillna`, `dropna`).
   - Time-series functionality.

4. **Data Visualization**:
   - Integrates well with libraries like Matplotlib and Seaborn for plotting.


# Basic Data Manipulation with Pandas

## Dataset: Patient Data

For this lecture, we will use a dataset that represents patient information in a clinical trial. The dataset includes details about patient demographics, treatment types, and their responses. Here’s a quick preview:

| PatientID | Age | Gender | Treatment      | Response          | GeneExpression |
|-----------|-----|--------|----------------|-------------------|----------------|
| P001      | 45  | Female | Chemotherapy   | Complete Response | 1.2            |
| P002      | 60  | Male   | Immunotherapy  | Partial Response  | 2.5            |
| P003      | 35  | Female | Chemotherapy   | No Response       | 0.8            |
| P004      | 50  | Male   | Targeted Drug  | Complete Response | 3.1            |
| P005      | 40  | Female | Immunotherapy  | No Response       | 1.5            |

---

## Key Operations in Pandas

### 1. Loading the Data

First, we’ll load this dataset into a pandas DataFrame.

```python
import pandas as pd

# Sample data
data = {
    "PatientID": ["P001", "P002", "P003", "P004", "P005"],
    "Age": [45, 60, 35, 50, 40],
    "Gender": ["Female", "Male", "Female", "Male", "Female"],
    "Treatment": ["Chemotherapy", "Immunotherapy", "Chemotherapy", "Targeted Drug", "Immunotherapy"],
    "Response": ["Complete Response", "Partial Response", "No Response", "Complete Response", "No Response"],
    "GeneExpression": [1.2, 2.5, 0.8, 3.1, 1.5]
}

# Create a DataFrame
df = pd.DataFrame(data)
print(df)
```

### 2. Exploring the Data

```python
# Display the first few rows
print(df.head())

# Get a summary of the data
print(df.info())
```

### 3. Filtering and Subsetting

```python
# Filter patients with a "Complete Response"
complete_response = df[df["Response"] == "Complete Response"]
print(complete_response)

# Select specific columns
selected_columns = df[["PatientID", "Treatment", "Response"]]
print(selected_columns)
```

### 4. Adding New Columns

```python
# Add a new column indicating whether the patient is above 50 years old
df["Above50"] = df["Age"] > 50
print(df)
```

### 5. Grouping and Aggregation

```python
# Group by Treatment and calculate the average GeneExpression
grouped_data = df.groupby("Treatment")["GeneExpression"].mean()
print(grouped_data)
```


In [4]:
# Now let's look at a real dataset
import pandas as pd
import kagglehub

# Download latest version
path = kagglehub.dataset_download("arifmia/heart-attack-risk-dataset")

print("Path to dataset files:", path)

df = pd.read_csv(path + "/heart_attack_risk_dataset.csv")

Path to dataset files: /Users/arman/.cache/kagglehub/datasets/arifmia/heart-attack-risk-dataset/versions/1


# Example Problems with Heart Attack Risk Dataset

## Dataset Overview

The Heart Attack Risk Prediction Dataset contains 50,000 rows and 20 features, including demographic, clinical, lifestyle, and diagnostic attributes. The target variable is `Heart_Attack_Risk` (Low, Moderate, High).

---

## Example Problems

### 1. Data Exploration
- Identify the number of rows, columns, and data types of each feature.
- Find the summary statistics for numerical columns (e.g., `Age`, `BMI`, `Cholesterol_Level`).

### 2. Handling Missing Values
- Check for missing values in each column.
- Fill missing values with the mean for numerical features and mode for categorical features.

### 3. Filtering and Subsetting Data
- Extract rows where `Smoking == 1` and `Heart_Attack_Risk == 'High'`.
- Subset individuals aged above 50 with `Family_History == 1`.

### 4. Grouping and Aggregation
- Group by `Heart_Attack_Risk` and calculate the mean `BMI`, `Cholesterol_Level`, and `Resting_BP`.
- Group by `Gender` and calculate the percentage of individuals with `Diabetes == 1`.

### 5. Feature Analysis
- Calculate the proportion of individuals in each `Stress_Level` category.
- Find the most common `Chest_Pain_Type` for `Heart_Attack_Risk == 'High'`.

### 6. Feature Engineering
- Create a binary `High_BMI` column (`1` if `BMI > 30`, else `0`).
- Create an `Age_Group` column categorizing ages into ranges (e.g., `18-30`, `31-45`, etc.).

### 7. Statistical Analysis
- Calculate the correlation between `BMI` and `Cholesterol_Level`.
- Test whether average `Resting_BP` differs across `Heart_Attack_Risk` categories.

### 8. Visualization
- Plot the distribution of `Age` for each `Heart_Attack_Risk` category.
- Create a bar chart showing the count of `Heart_Attack_Risk` levels grouped by `Gender`.


In [None]:
# Let's spend some time working on this 








# Advanced Pandas Functions and Techniques

## Overview

In this section, we will explore advanced pandas functions and techniques that are essential for efficient data manipulation and analysis. These functions help streamline workflows, handle complex data transformations, and optimize performance.

---

## Advanced Pandas Functions

### 1. Pivot Tables
- **Description**: Reshape data and calculate aggregations.
- **Example**:
  ```python
  # Create a pivot table showing average BMI grouped by Gender and Heart_Attack_Risk
  pivot = df.pivot_table(values='BMI', index='Gender', columns='Heart_Attack_Risk', aggfunc='mean')
  print(pivot)
  ```

### 2. Apply and Map Functions
- **Description**: Apply custom functions to rows or columns.
- **Example**:
  ```python
  # Apply a custom function to calculate BMI categories
  def bmi_category(bmi):
      if bmi < 18.5:
          return 'Underweight'
      elif 18.5 <= bmi < 25:
          return 'Normal weight'
      elif 25 <= bmi < 30:
          return 'Overweight'
      else:
          return 'Obesity'

  df['BMI_Category'] = df['BMI'].apply(bmi_category)
  print(df[['BMI', 'BMI_Category']])
  ```

### 3. Merging and Joining
- **Description**: Combine multiple DataFrames.
- **Example**:
  ```python
  # Merge patient data with treatment data on PatientID
  merged_df = pd.merge(patient_df, treatment_df, on='PatientID', how='inner')
  print(merged_df)
  ```

### 4. MultiIndex
- **Description**: Work with hierarchical indexing for complex data.
- **Example**:
  ```python
  # Set a MultiIndex using Gender and Heart_Attack_Risk
  multi_df = df.set_index(['Gender', 'Heart_Attack_Risk'])
  print(multi_df)
  ```

### 5. Window Functions
- **Description**: Perform rolling or expanding computations.
- **Example**:
  ```python
  # Calculate rolling average of Cholesterol_Level over a window of 3
  df['Rolling_Avg_Cholesterol'] = df['Cholesterol_Level'].rolling(window=3).mean()
  print(df[['Cholesterol_Level', 'Rolling_Avg_Cholesterol']])
  ```

### 6. Exploding Lists
- **Description**: Expand lists stored in a single column into multiple rows.
- **Example**:
  ```python
  # Expand a column with lists of symptoms into separate rows
  df['Symptoms'] = df['Symptoms'].str.split(',')
  exploded_df = df.explode('Symptoms')
  print(exploded_df)
  ```

### 7. Query Function
- **Description**: Filter data using an SQL-like syntax.
- **Example**:
  ```python
  # Filter patients with BMI > 30 and Heart_Attack_Risk == 'High'
  filtered_df = df.query("BMI > 30 and Heart_Attack_Risk == 'High'")
  print(filtered_df)
  ```

### 8. DataFrame Exploding and Aggregation
- **Description**: Summarize data efficiently.
- **Example**:
  ```python
  # Aggregate by Gender and calculate mean Cholesterol_Level
  aggregated_df = df.groupby('Gender')['Cholesterol_Level'].agg(['mean', 'max', 'min'])
  print(aggregated_df)
  ```

---

## Summary

These advanced pandas functions provide powerful tools for managing, transforming, and analyzing complex datasets. By mastering these techniques, you can tackle real-world data challenges with greater efficiency and confidence.
