# Simple Data Manipulation with Pandas

This notebook demonstrates fundamental data manipulation techniques using Pandas. We'll explore how to select specific columns, filter rows based on conditions, and perform basic data operations.

## Learning Objectives
- Select single and multiple columns from a DataFrame
- Filter rows using boolean conditions
- Apply multiple filtering conditions
- Use the `isin()` method for filtering
- Understand boolean indexing in Pandas

## Prerequisites
- Basic understanding of Python and Pandas
- Familiarity with DataFrames
- Pandas library installed

## Step 1: Import Required Libraries

First, let's import the Pandas library for data manipulation.

In [1]:
# Import the Python libraries
import pandas as pd

print("Pandas library imported successfully!")
print(f"Pandas version: {pd.__version__}")

Pandas library imported successfully!
Pandas version: 2.3.1


## Step 2: Load the Dataset

We'll load our CSV file into a DataFrame. If the file doesn't exist, we'll create sample data for demonstration purposes.

In [2]:
# Load the CSV file into a DataFrame
try:
    df = pd.read_csv("./data/data.csv")
    print("CSV file loaded successfully!")
    print(f"Dataset shape: {df.shape}")
except FileNotFoundError:
    print("CSV file not found. Creating sample data for demonstration.")
    # Create sample data
    df = pd.DataFrame({
        'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Davis', 
                 'Eve Wilson', 'Frank Miller', 'Grace Lee', 'Henry Taylor'],
        'Age': [25, 32, 28, 35, 22, 45, 29, 31],
        'City': ['New York', 'Dallas', 'Chicago', 'Dallas', 'New York', 
                 'Houston', 'Dallas', 'Chicago'],
        'Grade': ['F', 'D', 'C', 'B', 'A', 'F', 'D', 'C'],
    })
    print("Sample data created successfully!")

# Display the first few rows to understand our data
print("\nFirst 5 rows of the dataset:")
print(df.head())

CSV file loaded successfully!
Dataset shape: (1000, 4)

First 5 rows of the dataset:
    Name   Age       City Grade
0   Sara  50.0  San Diego     F
1   Emma  57.0  San Diego     C
2  David  29.0   New York     D
3  Chris  53.0   San Jose     D
4   Sara  21.0    Phoenix     A


## Step 3: Column Selection

Column selection is a fundamental operation in data analysis. We can select single columns or multiple columns based on our analysis needs.

### Single Column Selection
When we select a single column, we get a Pandas Series object.

In [3]:
# Single column selection
ages = df["Age"]
print('Selected "Age" column:')
print(f"Type: {type(ages)}")
print("First 5 values:")
print(ages.head())

# Alternative syntax for single column selection
# ages = df.Age  # This also works but is less recommended

Selected "Age" column:
Type: <class 'pandas.core.series.Series'>
First 5 values:
0    50.0
1    57.0
2    29.0
3    53.0
4    21.0
Name: Age, dtype: float64


### Multiple Column Selection
When we select multiple columns, we get a DataFrame with only the specified columns.

In [4]:
# Multiple columns selection
selected_columns = df[["Name", "City"]]
print('Selected "Name" and "City" columns:')
print(f"Type: {type(selected_columns)}")
print(f"Shape: {selected_columns.shape}")
print("First 5 rows:")
print(selected_columns.head())

Selected "Name" and "City" columns:
Type: <class 'pandas.core.frame.DataFrame'>
Shape: (1000, 2)
First 5 rows:
    Name       City
0   Sara  San Diego
1   Emma  San Diego
2  David   New York
3  Chris   San Jose
4   Sara    Phoenix


## Step 4: Row Filtering

Row filtering allows us to extract specific rows based on conditions. This is essential for data analysis and exploration.

### Basic Filtering
We can filter rows using boolean conditions. This creates a boolean mask that selects only rows where the condition is True.

In [5]:
# Basic filtering: Filtering rows where "Age" is less than 30
young_people = df[df["Age"] < 30]
print('Filtered rows where "Age" < 30:')
print(f"Original dataset size: {len(df)} rows")
print(f"Filtered dataset size: {len(young_people)} rows")
print("Filtered data:")
print(young_people)

Filtered rows where "Age" < 30:
Original dataset size: 1000 rows
Filtered dataset size: 281 rows
Filtered data:
      Name   Age         City Grade
2    David  29.0     New York     D
4     Sara  21.0      Phoenix     A
5    James  22.0      Houston     A
7     Sara  25.0    San Diego     B
11   David  26.0     San Jose     F
..     ...   ...          ...   ...
985   Emma  24.0     San Jose     B
986  James  18.0      Phoenix     C
993  David  26.0  Los Angeles   NaN
995  James  26.0  Los Angeles     F
997  David  27.0     New York   NaN

[281 rows x 4 columns]


### Multiple Conditions
We can combine multiple conditions using logical operators:
- `&` for AND operations
- `|` for OR operations
- `~` for NOT operations

**Important**: When combining conditions, each condition must be wrapped in parentheses.

In [6]:
# Filtering with multiple conditions: "Age" < 30 and "City" is "Dallas"
specific_group = df[(df["Age"] < 30) & (df["City"] == "Dallas")]
print('Filtered rows where "Age" < 30 and "City" is "Dallas":')
print(f"Filtered dataset size: {len(specific_group)} rows")
print("Filtered data:")
print(specific_group)

# Let's also try an OR condition
young_or_senior = df[(df["Age"] < 30) | (df["Age"] > 40)]
print('\nFiltered rows where "Age" < 30 OR "Age" > 40:')
print(f"Filtered dataset size: {len(young_or_senior)} rows")
print("First 5 rows:")
print(young_or_senior.head())

Filtered rows where "Age" < 30 and "City" is "Dallas":
Filtered dataset size: 23 rows
Filtered data:
      Name   Age    City Grade
65   James  24.0  Dallas     C
114   John  25.0  Dallas     F
193   John  29.0  Dallas     D
206   Emma  27.0  Dallas     D
342   Emma  27.0  Dallas     B
350   Sara  20.0  Dallas     A
369   Sara  28.0  Dallas     B
373    Bob  24.0  Dallas     D
417  James  26.0  Dallas     D
522   Emma  29.0  Dallas     B
525  Laura  29.0  Dallas     A
566  James  22.0  Dallas     F
587  Chris  24.0  Dallas     A
676   Emma  28.0  Dallas     C
709   Emma  27.0  Dallas     F
754  Chris  19.0  Dallas     D
781  David  20.0  Dallas     D
860  Alice  18.0  Dallas     C
901   John  18.0  Dallas     B
924  James  25.0  Dallas     D
931    Bob  29.0  Dallas     C
966   Mike  19.0  Dallas     A
973   Mike  24.0  Dallas     A

Filtered rows where "Age" < 30 OR "Age" > 40:
Filtered dataset size: 742 rows
First 5 rows:
    Name   Age       City Grade
0   Sara  50.0  San Diego     

### Using the isin() Method
The `isin()` method is useful when you want to filter based on multiple values in a single column. It's equivalent to multiple OR conditions but more readable.

In [7]:
# Filtering using isin(): Filtering rows where "City" is either "New York" or "Dallas"
cities = df[df["City"].isin(["New York", "Dallas"])]
print('Filtered rows where "City" is "New York" or "Dallas":')
print(f"Filtered dataset size: {len(cities)} rows")
print("Filtered data:")
print(cities)

# This is equivalent to:
# cities_alternative = df[(df["City"] == "New York") | (df["City"] == "Dallas")]

Filtered rows where "City" is "New York" or "Dallas":
Filtered dataset size: 195 rows
Filtered data:
      Name   Age      City Grade
2    David  29.0  New York     D
12   David  46.0    Dallas     A
17   David  28.0  New York     F
21    John  38.0  New York     A
24   Laura  40.0  New York     C
..     ...   ...       ...   ...
973   Mike  24.0    Dallas     A
976   Emma  56.0  New York     C
982    Bob  53.0  New York     B
991   Mike  39.0  New York     F
997  David  27.0  New York   NaN

[195 rows x 4 columns]


## Step 5: Advanced Filtering Examples

Let's explore some more advanced filtering techniques that are commonly used in data analysis.

In [8]:
# String filtering: Find people whose names start with specific letters
names_starting_with_a_or_b = df[df["Name"].str.startswith(('A', 'B'))]
print("People whose names start with 'A' or 'B':")
print(names_starting_with_a_or_b[["Name", "Age", "City"]])

# Numerical range filtering
middle_aged = df[(df["Age"] >= 30) & (df["Age"] <= 40)]
print(f"\nPeople aged between 30 and 40 (inclusive): {len(middle_aged)} people")
print(middle_aged[["Name", "Age", "City"]])

# Filtering with missing values (if any)
print(f"\nChecking for missing values:")
print(df.isnull().sum())

People whose names start with 'A' or 'B':
      Name   Age         City
6      Bob  54.0  Los Angeles
13     Bob  31.0  Los Angeles
16   Alice  39.0     San Jose
19   Alice  18.0  Los Angeles
27     Bob  32.0     New York
..     ...   ...          ...
967  Alice  30.0     New York
981  Alice  59.0      Phoenix
982    Bob  53.0     New York
983    Bob  37.0     San Jose
998  Alice  41.0      Houston

[193 rows x 3 columns]

People aged between 30 and 40 (inclusive): 254 people
      Name   Age         City
13     Bob  31.0  Los Angeles
16   Alice  39.0     San Jose
18    Mike  40.0  San Antonio
21    John  38.0     New York
24   Laura  40.0     New York
..     ...   ...          ...
983    Bob  37.0     San Jose
988   Emma  30.0          NaN
989  James  39.0  San Antonio
990  David  38.0  San Antonio
991   Mike  39.0     New York

[254 rows x 3 columns]

Checking for missing values:
Name     0
Age      4
City     6
Grade    8
dtype: int64


## Step 6: Combining Selection and Filtering

We can combine column selection with row filtering for more targeted data analysis.

In [9]:
# Combine filtering and column selection
young_people_info = df[df["Age"] < 30][["Name", "Age"]]
print("Names and ages of people under 30:")
print(young_people_info)

# Alternative approach using query method (more readable for complex conditions)
query_result = df.query("Age < 30 and City == 'Dallas'")[["Name", "Age"]]
print("\nUsing query method - People under 30 in Dallas:")
print(query_result)

Names and ages of people under 30:
      Name   Age
2    David  29.0
4     Sara  21.0
5    James  22.0
7     Sara  25.0
11   David  26.0
..     ...   ...
985   Emma  24.0
986  James  18.0
993  David  26.0
995  James  26.0
997  David  27.0

[281 rows x 2 columns]

Using query method - People under 30 in Dallas:
      Name   Age
65   James  24.0
114   John  25.0
193   John  29.0
206   Emma  27.0
342   Emma  27.0
350   Sara  20.0
369   Sara  28.0
373    Bob  24.0
417  James  26.0
522   Emma  29.0
525  Laura  29.0
566  James  22.0
587  Chris  24.0
676   Emma  28.0
709   Emma  27.0
754  Chris  19.0
781  David  20.0
860  Alice  18.0
901   John  18.0
924  James  25.0
931    Bob  29.0
966   Mike  19.0
973   Mike  24.0


## Step 7: Data Manipulation Summary

Let's create a summary of our data manipulation operations to understand what we've learned.

In [10]:
# Summary statistics
print("=== DATA MANIPULATION SUMMARY ===")
print(f"Original dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Columns: {list(df.columns)}")

print("\n=== FILTERING RESULTS ===")
print(f"People under 30: {len(df[df['Age'] < 30])} people")
print(f"People in New York or Dallas: {len(df[df['City'].isin(['New York', 'Dallas'])])} people")
print(f"People under 30 in Dallas: {len(df[(df['Age'] < 30) & (df['City'] == 'Dallas')])} people")

print("\n=== AGE DISTRIBUTION ===")
print(f"Youngest person: {df['Age'].min()} years old")
print(f"Oldest person: {df['Age'].max()} years old")
print(f"Average age: {df['Age'].mean():.1f} years")

print("\n=== CITY DISTRIBUTION ===")
print("People per city:")
print(df['City'].value_counts())

=== DATA MANIPULATION SUMMARY ===
Original dataset: 1000 rows, 4 columns
Columns: ['Name', 'Age', 'City', 'Grade']

=== FILTERING RESULTS ===
People under 30: 281 people
People in New York or Dallas: 195 people
People under 30 in Dallas: 23 people

=== AGE DISTRIBUTION ===
Youngest person: 18.0 years old
Oldest person: 59.0 years old
Average age: 38.7 years

=== CITY DISTRIBUTION ===
People per city:
City
Houston         111
Phoenix         106
San Diego       105
Chicago         104
Dallas          101
Philadelphia     98
San Antonio      96
Los Angeles      95
New York         94
San Jose         84
Name: count, dtype: int64


## Summary

In this notebook, we learned essential data manipulation techniques:

### Column Selection
1. **Single column**: `df["column_name"]` returns a Series
2. **Multiple columns**: `df[["col1", "col2"]]` returns a DataFrame

### Row Filtering
1. **Basic filtering**: `df[df["column"] > value]`
2. **Multiple conditions**: `df[(condition1) & (condition2)]`
3. **OR conditions**: `df[(condition1) | (condition2)]`
4. **isin() method**: `df[df["column"].isin([value1, value2])]`

### Advanced Techniques
- String operations with `.str` accessor
- Query method for readable complex conditions
- Combining selection and filtering
- Summary statistics and value counts

## Best Practices

1. **Always use parentheses** when combining conditions with `&` and `|`
2. **Use isin()** for multiple value filtering instead of multiple OR conditions
3. **Combine operations** efficiently to avoid creating unnecessary intermediate DataFrames
4. **Use descriptive variable names** for filtered datasets
5. **Check your results** by examining shape and sample data

## Next Steps

- Learn about data sorting and grouping
- Explore data aggregation methods
- Practice with more complex filtering scenarios
- Study data transformation techniques