# Tutorial 1: Pandas Fundamentals

*Building Core Data Analysis Skills*

**Estimated Time: 2-2.5 hours**

## Learning Objectives
By the end of this tutorial, you will be able to:
- Load and explore datasets using pandas
- Filter and manipulate data effectively
- Create new columns and perform calculations
- Sort and rank data
- Work with missing values
- Perform basic grouping and aggregation
- Navigate pandas documentation to solve problems

---

## Getting Started: Understanding Pandas

Pandas is built around two main data structures:
- **Series**: A one-dimensional array with labels (like a column in a spreadsheet)
- **DataFrame**: A two-dimensional table with rows and columns (like a spreadsheet)

### Exercise 1: Your First DataFrame

Let's start by creating a simple dataset about students:

In [15]:
import pandas as pd
import numpy as np

# Create a dictionary with student data
student_data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [20, 22, 19, 21, 23],
    'major': ['Biology', 'Physics', 'Chemistry', 'Biology', 'Physics'],
    'gpa': [3.8, 3.2, 3.9, 3.5, 3.7]
}

# Create a DataFrame from the dictionary
students = pd.DataFrame(student_data)

# Display the DataFrame
print(students)

      name  age      major  gpa
0    Alice   20    Biology  3.8
1      Bob   22    Physics  3.2
2  Charlie   19  Chemistry  3.9
3    Diana   21    Biology  3.5
4      Eve   23    Physics  3.7


**Documentation Reference**: [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

**Question**: What do you notice about how pandas displays the data? What are the numbers on the left called?

---

## Section 1: Loading and Exploring Real Data

### Exercise 2: Loading Data from a URL

For practice, let's work with the Gapminder dataset:

In [16]:
# The Gapminder dataset URL
url = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/world-data-gapminder.csv'


# Load the data using pd.read_csv()
# parse_dates=['year'] tells pandas to convert the 'year' column to datetime format
# This makes it easier to work with time-based data later
gapminder = pd.read_csv(url, parse_dates=['year'])

# Display the first few rows
# .head() shows the first 5 rows by default - this gives us a quick peek at the data structure
print(gapminder.head())

       country       year  population region     sub_region income_group  \
0  Afghanistan 1800-01-01     3280000   Asia  Southern Asia          Low   
1  Afghanistan 1801-01-01     3280000   Asia  Southern Asia          Low   
2  Afghanistan 1802-01-01     3280000   Asia  Southern Asia          Low   
3  Afghanistan 1803-01-01     3280000   Asia  Southern Asia          Low   
4  Afghanistan 1804-01-01     3280000   Asia  Southern Asia          Low   

   life_expectancy  income  children_per_woman  child_mortality  pop_density  \
0             28.2     603                 7.0            469.0          NaN   
1             28.2     603                 7.0            469.0          NaN   
2             28.2     603                 7.0            469.0          NaN   
3             28.2     603                 7.0            469.0          NaN   
4             28.2     603                 7.0            469.0          NaN   

   co2_per_capita  years_in_school_men  years_in_school_women 

**What's happening here:**
- `pd.read_csv()` reads a CSV file from a URL and creates a DataFrame
- `parse_dates=['year']` automatically converts the year column to a datetime data type
- `gapminder.head()` shows us the first 5 rows so we can see what our data looks like

**Documentation Reference**: [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

**Think About It**: Why might we want to parse the 'year' column as dates? What advantages does this give us?

### Exercise 3: Getting to Know Your Data

The `.info()` method is incredibly useful for understanding your dataset:

In [17]:
# Use .info() to get information about the gapminder dataset
# This shows us: data types, non-null counts, memory usage
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38982 entries, 0 to 38981
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   country                38982 non-null  object        
 1   year                   38982 non-null  datetime64[ns]
 2   population             38982 non-null  int64         
 3   region                 38982 non-null  object        
 4   sub_region             38982 non-null  object        
 5   income_group           38982 non-null  object        
 6   life_expectancy        38982 non-null  float64       
 7   income                 38982 non-null  int64         
 8   children_per_woman     38982 non-null  float64       
 9   child_mortality        38980 non-null  float64       
 10  pop_density            12282 non-null  float64       
 11  co2_per_capita         16285 non-null  float64       
 12  years_in_school_men    8188 non-null   float64       
 13  y

**What `.info()` tells us:**
- **Data types**: Whether columns are numbers, text, dates, etc.
- **Non-null count**: How many values are NOT missing in each column
- **Memory usage**: How much space the dataset takes up
- **Index information**: The range of row numbers

**Try These Too**:

In [18]:
# Find the shape of the dataset (rows, columns)
# .shape is a property (no parentheses) that returns a tuple: (rows, columns)
print("Dataset shape:", gapminder.shape)

# Get the column names as a list
# .columns gives us an Index object, .tolist() converts it to a regular Python list
print("Columns:", gapminder.columns.tolist())

# Get basic statistics for numerical columns
# .describe() automatically calculates count, mean, std, min, quartiles, max
print(gapminder.describe())

Dataset shape: (38982, 14)
Columns: ['country', 'year', 'population', 'region', 'sub_region', 'income_group', 'life_expectancy', 'income', 'children_per_woman', 'child_mortality', 'pop_density', 'co2_per_capita', 'years_in_school_men', 'years_in_school_women']
                                year    population  life_expectancy  \
count                          38982  3.898200e+04     38982.000000   
mean   1909-01-01 02:11:30.410958848  1.422075e+07        43.073468   
min              1800-01-01 00:00:00  1.250000e+04         1.000000   
25%              1854-01-01 00:00:00  5.060000e+05        31.200000   
50%              1909-01-01 00:00:00  2.140000e+06        35.500000   
75%              1964-01-01 00:00:00  6.870000e+06        55.600000   
max              2018-01-01 00:00:00  1.420000e+09        84.200000   
std                              NaN  6.722423e+07        16.219216   

              income  children_per_woman  child_mortality   pop_density  \
count   38982.000000    

**What each method shows:**
- `.shape`: Gives us (number_of_rows, number_of_columns) as a tuple
- `.columns.tolist()`: Shows all column names in a readable list format
- `.describe()`: Provides statistical summary for numerical columns only

**Documentation References**: 
- [DataFrame.info](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)
- [DataFrame.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

**Reflection**: Look at the output of `.info()`. How many countries are in the dataset? Which columns have missing values?

---

## Section 2: Selecting and Filtering Data

### Exercise 4: Selecting Columns

In [19]:
# Select just the 'country' column (returns a Series)
countries = gapminder['country']
print(type(countries))

# Select multiple columns (returns a DataFrame)
subset = gapminder[['country', 'year', 'life_expectancy']]
print(type(subset))

# Display the first 5 rows of your subset
print(subset.head())

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
       country       year  life_expectancy
0  Afghanistan 1800-01-01             28.2
1  Afghanistan 1801-01-01             28.2
2  Afghanistan 1802-01-01             28.2
3  Afghanistan 1803-01-01             28.2
4  Afghanistan 1804-01-01             28.2


**Key Insight**: Single brackets `[]` return a Series, double brackets `[[]]` return a DataFrame.

### Exercise 5: Filtering Rows with Boolean Indexing

Boolean indexing is one of pandas' most powerful features:

In [20]:
# Filter for data from the year 1982
# This creates a boolean mask: True for rows where year equals '1982', False otherwise
# Then pandas returns only the True rows
gm_1982 = gapminder[gapminder['year'] == '1982']
print("Rows in 1982:", len(gm_1982))

# Filter for countries with life expectancy greater than 75
# The > operator creates another boolean mask
high_life_exp = gapminder[gapminder['life_expectancy'] > 75]
print("Countries with high life expectancy:", len(high_life_exp))

# Combine conditions: Countries in 1982 with life expectancy > 75
# & means "and" - both conditions must be True
# IMPORTANT: Use parentheses around each condition when combining!
combined_filter = gapminder[
    (gapminder['year'] == '1982') & 
    (gapminder['life_expectancy'] > 75)
]
print("High life expectancy countries in 1982:", len(combined_filter))

Rows in 1982: 178
Countries with high life expectancy: 1644
High life expectancy countries in 1982: 10


**How boolean indexing works:**
1. `gapminder['year'] == '1982'` creates a Series of True/False values
2. `gapminder[boolean_series]` returns only rows where the boolean is True
3. `&` combines conditions with "and" logic (both must be True)
4. `|` combines conditions with "or" logic (either can be True)
5. **Always use parentheses** when combining conditions!

**Documentation Reference**: [Boolean indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)

**Practice Challenge**: Filter for countries that are either in Europe OR have a life expectancy greater than 80.

In [21]:
# Your solution here:
europe_or_high_life = gapminder.query('region == "Europe" or life_expectancy > 80')
europe_or_high_life

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
219,Albania,1800-01-01,410000,Europe,Southern Europe,Upper middle,35.4,667,4.60,375.00,,,,
220,Albania,1801-01-01,412000,Europe,Southern Europe,Upper middle,35.4,667,4.60,375.00,,,,
221,Albania,1802-01-01,413000,Europe,Southern Europe,Upper middle,35.4,667,4.60,375.00,,,,
222,Albania,1803-01-01,414000,Europe,Southern Europe,Upper middle,35.4,667,4.60,375.00,,,,
223,Albania,1804-01-01,416000,Europe,Southern Europe,Upper middle,35.4,667,4.60,375.00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37006,United Kingdom,2014-01-01,65000000,Europe,Northern Europe,High,80.8,38000,1.88,4.50,269.0,6.46,14.4,14.8
37007,United Kingdom,2015-01-01,65400000,Europe,Northern Europe,High,80.8,38500,1.88,4.40,270.0,,14.6,14.9
37008,United Kingdom,2016-01-01,65800000,Europe,Northern Europe,High,80.9,38900,1.87,4.30,272.0,,,
37009,United Kingdom,2017-01-01,66200000,Europe,Northern Europe,High,81.0,39500,1.87,4.28,274.0,,,


### Exercise 6: Working with Missing Data

In [22]:
# Check which columns have missing values and how many
# .isnull() creates a DataFrame of True/False values (True = missing)
# .sum() counts the True values (treating True as 1, False as 0)
missing_data = gapminder.isnull().sum()
print("Missing values per column:")
print(missing_data)

# Filter the dataset to only include rows where 'co2_per_capita' is NOT null
# .notna() is the opposite of .isnull() - it returns True for non-missing values
co2_data = gapminder[gapminder['co2_per_capita'].notna()]
print(f"Original dataset: {len(gapminder)} rows")
print(f"With CO2 data: {len(co2_data)} rows")

# What years have CO2 data available?
# .unique() returns all distinct values in a column
available_years = co2_data['year'].unique()
print(f"Years with CO2 data: {len(available_years)} years")

Missing values per column:
country                      0
year                         0
population                   0
region                       0
sub_region                   0
income_group                 0
life_expectancy              0
income                       0
children_per_woman           0
child_mortality              2
pop_density              26700
co2_per_capita           22697
years_in_school_men      30794
years_in_school_women    30794
dtype: int64
Original dataset: 38982 rows
With CO2 data: 16285 rows
Years with CO2 data: 215 years


**Understanding missing data methods:**
- `.isnull()`: Returns True where data is missing (NaN, None, etc.)
- `.notna()` or `.notnull()`: Returns True where data exists
- `.sum()` on boolean data: Counts True values (since True = 1 in math)
- `.unique()`: Shows all distinct values, useful for checking data ranges

---

## Section 3: Data Manipulation and Transformation

### Exercise 7: Creating New Columns

In [23]:
# Create a new column 'total_co2' by multiplying population and co2_per_capita
# First, create a copy to avoid the SettingWithCopyWarning
co2_data = co2_data.copy()  # This creates an independent copy of the data
# Now we can safely add new columns
co2_data['total_co2'] = co2_data['population'] * co2_data['co2_per_capita']

print("New column created:")
print(co2_data[['country', 'year', 'population', 'co2_per_capita', 'total_co2']].head())

New column created:
         country       year  population  co2_per_capita  total_co2
149  Afghanistan 1949-01-01     7660000         0.00191    14630.6
150  Afghanistan 1950-01-01     7750000         0.01090    84475.0
151  Afghanistan 1951-01-01     7840000         0.01170    91728.0
152  Afghanistan 1952-01-01     7930000         0.01150    91195.0
153  Afghanistan 1953-01-01     8040000         0.01320   106128.0


**What's happening:**
- `.copy()`: Creates an independent copy to avoid pandas warnings
- `df['new_column'] = calculation`: Creates a new column using existing columns
- Column operations are **vectorized**: the calculation applies to all rows at once

### Exercise 8: Creating Categorical Columns

In [24]:
# Create a categorical column 'life_exp_category' based on life expectancy
# np.where() works like: np.where(condition, value_if_true, value_if_false)
# We can nest multiple np.where() calls for multiple categories
co2_data['life_exp_category'] = np.where(
    co2_data['life_expectancy'] < 60, 'Low',           # If < 60, assign 'Low'
    np.where(co2_data['life_expectancy'] <= 75, 'Medium', 'High')  # Else if <= 75, 'Medium', else 'High'
)

print("Life expectancy categories:")
print(co2_data['life_exp_category'].value_counts())

Life expectancy categories:
life_exp_category
Low       9214
Medium    5739
High      1332
Name: count, dtype: int64


**How np.where() works:**
- `np.where(condition, value_if_true, value_if_false)`: Basic syntax
- You can nest multiple `np.where()` calls for multiple categories
- The conditions are checked in order: first condition, then second, etc.
- `.value_counts()`: Shows how many rows fall into each category

**Documentation Reference**: [numpy.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html)

### Exercise 9: Sorting Data

In [25]:
# Sort the 1982 data by life expectancy in descending order
sorted_by_life_exp = gm_1982.sort_values('life_expectancy', ascending=False)
print("Country with highest life expectancy in 1982:")
print(sorted_by_life_exp[['country', 'life_expectancy']].head(1))

# Sort by multiple columns: first by region, then by life expectancy (descending)
multi_sort = gm_1982.sort_values(['region', 'life_expectancy'], ascending=[True, False])
print("\nTop countries by region:")
print(multi_sort[['region', 'country', 'life_expectancy']].head(10))

Country with highest life expectancy in 1982:
      country  life_expectancy
13541  Greece             77.3

Top countries by region:
       region     country  life_expectancy
20111  Africa       Libya             72.0
30185  Africa  Seychelles             70.5
35660  Africa     Tunisia             67.7
22520  Africa   Mauritius             67.0
620    Africa     Algeria             64.4
23615  Africa     Morocco             62.7
4781   Africa    Botswana             62.5
18140  Africa       Kenya             62.1
38945  Africa    Zimbabwe             61.8
9818   Africa    Djibouti             61.0


**Documentation Reference**: [DataFrame.sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

---

## Section 4: Advanced Data Operations

### Exercise 10: Finding Top and Bottom Values

In [28]:
# Find the top 10 countries with highest CO2 per capita in the most recent year
recent_year = co2_data['year'].max()
recent_co2 = co2_data[co2_data['year'] == recent_year]

# Find the top 10 using nlargest()
top_10_co2 = recent_co2.nlargest(10, 'co2_per_capita')
print("Top 10 CO2 emitters per capita:")
print(top_10_co2[['country', 'co2_per_capita']])

# Find the bottom 5 countries with lowest life expectancy in 1982
bottom_5_life_exp = gm_1982.nsmallest(5, 'life_expectancy')
print("\nCountries with lowest life expectancy in 1982:")
print(bottom_5_life_exp[['country', 'life_expectancy']])

Top 10 CO2 emitters per capita:
                    country  co2_per_capita
28465                 Qatar            45.4
35473   Trinidad and Tobago            34.2
18610                Kuwait            25.2
2623                Bahrain            23.4
36787  United Arab Emirates            23.3
29560          Saudi Arabia            19.5
20581            Luxembourg            17.4
37225         United States            16.5
1747              Australia            15.4
26275                  Oman            15.4

Countries with lowest life expectancy in 1982:
           country  life_expectancy
182    Afghanistan             43.8
25367        Niger             44.2
26681    Palestine             44.4
11570     Ethiopia             45.0
19454      Lebanon             45.9


**Documentation Reference**: [DataFrame.nlargest](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html)

### Exercise 11: Basic Grouping and Aggregation

In [None]:
# Calculate average life expectancy by region for 1982
# .groupby('region') groups all rows by their region value
# ['life_expectancy'] selects just that column from each group
# .mean() calculates the average for each group
regional_avg = gm_1982.groupby('region')['life_expectancy'].mean()
print("Average life expectancy by region (1982):")
print(regional_avg.round(1))

# For each region, find the count of countries and average income
# .agg(['count', 'mean']) applies multiple functions to the grouped data
regional_stats = gm_1982.groupby('region')['income'].agg(['count', 'mean'])
print("\nRegional income statistics:")
print(regional_stats.round(0))

# Create a more complex aggregation
# When you pass a dictionary to .agg(), you can apply different functions to different columns
if 'income_group' in gm_1982.columns:
    complex_agg = gm_1982.groupby('income_group').agg({
        'life_expectancy': 'mean',  # Average life expectancy for each income group
        'population': 'sum',        # Total population for each income group
        'country': 'count'          # Number of countries in each income group
    })
    print("\nIncome group analysis:")
    print(complex_agg)

**How groupby works:**
1. **Split**: `.groupby('column')` divides data into groups based on unique values
2. **Apply**: Functions like `.mean()`, `.sum()`, `.count()` are applied to each group
3. **Combine**: Results are combined into a new DataFrame or Series
4. **Multiple functions**: `.agg(['func1', 'func2'])` applies multiple functions
5. **Different functions per column**: `.agg({'col1': 'mean', 'col2': 'sum'})` for flexibility

**Documentation Reference**: [DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

---

## Section 5: Working with Time Series Data

### Exercise 12: Date Operations and Filtering

In [29]:
# Extract the year as a number from the datetime column
gapminder['year_num'] = gapminder['year'].dt.year

# Filter for data from specific years: 1952, 1977, 2002
selected_years = ['1952', '1977', '2002']
time_series_data = gapminder[gapminder['year'].isin(selected_years)]

print(f"Data from selected years: {len(time_series_data)} rows")
print("Years included:")
print(time_series_data['year'].value_counts().sort_index())

Data from selected years: 534 rows
Years included:
year
1952-01-01    178
1977-01-01    178
2002-01-01    178
Name: count, dtype: int64


  time_series_data = gapminder[gapminder['year'].isin(selected_years)]


**Documentation Reference**: [pandas.DataFrame.isin](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html)

---

## Section 6: Using Pandas Documentation

### Exercise 13: Documentation Navigation

Learning to use the pandas documentation is crucial for solving new problems:

In [30]:
# Practice finding documentation for these methods:
# 1. How to calculate correlation between two columns
correlation_example = gapminder[['life_expectancy', 'income']].corr()
print("Correlation between life expectancy and income:")
print(correlation_example)

# 2. How to get unique values and their counts
unique_regions = gapminder['region'].value_counts()
print("\nCountries per region:")
print(unique_regions)

# 3. How to calculate rolling averages (you'll need to look this up!)
# Hint: Look for "rolling" in the pandas documentation

Correlation between life expectancy and income:
                 life_expectancy    income
life_expectancy         1.000000  0.588609
income                  0.588609  1.000000

Countries per region:
region
Africa      11388
Asia        10293
Europe       8541
Americas     6789
Oceania      1971
Name: count, dtype: int64


**Key Documentation Resources:**
1. **User Guide**: [pandas.pydata.org/docs/user_guide/](https://pandas.pydata.org/docs/user_guide/)
2. **API Reference**: [pandas.pydata.org/docs/reference/](https://pandas.pydata.org/docs/reference/)
3. **Search Strategy**: Use Ctrl+F to find specific method names

---

## Common Patterns and Best Practices

### Pattern 1: Method Chaining
Instead of creating many intermediate variables, you can chain operations:

In [31]:
# Chain operations together for cleaner code
result = (gapminder
          .query('year == "2000"')  # Filter condition
          [['country', 'region', 'life_expectancy']]  # Select columns
          .sort_values('life_expectancy', ascending=False)  # Sort
          .head(10))  # Get top 10

print("Top 10 countries by life expectancy in 2000:")
print(result)

Top 10 countries by life expectancy in 2000:
           country    region  life_expectancy
17501        Japan      Asia             81.0
15530      Iceland    Europe             79.9
33707  Switzerland    Europe             79.9
33488       Sweden    Europe             79.7
1733     Australia   Oceania             79.7
17063        Italy    Europe             79.5
32393        Spain    Europe             79.4
30641    Singapore      Asia             79.4
6332        Canada  Americas             79.2
12245       France    Europe             79.1


  .query('year == "2000"')  # Filter condition


### Pattern 2: Handling the SettingWithCopyWarning

In [32]:
# When you modify a subset of data, use .copy()
subset = gapminder[gapminder['year'] == '2000'].copy()
subset['new_column'] = subset['life_expectancy'] * 2

### Pattern 3: Safe Data Operations

In [33]:
# Always check your data after operations
print("Before operation:", gapminder.shape)
filtered_data = gapminder[gapminder['life_expectancy'] > 50]
print("After filtering:", filtered_data.shape)
print("Percentage retained:", len(filtered_data) / len(gapminder) * 100, "%")

Before operation: (38982, 15)
After filtering: (11672, 15)
Percentage retained: 29.94202452413935 %


---

## Practice Exercises

### Exercise 14: Putting It All Together

Use everything you've learned to answer these questions:

In [69]:
# Question: Which country had the fastest improvement in life expectancy between 1952 and 2007?
# Try doing this by yourself, using what you have learned so far. 
# The solution is at the bottom of the notebook.


# Step 1: Get data for 1952 and 2007
data_1952 = gapminder.query('year == 1952')
data_2007 = gapminder.query('year == 2007')

# Step 2: Merge the datasets to compare
improvement_data = data_1952.merge(
    data_2007,
    on = 'country',
    suffixes = ('_1952', '_2007')
)

# Step 3: Calculate improvement
improvement_data['life_exp_improvement'] = improvement_data['life_expectancy_2007'] - improvement_data['life_expectancy_1952']

# Step 4: Find the country with highest improvement
best_country = improvement_data.nlargest(1, 'life_exp_improvement')

print(best_country[['country', 'life_exp_improvement']])
best_index = improvement_data['life_exp_improvement'].idxmax()
best_country = improvement_data.loc[98, 'country']

print("Country with fastest life expectancy improvement:")
print(best_country)

     country  life_exp_improvement
98  Maldives                  47.3
Country with fastest life expectancy improvement:
Maldives


---

## Summary

You've now learned the fundamental pandas skills for data analysis:

### **Core Skills Mastered:**
1. **Loading and inspecting** data with `read_csv()`, `info()`, `describe()`
2. **Selecting and filtering** data with boolean indexing
3. **Creating new columns** with calculations and transformations
4. **Sorting and ranking** data for analysis
5. **Handling missing values** with appropriate strategies
6. **Basic grouping and aggregating** with `groupby()`
7. **Working with dates** and time series data
8. **Using pandas documentation** effectively

### **Key Patterns to Remember:**
- Use `.copy()` when modifying filtered data
- Combine conditions with `&` (and) and `|` (or) in parentheses
- Chain methods for cleaner, more readable code
- Always validate your data after operations

### **Next Steps:**
- Practice these techniques on different datasets
- Explore more advanced pandas functions
- Learn data cleaning techniques for visualization
- Apply these skills in your assignments!

**Congratulations!** You now have a solid foundation in pandas for data analysis. These skills will serve you well throughout the course and in your future data science work.



---

## Creating Your Pandas Cheat Sheet

As you learn pandas, it's crucial to build your own reference guide. A personalized cheat sheet helps you remember syntax and builds confidence for quizzes and projects.

### **Five-Step Process for Building Your Cheat Sheet**

#### **Step 1: Create Your Outline (5 minutes)**
Set up a document (Word, Google Doc, or notebook) with these sections:
- **Data Loading & Exploration**
- **Data Selection & Filtering** 
- **Data Manipulation**
- **Grouping & Aggregation**
- **Common Patterns**

#### **Step 2: Extract Key Syntax (10 minutes)**
Go through each section of this tutorial and write down the **essential syntax**:

```
DATA LOADING & EXPLORATION:
• pd.read_csv(url, parse_dates=['col'])
• df.info() - data types and missing values
• df.describe() - statistical summary
• df.shape - (rows, columns)
• df.columns.tolist() - column names

DATA SELECTION & FILTERING:
• df['column'] - single column (Series)
• df[['col1', 'col2']] - multiple columns (DataFrame) 
• df[df['col'] > value] - boolean filtering
• df[(condition1) & (condition2)] - multiple conditions
• df['col'].isnull() / .notna() - missing value checks
```

#### **Step 3: Add Your Own Examples (10 minutes)**
For each syntax pattern, write a **concrete example** from your own practice:

```
EXAMPLE: Filter high life expectancy countries in 2000
high_life_2000 = gapminder[
    (gapminder['year'] == '2000') & 
    (gapminder['life_expectancy'] > 75)
]
```

#### **Step 4: Note Common Gotchas (5 minutes)**
Record the mistakes you made or almost made:

```
REMEMBER:
• Use .copy() when modifying filtered data
• Always use parentheses with & and | operators
• .shape is a property (no parentheses)
• Use .tolist() to convert columns to readable list
```

#### **Step 5: Test Your Cheat Sheet (5 minutes)**
Pick 2-3 random operations from your cheat sheet and try them on a dataset. If you can't remember how to use them, add more detail to your notes.

### **Pro Tips for Effective Cheat Sheets:**

1. **Keep it concise**: One page per tutorial maximum
2. **Use your own words**: Don't just copy-paste from tutorials
3. **Include error solutions**: Note how you fixed common mistakes
4. **Update regularly**: Add new patterns as you encounter them
5. **Practice from it**: Use your cheat sheet to solve new problems

### **Before Your Next Tutorial:**
Spend 35 minutes total creating your Tutorial 1A cheat sheet. 
This investment will pay off during future assignments when you need to recall syntax quickly.

**Your cheat sheet is your personal pandas survival guide - make it work for you!**

---


## Self-Check Questions

Before moving on, make sure you can answer these questions. **Practice coding the answers** to prepare for your quiz:

### **Basic Concepts (Practice these first)**

1. **What's the difference between selecting columns with `df['col']` vs `df[['col']]`?**
   - Try both on a sample dataset and observe the output types

2. **How do you combine multiple conditions in boolean indexing?**
   - Practice creating filters with both `&` (and) and `|` (or)
   - Remember: always use parentheses around each condition!

3. **What's the purpose of using `.copy()` when creating filtered datasets?**
   - When do you need it and why does pandas give warnings without it?
   
4. **How do you find the top N values in a column?**
   - Pick an attribute from the dataset and make sure you can filter to only keep N values?

5. **What's the basic pattern for groupby operations?**
   

### **Applied Skills (Quiz-level questions)**

4. **Complex Filtering Challenge:**
   Write code to find countries that meet ALL these conditions:
   - Life expectancy greater than 70
   - Population greater than 5 million
   - From either Europe OR North America
   

5. **Groupby Analysis Challenge:**
   For each region, calculate:
   - The count of countries
   - The average life expectancy
   - The country with the highest income (hint: use `.idxmax()`)
   

6. **Missing Data Strategy Challenge:**
   You have a dataset where:
   - 'essential_column' has 5% missing values
   - 'optional_column' has 40% missing values
   - 'analysis_column' has 15% missing values
   
   What strategy would you use for each column and why? Practice implementing your strategy in code.

**If you can confidently answer and code solutions for questions 4-7, you're well-prepared for the quiz!**

In [86]:
#4
life_g70 = gapminder.query('life_expectancy > 70 and population > 5000000 and region in ["Europe", "North America"]')

#5
summary = (
    gapminder
        .groupby('region')
        .agg(
            country_count = ('country', 'nunique'),
            avg_life_exp  = ('life_expectancy', 'mean')
        )
)
best_index = gapminder['income'].idxmax()
best_country = gapminder.loc[best_index, 'country']
print(best_country)

#6

# Size of dataset
n = 100

# Base DataFrame
df = pd.DataFrame({
    'country': [f"C{i}" for i in range(n)],
    'essential_column': np.random.randn(n),
    'optional_column': np.random.randn(n),
    'analysis_column': np.random.randn(n),
})

# Insert missingness
df.loc[np.random.choice(n, size=5,  replace=False), 'essential_column'] = np.nan   # 5%
df.loc[np.random.choice(n, size=40, replace=False), 'optional_column']  = np.nan   # 40%
df.loc[np.random.choice(n, size=15, replace=False), 'analysis_column'] = np.nan   # 15%

df.head()

df_clean = df.copy()

# 1. Essential column → keep column, drop rows with missing values
df_clean = df_clean.dropna(subset=['essential_column'])

# 2. Optional column → drop entire column
df_clean = df_clean.drop(columns=['optional_column'])

# 3. Analysis column → keep column, drop rows with missing values
df_clean = df_clean.dropna(subset=['analysis_column'])

df_clean.query('essential_column == essential_column')
df_clean




United Arab Emirates


Unnamed: 0,country,essential_column,analysis_column
0,C0,-0.529730,1.399107
1,C1,0.469486,0.442950
2,C2,-0.630499,1.288744
3,C3,1.333342,2.673950
4,C4,-0.486552,-1.487199
...,...,...,...
93,C93,0.824114,0.435691
95,C95,-0.257243,1.849411
96,C96,-0.213792,0.115578
97,C97,-0.866182,-0.167502


In [37]:
# Solution for Exercise 14

# Step 1: Get data for 1952 and 2007
data_1952 = gapminder[gapminder['year'] == '1952']
data_2007 = gapminder[gapminder['year'] == '2007']

# Step 2: Merge the datasets to compare
improvement_data = data_1952.merge(
    data_2007, 
    on='country', 
    suffixes=('_1952', '_2007')
)

# Step 3: Calculate improvement
improvement_data['life_exp_improvement'] = (
    improvement_data['life_expectancy_2007'] - 
    improvement_data['life_expectancy_1952']
)

# Step 4: Find the country with highest improvement
best_improvement = improvement_data.nlargest(1, 'life_exp_improvement')
print("Country with fastest life expectancy improvement:")
print(best_improvement[['country', 'life_exp_improvement']])

Country with fastest life expectancy improvement:
     country  life_exp_improvement
98  Maldives                  47.3
