# Pandas & Polars: Exercise Results

## 1. Create a DataFrame
- Practice creating a small table of data using both Pandas and Polars. This exercise will help you get comfortable with initializing a DataFrame, explore the syntax for each library, and understand how tabular data is represented in Python using different tools.


In [None]:
# Pandas
df_pd = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 92, 78]})

print(df_pd)

In [None]:
# Polars
df_pl = pl.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'score': [85, 92, 78]})

print(df_pl)

## 2. Read CSV

- Practice loading tabular data from a CSV file using both Pandas and Polars. Reading CSVs is a core skill for data engineers, as CSV is one of the most common formats for raw datasets. In this exercise, you'll learn to import data, specify file paths, and handle basic parameters for real-world data ingestion workflows.


In [None]:
# Pandas
students_pd = pd.read_csv('students.csv')

In [None]:
# Polars
students_pl = pl.read_csv('students.csv')

## 3. Inspect Data

- Understand your dataset by displaying the first few rows, listing all column names, and summarizing its structure. Use these inspection techniques to quickly assess data quality, identify patterns or anomalies, and plan further cleaning or analysis steps.


In [None]:
# Pandas
display(students_pd.head(3))
print("Pandas columns:", list(students_pd.columns))

# Polars
display(students_pl.head(3))
print("Polars columns:", students_pl.columns)

## 4. Filter Data
- Practice selecting subsets of your DataFrame by applying conditional filters. For example, extract rows where the 'score' exceeds a specific value, or students belong to a particular class. Mastering filtering helps you focus on relevant records, perform targeted analyses, and prepare data for further processing.


In [None]:
# Pandas
print(students_pd[students_pd['score'] > 80])

In [None]:
# Polars
print(students_pl.filter(students_pl['score'] > 80))

## 5. Add a Derived Column
- Create a new column in your DataFrame that is calculated from existing columns (for example, a 'passed' column that is True if a student's score is greater than or equal to 60, and False otherwise). This exercise will help you practice transforming raw data into more meaningful features, which is a common task in data preparation and feature engineering.


In [None]:
# Pandas
students_pd = students_pd.assign(passed=students_pd['score'] >= 60)
print(students_pd)


In [None]:
# Polars
students_pl = students_pl.with_columns(
    (pl.col('score') >= 60).alias('passed')
)
print(students_pl)


## 6. Group and Aggregate
- Summarize your data by grouping rows based on a specific column and then computing aggregate metrics (e.g., mean, sum, count) for each group. This exercise will help you practice data summarization, which is essential for uncovering trends and patterns in your datasets.


In [None]:
# Pandas
grouped = students_pd.groupby('passed').agg({'score': ['count', 'mean']})

print(grouped)


In [None]:
# Polars
print(students_pl.groupby('passed').agg(pl.count()))

## 7. Handle Missing Data
- Detect, analyze, and address missing values in your dataset. Effective handling of nulls is essential for maintaining data quality, enabling accurate downstream analysis, and preventing misleading results. Practice strategies such as identifying missing entries, quantifying their impact, and imputing or removing them as appropriate for your use case.


In [None]:
# Pandas
mean_score = students_pd['score'].mean()
students_pd['score'] = students_pd['score'].fillna(mean_score)


In [None]:
# Polars
students_pl = students_pl.with_columns(
    pl.col('score').fill_null(pl.col('score').mean())
)


## 8. Merge DataFrames
- Combine two DataFrames by joining them on a shared column (key). Merging allows you to integrate related information from different sources, enabling comprehensive analysis and richer datasets.


In [None]:
# Pandas
emails_pd = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']})
merged_pd = students_pd.merge(emails_pd, on='name')

print(merged_pd)

In [None]:
# Polars
emails_pl = pl.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']})
merged_pl = students_pl.join(emails_pl, on='name')

print(merged_pl)

## 9. Export DataFrame to CSV

- Save your cleaned and transformed DataFrame to a CSV file. Exporting data allows you to persist results, share datasets with collaborators, and use them in other tools or systems. This step is fundamental for data reproducibility and further analysis outside your current environment.


In [None]:
# Pandas
output_path = 'output.csv'
students_pd.to_csv(output_path, index=False)

print(f"CSV written to {output_path}")


In [None]:
# Polars
students_pl.write_csv('output.csv')


---

### Challenge
- Download a real-world dataset (e.g., from Kaggle or data.gov) and perform exploratory analysis, cleaning, and transformations using both Pandas and Polars.
- Compare the performance of Pandas and Polars for common operations (filtering, grouping, aggregations) on this larger dataset.
- Document your findings and share insights on when you would prefer one library over the other in practical data engineering scenarios.


In [None]:
# Pandas
import pandas as pd
import time
import numpy as np

rows = 1_000_000
data = {'a': np.random.randint(0, 100, rows), 'b': np.random.randint(0, 100, rows)}
start = time.time()
pd_df = pd.DataFrame(data)
pd_df['c'] = pd_df['a'] + pd_df['b']
pandas_time = time.time() - start
print(f'Pandas time: {pandas_time:.2f}s')


In [None]:

# Polars
import polars as pl

start = time.time()
pl_df = pl.DataFrame(data)
pl_df = pl_df.with_columns((pl_df['a'] + pl_df['b']).alias('c'))
polars_time = time.time() - start
print(f'Polars time: {polars_time:.2f}s')



(Polars is typically much faster for large datasets)
