---
title: "From Python Basics to Pandas"
subtitle: "Master's in Business Data Science, M1"
author: "Roman Jurowetzki"
---

## 🎯 Welcome Back! Let's Build Your Toolkit

You've refreshed your knowledge of Python's fundamentals: variables, lists, dictionaries, and basic functions. Now, it's time to forge those building blocks into a powerful toolkit for data analysis.

In this session, we'll bridge the gap between foundational Python and the world of professional data science. We'll master the logic of **flow control**—the loops and conditionals that let you process data systematically. Then, we'll transition to `pandas`, the single most important library for data manipulation in Python, and see how it supercharges these concepts.

**Our Goal:** By the end of this session, you'll understand how to move from manually processing small collections of data in Python to efficiently analyzing large datasets with pandas.

Let's get started!

In [1]:
# Initial setup: Let's import the libraries we'll need.
# pandas is the industry-standard for data manipulation.
# numpy is its powerful companion for numerical operations.
import pandas as pd
import numpy as np

print("Libraries imported successfully!")

Libraries imported successfully!


---

# Part 1: Python Flow Control Mastery (35 mins)

Before we jump into specialized libraries, we need to master how to make Python "think" and "repeat." This is the core of automation. We'll use a small sample of data representing Udemy courses to see these concepts in action.

In [2]:
# A small, Python-native dataset (a list of dictionaries)
# This is how you might receive data from a web API.
sample_courses = [
    {'title': 'Python for Everybody', 'subject': 'Web Development', 'price': 0, 'subscribers': 1500000},
    {'title': 'Machine Learning A-Z', 'subject': 'Business Finance', 'price': 200, 'subscribers': 1200000},
    {'title': 'The Complete Web Developer Course 2.0', 'subject': 'Web Development', 'price': 200, 'subscribers': 1100000},
    {'title': 'Learn Ethical Hacking From Scratch', 'subject': 'Web Development', 'price': 0, 'subscribers': 950000},
    {'title': 'The Complete Financial Analyst Course', 'subject': 'Business Finance', 'price': 150, 'subscribers': 500000}
]

## 1.1 Loops for Data Processing (15 mins)

Loops are your primary tool for performing an action on every item in a collection.

### `for` Loops: The Workhorse

Let's calculate the total number of subscribers for 'Web Development' courses.

In [3]:
# The classic approach with a for loop
total_web_dev_subscribers = 0
for course in sample_courses:
    if course['subject'] == 'Web Development':
        total_web_dev_subscribers += course['subscribers'] # shorthand for total = total + ...

print(f"Total subscribers for Web Development courses: {total_web_dev_subscribers}")

Total subscribers for Web Development courses: 3550000


This works perfectly, but it takes a few lines of code. For simpler tasks, Python offers a more elegant solution.

### List Comprehensions: The Pythonic Way

List comprehensions are a concise way to create lists. They combine the loop and the item creation into a single, readable line.

Let's create a list containing the titles of all free courses.

In [4]:
# A list comprehension to filter and transform data
free_course_titles = [course['title'] for course in sample_courses if course['price'] == 0]

print("Titles of free courses:", free_course_titles)

Titles of free courses: ['Python for Everybody', 'Learn Ethical Hacking From Scratch']


::: {.callout-note}
#### Readability vs. Performance
List comprehensions are often faster than their equivalent `for` loops because they are optimized in Python's C backend. For simple filtering and transformation, they are the preferred method. For more complex logic, a standard `for` loop is often more readable.
:::

## 1.2 Conditionals and Logic (10 mins)

Conditionals (`if`, `elif`, `else`) are the "brains" of your code. They allow you to execute different actions based on data, which is essential for filtering, categorizing, and cleaning.

Let's categorize our courses into price tiers.

In [5]:
# Use if/elif/else to create categories
for course in sample_courses:
    title = course['title']
    price = course['price']

    if price == 0:
        category = 'Free'
    elif 0 < price <= 100:
        category = 'Affordable'
    else: # price > 100
        category = 'Premium'
    
    print(f'"{title}" is in the "{category}" category.')

"Python for Everybody" is in the "Free" category.
"Machine Learning A-Z" is in the "Premium" category.
"The Complete Web Developer Course 2.0" is in the "Premium" category.
"Learn Ethical Hacking From Scratch" is in the "Free" category.
"The Complete Financial Analyst Course" is in the "Premium" category.


Notice how we combined a `for` loop with conditional logic to build a small processing pipeline. This pattern is fundamental in data science.

## 1.3 Functions for Analysis (10 mins)

Functions allow you to package reusable code. Instead of copying and pasting your logic, you define it once and call it whenever you need it. This makes your code cleaner, less error-prone, and easier to manage.

Let's turn our price categorization logic into a function.

In [7]:
def categorize_price(price):
    """Categorizes a course price into a descriptive tier."""
    if not isinstance(price, (int, float)):
        return 'Invalid Price' # Basic error handling
    
    if price == 0:
        return 'Free'
    elif 0 < price <= 100:
        return 'Affordable'
    else:
        return 'Premium'

# Now we can use our function inside the loop
for course in sample_courses:
    price_category = categorize_price(course['price'])
    print(f'"{course["title"]}" is: {price_category}')

"Python for Everybody" is: Free
"Machine Learning A-Z" is: Premium
"The Complete Web Developer Course 2.0" is: Premium
"Learn Ethical Hacking From Scratch" is: Free
"The Complete Financial Analyst Course" is: Premium


### Lambda Functions: Quick, Anonymous Functions

Sometimes you need a simple function for a one-time use, like sorting or transforming data. `lambda` functions are perfect for this.

Let's calculate the revenue for each course (`price` * `subscribers`).

In [8]:
# A lambda function to calculate revenue
calculate_revenue = lambda price, subscribers: price * subscribers

# Using it in our loop
for course in sample_courses:
    revenue = calculate_revenue(course['price'], course['subscribers'])
    print(f'"{course["title"]}" generated an estimated ${revenue:,}.') # :, formats number with commas

"Python for Everybody" generated an estimated $0.
"Machine Learning A-Z" generated an estimated $240,000,000.
"The Complete Web Developer Course 2.0" generated an estimated $220,000,000.
"Learn Ethical Hacking From Scratch" generated an estimated $0.
"The Complete Financial Analyst Course" generated an estimated $75,000,000.


---

# Part 2: Transition to Pandas (35 mins)

Working with lists of dictionaries is great for learning, but it becomes slow and cumbersome with large datasets. Enter `pandas`. Pandas provides a high-performance, easy-to-use data structure called a **DataFrame**.

Think of a DataFrame as a super-powered spreadsheet or a SQL table, right inside Python.

## 2.1 From Python Collections to DataFrames (20 mins)

Let's load the full Udemy dataset and see the pandas equivalent of what we just did.

In [9]:
# Load the dataset from a URL using pandas
url = 'https://raw.githubusercontent.com/aaubs/ds-master/main/data/udemy_courses_info.csv'
df = pd.read_csv(url)

### First Look: `.head()`, `.info()`, `.describe()`

Pandas gives you powerful tools to inspect your data instantly.

In [10]:
# .head() shows the first 5 rows (like peeking at the top of a spreadsheet)
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject
0,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74.0,51.0,Intermediate Level,2.5,Business Finance
1,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45.0,26.0,Intermediate Level,2.0,Business Finance
2,192870,Trading Penny Stocks: A Guide for All Levels I...,https://www.udemy.com/trading-penny-stocks-a-g...,True,150,9221,138.0,25.0,All Levels,3.0,Business Finance
3,739964,Investing And Trading For Beginners: Mastering...,https://www.udemy.com/investing-and-trading-fo...,True,65,1540,178.0,26.0,Beginner Level,1.0,Business Finance
4,403100,Trading Stock Chart Patterns For Immediate Exp...,https://www.udemy.com/trading-chart-patterns-f...,True,95,2917,148.0,23.0,All Levels,2.5,Business Finance


In [11]:
# .info() gives a summary of columns, data types, and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2959 entries, 0 to 2958
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   course_id         2959 non-null   int64  
 1   course_title      2959 non-null   object 
 2   url               2959 non-null   object 
 3   is_paid           2959 non-null   bool   
 4   price             2959 non-null   int64  
 5   num_subscribers   2959 non-null   int64  
 6   num_reviews       2802 non-null   float64
 7   num_lectures      2880 non-null   float64
 8   level             2959 non-null   object 
 9   content_duration  2880 non-null   float64
 10  subject           2959 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(4)
memory usage: 234.2+ KB


In [12]:
# .describe() provides descriptive statistics for all numerical columns
df.describe()

Unnamed: 0,course_id,price,num_subscribers,num_reviews,num_lectures,content_duration
count,2959.0,2959.0,2959.0,2802.0,2880.0,2880.0
mean,567870.0,63.773234,3625.175397,186.523555,41.042708,4.216314
std,286467.2,59.121061,10448.724158,1064.627194,51.21181,6.265766
min,8324.0,0.0,0.0,0.0,0.0,0.0
25%,350868.0,20.0,165.0,6.0,15.0,1.5
50%,591880.0,45.0,1029.0,23.0,26.0,2.5
75%,806641.0,95.0,2814.0,81.0,46.0,4.5
max,1053462.0,200.0,268923.0,27445.0,544.0,78.5


### Column Selection and Filtering with Boolean Masks

Remember how we used a list comprehension to find free courses? In pandas, this is even easier and more powerful using **boolean masking**.

In [13]:
# Step 1: Create a boolean Series (a sequence of True/False values)
is_free = df['price'] == 0
print("Boolean mask (first 5 values):")
print(is_free.head())

# Step 2: Use this mask to filter the DataFrame
free_courses_df = df[is_free]

print("\n--- All Free Courses ---")
free_courses_df.head()

Boolean mask (first 5 values):
0    False
1    False
2    False
3    False
4    False
Name: price, dtype: bool

--- All Free Courses ---


Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,subject
76,133536,Stock Market Investing for Beginners,https://www.udemy.com/the-beginners-guide-to-t...,False,0,50855,2698.0,15.0,Beginner Level,1.5,Business Finance
79,265960,Fundamentals of Forex Trading,https://www.udemy.com/fundamentals-of-forex-tr...,False,0,17160,620.0,23.0,All Levels,1.0,Business Finance
81,923616,Website Investing 101 - Buying & Selling Onlin...,https://www.udemy.com/cash-flow-website-invest...,False,0,6811,151.0,51.0,All Levels,2.0,Business Finance
85,191854,Stock Market Foundations,https://www.udemy.com/how-to-invest-in-the-sto...,False,0,19339,794.0,9.0,Beginner Level,2.0,Business Finance
91,151668,Introduction to Financial Modeling,https://www.udemy.com/financial-modeling-asimp...,False,0,29167,1463.0,8.0,Intermediate Level,1.5,Business Finance


::: {.callout-tip}
#### The "Pandas Bridge"
- **Python `for` loop + `if` condition** becomes **Pandas boolean mask `df[condition]`**.
- This is one of the most important concepts in pandas. It is extremely efficient for filtering millions of rows.
:::


## 2.2 Pandas Operations in Action (15 mins)

### Method Chaining

One of the best features of pandas is the ability to "chain" operations together into a clean pipeline.

Let's find the titles of the top 5 free 'Web Development' courses with the most subscribers.

In [14]:
top_free_web_courses = (
    df[df['price'] == 0]                      # 1. Filter for free courses
    [df['subject'] == 'Web Development']      # 2. Filter for Web Development
    .sort_values(by='num_subscribers', ascending=False) # 3. Sort by subscribers
    .head(5)                                  # 4. Get the top 5
    ['course_title']                          # 5. Select only the title column
)

print(top_free_web_courses)

2252                 Learn HTML5 Programming From Scratch
2422                       Coding for Entrepreneurs Basic
2224    Build Your First Website in 1 Week with HTML5 ...
2064    Web Design for Web Developers: Build Beautiful...
2636    Practical PHP: Master the Basics and Code Dyna...
Name: course_title, dtype: object


  df[df['price'] == 0]                      # 1. Filter for free courses


### Basic Aggregations: `.sum()`, `.mean()`, `.value_counts()`

Pandas makes calculating summary statistics trivial.

In [15]:
# How many courses are in each subject?
subject_counts = df['subject'].value_counts()
print("Course counts per subject:")
print(subject_counts)

Course counts per subject:
subject
Web Development        976
Business Finance       968
Musical Instruments    568
Graphic Design         447
Name: count, dtype: int64


In [16]:
# What is the total number of subscribers across all courses?
total_subscribers = df['num_subscribers'].sum()
print(f"\nTotal subscribers: {total_subscribers:,}")


Total subscribers: 10,726,894


### Grouping Data with `.groupby()`

This is the super-power of pandas. `.groupby()` allows you to split your data into groups, apply a function to each group, and combine the results.

Let's replicate our first Python `for` loop: calculate the total subscribers for each subject.

In [17]:
# Split-Apply-Combine with groupby
subscribers_by_subject = df.groupby('subject')['num_subscribers'].sum().sort_values(ascending=False)

print(subscribers_by_subject)

subject
Web Development        7301548
Business Finance       1738412
Graphic Design          907807
Musical Instruments     779127
Name: num_subscribers, dtype: int64


In one line, pandas did the work of our multi-line Python loop, but on the entire dataset. This is the power of vectorized operations.

### Handling Missing Data

Real-world data is messy. `pandas` provides simple tools to find and handle missing values.

In [18]:
# Check for missing values in each column
print("Missing values per column:")
print(df.isnull().sum())

Missing values per column:
course_id             0
course_title          0
url                   0
is_paid               0
price                 0
num_subscribers       0
num_reviews         157
num_lectures         79
level                 0
content_duration     79
subject               0
dtype: int64


We see no missing values in this clean dataset, but if there were, we could easily fill them: `df['column'].fillna('some_value', inplace=True)`.

---

# Part 3: Connecting the Dots (20 mins)

Now let's bring everything together. How can we use our custom Python logic within the powerful pandas framework?

## 3.1 Python Functions + Pandas with `.apply()`

The `.apply()` method lets you run a custom function on every value in a pandas Series (a column). This is the perfect bridge between your Python skills and the pandas library.

Let's use the `categorize_price` function we wrote earlier to create a new column in our DataFrame.

In [19]:
# Reminder: here is our Python function from Part 1
def categorize_price(price):
    """Categorizes a course price into a descriptive tier."""
    if not isinstance(price, (int, float)):
        return 'Invalid Price'
    if price == 0:
        return 'Free'
    elif 0 < price <= 100:
        return 'Affordable'
    else:
        return 'Premium'

# Use .apply() to run this function on the 'price' column
df['price_category'] = df['price'].apply(categorize_price)

# Check the result
print("New 'price_category' column created:")
df[['course_title', 'price', 'price_category']].head()

New 'price_category' column created:


Unnamed: 0,course_title,price,price_category
0,Financial Modeling for Business Analysts and C...,45,Affordable
1,How To Maximize Your Profits Trading Options,200,Premium
2,Trading Penny Stocks: A Guide for All Levels I...,150,Premium
3,Investing And Trading For Beginners: Mastering...,65,Affordable
4,Trading Stock Chart Patterns For Immediate Exp...,95,Affordable


Now we can use this new categorical column in our analysis! For example, let's see the distribution of categories.

In [21]:
df['price_category'].value_counts()

price_category
Affordable    2183
Premium        544
Free           232
Name: count, dtype: int64

## 3.2 Practical Exercise (10 mins)

Let's solve a business question using the skills we've learned.

**Question:** Which subject has the highest *average number of subscribers* for courses at the 'Beginner Level'?

This requires you to:
1.  Filter the DataFrame for 'Beginner Level' courses.
2.  Group the filtered data by `subject`.
3.  Calculate the `mean` of `num_subscribers` for each group.
4.  Find the subject with the highest average.

In [22]:
# 🧠 Your Turn: Complete the code below!

# 1. Filter for 'Beginner Level' courses
beginner_courses = df[df['level'] == 'Beginner Level']

# 2. Group by subject and calculate the mean of subscribers
avg_subscribers_by_subject = beginner_courses.groupby('subject')['num_subscribers'].mean()

# 3. Sort the results to find the highest
sorted_avg_subscribers = avg_subscribers_by_subject.sort_values(ascending=False)

print("Average subscribers for Beginner Level courses by subject:")
print(sorted_avg_subscribers)

# 4. Get the top result programmatically
top_subject = sorted_avg_subscribers.index[0]
top_avg_value = sorted_avg_subscribers.iloc[0]

print(f'\nAnswer: {top_subject} has the highest average of {top_avg_value:,.0f} subscribers for beginner courses.')

Average subscribers for Beginner Level courses by subject:
subject
Web Development        7863.604651
Business Finance       2330.957529
Musical Instruments    1627.000000
Graphic Design         1310.152941
Name: num_subscribers, dtype: float64

Answer: Web Development has the highest average of 7,864 subscribers for beginner courses.


---

# Take-Home Assignments

Use the Udemy dataset (`df`) to complete the following exercises. These challenges will solidify your understanding and prepare you for more complex data tasks.

### Assignment 1: Advanced Python Control Flow

Write a Python function called `validate_course_data` that takes a single course dictionary (like one from our `sample_courses` list) as input.
- It should check if `price` and `num_subscribers` are non-negative numbers.
- It should check if `course_title` is a non-empty string.
- If all checks pass, it should return `True`.
- If any check fails, it should print an informative error message and return `False`.
- Use a `try-except` block to handle cases where a key might be missing from the dictionary.

### Assignment 2: Pandas Deep Dive

Using the full `df` DataFrame, perform the following:
1.  Find all courses that have "Python" in their title (hint: `df['course_title'].str.contains('Python', case=False)`).
2.  From that result, filter for courses that are **not** free.
3.  Calculate the average `num_reviews` for this specific subset of paid Python courses.
4.  How many of these paid Python courses were published in 2017? (You'll need to work with the `published_timestamp` column. Hint: convert it to datetime `pd.to_datetime(df['published_timestamp'])` and then access the year with `.dt.year`).

### Assignment 3: Integration Project

1.  Write a Python function `get_revenue(row)` that takes a DataFrame row as input and returns the revenue (`price` * `num_subscribers`).
2.  Apply this function to the DataFrame to create a new `revenue` column. Be sure to use `axis=1` in your `.apply()` call to pass rows to your function.
3.  Using `.groupby()`, find the total revenue generated by each `level` ('Beginner Level', 'All Levels', etc.). Which level is the most profitable?

### Assignment 4: Exploratory Data Analysis (EDA)

You are a data analyst at Udemy. Your manager wants to know: **"What factors contribute to a successful course?"**

Perform an exploratory analysis to answer this question. A "successful" course could be defined by high `num_subscribers`, high `num_reviews`, or high `revenue`.

Your task:
1.  Explore the relationships between different variables (e.g., `price` vs. `num_subscribers`, `subject` vs. `num_reviews`).
2.  Use a combination of filtering, grouping, and aggregation to uncover insights.
3.  Summarize your top **3 findings** in a markdown cell. For each finding, provide the pandas code you used to arrive at that conclusion. Your findings should be actionable business recommendations (e.g., "Courses in Web Development have the highest subscriber engagement, suggesting we should invest more in this area.").