# Python for Data Analytics
--- 

**Skill Level:** 
Absolute beginners to intermediate learners

**Career Paths:** 
Data Analyst, Business Analyst, Junior Data Scientist.

**Goal:** 
Equip learners with job-ready skills in Python for data analytics, from programming fundamentals to applied analytics, data wrangling, visualization, and storytelling.

**Tools:** 
Jupyter Notebook, pandas, NumPy, matplotlib, seaborn, and scikit-learn.

**Project Format:** 
Mini-projects, real-world datasets, code reviews

## Phase 1: Python Programming Foundations
***(Weeks 1–4)-Build foundational Python skills for data work.***


### Week 1: Python Basics & Environment Setup
**Topics:**
* Installing Python and Jupyter Notebooks
* Data types, variables, inputs, outputs
* Basic arithmetic and string formatting

**Mini Project:**
* BMI calculator 
* Name Formatter
* Currency Convert 

**Project:**
* “Data Entry Sanitizer" Clean and format text inputs like phone numbers or names.


# Week 1: Python Basics & Environment Setup

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Set up Python and Jupyter notebooks on your system or use Google Colab
- Understand and use basic data types and variables
- Perform arithmetic and format strings
- Write small scripts that take user input and produce output

---

## 🛠️ Setting Up Your Environment

### Option A: Local Setup
- Install Python: [https://www.python.org/downloads/](https://www.python.org/downloads/)
- Install Jupyter via Anaconda: [https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution)

### Option B: Online (No Install Needed)
- Use Google Colab: [https://colab.research.google.com/](https://colab.research.google.com/)

---

## 🧠 Core Python Concepts

### 🟢 Data Types
| Type        | Example     | Description                  |
|-------------|-------------|------------------------------|
| `int`       | `10`        | Whole numbers                |
| `float`     | `3.14`      | Decimal numbers              |
| `str`       | `"hello"`   | Text                         |
| `bool`      | `True`      | Boolean values               |

### 🟢 Variables
```python
name = "Alice"
age = 25
pi = 3.1415


### Installing Python and Jupyter Notebooks

To begin Python programming, you need:

Python: Install from https://python.org

**JupyterLab**
```python
#Install JupyterLab with pip:
pip install jupyterlab

#launch jupyterlab
jupyter lab 

```
**Jupyter Notebook**
```python
#Install the classic Jupyter Notebook with:
pip install notebook

#To run the notebook:
jupyter notebook 

```

## Markdown Basics

Markdown is a lightweight markup language that uses characters like # for headings and * for emphasis to format text simply and intuitively.

| Element        | Markdown Syntax                        |
|----------------|----------------------------------------|
| Heading        | `# H1`<br>`## H2`<br>`### H3`          |
| Bold           | `**bold text**`                        |
| Italic         | `*italicized text*`                    |
| Blockquote     | `> blockquote`                         |
| Ordered List   | `1. First item`<br>`2. Second item`<br>`3. Third item` |
| Unordered List | `- First item`<br>`- Second item`<br>`- Third item`   |
| Code           | `` `code` ``                           |
| Horizontal Rule| `---`                                  |
| Link           | `[title](https://www.example.com)`     |
| Image          | `![alt text](image.jpg)`               |

[Here is a more info on Markdown](https://www.markdownguide.org/basic-syntax/)

# Data types, variables, inputs, outputs

### Common Data Types

In [5]:
# Integer
age = 25

# Float
height = 5.9

# String
name = "Alice"

# Boolean
is_student = True


### Getting User Input & Displaying Output

In [None]:
# Input
user_name = input("Enter your name: ")

# Output
print("Hello,", user_name)


# Basic arithmetic and string formatting

### Arithmetic Operators

In [None]:
a = 10
b = 3

print("Addition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)
print("Modulus:", a % b)
print("Exponent:", a ** b)


### String Formatting

In [None]:
name = "Alex"
age = 30

# Using f-strings
print(f"My name is {name} and I am {age} years old.")


# Mini Project:

### 1. BMI calculator 

In [1]:
# BMI = weight (kg) / height (m)^2

weight = float(input("Enter your weight in kg: "))
height = float(input("Enter your height in meters: "))

bmi = weight / (height ** 2)

print(f"Your BMI is {bmi:.2f}")


### 2. Name Formatter

In [1]:
print("Name Formatter")

first = input("Enter first name: ").strip()
last = input("Enter last name: ").strip()

formatted_name = f"{first.capitalize()} {last.upper()}"
print(f"Formatted Name: {formatted_name}")


Name Formatter


Enter first name:  drani
Enter last name:  godfrey


Formatted Name: Drani GODFREY


### 3. Currency Convert 

In [3]:

usd = float(input("Enter amount in USD: "))
exchange_rate = 75  # Example: 1 USD = 75 INR

inr = usd * exchange_rate

print(f"{usd} USD = {inr:.2f} INR")


# Project:
“Data Entry Sanitizer" Clean and format text inputs like phone numbers or names.

**Goal**

Write a Python script that takes in raw text input (like names or phone numbers) and cleans/formats them for consistent data entry.

**🧩 Requirements**

* Accept a full name (e.g., " john DOE ") → Output: "John Doe"

* Accept a phone number (e.g., " (123) 456-7890 " or "1234567890") → Output: +1-123-456-7890

* Accept an email address and lowercase it


In [2]:
def clean_name(name):
    parts = name.strip().split()
    return " ".join([part.capitalize() for part in parts])

def format_phone(phone):
    digits = ''.join(filter(str.isdigit, phone))
    if len(digits) == 10:
        return f"+1-{digits[:3]}-{digits[3:6]}-{digits[6:]}"
    return "Invalid number"

def clean_email(email):
    return email.strip().lower()

# User Input
raw_name = input("Enter full name: ")
raw_phone = input("Enter phone number: ")
raw_email = input("Enter email address: ")

# Cleaned Output
print("\n--- Cleaned Entry ---")
print(f"Name: {clean_name(raw_name)}")
print(f"Phone: {format_phone(raw_phone)}")
print(f"Email: {clean_email(raw_email)}")


Enter full name:  drani godfrey
Enter phone number:  3483943484
Enter email address:  go@gmail.com



--- Cleaned Entry ---
Name: Drani Godfrey
Phone: +1-348-394-3484
Email: go@gmail.com


In [8]:
# Names (remove extra spaces, capitalize properly)
def sanitize_name(name):
    return name.strip().title()

# Example
raw_name = input("Enter a name: ")
cleaned_name = sanitize_name(raw_name)
print(f"Sanitized Name: {cleaned_name}")


Enter a name:  drani Godfrey 


Sanitized Name: Drani Godfrey


In [10]:
# Phone numbers (remove dashes/spaces, ensure valid format)
def sanitize_phone_number(phone):
    # Remove spaces, dashes, parentheses
    phone = phone.replace(" ", "").replace("-", "").replace("(", "").replace(")", "")
    
    # Check if starts with country code; if not, add one
    if len(phone) == 10:
        phone = "+256" + phone  # Example for India
    elif not phone.startswith("+"):
        phone = "+" + phone
    
    return phone

# Example
raw_phone = input("Enter phone number: ")
clean_phone = sanitize_phone_number(raw_phone)
print(f"Sanitized Phone Number: {clean_phone}")


Enter phone number:  242342432423r23434r234r234r324


Sanitized Phone Number: +242342432423r23434r234r234r324


In [None]:
name = input("Enter your name: ")
phone = input("Enter your phone number: ")

print("\nSanitized Outputs:")
print("Name:", sanitize_name(name))
print("Phone:", sanitize_phone_number(phone))


### Week 2: Control Flow & Logic in Data Contexts
**Topics:**
* Conditional logic: if, elif, else logic
* Loops: for, while, range()
* Boolean expressions and comparison operators
* Simple data validation (email, age, salary)

**Project:** 
* "Survey Quality Checker" Write a script to flag bad entries in a mock survey dataset.


# Week 2: Control Flow & Logic in Data Contexts

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Use `if`, `elif`, and `else` for conditional logic
- Apply `for` and `while` loops to iterate over data
- Use comparison and boolean operators
- Validate simple inputs like email, age, and salary using logic

---

## 🧠 Control Flow Basics

### 🟩 `if`, `elif`, `else`
```python
age = 18

if age >= 18:
    print("You can vote.")
elif age == 17:
    print("Almost there!")
else:
    print("Too young to vote.")


Loops

### for Loop

In [None]:
for i in range(5):
    print(i)


### while Loop

In [None]:
count = 0
while count < 5:
    print(count)
    count += 1


### Boolean Operators

| Operator | Description      | Example            |
| -------- | ---------------- | ------------------ |
| `==`     | Equal to         | `a == b`           |
| `!=`     | Not equal to     | `a != b`           |
| `>`      | Greater than     | `a > b`            |
| `<`      | Less than        | `a < b`            |
| `>=`     | Greater or equal | `a >= b`           |
| `<=`     | Less or equal    | `a <= b`           |
| `and`    | Logical AND      | `a > 5 and b < 10` |
| `or`     | Logical OR       | `a > 5 or b < 10`  |
| `not`    | Logical NOT      | `not True`         |


### Simple Data Validation

### Email Checker

In [None]:
email = input("Enter email: ")

if "@" in email and "." in email:
    print("Looks like a valid email.")
else:
    print("Invalid email.")


### Age Validator

In [None]:
age = int(input("Enter your age: "))

if age <= 0:
    print("Invalid age.")
elif age < 18:
    print("Underage")
else:
    print("Eligible")


### 💡 Real-World Relevance

Control flow is essential when filtering, validating, and processing data — critical for cleaning, building logic-based rules, and flagging inconsistencies in datasets.


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 2 – Practice Exercises: Control Flow & Logic

## 1. Check if number is even or odd
```python
num = int(input("Enter a number: "))
if num % 2 == 0:
    print("Even")
else:
    print("Odd")


### 2. Grade Checker

In [None]:
score = int(input("Enter your score: "))
if score >= 90:
    print("Grade: A")
elif score >= 75:
    print("Grade: B")
elif score >= 60:
    print("Grade: C")
else:
    print("Grade: F")


### 3. Password Length Validator

In [None]:
password = input("Enter password: ")
if len(password) < 8:
    print("Password too short")
else:
    print("Password accepted")



### 4. Loop through a list of numbers and print only positives

In [None]:
nums = [10, -5, 0, 22, -3]
for n in nums:
    if n > 0:
        print(n)


### 5. Count how many even numbers in a list

In [None]:
nums = [1, 2, 3, 4, 5, 6]
count = 0
for n in nums:
    if n % 2 == 0:
        count += 1
print(f"{count} even numbers")


### 6. Ask for input until user types "exit"

In [None]:
while True:
    cmd = input("Type something (or 'exit'): ")
    if cmd == "exit":
        break


### 7. Salary filter

In [None]:
salary = float(input("Enter salary: "))
if salary < 30000:
    print("Below threshold")
elif salary < 70000:
    print("Mid-range")
else:
    print("High salary")


### 8. Email domain extractor

In [None]:
email = input("Enter email: ")
if "@" in email:
    domain = email.split("@")[1]
    print(f"Domain: {domain}")
else:
    print("Invalid email")


In [None]:

---

## 🧪 Main Project: **“Survey Quality Checker”**

### 🎯 Goal
Write a Python script that scans mock survey responses and **flags bad entries** based on:
- Invalid emails
- Age outside range (0–100)
- Salary too low or negative

---

### 📦 Sample Input (Mock Data)

```python
survey_responses = [
    {"email": "john@example.com", "age": 28, "salary": 45000},
    {"email": "noatsymbol.com", "age": 45, "salary": 65000},
    {"email": "sara@domain.com", "age": -2, "salary": 50000},
    {"email": "mike@web.com", "age": 33, "salary": -15000},
]


### ✅ Your Task
1. Loop through each dictionary in the list
2. Check for:

    * Valid email (@ and . present)

    * Age between 0 and 100

    * Salary >= 0
3. Print invalid entries with the reason

### 🧩 Starter Code

In [None]:
def is_valid_email(email):
    return "@" in email and "." in email

def is_valid_age(age):
    return 0 <= age <= 100

def is_valid_salary(salary):
    return salary >= 0

# Data
survey_responses = [
    {"email": "john@example.com", "age": 28, "salary": 45000},
    {"email": "noatsymbol.com", "age": 45, "salary": 65000},
    {"email": "sara@domain.com", "age": -2, "salary": 50000},
    {"email": "mike@web.com", "age": 33, "salary": -15000},
]

# Check each entry
for response in survey_responses:
    errors = []
    if not is_valid_email(response["email"]):
        errors.append("Invalid email")
    if not is_valid_age(response["age"]):
        errors.append("Invalid age")
    if not is_valid_salary(response["salary"]):
        errors.append("Invalid salary")
    
    if errors:
        print(f"Bad entry: {response}")
        for err in errors:
            print(f"  ⚠️ {err}")


🚀 Stretch Goals

Collect valid entries into a clean list

Save invalid entries to a text file

Build a function clean_survey_data() to reuse this logic

📦 Week 2 Deliverables
Lesson notes: if, loops, logic

### Week 3: Functions, Reusability & Error Handling
**Key concepts:**

* Defining functions (def)
* Parameters, return, default values
* Built-in functions vs user-defined
* Default arguments, scope
* Lambda expressions for quick filtering
* Error handling: Try/except, assert, basic logging

**Mini Project:**
* Reusable data cleaning and transformation functions

**Project:** 
* "Data Cleaning Toolbox" – Build reusable functions for tasks like removing whitespace, fixing dates, capitalizing names.


# Week 3: Functions, Reusability & Error Handling

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Define and use custom functions
- Use parameters, return values, and default arguments
- Understand scope and lambda functions
- Handle runtime errors using try/except
- Use assertions and basic logging

---

## 🧠 Functions 101

### 🟢 Define a Function
```python
def greet(name):
    return f"Hello, {name}!"


### 🟢 Parameters and Return

In [None]:
def add(a, b):
    return a + b

result = add(5, 3)  # 8


### 🟢 Default Arguments

In [None]:
def greet(name="Guest"):
    return f"Hello, {name}"


### 🔁 Reusability Example

In [None]:
def format_name(name):
    return name.strip().title()

def validate_age(age):
    return age >= 0 and age <= 120


### 🧠 Scope Basics

In [None]:
x = 5

def print_number():
    x = 10
    print(x)  # local x

print_number()
print(x)  # global x


### Lambda Expressions

Used for quick anonymous functions:

In [None]:
double = lambda x: x * 2
print(double(4))  # 8


### Common use with filter() or map():

In [None]:
nums = [1, 2, 3, 4]
squared = list(map(lambda x: x**2, nums))


### ❗ Error Handling
🛡️ Try / Except

In [None]:
try:
    value = int(input("Enter a number: "))
except ValueError:
    print("That's not a number!")


### 🧪 Assert Statements

In [None]:
def divide(a, b):
    assert b != 0, "Division by zero!"
    return a / b


### 📄 Basic Logging

In [None]:
import logging
logging.basicConfig(level=logging.INFO)
logging.info("Process started")


### 💡 Real-World Relevance

Functions are the backbone of clean, reusable code. Error handling is vital when working with unpredictable data (e.g., user input, CSV files, APIs).


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 3 – Practice Exercises: Functions & Error Handling

## 1. Function to square a number
```python
def square(n):
    return n ** 2


### 2. Greet with default

In [None]:
def greet(name="friend"):
    return f"Hello, {name}"


### 3. Convert Celsius to Fahrenheit

In [None]:
def to_fahrenheit(celsius):
    return (celsius * 9/5) + 32


### 4. Safe Division with Try/Except

In [None]:
def safe_divide(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        return "Cannot divide by zero"


### 5. Capitalize all names in a list

In [None]:
names = ["alice", "BOB", "CharLie"]

def format_names(name_list):
    return [name.title() for name in name_list]


### 6. Validate email with assert

In [None]:
def check_email(email):
    assert "@" in email and "." in email, "Invalid email"


### 7. Use lambda to triple a number

In [None]:
triple = lambda x: x * 3


### 8. Create logger that prints success message

In [None]:
import logging
logging.basicConfig(level=logging.INFO)

def run_task():
    logging.info("Task completed!")



---

## 🧪 Mini Project: **Reusable Data Cleaning Functions**

### 🎯 Goal
Create small, reusable functions that clean common types of messy data.

---

### ✅ Tasks
Write and test these functions:

- `clean_whitespace(text)`: removes leading/trailing spaces
- `fix_case(name)`: capitalizes names
- `remove_special_chars(text)`: removes symbols like `@#!`
- `validate_email(email)`: returns True/False

---

### 🧩 Sample Code

```python
import re

def clean_whitespace(text):
    return text.strip()

def fix_case(name):
    return name.title()

def remove_special_chars(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

def validate_email(email):
    return "@" in email and "." in email

# Demo
dirty_name = "  jOhN doE@@ "
clean_name = fix_case(remove_special_chars(clean_whitespace(dirty_name)))

print("Cleaned Name:", clean_name)


### 🛠️ Main Project: "Data Cleaning Toolbox"
**🎯 Goal**

Build a mini Python library of functions for cleaning user input and textual data.

### ✅ Requirements

1. Create at least 4 functions, e.g.:
    * clean_name()
    * validate_email()
    * clean_date_string()
    * fix_phone_format()
2. Accept raw data from user or dictionary
3. Return a cleaned version or report errors
4. Use try/except where needed
5. Use assert in at least one function

In [None]:
def clean_name(name):
    assert isinstance(name, str), "Name must be a string"
    return name.strip().title()

def validate_email(email):
    return "@" in email and "." in email

def clean_date_string(date_str):
    try:
        parts = date_str.split("/")
        if len(parts) == 3:
            return f"{parts[2]}-{parts[1].zfill(2)}-{parts[0].zfill(2)}"
        return "Invalid format"
    except Exception:
        return "Error parsing date"

def fix_phone(phone):
    digits = ''.join(filter(str.isdigit, phone))
    if len(digits) == 10:
        return f"+1-{digits[:3]}-{digits[3:6]}-{digits[6:]}"
    return "Invalid phone"

# Test
raw = {
    "name": "  aLiCe JoNEs ",
    "email": "ALICE@Example.COM",
    "dob": "12/5/1990",
    "phone": " (123) 456-7890 "
}

cleaned = {
    "name": clean_name(raw["name"]),
    "email": validate_email(raw["email"]),
    "dob": clean_date_string(raw["dob"]),
    "phone": fix_phone(raw["phone"])
}

print(cleaned)


### 🚀 Stretch Goals

* Wrap your functions into a Python module
* Handle missing keys with .get()
* Add logging to track processing

### Week 4: Lists, Dictionaries & Working with Nested Data
**Topics:**
* Data structures: list, dict, tuple, set
* Indexing, slicing, updating
* Nested structures and Iteration
* List comprehensions (intro)
  
**Mini Project:**
* Frequency counters
* Filtering and grouping

**Project:** 
* "Mini CRM System" – Store, search, and update mock customer records using nested dictionaries/lists.


# Week 4: Lists, Dictionaries & Nested Data Structures

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Use Python's core data structures: `list`, `dict`, `tuple`, and `set`
- Index, slice, update, and loop over these structures
- Work with nested dictionaries/lists
- Use list comprehensions for quick transformations

---

## 🔢 Lists

### Create and Access
```python
fruits = ["apple", "banana", "cherry"]
print(fruits[1])  # "banana"


### Add, Remove, Slice

In [None]:
fruits.append("orange")
fruits.remove("banana")
print(fruits[1:3])  # slice


### 🔑 Dictionaries
Create and Access

In [None]:
person = {"name": "Alice", "age": 30}
print(person["name"])  # Alice


### Add, Update, Delete

In [None]:
person["email"] = "alice@example.com"
person["age"] = 31
del person["email"]


### 🧶 Nested Structures

In [None]:
customers = [
    {"name": "Alice", "email": "alice@email.com"},
    {"name": "Bob", "email": "bob@email.com"}
]

for c in customers:
    print(c["name"], c["email"])


### 🧠 Tuples & Sets
Tuple: Immutable

In [None]:
t = (1, 2, 3)


### Set: Unique values

In [None]:
s = {1, 2, 2, 3}
print(s)  # {1, 2, 3}


### ⚡ List Comprehensions

In [None]:
numbers = [1, 2, 3, 4]
squared = [x**2 for x in numbers]


### 💡 Real-World Relevance

Nested data is common in real-world datasets (e.g., JSON files, APIs, CRM systems). You'll use these patterns constantly in data cleaning, parsing, and storage tasks.

In [None]:

---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 4 – Practice Exercises: Lists & Dictionaries

## 1. Create a list of 5 cities and print each
```python
cities = ["New York", "Paris", "Tokyo", "Delhi", "Berlin"]
for city in cities:
    print(city)


### 2. Get the last 3 elements from a list

In [None]:
nums = [10, 20, 30, 40, 50]
print(nums[-3:])


### 3. Create a dictionary of a product

In [None]:
product = {"name": "Laptop", "price": 999.99, "stock": 20}
print(product["price"])


### 4. Loop through dictionary keys and values

In [None]:
for key, value in product.items():
    print(f"{key}: {value}")


### 5. Count how many times a word appears in a list

In [None]:
words = ["yes", "no", "yes", "maybe", "yes"]
print(words.count("yes"))


### 6. Use list comprehension to filter values > 10

In [None]:
nums = [3, 12, 7, 25, 5]
filtered = [x for x in nums if x > 10]


### 7. Combine two dictionaries

In [None]:
a = {"x": 1}
b = {"y": 2}
merged = {**a, **b}


### 8. Create and access nested dictionary

In [None]:
users = {
    "user1": {"name": "Alice", "email": "alice@email.com"},
    "user2": {"name": "Bob", "email": "bob@email.com"}
}
print(users["user2"]["email"])


### 9. Add a new customer to a list of dicts

In [None]:
customers = []
new_customer = {"name": "Charlie", "email": "charlie@email.com"}
customers.append(new_customer)


### 10. Create a frequency counter (manual)

In [None]:
nums = [1, 2, 2, 3, 3, 3]
counter = {}
for n in nums:
    counter[n] = counter.get(n, 0) + 1
print(counter)



---

## 🧪 Mini Projects

### 🟢 Frequency Counter
```python
text = "banana"
counter = {}

for char in text:
    counter[char] = counter.get(char, 0) + 1

print(counter)
# Output: {'b': 1, 'a': 3, 'n': 2}


### 🟢 Filter & Group Names by First Letter

In [None]:
names = ["Alice", "Amanda", "Bob", "Brian", "Cathy"]
grouped = {}

for name in names:
    key = name[0].upper()
    grouped.setdefault(key, []).append(name)

print(grouped)
# Output: {'A': ['Alice', 'Amanda'], 'B': ['Bob', 'Brian'], 'C': ['Cathy']}


### 🛠️ Main Project: "Mini CRM System"
**🎯 Goal**

Build a simple command-line customer tracking system using nested dictionaries/lists.

### Features to Implement
* Add a new customer (name, email, phone)
* Search customer by name
* Update customer email or phone
* Delete a customer by name
* List all customers

### 🧩 Starter Code

In [None]:
crm = []

def add_customer(name, email, phone):
    crm.append({"name": name, "email": email, "phone": phone})

def find_customer(name):
    for c in crm:
        if c["name"].lower() == name.lower():
            return c
    return None

def update_customer(name, email=None, phone=None):
    c = find_customer(name)
    if c:
        if email:
            c["email"] = email
        if phone:
            c["phone"] = phone
        return True
    return False

def delete_customer(name):
    global crm
    crm = [c for c in crm if c["name"].lower() != name.lower()]

def list_customers():
    for c in crm:
        print(f"{c['name']} - {c['email']} - {c['phone']}")

# Example Usage
add_customer("Alice", "alice@email.com", "1234567890")
add_customer("Bob", "bob@email.com", "9876543210")
list_customers()


### 🚀 Stretch Goals

Add command-line menu to interact with system

Save/load data from a .json file

Validate email and phone formats

## Phase 2: Data Analytics With Python
**(Weeks 5–10) - Hands-on with real-World Data Workflows**

### Week 5: File I/O and Data Ingestion
**Topics:**
* Reading/writing: .txt, .csv using open(), csv module
* Working with file paths
* JSON parsing with json module
* Basic exception handling for file errors

**Project:** 
"Sales Data Summary" - Load CSV and calculate totals, averages, handle missing/error rows.


# Week 5: File I/O and Data Ingestion

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Read from and write to `.txt`, `.csv`, and `.json` files
- Use the built-in `open()`, `csv`, and `json` modules
- Handle file paths properly
- Apply basic exception handling for file errors


### 📂 Reading & Writing Text Files

In [None]:
# Write
with open("notes.txt", "w") as f:
    f.write("This is a line of text.")

# Read
with open("notes.txt", "r") as f:
    content = f.read()
    print(content)


### 📊 Working with CSV Files

In [None]:
import csv

# Writing CSV
with open("data.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Name", "Age"])
    writer.writerow(["Alice", 30])

# Reading CSV
with open("data.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)


### 📦 Reading JSON Data

In [None]:
import json

# Save JSON
data = {"name": "Alice", "age": 30}
with open("data.json", "w") as f:
    json.dump(data, f)

# Load JSON
with open("data.json", "r") as f:
    loaded = json.load(f)
    print(loaded["name"])


### 🔁 Handling File Paths

In [None]:
import os

folder = "data"
filename = "example.txt"
path = os.path.join(folder, filename)
print(path)  # 'data/example.txt'


### ⚠️ File Error Handling

In [None]:
try:
    with open("missing.txt", "r") as f:
        content = f.read()
except FileNotFoundError:
    print("File not found!")


### 💡 Real-World Relevance

Reading data from files is the first step in nearly every data project. You'll handle messy CSV exports, logs, and JSON from APIs all the time.

### 🧪 Practice Exercises

In [None]:
# 🧪 Week 5 – Practice Exercises: File I/O

## 1. Write and read a short note to a text file
```python
with open("note.txt", "w") as f:
    f.write("Don't forget to check the data.")

with open("note.txt", "r") as f:
    print(f.read())


### 2. Append to a file

In [None]:
with open("log.txt", "a") as f:
    f.write("New log entry\n")


### 3. Count number of lines in a text file

In [None]:
with open("note.txt") as f:
    lines = f.readlines()
    print(len(lines))


### 4. Write a list of names to a CSV file

In [None]:
names = [["Name"], ["Alice"], ["Bob"]]
with open("names.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(names)


### 5. Read and print names from CSV

In [None]:
with open("names.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row[0])


### 6. Save a dictionary as JSON

In [None]:
profile = {"user": "john", "age": 29}
with open("profile.json", "w") as f:
    json.dump(profile, f)


### 7. Load and print data from JSON

In [None]:
with open("profile.json") as f:
    data = json.load(f)
    print(data["user"])


### 8. Gracefully handle missing file

In [None]:
try:
    with open("ghost.txt") as f:
        print(f.read())
except FileNotFoundError:
    print("File doesn't exist.")


### 9. Create a file path from folder + file

In [None]:
path = os.path.join("data", "sales.csv")



---

## 🛠️ Main Project: **"Sales Data Summary"**

### 🎯 Goal
Load a CSV file of sales transactions and generate a basic report including:
- Total transactions
- Total revenue
- Average sale
- Flag rows with missing or invalid data

---

### 📦 Example CSV: `sales.csv`



In [None]:
Date,Product,Quantity,Price
2023-01-01,Phone,2,699.99
2023-01-02,Laptop,1,1099.99
2023-01-03,Tablet,,399.99
2023-01-04,Monitor,1,abc


---

### ✅ Your Tasks

- Read the CSV file
- Skip header
- Validate that `Quantity` and `Price` are numeric
- Calculate:
  - `total sales` = sum of `Quantity * Price`
  - `average sale`
- Print invalid rows with reasons

---

### 🧩 Starter Code

```python
import csv

total_sales = 0
valid_rows = 0
invalid_rows = []

with open("sales.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            qty = int(row["Quantity"])
            price = float(row["Price"])
            total_sales += qty * price
            valid_rows += 1
        except (ValueError, TypeError):
            invalid_rows.append(row)

print(f"Total Valid Transactions: {valid_rows}")
print(f"Total Revenue: ${total_sales:.2f}")
print(f"Average Sale: ${total_sales / valid_rows:.2f}")
print("\nInvalid Rows:")
for row in invalid_rows:
    print(row)


In [None]:
Total Valid Transactions: 2  
Total Revenue: $2499.97  
Average Sale: $1249.99  

Invalid Rows:
{'Date': '2023-01-03', 'Product': 'Tablet', 'Quantity': '', 'Price': '399.99'}
{'Date': '2023-01-04', 'Product': 'Monitor', 'Quantity': '1', 'Price': 'abc'}


### 🚀 Stretch Goals

Add date range filter

Write summary to a text or JSON report

Save invalid rows to a separate CSV

### Week 6: NumPy for Efficient Numerical Computing
**Topics:**
* Why Numpy: performance vs Lists
* Arrays, slicing, reshaping
* Element-wise operations and broadcasting
* Aggregation: sum(), mean(), std()
* Boolean indexing for filtering

**Project:** 
*Analyze a small dataset using Numpy arrays for transformation and filtering


# Week 6: NumPy for Efficient Numerical Computing

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Understand why NumPy is used in data analysis
- Create and manipulate NumPy arrays
- Use slicing, reshaping, and broadcasting
- Perform efficient numerical operations and filtering
- Apply aggregation functions for insights

---

## 🔍 Why NumPy?

- Lists are flexible but **slow** for numerical tasks
- NumPy uses fixed-type arrays in **C-level memory** = blazing fast
- Base for **Pandas**, **Scikit-learn**, **TensorFlow**, etc.

---

## 📦 Creating Arrays

```python
import numpy as np

a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]])


### 📐 Array Shapes & Types

In [None]:
print(a.shape)       # (3,)
print(b.shape)       # (2, 2)
print(a.dtype)       # int64


### 🔁 Reshaping Arrays

In [None]:
x = np.array([1, 2, 3, 4, 5, 6])
x = x.reshape((2, 3))  # 2 rows, 3 columns


### 🔪 Slicing & Indexing

In [None]:
a = np.array([[10, 20], [30, 40]])
print(a[0, 1])     # 20
print(a[:, 1])     # column 1


### 🧮 Vectorized Operations

In [None]:
x = np.array([1, 2, 3])
y = np.array([10, 20, 30])
print(x + y)       # [11, 22, 33]
print(x * 2)       # [2, 4, 6]


### 📊 Aggregation

In [None]:
a = np.array([[1, 2], [3, 4]])
print(np.sum(a))         # 10
print(np.mean(a))        # 2.5
print(np.std(a))         # 1.118


### 🎯 Boolean Indexing

In [None]:
x = np.array([5, 10, 15, 20])
mask = x > 10
print(x[mask])           # [15 20]


### 💡 Real-World Relevance

NumPy makes bulk computation fast and memory-efficient — especially important in large datasets, feature engineering, matrix math (ML), and statistical analysis.


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 6 – Practice Exercises: NumPy Basics

## 1. Import numpy and create 1D array
```python
import numpy as np
a = np.array([1, 2, 3, 4])


### 2. Create a 3x2 array of zeros

In [None]:
z = np.zeros((3, 2))


### 3. Multiply all elements by 10

In [None]:
a = np.array([1, 2, 3])
print(a * 10)


### 4. Reshape a flat array into 2x3

In [None]:
x = np.array([1, 2, 3, 4, 5, 6])
x = x.reshape((2, 3))


### 5. Get mean and std of array

In [None]:
x = np.array([10, 20, 30])
print(np.mean(x), np.std(x))


### 6. Slice a 2D array to get column 2

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6]])
print(a[:, 1])


### 7. Filter values greater than 50

In [None]:
data = np.array([45, 55, 60, 40])
print(data[data > 50])


### 8. Create a range of even numbers

In [None]:
evens = np.arange(0, 20, 2)


### 9. Check shape and type

In [None]:
x = np.array([[1, 2], [3, 4]])
print(x.shape, x.dtype)


### 10. Add two arrays


---

## 🛠️ Project: **Analyze a Small Dataset with NumPy**

### 🎯 Goal

Use NumPy to:
- Load numeric dataset (e.g., grades, sales, scores)
- Perform analysis: total, mean, std, filtering
- Print summary insights

---

### 🧩 Dataset: `scores.csv`



In [None]:
Name,Math,Science,English
Alice,85,90,88
Bob,70,75,72
Charlie,95,98,94


---

### ✅ Tasks

- Load CSV (skip headers, use `np.loadtxt()` or `csv`)
- Extract numerical values
- Calculate:
  - Mean score per student
  - Mean score per subject
  - Std deviation overall
- Filter students with any score < 75

---

### 🧩 Example Code

```python
import numpy as np
import csv

# Load manually to skip names
scores = []

with open("scores.csv", "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for row in reader:
        scores.append([int(row[1]), int(row[2]), int(row[3])])

arr = np.array(scores)

# Per-student average
student_means = np.mean(arr, axis=1)
print("Student Averages:", student_means)

# Per-subject average
subject_means = np.mean(arr, axis=0)
print("Subject Averages:", subject_means)

# Overall stats
print("Overall Std Dev:", np.std(arr))

# Filter students with low scores
print("Any score < 75:")
print(arr[np.any(arr < 75, axis=1)])


In [None]:
Student Averages: [87.666 72.333 95.666]
Subject Averages: [83.33 87.66 84.66]
Overall Std Dev: 9.73
Any score < 75:
[[70 75 72]]


### 🚀 Stretch Goals

Normalize all scores (e.g., to 0–1 scale)
Add names and link stats back to each student
Plot histograms using matplotlib (next week)

### Week 7: Pandas Fundamentals
**Topics:**
* DataFrame vs Series
* Loading CSV/Excel data
* Exploring datasets: .head( ), .info( ), .describe( )
* Selecting/filtering with .loc[ ], .iloc[ ]

**Min project:**
* Filter high-value transactions or recent records

**Project:** 
* “HR Analytics Explorer” – Analyze headcount, attrition rate, salary bands


### 📘 Lesson Notes

# Week 7: Pandas Fundamentals

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Load CSV/Excel files into Pandas DataFrames
- Understand the difference between Series and DataFrames
- View and explore datasets (.head(), .info(), .describe())
- Select rows/columns with .loc[], .iloc[], and filters
- Apply basic filters to isolate key records

---

## 📦 What is Pandas?

- High-level data analysis library
- Built on top of NumPy
- Supports table-like data (like Excel or SQL)
- Core object: `DataFrame`

---

## 🧱 DataFrame vs Series

```python
import pandas as pd

# Series: 1D
s = pd.Series([1, 2, 3])

# DataFrame: 2D table
df = pd.DataFrame({
    "Name": ["Alice", "Bob"],
    "Age": [25, 30]
})


### 📂 Loading Data

In [None]:
df = pd.read_csv("sales.csv")
df = pd.read_excel("data.xlsx")  # needs openpyxl


### 🔍 Exploring Data

In [None]:
df.head()       # first 5 rows
df.info()       # column types & nulls
df.describe()   # basic stats


### 🧭 Selecting Data

In [None]:
# Columns
df["Age"]
df[["Name", "Age"]]

# Rows by index
df.loc[0]           # by label
df.iloc[0]          # by position

# Filter rows
df[df["Age"] > 30]


### 🔁 Common Operations

In [None]:
df["Salary"].mean()
df["City"].value_counts()
df.sort_values("Salary", ascending=False)


### 💡 Real-World Relevance

Pandas is your go-to tool for 80% of data wrangling tasks — Excel replacement, data cleaning, quick analysis, and prep for ML.


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 7 – Practice Exercises: Pandas Basics

## 1. Import pandas and read a CSV
```python
import pandas as pd
df = pd.read_csv("employees.csv")


### 2. View top and bottom rows

In [None]:
print(df.head())
print(df.tail())


### 3. Check number of rows and columns

In [None]:
print(df.shape)


### 4. Get column names

In [None]:
print(df.columns)


### 5. Summary statistics

In [None]:
print(df.describe())

### 6. Get all rows where salary > 50,000

In [None]:
df[df["Salary"] > 50000]


### 7. Select only Name and Department columns

In [None]:
df[["Name", "Department"]]


### 8. Sort by salary descending

In [None]:
df.sort_values("Salary", ascending=False)


### 9. Count unique departments

In [None]:
df["Department"].value_counts()


### 10. Get row by index

In [None]:
df.iloc[2]  # 3rd row



---

## 🛠️ Project: **HR Analytics Explorer**

### 🎯 Goal

Analyze a company’s employee dataset to answer key HR questions:
- What’s the average salary by department?
- Which departments have the most headcount?
- What's the attrition rate?

---

### 🧩 Sample Dataset: `hr_data.csv`



In [None]:
EmployeeID,Name,Department,Salary,Status
1,Alice,Sales,70000,Active
2,Bob,Engineering,90000,Active
3,Charlie,Sales,65000,Left
4,Dana,Engineering,85000,Active
5,Eli,HR,60000,Left
6,Faith,Sales,72000,Active


---

### ✅ Tasks

1. Load CSV into DataFrame
2. Use `.groupby()` to get:
   - Avg salary by department
   - Headcount by department
   - Count of "Left" vs "Active"
3. Use `.loc[]` to filter all employees who left
4. Sort by salary
5. Plot value counts (optional, next week)

---

### 🧩 Starter Code

```python
import pandas as pd

df = pd.read_csv("hr_data.csv")

# Average salary per department
avg_salary = df.groupby("Department")["Salary"].mean()
print("Average Salary:\n", avg_salary)

# Headcount by department
headcount = df["Department"].value_counts()
print("Headcount:\n", headcount)

# Attrition counts
status_counts = df["Status"].value_counts()
print("Status Counts:\n", status_counts)

# List employees who left
left = df[df["Status"] == "Left"]
print("Employees who left:\n", left[["Name", "Department"]])


In [None]:
Average Salary:
Department
Engineering    87500.0
HR             60000.0
Sales          69000.0

Headcount:
Sales          3
Engineering    2
HR             1

Status Counts:
Active    4
Left      2


### 🚀 Stretch Goals

Create a new column: “Seniority Level” based on salary

Save summaries to a new Excel or CSV file

Combine this with data from another department next week

### Week 8: Data Cleaning with Pandas
**Topics:**
* Handling nulls: isna( ), fillna( )
* String operations: .str.lower( ), .strip( ), .replace( )
* Date parsing and type conversion
* Renaming columns, dropping duplicates

**Project:** 
* “CRM Export Cleaner” – Clean a messy dataset for sales or marketing team


# Week 8: Data Cleaning with Pandas

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Identify and handle missing or duplicate data
- Clean and standardize string values
- Convert columns to proper data types (e.g., dates, numbers)
- Rename columns, drop irrelevant data
- Prepare raw exports for analysis or business use

---

## ❓Why Cleaning Matters

- Raw data is messy: typos, empty cells, bad formats
- Cleaning is 60–80% of real-world data work
- Clean data = accurate insights

---

## 🧼 Handling Missing Values

```python
df.isna().sum()               # Count nulls
df.dropna()                   # Remove rows with nulls
df.fillna(0)                  # Fill nulls with 0
df["col"].fillna("Unknown")   # Fill specific column


### 🔠 String Cleaning (with .str)

In [None]:
df["Name"] = df["Name"].str.strip()        # Remove spaces
df["Name"] = df["Name"].str.title()        # Capitalize
df["Email"] = df["Email"].str.lower()      # Lowercase
df["Phone"] = df["Phone"].str.replace("-", "")


### 🧮 Data Type Conversion

In [None]:
df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")
df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")


### 🏷️ Renaming & Dropping

In [None]:
df = df.rename(columns={"emp_id": "EmployeeID"})
df = df.drop(columns=["TempNotes"])


### 🧯 Dealing with Duplicates

In [None]:
df.duplicated().sum()
df = df.drop_duplicates()


### 🧠 Tip: Chain Methods

In [None]:
df["Name"] = df["Name"].str.strip().str.title()


### 💡 Real-World Relevance

You'll clean CRM exports, HR spreadsheets, survey data, and scraped data all the time. It’s not glamorous — but it’s mission critical.


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 8 – Practice Exercises: Cleaning with Pandas

## 1. Fill null salaries with 0
```python
df["Salary"] = df["Salary"].fillna(0)


### 2. Drop all rows with any nulls

In [None]:
df = df.dropna()


### 3. Strip spaces from name field

In [None]:
df["Name"] = df["Name"].str.strip()


### 4. Convert JoinDate to datetime

In [None]:
df["JoinDate"] = pd.to_datetime(df["JoinDate"])


### 5. Capitalize names

In [None]:
df["Name"] = df["Name"].str.title()



### 6. Lowercase all emails

In [None]:
df["Email"] = df["Email"].str.lower()


### 7. Remove duplicates

In [None]:
df = df.drop_duplicates()


### 8. Rename column "emp_id" to "EmployeeID"

In [None]:
df = df.rename(columns={"emp_id": "EmployeeID"})



### 9. Remove column "Notes"


In [None]:
df = df.drop(columns=["Notes"])


### 10. Fill missing values in "Department" with "Unknown"

In [None]:
df["Department"] = df["Department"].fillna("Unknown")



---

## 🛠️ Project: **CRM Export Cleaner**

### 🎯 Goal

Take a messy customer export and clean it up for the marketing/sales team:
- Standardize names, emails, and phone numbers
- Remove duplicates
- Convert types
- Handle missing values

---

### 🧩 Sample Dataset: `crm_data.csv`



In [None]:
Name,Email,Phone,SignupDate,Status
alice ,ALICE@EXAMPLE.COM, 123-456-7890,2023/01/10,Active
BOB,BOB@EMAIL.COM,1234567890,2023-02-12,Active
alice,Alice@example.com,1234567890,2023-01-10,Active
Charlie,,123.456.7890,,Inactive


---

### ✅ Tasks

1. Remove duplicates (same Name + Phone)
2. Strip/standardize Name and Email
3. Convert phone numbers to consistent format (remove punctuation)
4. Convert `SignupDate` to datetime
5. Fill missing SignupDate with default
6. Drop rows with missing emails
7. Final cleaned dataset saved to `clean_crm.csv`

---

### 🧩 Starter Code

```python
import pandas as pd

df = pd.read_csv("crm_data.csv")

# Strip and standardize strings
df["Name"] = df["Name"].str.strip().str.title()
df["Email"] = df["Email"].str.strip().str.lower()
df["Phone"] = df["Phone"].str.replace(r"\D", "", regex=True)

# Remove duplicates (same Name + Phone)
df = df.drop_duplicates(subset=["Name", "Phone"])

# Convert date
df["SignupDate"] = pd.to_datetime(df["SignupDate"], errors="coerce")

# Fill missing dates
df["SignupDate"] = df["SignupDate"].fillna(pd.Timestamp("2023-01-01"))

# Drop rows with missing email
df = df.dropna(subset=["Email"])

# Save cleaned version
df.to_csv("clean_crm.csv", index=False)

print(df)


### ✅ Sample Output

In [None]:
     Name             Email        Phone SignupDate   Status
0   Alice  alice@example.com  1234567890  2023-01-10   Active
1     Bob     bob@email.com  1234567890  2023-02-12   Active
3 Charlie  charlie@email.com 1234567890  2023-01-01  Inactive


### 🚀 Stretch Goals

Validate emails using regex

Add a “SignupMonth” column

Save final summary to Excel

### Week 9: Data Transformation & Feature Engineering
**Topics:**
* Creating new columns with apply() & lambda
* Binning & categorization (pd.cut(), qcut())
* Dummy variables (get_dummies)
* Mapping values (e.g., scoring systems)

*Project:* 
* Customer segmentation or performance tiering (e.g., "Gold", "Silver", "Bronze")


# Week 9: Data Transformation & Feature Engineering

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Create new columns using `apply()` and `lambda`
- Perform binning and categorization (e.g., age groups, spend tiers)
- Convert categorical variables using dummy/one-hot encoding
- Map and transform column values with dictionaries
- Prepare datasets for downstream modeling or analysis

---

## 🧮 Creating New Columns

```python
# Using arithmetic
df["Revenue"] = df["Price"] * df["Quantity"]

# Using apply + lambda
df["FullName"] = df["First"] + " " + df["Last"]
df["Tax"] = df["Price"].apply(lambda x: x * 0.07)


### Feature Binning with pd.cut() or pd.qcut()

In [None]:
# Fixed bins
df["AgeGroup"] = pd.cut(df["Age"], bins=[0, 18, 35, 60, 100], labels=["Teen", "Young Adult", "Adult", "Senior"])

# Quantile-based bins
df["SpendingTier"] = pd.qcut(df["TotalSpent"], q=3, labels=["Low", "Medium", "High"])


### 🔁 Mapping Categorical Values

In [None]:
rating_map = {"Poor": 1, "Average": 2, "Good": 3, "Excellent": 4}
df["RatingScore"] = df["Rating"].map(rating_map)


### 🔣 One-Hot Encoding (Dummy Variables)

In [None]:
pd.get_dummies(df["Department"])

# Or include in main DataFrame
df = pd.get_dummies(df, columns=["Department"])


### Why Feature Engineering?

Key step before modeling or deeper analysis

Allows you to capture patterns (segments, tiers, scores)

Helps convert messy real-world data into structured features

### 💡 Tip: Avoid modifying original column unless intended

In [None]:
# Don’t overwrite your raw data unless you're sure
df["Normalized"] = df["Score"] / df["Score"].max()



---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 9 – Practice Exercises: Feature Engineering

## 1. Create "Revenue" = Price × Quantity
```python
df["Revenue"] = df["Price"] * df["Quantity"]


### 2. Combine First and Last Name into FullName

In [None]:
df["FullName"] = df["First"] + " " + df["Last"]


### 3. Calculate DiscountedPrice = Price * 0.9

In [None]:
df["DiscountedPrice"] = df["Price"].apply(lambda x: x * 0.9)


### 4. Create Age Groups with cut()

In [None]:
df["AgeGroup"] = pd.cut(df["Age"], [0, 18, 35, 60, 100], labels=["Teen", "Young Adult", "Adult", "Senior"])


### 5. Score performance using map()

In [None]:
map_dict = {"Low": 1, "Medium": 2, "High": 3}
df["Score"] = df["Performance"].map(map_dict)



### 6. Convert "Category" to dummy variables

In [None]:
pd.get_dummies(df["Category"])



### 7. Normalize a column between 0 and 1

In [None]:
df["Normalized"] = df["Score"] / df["Score"].max()


### 8. Add “Spending Tier” using qcut()

In [None]:
df["Tier"] = pd.qcut(df["Spending"], 3, labels=["Low", "Medium", "High"])


### 9. Extract domain from email

In [None]:
df["Domain"] = df["Email"].apply(lambda x: x.split("@")[1])


### 10. Create "IsHighValue" if Revenue > 1000

In [None]:
df["IsHighValue"] = df["Revenue"] > 1000



---

## 🛠️ Project: **Customer Segmentation & Tiering**

### 🎯 Goal

Transform raw customer purchase data into **meaningful segments**:
- Calculate total and average spend
- Categorize customers into tiers
- Add flags for high-value behavior

---

### 🧩 Sample Dataset: `customers.csv`



In [None]:
CustomerID,Name,Age,TotalSpent,Transactions
101,Alice,34,1200.50,12
102,Bob,22,450.00,5
103,Charlie,45,3000.00,18
104,Dana,61,980.00,4
105,Eli,28,1750.00,8


---

### ✅ Tasks

1. Create `AvgSpend` = TotalSpent / Transactions
2. Categorize customers into:
   - Spending Tiers using `pd.qcut()`
   - Age Groups using `pd.cut()`
3. Create binary flag: `IsHighValue` if TotalSpent > 1000 and Transactions > 10
4. One-hot encode age groups (optional)

---

### 🧩 Starter Code

```python
import pandas as pd

df = pd.read_csv("customers.csv")

# Average Spend
df["AvgSpend"] = df["TotalSpent"] / df["Transactions"]

# Spending Tier (3 equal-sized groups)
df["SpendingTier"] = pd.qcut(df["TotalSpent"], 3, labels=["Low", "Medium", "High"])

# Age Group
df["AgeGroup"] = pd.cut(df["Age"], bins=[0, 25, 45, 100], labels=["Young", "Middle", "Senior"])

# High Value Flag
df["IsHighValue"] = (df["TotalSpent"] > 1000) & (df["Transactions"] > 10)

print(df)


### ✅ Sample Output

In [None]:
  Name   Age  TotalSpent  Transactions  AvgSpend SpendingTier AgeGroup  IsHighValue
0 Alice   34     1200.5            12      100.0       Medium   Middle         True
1 Bob     22      450.0             5       90.0          Low    Young        False
2 Charlie 45     3000.0            18      166.6         High   Middle         True
3 Dana    61      980.0             4      245.0          Low   Senior        False
4 Eli     28     1750.0             8      218.7       Medium   Middle        False


### 🚀 Stretch Goals

Add normalized “AvgSpend”

Create cohort labels (e.g. Young-HighSpender)

Export tier summary by group to Excel or PDF

### Week 10: Aggregation, Grouping, and Pivoting
**Topics:**
* Grouping with .groupby() and .agg()
* Pivot tables with .pivot_table()
* Sorting and filtering summaries
* Multi-level aggregation

**Project:** 
* “Revenue by Product Category” – Analyze revenue trends and generate pivot tables by region/month


# Week 10: Aggregation, Grouping, and Pivoting

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Use `.groupby()` and `.agg()` to summarize data
- Perform multi-level groupings
- Create pivot tables using `.pivot_table()`
- Sort and filter summary tables
- Apply business thinking to summarize KPIs

---

## 🔁 Grouping Data

```python
df.groupby("Department")["Salary"].mean()
df.groupby("Region")[["Sales", "Profit"]].sum()


### 🔧 Aggregation with .agg()

In [None]:
df.groupby("Region").agg({
    "Sales": "sum",
    "Profit": "mean",
    "Discount": "max"
})


Use a dict to assign different aggregation methods

More flexible than just .mean() or .sum()

### 🔀 Multi-Level Grouping

In [None]:
df.groupby(["Region", "Category"])["Sales"].sum()




    Group by multiple features for a detailed view

    Returns a MultiIndex DataFrame

### 📊 Pivot Tables with .pivot_table()

In [None]:
df.pivot_table(index="Region", columns="Category", values="Sales", aggfunc="sum")


Like Excel pivots — reshape and summarize data

aggfunc can be sum, mean, count, etc.

### 📈 Sorting Summaries

In [None]:
summary = df.groupby("Category")["Sales"].sum()
summary.sort_values(ascending=False)


### Use .sort_values() to highlight top performers

### 🧠 When to Use What?

| Task                          | Method           |
| ----------------------------- | ---------------- |
| Grouping + Summary            | `.groupby()`     |
| Summary Across 2 Dimensions   | `.pivot_table()` |
| Aggregation w/ multiple funcs | `.agg()`         |
| Ranking results               | `.sort_values()` |


### 💡 Real-World Relevance

Used in:

Sales analysis by region/month

HR analytics by department/gender

Marketing conversion by channel/campaign


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 10 – Practice Exercises: Grouping & Pivoting

## 1. Group by Department and get average salary
```python
df.groupby("Department")["Salary"].mean()


### 2. Group by Gender and get employee count

In [None]:
df.groupby("Gender")["EmployeeID"].count()


### 3. Sum sales by Region and Category

In [None]:
df.groupby(["Region", "Category"])["Sales"].sum()


### 4. Aggregate sales (sum) and profit (mean) by Category

In [None]:
df.groupby("Category").agg({
    "Sales": "sum",
    "Profit": "mean"
})


### 5. Create a pivot table for total sales by Region and Month

In [None]:
df.pivot_table(index="Region", columns="Month", values="Sales", aggfunc="sum")


### 6. Sort departments by total salary spent

In [None]:
df.groupby("Department")["Salary"].sum().sort_values(ascending=False)


### 7. Count employees per region and gender

In [None]:
df.groupby(["Region", "Gender"])["EmployeeID"].count()


### 8. Create pivot table with average salary per Department and Gender

In [None]:
df.pivot_table(index="Department", columns="Gender", values="Salary", aggfunc="mean")


### 9. Show top 3 categories by total revenue

In [None]:
df.groupby("Category")["Revenue"].sum().sort_values(ascending=False).head(3)


### 10. Add new column "Revenue" = Price × Quantity, then group by Product

In [None]:
df["Revenue"] = df["Price"] * df["Quantity"]
df.groupby("Product")["Revenue"].sum()



---

## 🛠️ Project: **Revenue by Product Category**

### 🎯 Goal

Use grouping and pivoting to analyze **sales performance** across products, categories, and regions.

---

### 🧩 Sample Dataset: `sales_data.csv`



In [None]:
OrderID,Region,Category,Product,Price,Quantity,OrderDate
1,West,Office Supplies,Pens,1.5,20,2023-01-12
2,East,Furniture,Chair,85,2,2023-01-14
3,South,Technology,Laptop,950,1,2023-02-02
4,West,Furniture,Table,200,1,2023-02-11
5,East,Office Supplies,Stapler,7.5,10,2023-03-05


---

### ✅ Tasks

1. Create `Revenue` = Price × Quantity
2. Group by `Category` and `Region` to get:
   - Total Revenue
   - Average Quantity per sale
3. Create pivot table: Revenue by Category vs Region
4. Extract `Month` from `OrderDate` and group by `Month` and `Category`
5. Sort to find top-grossing categories by month

---

### 🧩 Starter Code

```python
import pandas as pd

df = pd.read_csv("sales_data.csv")

# Revenue Column
df["Revenue"] = df["Price"] * df["Quantity"]

# Group by Category + Region
summary = df.groupby(["Category", "Region"]).agg({
    "Revenue": "sum",
    "Quantity": "mean"
})
print(summary)

# Pivot Table
pivot = df.pivot_table(index="Category", columns="Region", values="Revenue", aggfunc="sum")
print(pivot)

# Extract Month
df["Month"] = pd.to_datetime(df["OrderDate"]).dt.month

# Monthly Revenue by Category
monthly = df.groupby(["Month", "Category"])["Revenue"].sum().unstack()
print(monthly)


### ✅ Sample Output (Pivot)

In [None]:
Region           East   South   West
Category                            
Furniture        540    0      1200
Office Supplies  250    0       180
Technology       950    1100     0


### 🚀 Stretch Goals

Create a bar chart from the pivot table (using matplotlib)

Export summaries to Excel (multiple sheets)

Highlight top revenue category per region

### Phase 3: Visualization & Analytical Storytelling
***(Weeks 11–13)-Communicate insights through data visuals***


### Week 11: Matplotlib Basics for Plotting
**Topics:**
* Line, bar, scatter, histogram
* Titles, labels, ticks, legends
* Saving charts as images

**Project:**
* “Revenue Dashboard” – Visualize monthly revenue and product trends


# Week 11: Matplotlib Basics for Plotting

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Create common chart types: line, bar, scatter, histogram
- Label your charts with titles, axes, legends
- Adjust chart size, colors, and styles
- Save plots as image files (e.g. PNG)
- Understand basic chart use-cases in business contexts

---

## 📦 What is Matplotlib?

- Core plotting library in Python
- Powers most data visuals in notebooks
- Great for **custom**, **publication-quality** plots

---

## 📊 Line Plot

```python
import matplotlib.pyplot as plt

months = ["Jan", "Feb", "Mar", "Apr"]
sales = [1000, 1200, 900, 1400]

plt.plot(months, sales)
plt.title("Monthly Sales")
plt.xlabel("Month")
plt.ylabel("Sales ($)")
plt.show()


### 📊 Bar Chart

In [None]:
categories = ["A", "B", "C"]
values = [10, 30, 20]

plt.bar(categories, values, color="skyblue")
plt.title("Category Performance")
plt.show()


# 📉 Scatter Plot

In [None]:
x = [1, 2, 3, 4]
y = [2, 4, 1, 3]

plt.scatter(x, y, color="green")
plt.title("Relationship Example")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()


### 📈 Histogram

In [None]:
ages = [22, 25, 27, 30, 25, 35, 40, 42, 50, 23]
plt.hist(ages, bins=5, color="orange")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()


### 🖼️ Save to File

In [None]:
plt.savefig("sales_chart.png")


### 💡 Chart Types & When to Use

In [None]:
| Chart     | Use Case                           |
| --------- | ---------------------------------- |
| Line      | Trend over time                    |
| Bar       | Compare values across categories   |
| Scatter   | Relationship between two variables |
| Histogram | Distribution of values             |


### 🎯 Tips for Business Plots

Label clearly (titles, axes)

Avoid clutter (keep it simple)

Use consistent color schemes

Highlight what matters (e.g. outliers, trends)


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 11 – Practice Exercises: Matplotlib Basics

## 1. Plot revenue trend over 6 months
```python
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
revenue = [1000, 1100, 1200, 900, 1400, 1500]

plt.plot(months, revenue)
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue ($)")
plt.show()


### 2. Bar chart for sales by product

In [None]:
products = ["Pen", "Notebook", "Stapler"]
sales = [120, 90, 60]

plt.bar(products, sales, color="purple")
plt.title("Product Sales")
plt.show()



### 3. Scatter plot of Hours Studied vs Score

In [None]:
hours = [1, 2, 3, 4, 5]
scores = [50, 60, 65, 75, 80]

plt.scatter(hours, scores)
plt.title("Study Hours vs Exam Score")
plt.xlabel("Hours Studied")
plt.ylabel("Score")
plt.show()


### 4. Histogram of customer ages

In [None]:
ages = [22, 25, 30, 32, 35, 40, 42, 44, 45, 50, 55, 60]

plt.hist(ages, bins=6, color="teal")
plt.title("Customer Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()


### 5. Save chart to file

In [None]:
plt.plot([1, 2, 3], [3, 1, 4])
plt.title("Demo Plot")
plt.savefig("demo_plot.png")



---

## 🛠️ Project: **Revenue Dashboard (Matplotlib Only)**

### 🎯 Goal

Create a basic **monthly revenue dashboard** showing:
- Revenue trend over time
- Revenue by product category
- Customer volume by region
- Save charts as images

---

### 🧩 Sample Dataset: `revenue_data.csv`



In [None]:
Month,Category,Region,Revenue,Customers
Jan,Office Supplies,East,1200,35
Jan,Technology,West,2200,50
Feb,Office Supplies,East,900,30
Feb,Technology,West,2500,52
Mar,Office Supplies,East,1100,40
Mar,Technology,West,2800,60


---

### ✅ Tasks

1. Load CSV into DataFrame
2. Create:
   - Line chart: Revenue over Months
   - Bar chart: Total Revenue by Category
   - Bar chart: Customers by Region
3. Add labels and titles
4. Save each chart as PNG

---

### 🧩 Starter Code

```python
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("revenue_data.csv")

# Revenue trend
monthly = df.groupby("Month")["Revenue"].sum()
monthly.plot(kind="line", title="Monthly Revenue", xlabel="Month", ylabel="Revenue")
plt.savefig("monthly_revenue.png")
plt.clf()

# Revenue by category
category = df.groupby("Category")["Revenue"].sum()
category.plot(kind="bar", title="Revenue by Category", color="skyblue")
plt.ylabel("Revenue")
plt.savefig("revenue_by_category.png")
plt.clf()

# Customers by region
region = df.groupby("Region")["Customers"].sum()
region.plot(kind="bar", title="Customers by Region", color="green")
plt.ylabel("Customers")
plt.savefig("customers_by_region.png")


In [None]:
✅ Sample Output (Charts)

monthly_revenue.png

revenue_by_category.png

customers_by_region.png

### Week 12: Seaborn for Statistical Visualization
**Topics:**
* Distribution plots: distplot, boxplot, violinplot
* Relationship plots: scatterplot, pairplot, heatmap
* Themes and color palettes

**Project:** 
* “Product Analytics Visuals” – Visualize sales vs pricing vs customer rating


# Week 12: Seaborn for Statistical Visualization

## ✅ Learning Objectives
By the end of this week, you will be able to:
- Create statistical plots using Seaborn
- Visualize distributions and relationships
- Apply built-in themes and color palettes
- Combine categorical and numeric insights in one chart
- Make better visual decisions using Seaborn's expressiveness

---

## 📦 What is Seaborn?

- Built on top of Matplotlib
- High-level, easy-to-use API
- Better defaults, cleaner charts

```python
import seaborn as sns
import matplotlib.pyplot as plt


### 📊 Distribution Plots

In [None]:
# Histogram + KDE
sns.histplot(data=df, x="Age", kde=True)

# Boxplot
sns.boxplot(data=df, x="Department", y="Salary")

# Violin plot
sns.violinplot(data=df, x="Region", y="Revenue")


### 🔗 Relationship Plots

In [None]:
# Scatter + trend line
sns.scatterplot(data=df, x="Price", y="Rating")

# Add regression line
sns.lmplot(data=df, x="Experience", y="Salary")


### Heatmaps (Correlation or Pivot Tables)

In [None]:
# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")


### 🟢 Categorical Visuals

In [None]:
# Count of categories
sns.countplot(data=df, x="Category")

# Swarm for distribution per group
sns.swarmplot(data=df, x="Category", y="Score")


### 🎨 Themes & Styles

In [None]:
sns.set_style("whitegrid")  # options: white, darkgrid, ticks
sns.set_palette("pastel")   # or "deep", "muted", "Set2"


### 💡 Seaborn Use Cases

In [None]:
| Visual     | Use Case                                |
| ---------- | --------------------------------------- |
| Boxplot    | Compare distribution across categories  |
| Heatmap    | Show correlation or pivot summary       |
| Scatter/LM | Relationship between numeric variables  |
| Countplot  | Distribution of categories              |
| Violinplot | Mix of box + KDE (good for skewed data) |


### Seaborn vs Matplotlib

| Feature         | Matplotlib     | Seaborn             |
| --------------- | -------------- | ------------------- |
| Flexibility     | High           | Medium (high-level) |
| Aesthetics      | Manual styling | Great by default    |
| Stats Awareness | No             | Yes                 |



---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 12 – Practice Exercises: Seaborn Visuals

## 1. Plot distribution of ages
```python
sns.histplot(df["Age"], kde=True)


### 2. Compare Salary across Departments using Boxplot

In [None]:
sns.boxplot(data=df, x="Department", y="Salary")


### 3. Scatterplot of Price vs Rating

In [None]:
sns.scatterplot(data=df, x="Price", y="Rating")


### 4. Regression line of Experience vs Salary

In [None]:
sns.lmplot(data=df, x="Experience", y="Salary")


### 5. Countplot of Product Categories

In [None]:
sns.countplot(data=df, x="Category")


### 6. Heatmap of correlation matrix

In [None]:
sns.heatmap(df.corr(), annot=True)


### 7. Violinplot of Revenue by Region


In [None]:
sns.violinplot(data=df, x="Region", y="Revenue")


### 8. Set theme to whitegrid and palette to pastel

In [None]:
sns.set_style("whitegrid")
sns.set_palette("pastel")


### 9. Swarmplot of Scores by Group

In [None]:
sns.swarmplot(data=df, x="Group", y="Score")


### 10. Save Seaborn chart to file

In [None]:
plot = sns.boxplot(data=df, x="Category", y="Sales")
plot.figure.savefig("boxplot_output.png")



---

## 🛠️ Project: **Product Analytics Visuals (Seaborn)**

### 🎯 Goal

Visualize product performance and customer behavior using:
- Price vs Rating
- Category distributions
- Revenue vs Discount relationships
- Correlations between metrics

---

### 🧩 Sample Dataset: `product_data.csv`



In [None]:
Product,Category,Price,Discount,Rating,Revenue,Region
Notebook,Office Supplies,15,0.1,4.2,3000,East
Monitor,Technology,200,0.15,4.7,8000,West
Chair,Furniture,85,0.2,4.0,5000,South
Pen,Office Supplies,1.5,0.05,4.1,1200,East
Desk,Furniture,250,0.25,4.5,9500,North


---

### ✅ Tasks

1. Plot scatter of `Price vs Rating`
2. Use `lmplot()` to show regression line (Price vs Rating)
3. Boxplot: `Revenue by Category`
4. Violinplot: `Discount by Region`
5. Countplot: Number of products by Category
6. Correlation heatmap for numeric columns

---

### 🧩 Starter Code

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid")
sns.set_palette("Set2")

df = pd.read_csv("product_data.csv")

# 1. Price vs Rating
sns.scatterplot(data=df, x="Price", y="Rating")
plt.title("Price vs Customer Rating")
plt.show()

# 2. Regression Line
sns.lmplot(data=df, x="Price", y="Rating")

# 3. Revenue by Category
sns.boxplot(data=df, x="Category", y="Revenue")
plt.title("Revenue Distribution by Category")
plt.show()

# 4. Violinplot: Discount by Region
sns.violinplot(data=df, x="Region", y="Discount")
plt.title("Discount Range by Region")
plt.show()

# 5. Count of Products
sns.countplot(data=df, x="Category")
plt.title("Product Count by Category")
plt.show()

# 6. Heatmap of numeric correlation
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


### Sample Output (Charts)

Price vs Rating (scatter + trend)

Revenue by Category (boxplot)

Discount by Region (violin)

Count of Products (bar)

Correlation Heatmap

### 🚀 Stretch Goals

Create a Seaborn pairplot across Price, Revenue, Rating

Annotate highest-rated product

Save all charts to files

### Week 13: Exploratory Data Analysis (EDA) Projects
**Topics:**
* Combining pandas + matplotlib/seaborn
* Detect outliers, correlations, missing data
* Business storytelling: crafting a data narrative

**Capstone Project:**
* “EDA Case Study” – Titanic dataset, HR dataset, or a Kaggle dataset Include:
    * Data cleaning
    * Visual exploration
    * Insight summary in Markdown or slides


# Week 13: Exploratory Data Analysis (EDA) Projects

## ✅ Learning Objectives

By the end of this week, you will be able to:
- Perform full EDA on a real dataset
- Clean and prepare messy data
- Visualize relationships, distributions, and patterns
- Detect outliers, nulls, and trends
- Craft an analytical narrative with visual support

---

## 🧭 What is EDA?

EDA is the process of:
- **Understanding** data: structure, types, quality
- **Cleaning** data: handling nulls, outliers, formats
- **Visualizing** distributions, trends, relationships
- **Summarizing** insights for decisions

---

## 📊 Typical EDA Steps

1. **Understand the Dataset**
   - `.head()`, `.info()`, `.describe()`
   - Check shapes, types, ranges

2. **Clean the Data**
   - Handle missing values
   - Fix data types (e.g., dates)
   - Rename columns

3. **Explore Distributions**
   - Histograms, boxplots, violinplots
   - Summary stats (mean, median, outliers)

4. **Explore Relationships**
   - Scatterplots, heatmaps, grouped summaries

5. **Group & Compare**
   - `groupby()`, `pivot_table()`, aggregation

6. **Tell a Story**
   - Structure findings logically
   - Use visuals to support each point

---

## 🧠 Key Questions to Ask in EDA

- What variables are important? Why?
- What patterns or anomalies exist?
- Are there missing or extreme values?
- How are groups different (e.g., by gender, region)?
- What story can the data tell?

---

## 🛠 Tools Used

- Pandas for data loading, cleaning, analysis
- Seaborn/Matplotlib for visuals
- Markdown/Slides for insights



In [None]:
🧪 Practice: Mini-EDA

# 🧪 Mini-EDA: HR Attrition Dataset

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("hr_data.csv")

# Basic checks
print(df.head())
print(df.info())
print(df.describe())

# Null values
print(df.isna().sum())

# Turn 'HireDate' into datetime
df["HireDate"] = pd.to_datetime(df["HireDate"])

# Add tenure
df["Tenure"] = (pd.to_datetime("2025-01-01") - df["HireDate"]).dt.days // 365

# Histogram of Age
sns.histplot(df["Age"], bins=10)
plt.title("Age Distribution")
plt.show()

# Boxplot Salary by Department
sns.boxplot(data=df, x="Department", y="Salary")
plt.title("Salary by Department")
plt.show()

# Countplot of Attrition
sns.countplot(data=df, x="Attrition")
plt.title("Attrition Distribution")
plt.show()

# Correlation
sns.heatmap(df.corr(numeric_only=True), annot=True)
plt.title("Correlation Matrix")
plt.show()



---

## 🎓 Capstone Project: **EDA Case Study**

> Choose one dataset and conduct a **full EDA**, delivering your findings as a Jupyter Notebook or Markdown report.

---

### 🧩 Choose One Dataset:

| Dataset                        | Description                                |
|-------------------------------|--------------------------------------------|
| 🛒 Retail Sales                | Product, region, time, revenue             |
| 👥 HR Attrition                | Employees, salary, department, attrition   |
| 📊 Kaggle: Titanic             | Survival based on demographics             |
| 🎯 Marketing Campaign Funnel   | Clicks, conversions, revenue               |
| 💬 Survey NPS Sentiment       | Text, scores, feedback sentiment           |

---

### ✅ Requirements

1. **Data Cleaning**
   - Drop/fix nulls
   - Convert data types
   - Rename columns

2. **Feature Engineering**
   - Create derived columns (e.g. tenure, revenue)
   - Bin or categorize (e.g., age groups)

3. **Visualizations**
   - 3–5 high-quality charts using Seaborn/Matplotlib
   - Include:
     - Distribution
     - Relationship
     - Grouped comparison
     - Trend over time (if applicable)

4. **Insight Summary**
   - What did you learn from the data?
   - What’s surprising or useful?
   - Use Markdown or slides to communicate

5. **Final Deliverables**
   - `.ipynb` notebook with code + visuals
   - Markdown summary OR exported slides
   - Optional: GitHub repo or downloadable PDF

---

### 🧩 Starter Template

```python
# Load data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("your_dataset.csv")

# Initial check
df.info()
df.head()

# Cleaning steps...

# Feature Engineering...

# Visuals
sns.histplot(df["SomeColumn"])
# Add 3–5 relevant plots

# Markdown Summary:
# - Key insights
# - Issues or gaps
# - Business relevance


### Stretch Goals

Use pairplot or FacetGrid for multi-view plots

Add interactivity with Plotly (optional)

Compare insights across groups (e.g., attrition by gender + department)

# Phase 4: Advanced Analytics & Projects 
**(Weeks 14–16)-Apply complete workflow and optionally explore ML**


### Week 14: Time Series & Date Analysis
* Using .dt accessor for date components
* Resampling: daily, weekly, monthly
* Rolling averages, time windowing

**Mini-Project:**
* Analyze and visualize website traffic or sales trends over time


# Week 14: Time Series & Date Analysis

## ✅ Learning Objectives

By the end of this week, you will be able to:
- Parse dates and extract time components
- Use `.dt` accessor for time operations
- Perform resampling (daily, weekly, monthly)
- Calculate rolling statistics (moving averages)
- Visualize time-based trends

---

## 🕐 Why Time Series?

- Time is key in sales, finance, traffic, user activity
- Many business questions are **time-dependent**:
  - What’s the monthly growth?
  - Are we improving year-over-year?
  - What’s the 7-day average?

---

## 📦 Working with Dates in Pandas

```python
df["OrderDate"] = pd.to_datetime(df["OrderDate"])


Ensures datetime format for plotting and resampling

### 🧱 Extracting Date Parts

In [None]:
df["Year"] = df["OrderDate"].dt.year
df["Month"] = df["OrderDate"].dt.month
df["Weekday"] = df["OrderDate"].dt.day_name()
df["Hour"] = df["OrderDate"].dt.hour


### 📅 Resampling (Daily, Monthly, etc.)

In [None]:
df.set_index("OrderDate", inplace=True)

# Monthly sales
monthly = df["Revenue"].resample("M").sum()

# Weekly average
weekly = df["Visits"].resample("W").mean()


### Common frequency codes:

"D" – daily

"W" – weekly

"M" – month-end

"MS" – month start

"Q" – quarterly

### 🔁 Rolling Averages (Moving Window)

In [None]:
monthly["7-day"] = monthly.rolling(window=7).mean()


Helps smooth volatility

Great for trend detection

### 📊 Time Series Plots

In [None]:
import matplotlib.pyplot as plt

monthly.plot(title="Monthly Revenue Trend")
plt.ylabel("Revenue")
plt.show()


### 💡 Common Use Cases

Website traffic over time

Sales performance by week/month

Employee churn trends

Rolling KPIs for campaigns


---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 14 – Time Series Practice

## 1. Convert column to datetime
```python
df["Date"] = pd.to_datetime(df["Date"])


### 2. Extract year, month, weekday

In [None]:
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Weekday"] = df["Date"].dt.day_name()


### 3. Set date column as index

In [None]:
df.set_index("Date", inplace=True)


### 4. Resample daily and plot

In [None]:
df["Revenue"].resample("D").sum().plot()


### 5. Monthly average orders

In [None]:
df["Orders"].resample("M").mean()


### 6. Add 7-day moving average

In [None]:
df["Revenue"].rolling(7).mean().plot()


### 7. Visualize trends with multiple series

In [None]:
df[["Revenue", "7-day"]].plot()



### 8. Create weekday heatmap

In [None]:
df["Weekday"] = df.index.day_name()
df["Hour"] = df.index.hour
pivot = df.pivot_table(index="Weekday", columns="Hour", values="Sessions", aggfunc="sum")
sns.heatmap(pivot)


### 9. Plot total revenue by weekday

In [None]:
df.groupby("Weekday")["Revenue"].sum().plot(kind="bar")


### 10. Compare weekly revenue across categories

In [None]:
df.groupby([pd.Grouper(freq="W"), "Category"])["Revenue"].sum().unstack().plot()



---

## 🛠️ Mini-Project: **Website or Sales Trends Over Time**

### 🎯 Goal

Analyze and visualize trends over time using:
- Resampling
- Rolling averages
- Date decomposition

---

### 🧩 Sample Dataset: `web_traffic.csv` or `sales_data.csv`



In [None]:
Date,PageViews,Users,Revenue
2023-01-01,1200,300,2500
2023-01-02,1300,320,2600
2023-01-03,1100,290,2400


---

### ✅ Tasks

1. Convert `Date` column to datetime
2. Set `Date` as index
3. Plot:
   - Daily revenue trend
   - 7-day rolling average
4. Resample by month and compare growth
5. Add weekday column and visualize patterns
6. Bonus: Heatmap of traffic by weekday & hour (if hour available)

---

### 🧩 Starter Code

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("web_traffic.csv")
df["Date"] = pd.to_datetime(df["Date"])
df.set_index("Date", inplace=True)

# Daily Revenue
df["Revenue"].plot(title="Daily Revenue")
plt.show()

# 7-Day Rolling Average
df["Revenue"].rolling(7).mean().plot(title="7-Day Revenue Trend")
plt.show()

# Monthly Resample
monthly = df["Revenue"].resample("M").sum()
monthly.plot(kind="bar", title="Monthly Revenue")
plt.show()

# Weekday Analysis
df["Weekday"] = df.index.day_name()
sns.boxplot(data=df, x="Weekday", y="Revenue")
plt.title("Revenue by Weekday")
plt.show()


#### 🚀 Stretch Goals

Compare weekday vs weekend behavior

Add seasonality markers (holidays, promos)

Save visualizations as PNG

### Week 15: Intro to Predictive Modeling (Optional)
* Linear Regression with scikit-learn
* Model training, evaluation basics
* Avoiding overfitting, test/train split
  
**Mini-Project:** 
* Predict future sales or employee attrition using linear regression


# Week 15: Intro to Predictive Modeling (Linear Regression)

## ✅ Learning Objectives

By the end of this week, you will be able to:
- Understand what a predictive model is
- Build a simple linear regression model using `scikit-learn`
- Split data into training and testing sets
- Evaluate model performance (R², MAE, RMSE)
- Use the model to make predictions

---

## 🔍 What is Predictive Modeling?

> Predictive modeling uses **historical data** to forecast **future outcomes**.

Examples:
- Predict next month’s sales
- Predict employee attrition
- Predict customer ratings or satisfaction

---

## 📦 Linear Regression Overview

- Simplest form of prediction
- Assumes a **linear relationship** between input (X) and output (y)
- Predicts **continuous** values

---

## 🔁 Modeling Workflow

1. **Define X (features) and y (target)**
2. **Split** the data into training and test sets
3. **Train** the model using `.fit()`
4. **Predict** on new data using `.predict()`
5. **Evaluate** the model

---

## 📘 Key Metrics

| Metric | Use |
|--------|-----|
| R²     | How well model explains variance (closer to 1 = better) |
| MAE    | Mean Absolute Error (lower = better) |
| RMSE   | Root Mean Squared Error (lower = better) |

--


### 💡 Tips for Good Predictions

Make sure your features (X) are numeric

Remove or fill missing values

Avoid using features that "leak" future information

Train/test split is critical to avoid overfitting

🤖 scikit-learn Model Summary

| Step        | Function                              |
| ----------- | ------------------------------------- |
| Split data  | `train_test_split()`                  |
| Train model | `model.fit()`                         |
| Predict     | `model.predict()`                     |
| Evaluate    | `r2_score()`, `mean_absolute_error()` |



---

## 🧪 Practice Exercises

```markdown
# 🧪 Week 15 – Linear Regression Practice

## 1. Load dataset and check shape
```python
df = pd.read_csv("sales_prediction.csv")
print(df.shape)
print(df.head())



### 2. Define feature matrix X and target y

In [None]:
X = df[["Ad_Spend", "Email_Spend", "Social_Spend"]]
y = df["Total_Revenue"]



### 3. Train/test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


### 4. Train model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)


### 5. Make predictions

In [None]:
y_pred = model.predict(X_test)


### 6. Evaluate model

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

print("MAE:", mean_absolute_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))


### 7. Visualize actual vs predicted

In [None]:
import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred)
plt.xlabel("Actual Revenue")
plt.ylabel("Predicted Revenue")
plt.title("Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "r--")
plt.show()



---

## 🛠️ Mini-Project: **Predict Sales or Attrition**

### 🎯 Goal

Use a real dataset to train and evaluate a simple linear regression model.

---

### 🧩 Dataset Ideas

| Dataset         | Predict…                         |
|----------------|-----------------------------------|
| 📊 Marketing    | Revenue from ad spend             |
| 👥 HR           | Salary from years of experience   |
| 🛒 Retail       | Sales from price/discount         |

---

### ✅ Tasks

1. Load and clean your dataset
2. Choose numeric input features (X) and target (y)
3. Split into train and test
4. Train `LinearRegression()` model
5. Evaluate using:
   - R² Score
   - MAE / RMSE
6. Visualize predictions vs actual
7. Bonus: Interpret coefficients

---

### 🧩 Starter Code Snippet

```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

df = pd.read_csv("your_data.csv")

# Clean/prep data
X = df[["Feature1", "Feature2"]]
y = df["Target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("R²:", r2_score(y_test, predictions))
print("MAE:", mean_absolute_error(y_test, predictions))


### 🚀 Stretch Goals

Plot residuals (actual - predicted)

Standardize features using StandardScaler

Try polynomial regression for curve-fitting

### Week 16: Final Capstone Project
**Choose one real dataset and business problem:**
* Retail Sales Dashboard
* HR Attrition Analysis
* Marketing Funnel Performance
* Financial KPIs Tracker
* Survey Sentiment and NPS Analysis

**Deliverables:**
* Cleaned dataset
* Data transformation notebook
* 3–5 meaningful visualizations
* Insightful summary (Markdown / PowerPoint)
* GitHub repo or downloadable PDF


### 🎓 Purpose

Build and deliver a real-world data project from start to finish, using everything you’ve learned:

Python foundations

Data cleaning

Pandas, NumPy

Visualization (Matplotlib, Seaborn)

(Optional) Predictive modeling

### 🧩 Project Options

Choose one dataset and business question:

| Project                       | Dataset Type           | Goal                                      |
| ----------------------------- | ---------------------- | ----------------------------------------- |
| 🛒 **Retail Sales Dashboard** | Product sales (CSV)    | Analyze revenue, seasonality, products    |
| 👥 **HR Attrition Analysis**  | Employee data          | Who is leaving and why?                   |
| 📈 **Marketing Funnel KPIs**  | Campaign tracking      | Track conversions, drop-off points        |
| 💸 **Financial KPI Tracker**  | Revenue & cost trends  | Visualize and forecast KPIs               |
| 💬 **Survey NPS + Sentiment** | Survey + open feedback | Clean text + analyze satisfaction metrics |


### 📁 Final Deliverables

In [None]:
| Component           | Required? | Format                           |
| ------------------- | --------- | -------------------------------- |
| 📄 Cleaned Dataset  | ✅         | CSV or processing code           |
| 🧹 Data Notebook    | ✅         | `.ipynb` (Jupyter/Colab)         |
| 📊 Visuals          | ✅         | Seaborn / Matplotlib charts      |
| 📋 Insight Summary  | ✅         | Markdown or Slides               |
| 🤖 Predictive Model | Optional  | Linear regression (scikit-learn) |
| 💾 Submission       | ✅         | GitHub repo or ZIP folder        |


### ✅ Workflow Checklist

### STEP 1: Load & Explore

- [ ] Read dataset with `pd.read_csv()`
- [ ] Run `.info()` and `.describe()`
- [ ] Check nulls, dtypes, duplicates

### STEP 2: Clean Data

- [ ] Handle missing values (drop or fill)
- [ ] Rename columns
- [ ] Convert dates
- [ ] Create new columns if needed

### STEP 3: Analyze & Transform

- [ ] Aggregations: `.groupby()`, `.agg()`
- [ ] Feature engineering: `.apply()`, `.map()`
- [ ] Sort, filter, slice

### STEP 4: Visualize

- [ ] Distribution plots (hist, box, violin)
- [ ] Relationships (scatter, lmplot)
- [ ] Group comparisons (bar, line, heatmap)
- [ ] Time trends (resample, rolling)

### STEP 5 (Optional): Predict

- [ ] Split into X, y
- [ ] Train/test split
- [ ] LinearRegression fit + predict
- [ ] Evaluate with R², MAE

### STEP 6: Communicate

- [ ] Write insights in Markdown or PPT
- [ ] Add titles to all charts
- [ ] Answer business question(s)


### 📊 Example Visuals to Include

| Chart Type  | Shows                                     |
| ----------- | ----------------------------------------- |
| Line Plot   | Trends over time                          |
| Bar Plot    | Group comparisons (e.g., region, segment) |
| Boxplot     | Distribution across categories            |
| Heatmap     | Correlation matrix                        |
| Scatterplot | Relationship between 2 variables          |


### 🧠 Insight Summary Template

You can deliver insights as:

Markdown cell summary (in your .ipynb)

PowerPoint slides

README in your GitHub repo

Sample structure:

## 📊 Capstone Project: Retail Sales Dashboard

### 🔍 Key Questions
- What regions bring the most revenue?
- Are there seasonal sales trends?
- What product categories perform best?

### 📈 Insights
- Western region has 45% higher avg monthly revenue
- Sales peak during Q4 (holiday season)
- Office Supplies outperform Furniture in margin

### 🛠 Recommendations
- Increase Q4 stock for high-volume items
- Focus marketing on Western & Southern regions
- Optimize Furniture discount strategy

### 🧠 Bonus Insight
- Regression shows 72% of revenue variance explained by ad spend


### 🗂️ Submission Options

GitHub repo with:

notebook.ipynb

README.md

charts/ folder (optional)

OR ZIP folder with notebook, dataset, and slides

### 🏁 Grading Criteria (if applicable)

| Area            | Weight |
| --------------- | ------ |
| Data Cleaning   | 25%    |
| Transformations | 20%    |
| Visuals         | 25%    |
| Insights/Story  | 20%    |
| Bonus: Modeling | 10%    |


### 🎉 You're Done!

Once this project is complete, you’ll have a portfolio-ready data project to share with:

Hiring managers

Clients

Stakeholders

### 🛒 Sample Project: Retail Sales Dashboard

### 📁 Dataset: retail_sales.csv

Here’s a sample structure:

In [None]:
OrderDate,Region,ProductCategory,ProductName,UnitsSold,UnitPrice,Discount,Revenue
2024-01-01,West,Office Supplies,Printer,10,200,0.1,1800
2024-01-02,South,Technology,Laptop,5,800,0.15,3400
2024-01-03,West,Furniture,Desk,3,400,0.2,960
...


In [None]:
You can create your own, or I can generate it for you — just say the word.

### 🎯 Business Questions

You'll answer:

Which region is most profitable?

Are sales seasonal or monthly trends evident?

Which product categories perform best?

How does discounting affect revenue?

### 🧪 Starter Notebook Template

In [None]:
# ✅ Retail Sales Dashboard – Starter

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load Data
df = pd.read_csv("retail_sales.csv", parse_dates=["OrderDate"])
df.info()
df.head()

# 2. Clean & Engineer Features
df["Month"] = df["OrderDate"].dt.to_period("M")
df["EffectivePrice"] = df["UnitPrice"] * (1 - df["Discount"])
df["TotalRevenue"] = df["UnitsSold"] * df["EffectivePrice"]

# 3. Monthly Sales Trend
monthly = df.groupby("Month")["TotalRevenue"].sum()
monthly.plot(title="Monthly Revenue Trend", figsize=(10, 4))
plt.ylabel("Revenue")
plt.show()

# 4. Revenue by Region
sns.barplot(data=df, x="Region", y="TotalRevenue", estimator=sum)
plt.title("Revenue by Region")
plt.show()

# 5. Category Analysis
sns.boxplot(data=df, x="ProductCategory", y="TotalRevenue")
plt.title("Revenue Distribution by Category")
plt.show()

# 6. Discount vs Revenue Scatter
sns.scatterplot(data=df, x="Discount", y="TotalRevenue", alpha=0.6)
plt.title("Discount Impact on Revenue")
plt.show()


### 📋 Markdown Summary Example

## Capstone Project Summary: Retail Sales Dashboard

### 🔍 Business Questions
- What region generates the most revenue?
- How do discounts affect performance?
- Are there product categories with high returns?

### 📈 Key Findings
- The **West region** had 40% more revenue than others.
- **Technology** products outperform in average revenue per unit.
- High discounts (>20%) reduce total revenue below average.

### 💡 Recommendations
- Run promotions in low-performing regions.
- Reduce discounts on high-performing tech items.
- Focus Q4 inventory on Office Supplies & Technology.


🔍 Summary: How Does It Stack Up?

Your course is surprisingly rigorous for a self-paced or part-time structure.

It focuses heavily on hands-on learning, not just theory or slides.

You're giving learners portfolio-ready projects that resemble what they'd do on the job.

You cover Python as the primary data language, while many popular certs (Google) emphasize spreadsheets, SQL, and BI tools more than programming.