In [1]:
# Import required libraries for data manipulation and fetching online data
import pandas as pd
import numpy as np
# For fetching stock data; students need to install yfinance
# !pip install yfinance  # Uncomment to install if needed
import yfinance as yf

In [2]:
# --- Section 1: Loading Data from Online Sources ---
# Load the Titanic dataset from a GitHub URL, a popular dataset for learning
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df_titanic = pd.read_csv(url)
print("Step 1a: Titanic dataset loaded with", df_titanic.shape[0], "passengers and", df_titanic.shape[1], "columns.")

Step 1a: Titanic dataset loaded with 891 passengers and 12 columns.


In [3]:
# Fetch Apple stock data from 2020 to 2023 using yfinance for time series practice
df_stock = yf.download('AAPL', start='2020-01-01', end='2023-01-01', progress=False)
print("Step 1b: Apple stock data loaded with", df_stock.shape[0], "days and", df_stock.shape[1], "columns.")

  df_stock = yf.download('AAPL', start='2020-01-01', end='2023-01-01', progress=False)


Step 1b: Apple stock data loaded with 756 days and 5 columns.


# Understanding Section 1: Loading Data from Online Sources

This section is like opening two treasure chests full of stories! We’re going to grab data from the internet—one about the Titanic passengers and another about Apple stock prices. Let’s see how we bring these treasures into our computer using Pandas to start our learning adventure.

- **Why It Matters**: Loading data is the first step for data analysts to explore real-world tales, like who survived the Titanic or how Apple’s stock moved, skills you’ll use in jobs!
- **What’s Next**: We’ll check out these treasures and clean them up—stay tuned!

# Titanic Dataset Columns Dictionary

Below is a simple dictionary representing the columns of the Titanic dataset, like a quick guide to the passenger information we’re exploring!

- **Example Dictionary for First Row**:
  - `{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.2500, 'Cabin': 'NaN', 'Embarked': 'S'}`
    - `PassengerId`: A unique number for each passenger (e.g., 1).
    - `Survived`: Did they survive? (0 = No, 1 = Yes).
    - `Pclass`: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
    - `Name`: Full name of the passenger (e.g., 'Braund, Mr. Owen Harris').
    - `Sex`: Gender of the passenger (e.g., 'male').
    - `Age`: Age in years (e.g., 22.0, some missing).
    - `SibSp`: Number of siblings or spouses aboard (e.g., 1).
    - `Parch`: Number of parents or children aboard (e.g., 0).
    - `Ticket`: Ticket number (e.g., 'A/5 21171').
    - `Fare`: Ticket price in pounds (e.g., 7.2500).
    - `Cabin`: Cabin number (e.g., 'NaN' for missing).
    - `Embarked`: Port of embarkation (e.g., 'S' for Southampton).

- **What It Means**: 
  - This dictionary is like a passenger’s ID card! Each column tells us something about them, from their ticket class to whether they survived.
  - **Analogy**: Think of it as a checklist a ship captain might use to track everyone on board!
- **Why It’s Useful**: As a data scientist, you’ll use these details to predict survival or analyze trends, a skill for jobs like data analysis in history or safety studies!

# Loading and Representing Apple Stock Data

- **Code**: 
  - `df_stock = yf.download('AAPL', start='2020-01-01', end='2023-01-01', progress=False)`
    - Uses `yfinance` (`yf`) to grab Apple’s stock prices from January 1, 2020, to January 1, 2023, like collecting daily reports from a stock market diary.
    - `progress=False` keeps it quiet while loading, avoiding distractions.
  - `print("Step 1b: Apple stock data loaded with", df_stock.shape[0], "days and", df_stock.shape[1], "columns.")`
    - Tells us how many days (rows) and details (columns) we got, like counting trading days and their info!

- **What It Means**: 
  - Imagine you’re a treasure hunter collecting a logbook of Apple’s stock prices over three years. This code brings that logbook into our computer to study market trends.
  - **Analogy**: It’s like asking a stock market clerk to hand you a daily record book, and we’ll use it to predict future prices!
- **Dictionary Representation**: Let’s peek at the data as a mini-dictionary for the first day (2020-01-02):
  - Example: 
    - `{'Date': '2020-01-02', 'Open': 74.06, 'High': 75.15, 'Low': 73.66, 'Close': 75.09, 'Adj Close': 72.82, 'Volume': 135480400}`
    - This is like a snapshot of the day: opening price ($74.06), highest ($75.15), lowest ($73.66), closing price ($75.09), adjusted close ($72.82 for dividends), and trading volume (135,480,400 shares).
- **Why It’s Useful**: As a data scientist, you could use this to predict stock trends or advise investors, a hot skill in finance jobs!
- **What’s Next**: Let’s explore these treasures and clean them up to find hidden gems!

In [5]:
# --- Section 2: Data Inspection ---
# Display the first 5 rows of the Titanic dataset to understand its structure
pd.set_option('display.width', 1000)
print("\nStep 2a: First 5 rows of Titanic dataset:")
print(df_titanic.head())


Step 2a: First 5 rows of Titanic dataset:
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S


In [6]:
# Show DataFrame info (column names, data types, non-null counts)
print("\nStep 2b: Titanic DataFrame Info:")
print(df_titanic.info())


Step 2b: Titanic DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


In [7]:
# Display summary statistics for numerical columns (count, mean, std, etc.)
print("\nStep 2c: Titanic Summary Statistics:")
print(df_titanic.describe())


Step 2c: Titanic Summary Statistics:
       PassengerId    Survived      Pclass         Age       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200


In [8]:
# Display the first 5 rows of stock data to see time series structure
print("\nStep 2d: Apple Stock Data (First 5 rows):")
print(df_stock.head())


Step 2d: Apple Stock Data (First 5 rows):
Price           Close       High        Low       Open     Volume
Ticker           AAPL       AAPL       AAPL       AAPL       AAPL
Date                                                             
2020-01-02  72.620834  72.681281  71.373211  71.627084  135480400
2020-01-03  71.914818  72.676447  71.689957  71.847118  146322800
2020-01-06  72.487846  72.526533  70.783248  71.034709  118387200
2020-01-07  72.146942  72.753823  71.926915  72.497529  108872000
2020-01-08  73.307510  73.609745  71.849533  71.849533  132079200


In [41]:
# Interactive: Ask students to input a column to explore
column_to_check = input("Step 2e: Enter a Titanic column name to see its unique values (e.g., 'Pclass'): ")
print(f"\nUnique values in {column_to_check}:")
print(df_titanic[column_to_check].unique())


Unique values in Pclass:
[3 1 2]


In [10]:
# --- Section 3: Data Cleaning ---
# Check for missing values in each column of the Titanic dataset
print("\nStep 3a: Missing Values in Titanic Dataset:")
print(df_titanic.isnull().sum())


Step 3a: Missing Values in Titanic Dataset:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [11]:
# Fill missing 'Age' values with the median age to preserve data distribution
median_age = df_titanic['Age'].median()
df_titanic['Age'] = df_titanic['Age'].fillna(median_age)
print("\nStep 3b: Missing 'Age' filled with median:", median_age)


Step 3b: Missing 'Age' filled with median: 28.0


# Understanding Step 3b: Filling Missing 'Age' Values

This step is like filling in the blanks on a family photo album! We’re fixing missing ages in the Titanic dataset to keep our passenger story complete. Let’s break down the code and explore how we do this, plus other ways to fill gaps.

## What Are We Doing?
- **Definition**: We’re replacing missing age values with a typical age to avoid losing data, helping our analysis stay strong.
- **Why It Matters**: As a data scientist, you’ll clean data like this to predict survival or analyze trends, a key skill for jobs in history or safety research!

## Breaking Down the Code
- **Code**: 
  - `median_age = df_titanic['Age'].median()`
    - Finds the middle age value in the list, like picking the most common age when everyone lines up from youngest to oldest.
  - `df_titanic['Age'] = df_titanic['Age'].fillna(median_age)`
    - Fills all the blank age spots with that middle age, like pasting a photo in every empty album slot.
  - `print("\nStep 3b: Missing 'Age' filled with median:", median_age)`
    - Tells us the middle age used, so we know what filled the gaps!

- **What It Means**: 
  - Imagine you’re organizing a class photo, but some kids forgot their ages. You ask the group, find the average age (say, 28), and write that for the missing ones. This keeps the album full and fair!
  - **Analogy**: It’s like a teacher guessing a student’s age based on the class average when their record is lost!
- **Why It’s Useful**: Using the median keeps the age spread natural, helping us predict who survived without skewing the story.

## Other Filling Techniques
- **Mean Filling**: Replaces missing ages with the average age (e.g., 29.7 if that’s the mean). Like asking the whole class for their age total and dividing—simple but can be pulled by extreme values (e.g., a 70-year-old).
- **Mode Filling**: Uses the most common age (e.g., 24 if most passengers are that age). Like picking the age most kids share—great for skewed data but might over-repeat.
- **Forward Fill**: Copies the age from the passenger before (e.g., 25 for the next if the last was 25). Like filling a missing photo with the one before it—works for ordered data like time logs.
- **Backward Fill**: Copies the age from the passenger after. Like using the next photo’s age—also for ordered data.
- **Interpolation**: Guesses ages based on a smooth line between known ages. Like drawing a curve to fill gaps in a growth chart—fancy but needs a pattern.
- **Why Choose?**: The median is solid for the Titanic because ages vary a lot. You’d pick based on your data as a data analyst!

## What’s Next?
- We’ll clean up more missing spots and get ready to find patterns. Keep exploring this passenger story!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy filling these blanks!

In [12]:
# Fill missing 'Embarked' with the most common port (mode) to maintain consistency
most_common_embarked = df_titanic['Embarked'].mode()[0]
df_titanic['Embarked'] = df_titanic['Embarked'].fillna(most_common_embarked)
print("Step 3c: Missing 'Embarked' filled with mode:", most_common_embarked)

Step 3c: Missing 'Embarked' filled with mode: S


In [13]:
# Drop 'Cabin' column due to excessive missing values, reducing noise
df_titanic = df_titanic.drop('Cabin', axis=1)
print("\nStep 3d: 'Cabin' column dropped due to many missing values.")


Step 3d: 'Cabin' column dropped due to many missing values.


# Understanding Step 3d: Dropping the 'Cabin' Column

This step is like cleaning up a messy drawer! We’re getting rid of the 'Cabin' column in the Titanic dataset because it has too many missing entries, which can confuse our story. Let’s break down the code to see how we tidy things up.

## What Are We Doing?
- **Definition**: We’re removing the 'Cabin' column because it has lots of blank spots, helping our data stay clear and useful.
- **Why It Matters**: As a data scientist, you’ll clean data like this to focus on what helps predict outcomes, like survival rates, a key skill for jobs in research or safety analysis!

## Breaking Down the Code
- **Code**: 
  - `df_titanic = df_titanic.drop('Cabin', axis=1)`
    - Tells Pandas to throw out the 'Cabin' column from our table (`df_titanic`).
    - `axis=1` means we’re dropping a column (not a row), like pulling out a cluttered drawer section.
  - `print("\nStep 3d: 'Cabin' column dropped due to many missing values.")`
    - Lets us know the messy column is gone, like a note saying the drawer is cleaner!

- **What It Means**: 
  - Imagine you’re sorting a passenger list, but the 'Cabin' section has too many blank spots because cabins weren’t recorded. We toss it out to avoid guessing, keeping our focus on solid info like age or fare.
  - **Analogy**: It’s like a librarian throwing away a torn page with missing names to keep the book readable!
- **Why It’s Useful**: Dropping noisy data helps our model predict better, like ignoring bad clues in a mystery game.

## What’s Next?
- We’ll check if any other gaps need fixing and get ready to find patterns. Keep exploring this cleaner dataset!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy tidying up!

In [14]:
# Verify no missing values remain after cleaning
print("Step 3e: Missing Values After Cleaning:")
print(df_titanic.isnull().sum())

Step 3e: Missing Values After Cleaning:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


In [15]:
# Check and remove duplicate rows in the Titanic dataset
print("\nStep 3f: Number of Duplicate Rows:", df_titanic.duplicated().sum())
df_titanic = df_titanic.drop_duplicates()
print("Step 3g: Duplicate rows removed. New shape:", df_titanic.shape)


Step 3f: Number of Duplicate Rows: 0
Step 3g: Duplicate rows removed. New shape: (891, 11)


In [16]:
# Clean stock data: Handle any missing values with forward fill
df_stock = df_stock.fillna(method='ffill')
print("\nStep 3h: Missing values in stock data filled with forward fill.")


Step 3h: Missing values in stock data filled with forward fill.


  df_stock = df_stock.fillna(method='ffill')


In [17]:
# --- Section 4: Data Transformation ---
# Filter passengers older than 30 to focus on a specific group
older_than_30 = df_titanic[df_titanic['Age'] > 30]
print("\nStep 4a: Passengers older than 30 (first 5):")
print(older_than_30.head())


Step 4a: Passengers older than 30 (first 5):
    PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch    Ticket     Fare Embarked
1             2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0  PC 17599  71.2833        C
3             4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0    113803  53.1000        S
4             5         0       3                           Allen, Mr. William Henry    male  35.0      0      0    373450   8.0500        S
6             7         0       1                            McCarthy, Mr. Timothy J    male  54.0      0      0     17463  51.8625        S
11           12         1       1                           Bonnell, Miss. Elizabeth  female  58.0      0      0    113783  26.5500        S


In [18]:
# Sort passengers by 'Fare' in descending order to identify high spenders
sorted_by_fare = df_titanic.sort_values(by='Fare', ascending=False)
print("\nStep 4b: Top 5 passengers by fare:")
print(sorted_by_fare[['Name', 'Fare']].head())


Step 4b: Top 5 passengers by fare:
                                   Name      Fare
258                    Ward, Miss. Anna  512.3292
737              Lesurer, Mr. Gustave J  512.3292
679  Cardeza, Mr. Thomas Drake Martinez  512.3292
88           Fortune, Miss. Mabel Helen  263.0000
27       Fortune, Mr. Charles Alexander  263.0000


In [19]:
# Group by 'Pclass' and calculate mean 'Fare' to analyze class-based pricing
mean_fare_by_class = df_titanic.groupby('Pclass')['Fare'].mean()
print("\nStep 4c: Mean Fare by Pclass:")
print(mean_fare_by_class)


Step 4c: Mean Fare by Pclass:
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64


In [20]:
# Interactive: Group by a column chosen by students
group_by_col = input("Step 4d: Enter a column to group by (e.g., 'Sex'): ")
if group_by_col in df_titanic.columns:
    group_result = df_titanic.groupby(group_by_col)['Fare'].mean()
    print(f"\nMean Fare by {group_by_col}:")
    print(group_result)
else:
    print("Column not found. Try 'Sex', 'Pclass', or 'Embarked'.")


Mean Fare by Sex:
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64


In [21]:
# --- Section 5: Feature Engineering ---
# Create 'FamilySize' feature by summing 'SibSp', 'Parch', and 1 (self)
df_titanic['FamilySize'] = df_titanic['SibSp'] + df_titanic['Parch'] + 1
print("\nStep 5a: Added 'FamilySize' feature. First 5 rows:")
print(df_titanic[['SibSp', 'Parch', 'FamilySize']].head())


Step 5a: Added 'FamilySize' feature. First 5 rows:
   SibSp  Parch  FamilySize
0      1      0           2
1      1      0           2
2      0      0           1
3      1      0           2
4      0      0           1


# Understanding Section 5: Feature Engineering

This section is like being a chef who mixes new ingredients to make a tastier dish! We’re creating a new data piece called 'FamilySize' to help us understand the Titanic passengers better. Let’s break down the code and learn what feature engineering is all about.

## What is Feature Engineering?
- **Definition**: Feature engineering is when we make new data columns from existing ones to give our model better clues for predicting things, like who survived.
- **Simple Analogy**: Imagine you’re baking a cake. You have flour and sugar, but you mix them into batter to make it yummier. Feature engineering is like that—it combines raw data (like family members) into a new, helpful ingredient!
- **Why It Matters**: As a data scientist, you’ll create features like this to improve predictions, a key skill for jobs in research or business!

## Breaking Down the Code
- **Code**: 
  - `# Create 'FamilySize' feature by summing 'SibSp', 'Parch', and 1 (self)`
    - Tells us we’re making a new column called 'FamilySize' by adding up family details.
  - `df_titanic['FamilySize'] = df_titanic['SibSp'] + df_titanic['Parch'] + 1`
    - Adds `SibSp` (siblings/spouses), `Parch` (parents/children), and 1 (the passenger themselves) for each row.
    - Like counting everyone in a family: siblings, parents, and you!
  - `print("\nStep 5a: Added 'FamilySize' feature. First 5 rows:")`
    - Lets us know the new feature is ready and shows the first five examples.
  - `print(df_titanic[['SibSp', 'Parch', 'FamilySize']].head())`
    - Displays a table with `SibSp`, `Parch`, and `FamilySize` for the first five passengers, like a family roll call!

- **What It Means**: 
  - Imagine you’re tracking a passenger’s family on the Titanic. If they have 1 sibling and 0 parents, plus themselves (1), their 'FamilySize' is 2. This new number helps us see if bigger families survived differently.
  - **Analogy**: It’s like a teacher adding up students, their siblings, and parents to figure out class family sizes for a group project!
- **Why It’s Useful**: A bigger 'FamilySize' might mean more support or chaos, which could affect survival. This new clue helps our model guess better!

## What’s Next?
- We’ll create more helpful features and get ready to predict survival. Keep cooking up these data ingredients!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy building new features!

In [22]:
# One-hot encode 'Sex' and 'Embarked' for ML model compatibility
df_titanic = pd.get_dummies(df_titanic, columns=['Sex', 'Embarked'], drop_first=True)
print("\nStep 5b: One-hot encoded 'Sex' and 'Embarked'. New columns:")
print(list(df_titanic.columns))


Step 5b: One-hot encoded 'Sex' and 'Embarked'. New columns:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'FamilySize', 'Sex_male', 'Embarked_Q', 'Embarked_S']


# Understanding Step 5b: One-Hot Encoding 'Sex' and 'Embarked'

This step is like turning a messy name tag into clear yes-or-no buttons! We’re transforming 'Sex' and 'Embarked' into a format our machine learning model can understand easily. Let’s break down the code, learn what one-hot encoding is, and practice remembering it with a fun trick.

## What is One-Hot Encoding?
- **Definition**: One-hot encoding changes categories (like 'male' or 'female') into separate yes/no columns (e.g., 1 for yes, 0 for no) so our model can use them to predict things, like who survived the Titanic.
- **Simple Analogy**: Imagine you’re sorting friends into teams by favorite color—red, blue, or green. Instead of writing “blue,” you give each friend a badge: “IsRed? 0,” “IsBlue? 1,” “IsGreen? 0.” This way, the computer sees clear choices, not words!
- **Memory Trick**: Think “One-Hot = One Yes!”—each person or data point gets one “yes” (1) and the rest are “no” (0), like flipping a single light switch on in a row of switches.
- **Why It Matters**: As a data scientist, you’ll use this to prepare data for predictions, a key skill for jobs in marketing or safety analysis!

## Breaking Down the Code
- **Code**: 
  - `# One-hot encode 'Sex' and 'Embarked' for ML model compatibility`
    - Tells us we’re making 'Sex' (male/female) and 'Embarked' (port names) ready for our model.
  - `df_titanic = pd.get_dummies(df_titanic, columns=['Sex', 'Embarked'], drop_first=True)`
    - Uses Pandas (`pd.get_dummies`) to turn 'Sex' and 'Embarked' into new columns.
    - `columns=['Sex', 'Embarked']` picks these two to change.
    - `drop_first=True` skips the first category (e.g., 'female' or 'C') to avoid repeats, like not needing a “Not Red” badge if we have “Red.”
  - `print("\nStep 5b: One-hot encoded 'Sex' and 'Embarked'. New columns:")`
    - Lets us know the change is done and shows the new column names.
  - `print(list(df_titanic.columns))`
    - Lists all columns now, including the new ones, like checking our updated name tag collection!

- **What It Means**: 
  - Before: 'Sex' might be 'male' or 'female,' and 'Embarked' might be 'S,' 'C,' or 'Q' (ports). After, we get columns like 'Sex_male' (1 if male, 0 if female) and 'Embarked_Q' (1 if Queenstown, 0 if not).
  - **Analogy**: It’s like giving each passenger a set of light switches—one for “male” and one for “Queenstown”—turning on only the right ones to show who they are!
- **Memory Trick Practice**: Say “One-Hot = One Yes!” while imagining switches. Flip one on (e.g., 'Sex_male' = 1) and others off (e.g., 'Sex_female' = 0) for each passenger to build that muscle memory.
- **Why It’s Useful**: This format helps our model understand categories without mixing them up, improving survival predictions!

## What’s Next?
- We’ll create more features to make our model even smarter. Keep flipping those switches to learn more!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy turning data into clear signals!

In [23]:
# Bin 'Age' into categories for better feature representation
bins = [0, 12, 18, 35, 60, 100]
labels = ['Child', 'Teen', 'Young Adult', 'Adult', 'Senior']
df_titanic['AgeGroup'] = pd.cut(df_titanic['Age'], bins=bins, labels=labels)
print("\nStep 5c: Added 'AgeGroup' feature. First 5 rows:")
print(df_titanic[['Age', 'AgeGroup']].head())


Step 5c: Added 'AgeGroup' feature. First 5 rows:
    Age     AgeGroup
0  22.0  Young Adult
1  38.0        Adult
2  26.0  Young Adult
3  35.0  Young Adult
4  35.0  Young Adult


# Understanding Step 5c: Binning 'Age' into Categories

This step is like sorting toys into different boxes by size! We’re grouping passenger ages into categories like 'Child' or 'Adult' to make our Titanic data easier for the model to understand. Let’s break down the code, learn what binning is, and practice remembering it with a fun trick.

## What is Binning?
- **Definition**: Binning puts ages into labeled groups (e.g., 'Teen', 'Adult') instead of using exact numbers, helping our model spot patterns better.
- **Simple Analogy**: Imagine you’re organizing a toy chest. You put small toys in a 'Tiny' box, medium ones in a 'Medium' box, and big ones in a 'Large' box. Binning is like that—it sorts ages into neat groups!
- **Memory Trick**: Think “Bin = Box!”—each bin is a box where we drop ages, and we label the box to remember who fits where. Say “Bin = Box!” a few times to lock it in!
- **Why It Matters**: As a data scientist, you’ll use binning to simplify data for predictions, a key skill for jobs in marketing or safety studies!

## Breaking Down the Code
- **Code**: 
  - `# Bin 'Age' into categories for better feature representation`
    - Tells us we’re grouping 'Age' to help our model see the big picture.
  - `bins = [0, 12, 18, 35, 60, 100]`
    - Sets up age cutoffs: 0–12, 12–18, 18–35, 35–60, 60–100, like drawing lines on a ruler.
  - `labels = ['Child', 'Teen', 'Young Adult', 'Adult', 'Senior']`
    - Names each group, like labeling our toy boxes with fun tags!
  - `df_titanic['AgeGroup'] = pd.cut(df_titanic['Age'], bins=bins, labels=labels)`
    - Uses Pandas (`pd.cut`) to sort ages into these groups and adds a new column called 'AgeGroup'.
    - Like putting each passenger’s age into the right box based on the ruler lines.
  - `print("\nStep 5c: Added 'AgeGroup' feature. First 5 rows:")`
    - Lets us know the new group column is ready and shows the first five examples.
  - `print(df_titanic[['Age', 'AgeGroup']].head())`
    - Displays a table with 'Age' and 'AgeGroup' for the first five passengers, like peeking into our sorted boxes!

- **What It Means**: 
  - Before: Ages were numbers like 22 or 35. Now, a 22-year-old goes into 'Young Adult' and a 40-year-old into 'Adult'.
  - **Analogy**: It’s like a teacher sorting students into age teams—'Child' for under 12, 'Teen' for 12–18—making it easier to plan activities!
- **Memory Trick Practice**: Say “Bin = Box!” and imagine dropping ages into labeled boxes. Check the output to see who’s in each team to build that muscle memory.
- **Why It’s Useful**: Grouping ages helps our model see if kids or seniors survived differently, improving our survival predictions!

## What’s Next?
- We’ll create more grouped features to make our model even smarter. Keep sorting those boxes to learn more!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy organizing these ages!

In [24]:
# Interactive: Ask students to suggest a feature to scale
feature_to_scale = input("Step 5d: Enter a numerical feature to scale (e.g., 'Fare'): ")
if feature_to_scale in df_titanic.select_dtypes(include=[np.number]).columns:
    df_titanic[feature_to_scale + '_scaled'] = (df_titanic[feature_to_scale] - df_titanic[feature_to_scale].min()) / (df_titanic[feature_to_scale].max() - df_titanic[feature_to_scale].min())
    print(f"\nScaled {feature_to_scale} added. First 5 rows:")
    print(df_titanic[[feature_to_scale, feature_to_scale + '_scaled']].head())
else:
    print("Please enter a valid numerical column like 'Fare' or 'Age'.")


Scaled Fare added. First 5 rows:
      Fare  Fare_scaled
0   7.2500     0.014151
1  71.2833     0.139136
2   7.9250     0.015469
3  53.1000     0.103644
4   8.0500     0.015713


# Understanding Step 5d: Scaling a Numerical Feature

This step is like adjusting the volume on a radio so all songs sound balanced! We’re scaling a number like 'Fare' to make our Titanic data fair for the model. Let’s break down the code, learn why, what, and when to scale, and check the results with a fun trick to remember it.

## What is Scaling?
- **Definition**: Scaling changes big or small numbers (e.g., 'Fare' from $0 to $500) into a range like 0 to 1, so our model treats them equally.
- **Why to Apply**: Models like our future regression need all numbers on the same scale. If one feature (e.g., 'Fare' at $500) is huge and another (e.g., 'Age' at 30) is small, the big one might drown out the small one!
  - **Analogy**: It’s like mixing ingredients for a cake. If you add 500g of sugar and 30g of salt without adjusting, the sugar overshadows the salt. Scaling balances them, like using teaspoons for both!
- **What It Does**: Turns raw numbers into a 0-to-1 scale where the smallest value becomes 0 and the largest becomes 1.
- **When to Apply**: Use scaling when numbers vary widely (e.g., 'Fare' vs. 'Age') and you’re using models that care about size, like regression or machine learning tools we’ll meet soon.
  - **Analogy**: Scale when you’re comparing apples and watermelons by weight—turn them into a fair size range so the comparison makes sense!
- **Memory Trick**: Think “Scale = Balance!”—imagine a seesaw leveling out big and small weights. Say “Scale = Balance!” a few times to lock it in!
- **Why It Matters**: As a data scientist, you’ll scale data to improve predictions, a key skill for jobs in finance or marketing!

## Breaking Down the Code
- **Code**: 
  - `# Interactive: Ask students to suggest a feature to scale`
    - Invites you to pick a number column to adjust, like choosing which ingredient to balance.
  - `feature_to_scale = input("Step 5d: Enter a numerical feature to scale (e.g., 'Fare'): ")`
    - Asks you to type a column name, like 'Fare' or 'Age', to work on.
  - `if feature_to_scale in df_titanic.select_d

In [25]:
# --- Section 6: Time Series Manipulation ---
# Ensure stock data index is datetime for time series operations
df_stock.index = pd.to_datetime(df_stock.index)
print("\nStep 6a: Stock data index converted to datetime.")


Step 6a: Stock data index converted to datetime.


In [26]:
# Resample stock data to monthly frequency, calculating mean closing price
monthly_close = df_stock['Close'].resample('M').mean()
print("\nStep 6b: Monthly Average Closing Price (first 5):")
print(monthly_close.head())


Step 6b: Monthly Average Closing Price (first 5):
Ticker           AAPL
Date                 
2020-01-31  75.417396
2020-02-29  75.401414
2020-03-31  63.606265
2020-04-30  66.015845
2020-05-31  75.283144


  monthly_close = df_stock['Close'].resample('M').mean()


In [27]:
# Calculate 50-day moving average to smooth stock price trends
df_stock['MA50'] = df_stock['Close'].rolling(window=50).mean()
print("\nStep 6c: 50-Day Moving Average (first 5):")
print(df_stock[['Close', 'MA50']].head())


Step 6c: 50-Day Moving Average (first 5):
Price           Close MA50
Ticker           AAPL     
Date                      
2020-01-02  72.620834  NaN
2020-01-03  71.914818  NaN
2020-01-06  72.487846  NaN
2020-01-07  72.146942  NaN
2020-01-08  73.307510  NaN


In [28]:
# Interactive: Ask for a window size for moving average
window_size = int(input("Step 6d: Enter a window size for moving average (e.g., 30): "))
df_stock['MA' + str(window_size)] = df_stock['Close'].rolling(window=window_size).mean()
print(f"\n{window_size}-Day Moving Average (first 5):")
print(df_stock[['Close', 'MA' + str(window_size)]].head())


30-Day Moving Average (first 5):
Price           Close MA30
Ticker           AAPL     
Date                      
2020-01-02  72.620834  NaN
2020-01-03  71.914818  NaN
2020-01-06  72.487846  NaN
2020-01-07  72.146942  NaN
2020-01-08  73.307510  NaN


In [29]:
# --- Section 7: Applying Custom Functions ---
# Define a function to categorize fares into 'Cheap', 'Moderate', 'Expensive'
def fare_category(fare):
    if fare < 10:
        return 'Cheap'
    elif fare < 50:
        return 'Moderate'
    else:
        return 'Expensive'
    
    # Apply the fare_category function to create a new feature
df_titanic['FareCategory'] = df_titanic['Fare'].apply(fare_category)
print("\nStep 7a: Added 'FareCategory' feature. First 5 rows:")
print(df_titanic[['Fare', 'FareCategory']].head())


Step 7a: Added 'FareCategory' feature. First 5 rows:
      Fare FareCategory
0   7.2500        Cheap
1  71.2833    Expensive
2   7.9250        Cheap
3  53.1000    Expensive
4   8.0500        Cheap


In [30]:
# Interactive: Ask students to define a custom category function
def custom_category(value, threshold1, threshold2, labels):
    if value < threshold1:
        return labels[0]
    elif value < threshold2:
        return labels[1]
    else:
        return labels[2]

col_to_categorize = input("Step 7b: Enter a numerical column to categorize (e.g., 'Age'): ")
if col_to_categorize in df_titanic.select_dtypes(include=[np.number]).columns:
    thresh1 = float(input("Enter first threshold (e.g., 20): "))
    thresh2 = float(input("Enter second threshold (e.g., 40): "))
    custom_labels = input("Enter three labels separated by commas (e.g., 'Young,Middle,Old'): ").split(',')
    df_titanic[col_to_categorize + '_Custom'] = df_titanic[col_to_categorize].apply(custom_category, args=(thresh1, thresh2, custom_labels))
    print(f"\nCustom categories for {col_to_categorize}:")
    print(df_titanic[[col_to_categorize, col_to_categorize + '_Custom']].head())
else:
    print("Please enter a valid numerical column like 'Age' or 'Fare'.")



Custom categories for Age:
    Age Age_Custom
0  22.0     Middle
1  38.0     Middle
2  26.0     Middle
3  35.0     Middle
4  35.0     Middle


In [31]:
# --- Section 8: Advanced Pandas Operations ---
# Create a MultiIndex DataFrame by setting 'Pclass' and 'Sex_male' as indices
df_titanic_multi = df_titanic.set_index(['Pclass', 'Sex_male'])
print("\nStep 8a: MultiIndex DataFrame (first 5):")
print(df_titanic_multi.head())



Step 8a: MultiIndex DataFrame (first 5):
                 PassengerId  Survived                                               Name   Age  SibSp  Parch            Ticket     Fare  FamilySize  Embarked_Q  Embarked_S     AgeGroup  Fare_scaled FareCategory Age_Custom
Pclass Sex_male                                                                                                                                                                                                               
3      True                1         0                            Braund, Mr. Owen Harris  22.0      1      0         A/5 21171   7.2500           2       False        True  Young Adult     0.014151        Cheap     Middle
1      False               2         1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0          PC 17599  71.2833           2       False       False        Adult     0.139136    Expensive     Middle
3      False               3         1                            

# Understanding Step 8a: Creating a MultiIndex DataFrame

This step is like organizing a big family reunion chart with two labels! We’re setting up the Titanic data with 'Pclass' and 'Sex_male' as special markers to group passengers in a fancy way. Let’s break down the code, learn what a MultiIndex is, and practice remembering it with a fun trick.

## What is a MultiIndex DataFrame?
- **Definition**: A MultiIndex DataFrame uses two or more columns (like 'Pclass' and 'Sex_male') as labels to organize data into layers, making it easier to find specific groups.
- **Simple Analogy**: Imagine you’re sorting a big box of family photos. You label them by “Family Name” and “Birthday Year” to quickly find, say, “Smith 1990” photos. A MultiIndex is like that—it stacks labels to sort data!
- **Memory Trick**: Think “Multi = Many Labels!”—picture stacking two name tags (e.g., 'Pclass' and 'Sex_male') on each passenger to find them fast. Say “Multi = Many Labels!” a few times to lock it in!
- **Why It Matters**: As a data scientist, you’ll use MultiIndex to analyze groups (e.g., survival by class and gender), a key skill for jobs in research or business!

## Breaking Down the Code
- **Code**: 
  - `# Create a MultiIndex DataFrame by setting 'Pclass' and 'Sex_male' as indices`
    - Tells us we’re making a special table with 'Pclass' (ticket class) and 'Sex_male' (1 for male, 0 for female) as the main organizers.
  - `df_titanic_multi = df_titanic.set_index(['Pclass', 'Sex_male'])`
    - Uses Pandas (`set_index`) to turn 'Pclass' and 'Sex_male' into layered labels for our table (`df_titanic_multi`).
    - Like pinning two labels on each photo: one for class (1, 2, 3) and one for gender (male or not).
  - `print("\nStep 8a: MultiIndex DataFrame (first 5):")`
    - Lets us know the new table is ready and shows the first five examples.
  - `print(df_titanic_multi.head())`
    - Displays the first five rows of the new table, like peeking at the top of our organized photo box!

- **What It Means**: 
  - Before: Data was a flat list. Now, it’s grouped by 'Pclass' (e.g., 3) and 'Sex_male' (e.g., 1), so we can quickly find, say, all third-class males.
  - **Analogy**: It’s like a librarian filing books by “Genre” and “Author” on the shelf, making it easy to grab a mystery by Agatha Christie!
- **Memory Trick Practice**: Say “Multi = Many Labels!” and imagine stacking 'Pclass' and 'Sex_male' tags on each passenger. Check the output to see the new layout and build that muscle memory.
- **Why It’s Useful**: This helps our model see patterns, like if third-class males survived less, improving our predictions!

## What’s Next?
- We’ll explore more advanced ways to organize data and get ready for predictions. Keep stacking those labels to learn more!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy organizing with MultiIndex!

In [32]:
# Use vectorized operation to calculate log of 'Fare' (adding 1 to avoid log(0))
df_titanic['LogFare'] = np.log1p(df_titanic['Fare'])
print("\nStep 8b: Log of Fare (first 5):")
print(df_titanic[['Fare', 'LogFare']].head())


Step 8b: Log of Fare (first 5):
      Fare   LogFare
0   7.2500  2.110213
1  71.2833  4.280593
2   7.9250  2.188856
3  53.1000  3.990834
4   8.0500  2.202765


# Understanding Step 8b: Using a Vectorized Log Operation

This step is like turning a loud shout into a soft whisper to make it easier to listen! We’re changing the 'Fare' prices in the Titanic dataset into a smoother form using a log trick, and we’ll do it super fast for all passengers at once. Let’s break it down, learn why and when to use this, and make it fun to remember—even without math skills!

## What is a Vectorized Log Operation?
- **Definition**: A vectorized log operation takes big or small numbers (like ticket prices) and turns them into a gentler scale using a “log” (short for logarithm), doing it for all data at once super quickly.
- **Why to Use**: 
  - **Why**: Big numbers (e.g., $500 fares) can overpower small ones (e.g., $7 fares) in predictions, making our model unfair. Logging shrinks the big ones more, balancing them out.
  - **Analogy**: It’s like turning up the volume on a quiet voice and turning down a loud shout so everyone can be heard equally at a party!
- **What It Does**: Changes numbers so the differences feel smaller. For example, $7 becomes a tiny number, and $500 becomes a bigger but manageable number, all in one go.
- **When to Use**: 
  - Use this when numbers vary a lot (e.g., fares from $0 to $500) and you’re predicting something, like survival or house prices, with a model that likes balanced data.
  - **Analogy**: Use it when you’re comparing a tiny puppy’s bark with a big dog’s howl—log it to make both sounds fit on the same listening level!
- **Memory Trick**: Think “Log = Level!”—imagine a volume knob leveling out loud and quiet sounds. Say “Log = Level!” a few times to get it stuck in your head, no math needed!
- **Why It Matters**: As a data scientist, you’ll use this to make data friendly for predictions, a key skill for jobs in finance or travel analysis!

## Breaking Down the Code
- **Code**: 
  - `# Use vectorized operation to calculate log of 'Fare' (adding 1 to avoid log(0))`
    - Tells us we’re smoothing 'Fare' prices with a log, and adding 1 stops errors if a fare is $0.
  - `df_titanic['LogFare'] = np.log1p(df_titanic['Fare'])`
    - Uses NumPy (`np.log1p`) to apply the log to all 'Fare' values at once (vectorized means super fast!).
    - `log1p` adds 1 first (e.g., $7 becomes 8, then logs it) to avoid breaking with $0 fares.
    - Creates a new column 'LogFare' with the smoothed numbers.
  - `print("\nStep 8b: Log of Fare (first 5):")`
    - Lets us know the new column is ready and shows the first five examples.
  - `print(df_titanic[['Fare', 'LogFare']].head())`
    - Displays a table with the original 'Fare' and new 'LogFare' for the first five passengers, like comparing the old shout to the new whisper!

- **What It Means**: 
  - Before: 'Fare' had big jumps (e.g., $7 to $71). Now, 'LogFare' makes them closer (e.g., 2 to 4), so our model won’t favor the big fares.
  - **Analogy**: It’s like a DJ softening a loud song ($71) to match a quiet one ($7) on the same playlist, making the party vibe smooth!
- **Memory Trick Practice**: Say “Log = Level!” and imagine turning a volume knob to balance fares. Check the output to see the new whisper levels.

## Explaining the Results

In [33]:
# Optimize memory by converting 'AgeGroup' and 'FareCategory' to category type
df_titanic['AgeGroup'] = df_titanic['AgeGroup'].astype('category')
df_titanic['FareCategory'] = df_titanic['FareCategory'].astype('category')
print("\nStep 8c: Optimized 'AgeGroup' and 'FareCategory' to category type.")


Step 8c: Optimized 'AgeGroup' and 'FareCategory' to category type.


# Understanding Step 8c: Optimizing Memory with Category Type

This step is like packing a suitcase smarter to save space! We’re changing 'AgeGroup' and 'FareCategory' in the Titanic dataset into a special format to use less computer memory, making our work faster. Let’s break down the code, learn why this helps, and practice remembering it with a fun trick.

## What is Memory Optimization with Category Type?
- **Definition**: Converting columns like 'AgeGroup' (e.g., 'Child', 'Adult') or 'FareCategory' (e.g., 'Cheap', 'Expensive') to 'category' type tells the computer to store them efficiently, like using shortcuts instead of writing everything out.
- **Why to Use**: 
  - Saves memory when we have lots of repeated words (e.g., 'Adult' for many passengers), so our program runs quicker.
  - **Analogy**: It’s like using a sticker sheet with 'Child,' 'Adult,' etc., instead of writing those words on every photo label—less ink, same info!
- **What It Does**: Turns text categories into a compact code behind the scenes, reducing the file size.
- **When to Use**: 
  - Use this when you have columns with a few repeated categories (e.g., 'Sex' or 'AgeGroup') and want to speed up your work, especially for big datasets or models.
  - **Analogy**: Use it when packing for a trip with lots of the same clothes—use tags like 'T-shirt' instead of listing each one!
- **Memory Trick**: Think “Category = Compact!”—imagine shrinking a big word list into tiny stickers. Say “Category = Compact!” a few times to stick it in your mind!
- **Why It Matters**: As a data scientist, you’ll optimize data to handle large projects faster, a key skill for jobs in tech or research!

## Breaking Down the Code
- **Code**: 
  - `# Optimize memory by converting 'AgeGroup' and 'FareCategory' to category type`
    - Tells us we’re making these columns use less space for better performance.
  - `df_titanic['AgeGroup'] = df_titanic['AgeGroup'].astype('category')`
    - Changes 'AgeGroup' (e.g., 'Child', 'Teen') into a category type, like swapping long labels for stickers.
  - `df_titanic['FareCategory'] = df_titanic['FareCategory'].astype('category')`
    - Does the same for 'FareCategory' (e.g., 'Cheap', 'Moderate'), packing it efficiently.
  - `print("\nStep 8c: Optimized 'AgeGroup' and 'FareCategory' to category type.")`
    - Lets us know the change is done, like checking our suitcase is lighter!

- **What It Means**: 
  - Before: 'AgeGroup' and 'FareCategory' were stored as full words for each passenger. Now, they’re coded as shortcuts, saving space.
  - **Analogy**: It’s like a travel agent swapping a heavy book of labels for a small sticker sheet, making your luggage easier to carry!
- **Memory Trick Practice**: Say “Category = Compact!” and imagine sticking 'Child' or 'Cheap' onto a photo with a tiny sticker. Check the next steps to see the savings in action.
- **Why It’s Useful**: This speeds up our model when predicting survival, especially with lots of passengers!

## What’s Next?
- We’ll explore more ways to make our data work better and get ready for predictions. Keep packing smart to learn more!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy making your data lighter!

In [34]:
# --- Section 9: Merging and Joining ---
# Create a small DataFrame with ticket adjustments for merging demonstration
ticket_adjustment = pd.DataFrame({
    'Ticket': df_titanic['Ticket'].unique(),
    'Adjustment': np.random.uniform(-10, 10, size=len(df_titanic['Ticket'].unique()))
})
print("\nStep 9a: Ticket Adjustment DataFrame (first 5):")
print(ticket_adjustment.head())


Step 9a: Ticket Adjustment DataFrame (first 5):
             Ticket  Adjustment
0         A/5 21171    7.015100
1          PC 17599   -7.115407
2  STON/O2. 3101282   -5.268915
3            113803    8.259056
4            373450    7.132332


# Understanding Step 9a: Creating a Ticket Adjustment DataFrame

This step is like adding a bonus gift list to our passenger party plan! We’re making a new table with ticket adjustments to show how we can combine it with our Titanic data later. Let’s break down the code, learn what merging is about, and practice remembering it with a fun trick.

## What is Merging and Joining?
- **Definition**: Merging and joining is like matching two lists—our passenger data and a new adjustment list—to add extra info, making our story richer.
- **Simple Analogy**: Imagine you’re planning a party and have a guest list. You get a second list with gift amounts for each guest. Merging is like pairing the guest list with the gift list to see who gets what!
- **Memory Trick**: Think “Merge = Match!”—picture sticking two friend groups’ name tags together. Say “Merge = Match!” a few times to get it stuck in your head!
- **Why It Matters**: As a data scientist, you’ll merge data to add details (e.g., price changes), a key skill for jobs in travel or finance!

## Breaking Down the Code
- **Code**: 
  - `# Create a small DataFrame with ticket adjustments for merging demonstration`
    - Tells us we’re building a mini-table to practice adding adjustments.
  - `ticket_adjustment = pd.DataFrame({`
    - Starts a new table using Pandas (`pd.DataFrame`).
  - `'Ticket': df_titanic['Ticket'].unique(),`
    - Lists all unique ticket numbers from our passenger data, like collecting every party invitation.
  - `'Adjustment': np.random.uniform(-10, 10, size=len(df_titanic['Ticket'].unique()))`
    - Adds random adjustment numbers between -10 and 10 for each ticket, like picking surprise gift amounts.
    - `np.random.uniform` picks these numbers, and `size` matches the ticket count.
  - `})`
    - Closes the table creation.
  - `print("\nStep 9a: Ticket Adjustment DataFrame (first 5):")`
    - Lets us know the new table is ready and shows the first five rows.
  - `print(ticket_adjustment.head())`
    - Displays the top five entries, like peeking at the first five gift tags!

- **What It Means**: 
  - We made a new list pairing each ticket (e.g., 'A/5 21171') with a random adjustment (e.g., 7.01), like adding a bonus to some party gifts.
  - **Analogy**: It’s like a party planner making a side list of extra treats for each invitation, ready to match it with the main guest list!
- **Memory Trick Practice**: Say “Merge = Match!” and imagine sticking a gift tag next to each invitation. Check the output to see the matches and build that muscle memory.
- **Why It’s Useful**: This extra info can help adjust fares or predict survival, making our model smarter!

## Explaining the Results
- **Output**: `Step 9a: Ticket Adjustment DataFrame (first 5):`
  - `Ticket         Adjustment`
  - `0         A/5 21171    7.015100`
  - `1          PC 17599   -7.115407`
  - `2  STON/O2. 3101282   -5.268915`
  - `3            113803    8.259056`
  - `4            373450    7.132332`
- **What It Means**: 
  - **Ticket**: Unique ticket numbers from the Titanic, like party invitations (e.g., 'A/5 21171').
  - **Adjustment**: Random numbers between -10 and 10, like bonus gifts (e.g., 7.01 means +$7.01, -7.11 means -$7.11).
  - **Analogy**: It’s like seeing a list where 'A/5 21171' gets a $7.01 treat, 'PC 17599' loses $7.11, and so on—ready to add to the party plan!
- **Why It Works**: These adjustments can tweak fares later, helping us analyze if money changes survival chances.

## What’s Next?
- We’ll match this list with our passenger data and adjust fares. Keep matching those tags to learn more!
- *Note*: This dataset has historical gaps, which we’ll handle fairly in Week 4. For now, enjoy building your extra list!

In [35]:
# Merge ticket adjustments with the main DataFrame
df_titanic = pd.merge(df_titanic, ticket_adjustment, on='Ticket', how='left')
print("\nStep 9b: Merged DataFrame (first 5):")
print(df_titanic[['Ticket', 'Adjustment']].head())


Step 9b: Merged DataFrame (first 5):
             Ticket  Adjustment
0         A/5 21171    7.015100
1          PC 17599   -7.115407
2  STON/O2. 3101282   -5.268915
3            113803    8.259056
4            373450    7.132332


In [36]:
# Calculate adjusted fare by adding the adjustment
df_titanic['AdjustedFare'] = df_titanic['Fare'] + df_titanic['Adjustment']
print("\nStep 9c: Adjusted Fare (first 5):")
print(df_titanic[['Fare', 'Adjustment', 'AdjustedFare']].head())


Step 9c: Adjusted Fare (first 5):
      Fare  Adjustment  AdjustedFare
0   7.2500    7.015100     14.265100
1  71.2833   -7.115407     64.167893
2   7.9250   -5.268915      2.656085
3  53.1000    8.259056     61.359056
4   8.0500    7.132332     15.182332


In [37]:
# Interactive: Ask students to merge with a custom DataFrame
custom_data = input("Step 9d: Enter a column to merge with (e.g., 'Name'), or skip: ")
if custom_data in df_titanic.columns:
    custom_df = pd.DataFrame({custom_data: df_titanic[custom_data].unique(), 'CustomValue': np.random.rand(len(df_titanic[custom_data].unique()))})
    df_titanic = pd.merge(df_titanic, custom_df, on=custom_data, how='left')
    print(f"\nMerged with custom DataFrame on {custom_data}:")
    print(df_titanic[[custom_data, 'CustomValue']].head())
else:
    print("Skipping custom merge or invalid column.")


Merged with custom DataFrame on Name:
                                                Name  CustomValue
0                            Braund, Mr. Owen Harris     0.320761
1  Cumings, Mrs. John Bradley (Florence Briggs Th...     0.333112
2                             Heikkinen, Miss. Laina     0.093483
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)     0.949955
4                           Allen, Mr. William Henry     0.991678


In [39]:
# --- Section 10: Preparing Data for Machine Learning ---
# Select numerical and encoded features for ML model training
df_titanic['Age_standardized'] = (df_titanic['Age'] - df_titanic['Age'].mean()) / df_titanic['Age'].std()
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'AgeGroup', 'Fare_scaled', 'Age_standardized']
X = df_titanic[features]
y = df_titanic['Survived']
print("\nStep 10a: Features selected for ML:", features)
print("Step 10b: Target variable:", y.name)


Step 10a: Features selected for ML: ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'AgeGroup', 'Fare_scaled', 'Age_standardized']
Step 10b: Target variable: Survived


In [40]:
# Interactive: Ask students to add or remove a feature
feature_action = input("Step 10c: Add or remove a feature? (add/remove, then feature name, e.g., 'add AgeGroup'): ").split()
if feature_action[0].lower() == 'add' and feature_action[1] in df_titanic.columns and feature_action[1] not in features:
    features.append(feature_action[1])
    X = df_titanic[features]
    print(f"\nAdded {feature_action[1]} to features. New features:", features)
elif feature_action[0].lower() == 'remove' and feature_action[1] in features:
    features.remove(feature_action[1])
    X = df_titanic[features]
    print(f"\nRemoved {feature_action[1]} from features. New features:", features)
else:
    print("Invalid action or feature. Features unchanged:", features)

Invalid action or feature. Features unchanged: ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'AgeGroup', 'Fare_scaled', 'Age_standardized']
