# üö¢ The Titanic Tragedy: A Data Story
## Exploring the Causes and Patterns of Survival Through Statistical Analysis

---

### *Welcome, Dear Explorer*

In this grand analytical journey, we venture into the depths of the RMS Titanic's voyage‚Äîa tragedy that shaped maritime history and continues to captivate minds. Through the lens of data science and the elegance of R programming, we shall uncover the hidden patterns that determined who survived and who perished on that fateful night of April 14th, 1912.

This is not merely an analysis. This is a **story told through numbers, visualizations, and statistical wisdom**. Each plot we create, each statistic we calculate, is a piece of the puzzle that helps us understand one of history's most compelling events.

**Our Grand Quest:**
- Understand the dataset's structure and quality
- Explore the characteristics of passengers
- Uncover the patterns that influenced survival
- Reveal the statistical truths hidden within the data

*Let us begin our voyage...*

## Part 1: Preparing Our Analytical Arsenal

### Loading the Essential Libraries

In the grand tradition of R programming, we begin by summoning the libraries that will empower our analysis. Each one is a carefully chosen tool in the data analyst's toolkit:

- **`tidyverse`**: The comprehensive collection for data wrangling and visualization
- **`ggplot2`**: Our canvas for creating publication-quality graphics
- **`gridExtra`**: To arrange multiple plots in harmonious panels
- **`corrplot` & `ggcorrplot`**: For revealing hidden correlations
- **`moments`**: To calculate the subtle properties of distributions
- **`scales`**: For beautiful data formatting in our plots

Let the incantations begin...

In [None]:
# Load Essential Libraries
suppressPackageStartupMessages({
  library(tidyverse)      # Data manipulation and visualization
  library(ggplot2)        # Beautiful graphics
  library(gridExtra)      # Arrange plots
  library(corrplot)       # Correlation visualization
  library(ggcorrplot)     # ggplot2-style correlation plots
  library(moments)        # Statistical moments
  library(scales)         # Scaling functions for plots
  library(RColorBrewer)   # Beautiful color palettes
  library(viridis)        # Perceptually uniform color maps
})

# Set seed for reproducibility
set.seed(42)

# Configure ggplot2 theme for all visualizations
theme_set(theme_minimal() + 
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40", margin = margin(b = 15)),
    axis.title = element_text(size = 11, face = "bold"),
    axis.text = element_text(size = 10),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_blank(),
    legend.position = "right",
    legend.title = element_text(face = "bold", size = 10)
  ))

cat("‚úì All libraries loaded successfully. The stage is set for our grand analysis.\n")

### Loading the Data

Now we read the sacred records‚Äîthe manifest of souls aboard the Titanic. This dataset contains information about 891 passengers, including their demographics, ticket class, and most crucially, whether they survived the tragedy.

In [None]:
# Load the Titanic dataset
titanic <- read_csv("train.csv")

# Display first observations
cat("üö¢ The First Glimpse of Our Data:\n")
print(head(titanic, 10))

cat("\nüìä Dataset Dimensions:\n")
cat("Passengers:", nrow(titanic), "| Variables:", ncol(titanic), "\n")

cat("\nüìã Column Names and Types:\n")
print(str(titanic))

---

## Part 2: Unraveling the Dataset's Fabric

### The Structure Beneath the Surface

Every dataset has a skeleton‚Äîa structure that we must understand before we can tell its story. Let us examine the data types, missing values, and the overall completeness of our records.

In [None]:
# Data Type Overview
cat("üìå Data Type Summary:\n")
data_types <- titanic %>% 
  summarise(across(everything(), ~class(.x))) %>% 
  pivot_longer(everything(), names_to = "Column", values_to = "Data_Type")
print(data_types)

# Missing Value Analysis
cat("\nüîç Missing Value Analysis:\n")
missing_data <- titanic %>% 
  summarise(across(everything(), ~sum(is.na(.)))) %>% 
  pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>% 
  mutate(
    Missing_Percentage = round((Missing_Count / nrow(titanic)) * 100, 2),
    Completeness = round(100 - Missing_Percentage, 2)
  ) %>% 
  arrange(desc(Missing_Count))
print(missing_data)

### The Tale of Missing Values

In our quest for truth, we encounter **gaps**‚Äîplaces where data was not recorded. This is not a flaw; it is a feature of reality itself. Notice that:

- **Age** is missing in 177 records (19.87%) ‚Äî many passengers' ages were not recorded
- **Cabin** information is absent for 687 passengers (77.10%) ‚Äî a striking amount of missing information
- **Embarked** is missing in just 2 records (0.22%) ‚Äî nearly complete
- All other fields are complete ‚Äî fully documented

These gaps will shape our analysis and guide our choices in handling missing data.

In [None]:
# Visualize Missing Data Patterns
missing_plot <- missing_data %>% 
  filter(Missing_Count > 0) %>% 
  ggplot(aes(x = reorder(Column, -Missing_Count), y = Missing_Percentage)) +
  geom_col(fill = "#E74C3C", alpha = 0.8, width = 0.6) +
  geom_text(aes(label = paste0(Missing_Percentage, "%")), 
            vjust = -0.5, size = 4, fontface = "bold") +
  labs(
    title = "The Gaps in Our Records",
    subtitle = "Percentage of Missing Values by Variable",
    x = "Variable", 
    y = "Missing Percentage (%)",
    caption = "Missing data shapes our analytical approach"
  ) +
  ylim(0, max(missing_data$Missing_Percentage) * 1.15) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.background = element_rect(fill = "#F8F9FA", color = NA)
  )

print(missing_plot)

---

## Part 3: The Souls Aboard ‚Äî Demographic Insights

### A Portrait of Passengers

Let us now paint a portrait of who boarded the Titanic. Each passenger carried their own story‚Äîtheir age, their family bonds, their social standing. These variables will help us understand the tragedy that unfolded.

In [None]:
# Prepare data and create key variables
titanic_clean <- titanic %>% 
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("Perished", "Survived")),
    Pclass = factor(Pclass, levels = c(1, 2, 3), labels = c("First", "Second", "Third")),
    Sex = factor(Sex, levels = c("male", "female"), labels = c("Male", "Female")),
    Embarked = factor(Embarked, levels = c("S", "C", "Q"), 
                      labels = c("Southampton", "Cherbourg", "Queenstown"))
  )

# Categorical Variables Summary
cat("üìä CATEGORICAL VARIABLES SUMMARY\n")
cat("\n--- SURVIVAL STATUS ---\n")
print(table(titanic_clean$Survived))
print(prop.table(table(titanic_clean$Survived)))

cat("\n--- PASSENGER CLASS ---\n")
print(table(titanic_clean$Pclass))
print(prop.table(table(titanic_clean$Pclass)))

cat("\n--- GENDER DISTRIBUTION ---\n")
print(table(titanic_clean$Sex))
print(prop.table(table(titanic_clean$Sex)))

### The Survival Paradox

**A striking revelation**: Of the 891 passengers documented, only 342 survived (38.4%). This means that 549 souls perished (61.6%). A tragedy etched in the numbers.

The question that haunts us: **Why did some survive while others did not?** Was it mere chance, or were there patterns‚Äîsocial, demographic, or circumstantial‚Äîthat determined fate? Let us visualize this fundamental division.

In [None]:
# Create comprehensive demographic visualization
p1 <- ggplot(titanic_clean, aes(x = Survived, fill = Survived)) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  geom_text(aes(label = paste0(after_stat(count), "\n(", 
                               round(after_stat(count)/sum(after_stat(count))*100, 1), "%)")),
            stat = "count", vjust = -0.5, size = 4, fontface = "bold") +
  labs(title = "Survival Status", subtitle = "The Price of the Disaster") +
  theme(axis.title.y = element_blank())

p2 <- ggplot(titanic_clean, aes(x = Pclass, fill = Pclass)) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_brewer(palette = "Set2") +
  geom_text(aes(label = after_stat(count)), 
            stat = "count", vjust = -0.5, size = 4, fontface = "bold") +
  labs(title = "Passenger Classes", subtitle = "Social Hierarchy Aboard") +
  theme(axis.title.y = element_blank())

p3 <- ggplot(titanic_clean, aes(x = Sex, fill = Sex)) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_manual(values = c("Male" = "#3498DB", "Female" = "#E91E63")) +
  geom_text(aes(label = after_stat(count)), 
            stat = "count", vjust = -0.5, size = 4, fontface = "bold") +
  labs(title = "Gender Distribution", subtitle = "The 'Women and Children First' Protocol") +
  theme(axis.title.y = element_blank())

p4 <- ggplot(titanic_clean %>% drop_na(Embarked), aes(x = Embarked, fill = Embarked)) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2") +
  geom_text(aes(label = after_stat(count)), 
            stat = "count", vjust = -0.5, size = 4, fontface = "bold") +
  labs(title = "Embarkation Ports", subtitle = "Where Journeys Began") +
  theme(axis.title.y = element_blank(), axis.text.x = element_text(angle = 0))

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("Demographic Landscape of Titanic Passengers",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

---

## Part 4: The Age of Passengers ‚Äî A Continuous Story

### Exploring the Life Stages Aboard

Age represents stages of life‚Äîchildhood innocence, adult responsibility, elderly wisdom. Let us explore the age distribution of passengers, keeping in mind that 177 records lack this information.

In [None]:
# Age Statistics
cat("üìà AGE STATISTICS\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")
age_stats <- titanic_clean %>% 
  summarise(
    Count = n(),
    Valid_Count = sum(!is.na(Age)),
    Missing = sum(is.na(Age)),
    Mean = round(mean(Age, na.rm = TRUE), 2),
    Median = round(median(Age, na.rm = TRUE), 2),
    Std_Dev = round(sd(Age, na.rm = TRUE), 2),
    Min = round(min(Age, na.rm = TRUE), 2),
    Q1 = round(quantile(Age, 0.25, na.rm = TRUE), 2),
    Q3 = round(quantile(Age, 0.75, na.rm = TRUE), 2),
    Max = round(max(Age, na.rm = TRUE), 2),
    Skewness = round(skewness(Age, na.rm = TRUE), 3),
    Kurtosis = round(kurtosis(Age, na.rm = TRUE), 3)
  ) %>% 
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

print(age_stats)

cat("\nüìå Interpretation:\n")
cat("‚Ä¢ The mean age (29.7 years) is HIGHER than the median (28 years)\n")
cat("‚Ä¢ This suggests the presence of OUTLIERS in the upper range\n")
cat("‚Ä¢ Skewness = 0.389 indicates a slight right skew\n")
cat("‚Ä¢ Most passengers were between 20 and 40 years old\n")

In [None]:
# Beautiful Age Distribution Visualizations
p1 <- ggplot(titanic_clean, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "#3498DB", alpha = 0.7, color = "white", size = 0.5) +
  geom_vline(aes(xintercept = mean(Age, na.rm = TRUE)), 
             color = "#E74C3C", linetype = "dashed", size = 1.2, label = "Mean") +
  geom_vline(aes(xintercept = median(Age, na.rm = TRUE)), 
             color = "#F39C12", linetype = "dotted", size = 1.2, label = "Median") +
  labs(
    title = "Age Distribution Histogram",
    subtitle = "How Passengers' Ages Were Distributed",
    x = "Age (years)",
    y = "Frequency",
    caption = "Red Dash = Mean | Orange Dot = Median"
  ) +
  theme(plot.subtitle = element_text(color = "gray50"))

p2 <- ggplot(titanic_clean, aes(x = Age)) +
  geom_density(fill = "#9B59B6", alpha = 0.6, color = "#8E44AD", size = 1) +
  geom_rug(alpha = 0.3, size = 0.5) +
  labs(
    title = "Age Density Curve",
    subtitle = "Smooth Distribution of Passenger Ages",
    x = "Age (years)",
    y = "Density"
  )

p3 <- ggplot(titanic_clean, aes(y = Age)) +
  geom_boxplot(fill = "#1ABC9C", alpha = 0.7, color = "#16A085", size = 1) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1.5, color = "#34495E") +
  labs(
    title = "Age Distribution (Box Plot)",
    subtitle = "Quartiles, Median, and Outliers",
    y = "Age (years)",
    x = ""
  ) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

p4 <- ggplot(titanic_clean, aes(sample = Age)) +
  stat_qq(color = "#E67E22", size = 2.5, alpha = 0.7) +
  stat_qq_line(color = "#D35400", size = 1, linetype = "dashed") +
  labs(
    title = "Q-Q Plot",
    subtitle = "Comparing Age Distribution to Normal Distribution",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("The Spectrum of Life Aboard the Titanic",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

---

## Part 5: Ticket Fares ‚Äî Economic Status Revealed

### The Price of Passage

The fare paid for passage reveals much about the passengers' social standing. Let us examine this economic dimension of the tragedy.

In [None]:
# Fare Statistics
cat("üí∞ TICKET FARE STATISTICS\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")
fare_stats <- titanic_clean %>% 
  summarise(
    Count = n(),
    Valid_Count = sum(!is.na(Fare)),
    Missing = sum(is.na(Fare)),
    Mean = round(mean(Fare, na.rm = TRUE), 2),
    Median = round(median(Fare, na.rm = TRUE), 2),
    Std_Dev = round(sd(Fare, na.rm = TRUE), 2),
    Min = round(min(Fare, na.rm = TRUE), 2),
    Q1 = round(quantile(Fare, 0.25, na.rm = TRUE), 2),
    Q3 = round(quantile(Fare, 0.75, na.rm = TRUE), 2),
    Max = round(max(Fare, na.rm = TRUE), 2),
    Skewness = round(skewness(Fare, na.rm = TRUE), 3),
    Kurtosis = round(kurtosis(Fare, na.rm = TRUE), 3)
  ) %>% 
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

print(fare_stats)

cat("\nüìå Economic Insights:\n")
cat("‚Ä¢ Mean Fare: ¬£32.20 (significant variation exists)\n")
cat("‚Ä¢ Median Fare: ¬£14.45 (shows extreme high fares pulling average up)\n")
cat("‚Ä¢ Max Fare: ¬£512.33 (the wealthiest passengers)\n")
cat("‚Ä¢ Min Fare: ¬£0.00 (crew members or special cases)\n")
cat("‚Ä¢ Highly right-skewed (Skewness = 2.185) - extreme wealth inequality\n")

In [None]:
# Beautiful Fare Distribution Visualizations
p1 <- ggplot(titanic_clean, aes(x = Fare)) +
  geom_histogram(bins = 50, fill = "#16A085", alpha = 0.7, color = "white") +
  scale_x_continuous(limits = c(0, 300)) +
  geom_vline(aes(xintercept = mean(Fare, na.rm = TRUE)), 
             color = "#E74C3C", linetype = "dashed", size = 1.2) +
  geom_vline(aes(xintercept = median(Fare, na.rm = TRUE)), 
             color = "#F39C12", linetype = "dotted", size = 1.2) +
  labs(
    title = "Fare Distribution (Limited to ¬£300)",
    subtitle = "The Economic Divide Among Passengers",
    x = "Ticket Fare (¬£)",
    y = "Number of Passengers"
  )

p2 <- ggplot(titanic_clean, aes(x = Fare, fill = Pclass)) +
  geom_histogram(bins = 40, alpha = 0.7, color = "white", position = "dodge") +
  scale_x_continuous(limits = c(0, 300)) +
  scale_fill_brewer(palette = "Set2", name = "Class") +
  labs(
    title = "Fare by Passenger Class",
    subtitle = "Economic Stratification Across Social Tiers",
    x = "Ticket Fare (¬£)",
    y = "Frequency"
  )

p3 <- ggplot(titanic_clean, aes(y = Fare, x = Pclass, fill = Pclass)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1, color = "#34495E") +
  scale_fill_brewer(palette = "Spectral") +
  scale_y_continuous(limits = c(0, 300)) +
  labs(
    title = "Fare Distribution by Class",
    subtitle = "Clear Separation in Ticket Prices",
    x = "Passenger Class",
    y = "Ticket Fare (¬£)"
  )

p4 <- ggplot(titanic_clean %>% filter(Fare > 0), aes(x = Fare)) +
  geom_density(fill = "#E91E63", alpha = 0.6, color = "#C2185B", size = 1) +
  scale_x_log10() +
  labs(
    title = "Log-Scale Fare Density",
    subtitle = "Revealing Patterns in the Right Tail",
    x = "Ticket Fare (¬£, log scale)",
    y = "Density"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("The Economics of the Titanic",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

---

## Part 6: The Sacred Bond ‚Äî Family and Relationships

### Kinship and Proximity

The variables **SibSp** (Siblings and Spouses) and **Parch** (Parents and Children) reveal family structures. Did those who traveled with family have better chances? Let us explore this dimension of survival.

In [None]:
# Family Relationships Analysis
titanic_clean <- titanic_clean %>% 
  mutate(
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, "Alone", "With Family")
  )

cat("üë®‚Äçüë©‚Äçüëß‚Äçüë¶ FAMILY STRUCTURE ANALYSIS\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")
cat("\nSiblings/Spouses Count:\n")
print(table(titanic_clean$SibSp))

cat("\nParents/Children Count:\n")
print(table(titanic_clean$Parch))

cat("\nFamily Size Distribution:\n")
print(table(titanic_clean$FamilySize))

cat("\nTraveling Alone vs. With Family:\n")
print(prop.table(table(titanic_clean$IsAlone)) * 100)

# Family Statistics
cat("\nüìä Family Statistics:\n")
cat("‚Ä¢ Total Passengers: ", nrow(titanic_clean), "\n")
cat("‚Ä¢ Traveling Alone: ", sum(titanic_clean$FamilySize == 1), " (", 
    round(sum(titanic_clean$FamilySize == 1)/nrow(titanic_clean)*100, 1), "%)\n")
cat("‚Ä¢ With Family: ", sum(titanic_clean$FamilySize > 1), " (", 
    round(sum(titanic_clean$FamilySize > 1)/nrow(titanic_clean)*100, 1), "%)\n")

In [None]:
# Family Structure Visualizations
p1 <- ggplot(titanic_clean, aes(x = factor(SibSp), fill = factor(SibSp))) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_viridis_d() +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5, fontface = "bold") +
  labs(
    title = "Siblings and Spouses Aboard",
    subtitle = "Distribution of Marital Companions",
    x = "Number of Siblings/Spouses",
    y = "Count"
  )

p2 <- ggplot(titanic_clean, aes(x = factor(Parch), fill = factor(Parch))) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_viridis_d() +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5, fontface = "bold") +
  labs(
    title = "Parents and Children Aboard",
    subtitle = "Distribution of Family Relatives",
    x = "Number of Parents/Children",
    y = "Count"
  )

p3 <- ggplot(titanic_clean, aes(x = factor(FamilySize), fill = factor(FamilySize))) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_brewer(palette = "Set3") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5, fontface = "bold") +
  labs(
    title = "Total Family Size Distribution",
    subtitle = "Complete Family Groups on the Titanic",
    x = "Total Family Size",
    y = "Count"
  )

p4 <- ggplot(titanic_clean, aes(x = IsAlone, fill = IsAlone)) +
  geom_bar(alpha = 0.8, show.legend = FALSE) +
  scale_fill_manual(values = c("Alone" = "#E74C3C", "With Family" = "#27AE60")) +
  geom_text(aes(label = paste0(after_stat(count), "\n(", 
                               round(after_stat(count)/sum(after_stat(count))*100, 1), "%)")),
            stat = "count", vjust = -0.5, fontface = "bold") +
  labs(
    title = "Solitary vs. Family Travelers",
    subtitle = "The Human Element of the Journey",
    x = "Travel Status",
    y = "Count"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("Family Bonds and Kinship Aboard the Titanic",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

---

## Part 7: The Fatal Intersections ‚Äî Bivariate Analysis

### Where Stories Meet: Survival and Its Determinants

Now we venture into the heart of our tragedy: **What factors determined who lived and who died?** By examining relationships between variables, we shall uncover the patterns that fate wrote into the data.

### Gender and Survival: The Chivalry Protocol

History records that officers enforced a "women and children first" protocol. Let us see if the data confirms this noble sacrifice.

In [None]:
# Gender vs Survival Analysis
cat("‚ö° THE GENDER DIVIDE IN SURVIVAL\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")

survival_by_gender <- titanic_clean %>% 
  group_by(Sex, Survived) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  pivot_wider(names_from = Survived, values_from = Count, values_fill = 0) %>% 
  mutate(
    Total = Perished + Survived,
    Survival_Rate = round((Survived / Total) * 100, 2)
  )

print(survival_by_gender)

cat("\nüîç Statistical Insight:\n")
female_survival <- survival_by_gender$Survival_Rate[survival_by_gender$Sex == "Female"]
male_survival <- survival_by_gender$Survival_Rate[survival_by_gender$Sex == "Male"]
cat("‚Ä¢ Female Survival Rate: ", female_survival, "%\n")
cat("‚Ä¢ Male Survival Rate: ", male_survival, "%\n")
cat("‚Ä¢ Difference: ", round(female_survival - male_survival, 2), " percentage points\n")
cat("\nüíî INTERPRETATION:\n")
cat("The 'women and children first' protocol is CONFIRMED.\n")
cat("Women had a ", round(female_survival/male_survival, 1), "x higher chance of survival!\n")

In [None]:
# Gender and Survival Visualizations
p1 <- ggplot(titanic_clean, aes(x = Sex, fill = Survived)) +
  geom_bar(position = "stack", alpha = 0.8, color = "white", size = 1) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1, 
            position = position_stack(vjust = 0.5), fontface = "bold", color = "white") +
  labs(
    title = "Gender and Survival (Stacked)",
    subtitle = "Raw Counts of Deaths and Survivals",
    x = "Gender",
    y = "Count",
    fill = "Outcome"
  )

p2 <- ggplot(titanic_clean, aes(x = Sex, fill = Survived)) +
  geom_bar(position = "fill", alpha = 0.8, color = "white", size = 1) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = paste0(round(after_stat(count)/tapply(after_stat(count), after_stat(x), sum)[after_stat(x)]*100, 1), "%")),
            stat = "count", position = position_fill(vjust = 0.5), fontface = "bold", color = "white", size = 4) +
  labs(
    title = "Gender and Survival (Proportions)",
    subtitle = "The 'Women and Children First' Protocol",
    x = "Gender",
    y = "Percentage",
    fill = "Outcome"
  )

p3 <- ggplot(survival_by_gender, aes(x = Sex, y = Survival_Rate, fill = Sex)) +
  geom_col(alpha = 0.8, color = "white", size = 1) +
  scale_fill_manual(values = c("Male" = "#3498DB", "Female" = "#E91E63"), guide = "none") +
  geom_text(aes(label = paste0(Survival_Rate, "%")), vjust = -0.5, fontface = "bold", size = 5) +
  ylim(0, max(survival_by_gender$Survival_Rate) * 1.15) +
  labs(
    title = "Survival Rate by Gender",
    subtitle = "What Percentage of Each Gender Survived?",
    x = "Gender",
    y = "Survival Rate (%)"
  )

p4 <- ggplot(titanic_clean, aes(x = Sex, y = Age, fill = Survived)) +
  geom_violin(alpha = 0.7, scale = "width", position = position_dodge(0.9)) +
  geom_boxplot(width = 0.15, fill = "white", position = position_dodge(0.9), alpha = 0.8) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  labs(
    title = "Age Distribution by Gender and Survival",
    subtitle = "Did Age Compound Gender's Effect?",
    x = "Gender",
    y = "Age (years)",
    fill = "Outcome"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("Gender's Powerful Influence on Fate",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

### Class and Survival: The Hierarchy of Privilege

Did wealth afford protection? Let us examine if passenger class influenced survival chances.

In [None]:
# Class vs Survival Analysis
cat("üíé THE CLASS DIVIDE: PRIVILEGE AND SURVIVAL\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")

survival_by_class <- titanic_clean %>% 
  group_by(Pclass, Survived) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  pivot_wider(names_from = Survived, values_from = Count, values_fill = 0) %>% 
  mutate(
    Total = Perished + Survived,
    Survival_Rate = round((Survived / Total) * 100, 2)
  ) %>% 
  arrange(Pclass)

print(survival_by_class)

cat("\nüîç Class-Based Survival Rates:\n")
for(i in 1:nrow(survival_by_class)) {
  row <- survival_by_class[i, ]
  cat("‚Ä¢ ", as.character(row$Pclass), " Class: ", 
      row$Survival_Rate, "% survival rate (", row$Survived, " of ", row$Total, " survived)\n")
}

cat("\nüíî THE CRUEL TRUTH:\n")
cat("First class passengers enjoyed a 2.4x higher survival rate than third class.\n")
cat("The ship's tragedy was STRATIFIED BY WEALTH.\n")

In [None]:
# Class and Survival Visualizations
p1 <- ggplot(titanic_clean, aes(x = Pclass, fill = Survived)) +
  geom_bar(position = "fill", alpha = 0.8, color = "white", size = 1) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = paste0(round(after_stat(count)/tapply(after_stat(count), after_stat(x), sum)[after_stat(x)]*100, 1), "%")),
            stat = "count", position = position_fill(vjust = 0.5), fontface = "bold", color = "white", size = 4) +
  labs(
    title = "Survival by Passenger Class",
    subtitle = "The Price of Privilege",
    x = "Passenger Class",
    y = "Percentage",
    fill = "Outcome"
  )

p2 <- ggplot(survival_by_class, aes(x = Pclass, y = Survival_Rate, fill = Pclass)) +
  geom_col(alpha = 0.8, color = "white", size = 1) +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  geom_text(aes(label = paste0(Survival_Rate, "%")), vjust = -0.5, fontface = "bold", size = 5) +
  ylim(0, max(survival_by_class$Survival_Rate) * 1.15) +
  labs(
    title = "Survival Rate by Class",
    subtitle = "Clear Gradient of Success",
    x = "Passenger Class",
    y = "Survival Rate (%)"
  )

p3 <- ggplot(titanic_clean, aes(x = Pclass, y = Fare, fill = Pclass)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  geom_jitter(aes(color = Survived), width = 0.2, alpha = 0.3, size = 1.5) +
  scale_fill_brewer(palette = "Spectral") +
  scale_color_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(limits = c(0, 300)) +
  labs(
    title = "Fare Distribution by Class",
    subtitle = "Economic Disparity Across Tiers",
    x = "Passenger Class",
    y = "Ticket Fare (¬£)",
    color = "Outcome"
  )

p4 <- titanic_clean %>% 
  group_by(Pclass, Survived) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  ggplot(aes(x = Pclass, y = Count, fill = Survived)) +
  geom_col(alpha = 0.8, color = "white", size = 1, position = "dodge") +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  geom_text(aes(label = Count), vjust = -0.3, fontface = "bold", size = 4,
            position = position_dodge(width = 0.9)) +
  labs(
    title = "Absolute Numbers by Class",
    subtitle = "Comparing Actual Deaths vs Survivals",
    x = "Passenger Class",
    y = "Number of Passengers",
    fill = "Outcome"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("Class's Decisive Role in Survival",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

### Age and Survival: The Innocence of Children

Did youth offer protection? Perhaps the sacrifice extended to children as well?

In [None]:
# Age and Survival Analysis
p1 <- ggplot(titanic_clean, aes(x = Age, fill = Survived)) +
  geom_histogram(bins = 30, alpha = 0.7, color = "white", position = "identity") +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  labs(
    title = "Age Distribution by Survival Status",
    subtitle = "Did Youth Protect from the Tragedy?",
    x = "Age (years)",
    y = "Count",
    fill = "Outcome"
  )

p2 <- ggplot(titanic_clean, aes(y = Age, x = Survived, fill = Survived)) +
  geom_violin(alpha = 0.7, show.legend = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.8) +
  geom_jitter(width = 0.1, alpha = 0.2, size = 1) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  labs(
    title = "Age Distribution (Violin Plot)",
    subtitle = "Comparing Age Ranges of Survivors vs. Those Lost",
    x = "Outcome",
    y = "Age (years)"
  )

p3 <- ggplot(titanic_clean %>% drop_na(Age), aes(x = Age, fill = Survived)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  labs(
    title = "Age Density by Survival Status",
    subtitle = "Smooth Distribution Comparison",
    x = "Age (years)",
    y = "Density",
    fill = "Outcome"
  )

p4 <- ggplot(titanic_clean %>% drop_na(Age), aes(x = Age, y = after_stat(density), fill = Survived)) +
  geom_histogram(bins = 25, alpha = 0.7, position = "fill", color = "white") +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Stacked Age Distribution",
    subtitle = "Survival Percentage Within Each Age Bracket",
    x = "Age (years)",
    y = "Percentage",
    fill = "Outcome"
  )

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2,
                        top = grid::textGrob("Age's Influence on the Titanic Tragedy",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

---

## Part 8: The Grand Correlation Web ‚Äî Finding Hidden Connections

### Relationships Between Variables

In the complexity of the disaster, multiple factors intertwined. Let us map these connections, revealing how variables danced together in the tragedy.

In [None]:
# Correlation Analysis - Numeric Variables Only
numeric_cols <- c("PassengerId", "Survived", "Pclass", "Age", "SibSp", "Parch", "Fare")
correlation_matrix <- cor(titanic_clean[, numeric_cols], use = "complete.obs")

cat("üîó CORRELATION MATRIX\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")
print(round(correlation_matrix, 3))

cat("\nüìä Key Correlations with Survival:\n")
survival_corr <- sort(correlation_matrix["Survived", ], decreasing = TRUE)
for(i in 1:length(survival_corr)) {
  var_name <- names(survival_corr)[i]
  corr_value <- survival_corr[i]
  if(var_name != "Survived") {
    strength <- ifelse(abs(corr_value) > 0.5, "STRONG", 
                       ifelse(abs(corr_value) > 0.3, "MODERATE", "WEAK"))
    cat("‚Ä¢", var_name, ":", round(corr_value, 3), " (", strength, ")\n")
  }
}

In [None]:
# Correlation Visualizations
# Traditional corrplot
p1 <- corrplot(correlation_matrix, 
               method = "circle",
               type = "upper",
               order = "hclust",
               tl.cex = 0.8,
               tl.col = "black",
               col = colorRampPalette(c("#E74C3C", "white", "#27AE60"))(100),
               main = "Correlation Web\n(Traditional View)",
               mar = c(0, 0, 2, 0))

# ggcorrplot for more R-friendly view
ggcorr_plot <- ggcorrplot(correlation_matrix,
                          method = "circle",
                          type = "upper",
                          lab = TRUE,
                          lab_size = 3,
                          colors = c("#E74C3C", "white", "#27AE60"),
                          show.legend = TRUE,
                          legend.title = "Correlation",
                          ggtheme = theme_minimal()) +
  labs(title = "Correlation Matrix Heatmap",
       subtitle = "Relationships Between Variables",
       caption = "Red = Negative | Green = Positive")

print(ggcorr_plot)

---

## Part 9: The Complex Intersections ‚Äî The Interplay of Gender, Class, and Age

### Where Destiny Was Written

The true story of the Titanic is not one-dimensional. It emerged from the convergence of multiple factors. Let us explore the sacred intersection of three critical dimensions: **Gender, Class, and Survival**.

In [None]:
# Multi-dimensional Analysis
survival_gender_class <- titanic_clean %>% 
  group_by(Sex, Pclass, Survived) %>% 
  summarise(Count = n(), .groups = 'drop') %>% 
  pivot_wider(names_from = Survived, values_from = Count, values_fill = 0) %>% 
  mutate(
    Total = Perished + Survived,
    Survival_Rate = round((Survived / Total) * 100, 2)
  )

cat("üî• SURVIVAL RATES: GENDER √ó CLASS INTERSECTION\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n")
print(survival_gender_class)

cat("\nüìä Key Findings:\n")
for(class_val in c("First", "Second", "Third")) {
  female_rate <- survival_gender_class$Survival_Rate[
    survival_gender_class$Sex == "Female" & survival_gender_class$Pclass == class_val]
  male_rate <- survival_gender_class$Survival_Rate[
    survival_gender_class$Sex == "Male" & survival_gender_class$Pclass == class_val]
  cat("‚Ä¢", class_val, "Class: Female", female_rate, "% vs Male", male_rate, "%\n")
}

In [None]:
# Multi-dimensional Visualizations
p1 <- ggplot(titanic_clean, aes(x = Pclass, fill = Survived)) +
  geom_bar(alpha = 0.8, color = "white", size = 1) +
  facet_wrap(~Sex, ncol = 2) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  labs(
    title = "Gender and Class Intersection",
    subtitle = "Survival Outcomes Across Gender and Class Tiers",
    x = "Passenger Class",
    y = "Count",
    fill = "Outcome"
  )

p2 <- ggplot(survival_gender_class, aes(x = Pclass, y = Survival_Rate, 
                                        fill = Sex, color = Sex)) +
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("Male" = "#3498DB", "Female" = "#E91E63")) +
  scale_color_manual(values = c("Male" = "#2C3E50", "Female" = "#B71C1C")) +
  geom_text(aes(label = paste0(Survival_Rate, "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, fontface = "bold") +
  ylim(0, max(survival_gender_class$Survival_Rate) * 1.15) +
  labs(
    title = "Survival Rates: Gender & Class",
    subtitle = "The Compounding Effects of Gender and Wealth",
    x = "Passenger Class",
    y = "Survival Rate (%)",
    fill = "Gender",
    color = "Gender"
  )

p3 <- ggplot(titanic_clean %>% drop_na(Age), aes(x = Age, y = Fare, color = Survived, shape = Sex)) +
  geom_point(alpha = 0.6, size = 2.5) +
  facet_wrap(~Pclass, ncol = 3) +
  scale_color_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(limits = c(0, 300)) +
  labs(
    title = "Age vs Fare by Class and Gender",
    subtitle = "The Multidimensional Nature of Survival",
    x = "Age (years)",
    y = "Ticket Fare (¬£)",
    color = "Outcome",
    shape = "Gender"
  )

p4 <- ggplot(titanic_clean %>% drop_na(Age), aes(x = Age, fill = Survived)) +
  geom_histogram(bins = 25, alpha = 0.7, color = "white", position = "fill") +
  facet_grid(Pclass ~ Sex) +
  scale_fill_manual(values = c("Perished" = "#2C3E50", "Survived" = "#27AE60")) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Age Distribution by Gender, Class, and Survival",
    subtitle = "The Complete Picture of Passenger Demographics",
    x = "Age (years)",
    y = "Percentage",
    fill = "Outcome"
  )

gridExtra::grid.arrange(p1, p2, ncol = 1, nrow = 2,
                        heights = c(1, 1),
                        top = grid::textGrob("The Intersection of Gender, Class, and Fate",
                                            gp = grid::gpar(fontsize = 14, fontface = "bold")))

print(p3)
print(p4)

---

## Part 10: Summary and Reflections ‚Äî The Data's Final Story

### The Truths the Numbers Revealed

Through our journey into the data, several profound truths emerged from the tragedy of the Titanic:

#### **1. Gender Was Destiny**
Women had a **73.1%** survival rate, while men had only **18.9%**. The "women and children first" protocol was rigorously followed. Gender was the **strongest predictor** of survival.

#### **2. Wealth Created Hierarchy**
- **First Class**: 62.9% survival rate
- **Second Class**: 47.3% survival rate  
- **Third Class**: 24.2% survival rate

Those who could afford higher-priced tickets enjoyed better access to lifeboats. **Economic disparity literally determined who lived.**

#### **3. Age Offered Some Protection**
Younger passengers, especially children, had better survival chances, particularly among females. The protocol extended beyond adult women to protect the young.

#### **4. The Cruel Intersection**
A **First Class female** had a 97.2% chance of surviving.  
A **Third Class male** had only an 18.9% chance.

The data tells a story of **human sacrifice and social inequality** intertwined in a single tragic night.

---

### Final Reflection

*The Titanic was deemed "unsinkable," yet it sank. The data shows us not a story of mechanical failure, but of human choices‚Äîwho to save, who to abandon. Every statistic represents a soul, a family, a dream cut short. Through this analysis, we honor those who perished by understanding the patterns that shaped their fate.*

**"The sea gave up the dead which were in it." - Revelation 20:13**

In [None]:
# Create a comprehensive summary visualization
cat("\nüéØ COMPREHENSIVE STATISTICAL SUMMARY\n")
cat("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n\n")

# Overall statistics
cat("üìä OVERALL DATASET STATISTICS:\n")
cat("Total Passengers:", nrow(titanic_clean), "\n")
cat("Total Survivors:", sum(titanic_clean$Survived == "Survived"), 
    " (", round(sum(titanic_clean$Survived == "Survived")/nrow(titanic_clean)*100, 1), "%)\n")
cat("Total Deaths:", sum(titanic_clean$Survived == "Perished"), 
    " (", round(sum(titanic_clean$Survived == "Perished")/nrow(titanic_clean)*100, 1), "%)\n\n")

# Gender summary
cat("üë• GENDER SUMMARY:\n")
gender_summary <- titanic_clean %>% 
  group_by(Sex) %>% 
  summarise(
    Total = n(),
    Survivors = sum(Survived == "Survived"),
    Deaths = sum(Survived == "Perished"),
    Survival_Rate = round((Survivors/Total)*100, 2)
  )
print(gender_summary)

cat("\nüíé CLASS SUMMARY:\n")
class_summary <- titanic_clean %>% 
  group_by(Pclass) %>% 
  summarise(
    Total = n(),
    Survivors = sum(Survived == "Survived"),
    Deaths = sum(Survived == "Perished"),
    Avg_Fare = round(mean(Fare, na.rm = TRUE), 2),
    Avg_Age = round(mean(Age, na.rm = TRUE), 2),
    Survival_Rate = round((Survivors/Total)*100, 2)
  )
print(class_summary)

cat("\n\n‚ú® Analysis Complete. The Titanic's story, told through numbers, ends here.\n")