# Titanic Dataset Analysis - Pandas Basics

This notebook focuses on basic pandas operations using the Titanic dataset. It's designed for beginners learning data science.

In [None]:
# Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

## 1. Loading Data

The first step in any data analysis is loading the data. We'll use pandas' `read_csv()` function to load the Titanic dataset.

In [None]:
# Load the Titanic dataset
df = pd.read_csv("Titanic-Dataset.csv")

# 'df' is a common variable name for a pandas DataFrame

## 2. Basic DataFrame Exploration

Let's explore the dataset using basic pandas functions.

In [None]:
# View the first 10 rows of the DataFrame
df.head(10)

# head() shows the first n rows (default is 5)

In [None]:
# View the last 5 rows of the DataFrame
df.tail()

# tail() shows the last n rows (default is 5)

In [None]:
# Get information about the DataFrame
df.info()

# info() provides a summary of the DataFrame including:
# - Number of rows and columns
# - Column names and data types
# - Number of non-null values in each column
# - Memory usage

In [None]:
# Get statistical summary of the DataFrame
df.describe()

# describe() provides statistical information for numeric columns:
# - count: number of non-null values
# - mean: average value
# - std: standard deviation
# - min: minimum value
# - 25%, 50% (median), 75%: percentiles
# - max: maximum value

## 3. Accessing Data in a DataFrame

There are multiple ways to access data in a pandas DataFrame.

In [None]:
# Get column names
print(df.columns)

In [None]:
# Access a single column (returns a Series)
df['Age'].head()

# A Series is a one-dimensional array with labels

In [None]:
# Access multiple columns (returns a DataFrame)
df[['Name', 'Age', 'Sex']].head()

In [None]:
# Access a specific row by index
df.iloc[0]  # First row

# iloc[] is used for integer-location based indexing

In [None]:
# Access multiple rows
df.iloc[0:5]  # First 5 rows

In [None]:
# Access specific cells (row, column)
df.iloc[0, 3]  # First row, fourth column (Name)

# This returns the name of the first passenger

## 4. Basic Data Analysis

Let's perform some basic analysis on the Titanic dataset.

In [None]:
# Count the number of survivors and non-survivors
df['Survived'].value_counts()

# value_counts() returns the count of each unique value in a Series
# 0 = Did not survive, 1 = Survived

In [None]:
# Calculate the survival rate
survival_rate = df['Survived'].mean() * 100
print(f'Survival rate: {survival_rate:.2f}%')

# Since Survived is 0 or 1, the mean gives us the proportion of 1s (survivors)

In [None]:
# Count the number of passengers by gender
df['Sex'].value_counts()

In [None]:
# Count the number of passengers by class
df['Pclass'].value_counts()

# Pclass: 1 = 1st class, 2 = 2nd class, 3 = 3rd class

## 5. Filtering Data

Filtering allows us to select rows that meet specific conditions.

In [None]:
# Filter: passengers who survived
survivors = df[df['Survived'] == 1]

# Show the first few survivors
survivors.head()

In [None]:
# Filter: female passengers
females = df[df['Sex'] == 'female']

# Count the number of female passengers
print(f'Number of female passengers: {len(females)}')

In [None]:
# Multiple conditions: female survivors
female_survivors = df[(df['Sex'] == 'female') & (df['Survived'] == 1)]

# Count the number of female survivors
print(f'Number of female survivors: {len(female_survivors)}')

# Note: use & for AND, | for OR, ~ for NOT

In [None]:
# Calculate survival rate by gender
male_survival_rate = df[df['Sex'] == 'male']['Survived'].mean() * 100
female_survival_rate = df[df['Sex'] == 'female']['Survived'].mean() * 100

print(f'Male survival rate: {male_survival_rate:.2f}%')
print(f'Female survival rate: {female_survival_rate:.2f}%')

## 6. Handling Missing Values

Missing values are common in real-world datasets. Let's see how to handle them.

In [None]:
# Check for missing values
df.isnull().sum()

# This shows the number of missing values in each column

In [None]:
# Create a copy of the DataFrame for demonstration
df_clean = df.copy()

# Fill missing Age values with the median age
median_age = df['Age'].median()
df_clean['Age'].fillna(median_age, inplace=True)

# Fill missing Embarked values with the most common value
most_common_embarked = df['Embarked'].mode()[0]  # mode() returns a Series
df_clean['Embarked'].fillna(most_common_embarked, inplace=True)

# Check if missing values are handled
df_clean[['Age', 'Embarked']].isnull().sum()

## 7. Grouping and Aggregation

Grouping allows us to split data into groups and apply functions to each group.

In [None]:
# Group by passenger class and calculate survival rate
class_survival = df.groupby('Pclass')['Survived'].mean() * 100
print(class_survival)

# This shows the survival rate for each passenger class

In [None]:
# Group by multiple columns: class and gender
class_gender_survival = df.groupby(['Pclass', 'Sex'])['Survived'].mean() * 100
print(class_gender_survival)

# This shows the survival rate for each combination of class and gender

In [None]:
# Multiple aggregations
age_stats = df.groupby('Pclass')['Age'].agg(['count', 'mean', 'min', 'max'])
print(age_stats)

# This shows various statistics about age for each passenger class

## 8. Basic Data Visualization

Pandas integrates with matplotlib to provide basic plotting capabilities.

In [None]:
# Bar chart: survival count by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].sum()
survival_by_class.plot(kind='bar', title='Number of Survivors by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Survivors')
plt.show()

In [None]:
# Pie chart: proportion of passengers by class
df['Pclass'].value_counts().plot(kind='pie', autopct='%1.1f%%', title='Passenger Class Distribution')
plt.ylabel('')  # Hide the ylabel
plt.show()

In [None]:
# Histogram: age distribution
df['Age'].plot(kind='hist', bins=20, title='Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

## 9. Creating New Columns

We can create new columns based on existing data.

In [None]:
# Create a new DataFrame for demonstration
df_new = df.copy()

# Create a column for family size (siblings/spouses + parents/children + 1)
df_new['FamilySize'] = df_new['SibSp'] + df_new['Parch'] + 1

# Create a column for fare per person
df_new['FarePerPerson'] = df_new['Fare'] / df_new['FamilySize']

# Display the new columns
df_new[['Name', 'SibSp', 'Parch', 'FamilySize', 'Fare', 'FarePerPerson']].head()

## 10. Saving Data

After processing data, we might want to save it for future use.

In [None]:
# Save to CSV
# df_new.to_csv('titanic_processed.csv', index=False)

# Save to Excel
# df_new.to_excel('titanic_processed.xlsx', index=False)

# Note: These lines are commented out to prevent creating files
# Uncomment them if you want to save the files

## Summary

In this notebook, we've covered the basics of pandas using the Titanic dataset:

1. Loading data with `read_csv()`
2. Exploring data with `head()`, `info()`, and `describe()`
3. Accessing data using column names, `iloc[]`, and boolean indexing
4. Basic analysis with `value_counts()` and aggregation functions
5. Filtering data with boolean conditions
6. Handling missing values with `fillna()`
7. Grouping and aggregation with `groupby()`
8. Basic visualization with pandas plotting functions
9. Creating new columns based on existing data
10. Saving data to files

These are the fundamental pandas operations that form the basis of data analysis in Python.