# Population Data Analysis

This notebook analyzes the population dataset, exploring trends in population growth, sex ratios, and population density.

## Key Insights
- **Global Growth**: The global population has shown consistent growth, surpassing 8 billion in recent years.
- **Regional Demographics**: Africa exhibits a younger population structure with a higher percentage of youth (0-14 years) compared to the global average.
- **Sex Ratio Stability**: Globally, the sex ratio remains relatively balanced (~101 males per 100 females), though regional disparities exist.
- **Urbanization Trends**: Population density metrics highlight significant variances, reflecting urbanization trends and geographic constraints.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_theme(style="whitegrid")

## 1. Load and Clean Data

We load the CSV file, skipping the first row which contains metadata. We also handle column names and data types.

In [None]:
# Load the dataset
file_path = 'data/SYB67_1_202411_Population, Surface Area and Density.csv'
# Read with header=1 (skipping row 0)
df = pd.read_csv(file_path, header=1)

# Display the first few rows
df.head()

In [None]:
# Rename columns
# Column at index 1 is Unnamed because the header row had ,, for it.
# Based on data inspection:
# Column 0: Region/Country/Area (appears to be an ID)
# Column 1: Unnamed (appears to be the Name)
df.rename(columns={'Region/Country/Area': 'Region_Code', 'Unnamed: 1': 'Region_Name'}, inplace=True)

# Clean 'Value' column: remove commas and convert to numeric
df['Value'] = df['Value'].astype(str).str.replace(',', '', regex=False)
df['Value'] = pd.to_numeric(df['Value'], errors='coerce')

# Check info
df.info()

## 2. Data Exploration

Let's check the available series and regions.

In [None]:
print("Available Series:")
print(df['Series'].unique())
print("\nNumber of unique Regions/Countries:", df['Region_Name'].nunique())

## 3. Analysis & Visualization

### 3.1 Global Population Growth
We'll look at the "Total, all countries or areas" to see the global trend.

In [None]:
# Filter for global total and population estimates
global_pop = df[(df['Region_Name'] == 'Total, all countries or areas') & 
                (df['Series'] == 'Population mid-year estimates (millions)')]

plt.figure(figsize=(10, 6))
sns.lineplot(data=global_pop, x='Year', y='Value', marker='o')
plt.title('Global Population Growth (Millions)')
plt.ylabel('Population (Millions)')
plt.show()

### 3.2 Sex Ratio
Examining the sex ratio (males per 100 females) for major continents/regions in 2022.

In [None]:
# Filter for Sex ratio series
sex_ratio = df[df['Series'] == 'Sex ratio (males per 100 females)']

# Define a list of major regions to compare (adjust based on actual data if needed)
regions_to_compare = ['Africa', 'Northern Africa', 'Sub-Saharan Africa', 
                      'Eastern Africa', 'Middle Africa', 'Southern Africa', 'Western Africa', 
                      'Americas', 'Northern America', 'Latin America & the Caribbean', 
                      'Asia', 'Europe', 'Oceania']

# Filter for year 2022 and selected regions
sex_ratio_2022 = sex_ratio[(sex_ratio['Year'] == 2022) & 
                           (sex_ratio['Region_Name'].isin(regions_to_compare))]

plt.figure(figsize=(12, 6))
sns.barplot(data=sex_ratio_2022, x='Region_Name', y='Value')
plt.axhline(100, color='red', linestyle='--', label='Equal Ratio (100)')
plt.xticks(rotation=45, ha='right')
plt.title('Sex Ratio (Males per 100 Females) by Region in 2022')
plt.ylabel('Males per 100 Females')
plt.legend()
plt.tight_layout()
plt.show()

### 3.3 Population Density
Top 10 most densely populated countries/areas in 2022.

In [None]:
# Filter for Population density
density = df[df['Series'] == 'Population density']

# Filter for year 2022
density_2022 = density[density['Year'] == 2022].sort_values('Value', ascending=False)

# Exclude aggregates (Region_Code usually shorter or specific range, but easier to just take top 20 and filter manually or check names)
# A simplified approach: exclude 'Total, all countries or areas' and major continents if they appear high, 
# but normally countries like Monaco, Singapore etc are at top.
# Let's just show top 15 excluding the world aggregate if present
density_2022 = density_2022[density_2022['Region_Name'] != 'Total, all countries or areas']

plt.figure(figsize=(12, 6))
sns.barplot(data=density_2022.head(15), x='Value', y='Region_Name')
plt.title('Top 15 Most Densely Populated Areas (2022)')
plt.xlabel('Population Density (per km2)')
plt.ylabel('Country/Area')
plt.tight_layout()
plt.show()