# 🧹 Working with Tables: Meet the Hippos

Today, we’ll work with a messy real-world dataset (it includes hippos!).
We’ll learn how to:
- Read data into a table using `pandas`
- Inspect the data and spot common formatting issues
- Clean up names, numbers, and categories
- Create a tidy DataFrame ready for analysis

## 📥 Load the Data

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/data-analysis/main/data/messy_hippos.csv')  # update this if hosted elsewhere
df

## 🔍 What’s Going On Here?

In [None]:
# Let’s take a quick look
df.info()

## ✨ Clean Up the Columns

In [None]:
# Strip extra spaces from column names
df.columns = df.columns.str.strip()
# Also fix column name casing and spelling
df = df.rename(columns={'Name': 'Name', 'Weight (kg)': 'Weight_kg', 'height_cm': 'Height_cm'})
df

## 🧼 Clean Individual Columns

In [None]:
# Remove extra spaces in names and fix inconsistencies
df['Name'] = df['Name'].str.strip().str.capitalize()

# Standardise species names
df['species'] = df['species'].str.strip().str.lower().str.replace('hippos', 'hippo')

# Clean Weight column: remove commas, strip, convert to numeric
df['Weight_kg'] = pd.to_numeric(df['Weight_kg'].str.replace(',', '').str.strip(), errors='coerce')

# Clean Height
df['Height_cm'] = pd.to_numeric(df['Height_cm'], errors='coerce')

# Standardise habitat names
df['habitat'] = df['habitat'].str.strip().str.capitalize()

df

## ❓ What’s Missing or Strange?

In [None]:
df.isna().sum()

## 🧾 Summary of the Cleaned Dataset

In [None]:
df.describe(include='all')

## ✅ Summary – What You’ve Learned Today
- Read a CSV file into a `pandas` DataFrame
- Spotted and fixed common issues: extra spaces, inconsistent labels, strange number formats
- Used `.str.strip()`, `.replace()`, and `pd.to_numeric()` to clean columns
- Checked for missing values with `.isna()`

Next time, we’ll learn how to group and summarise data – like how much a typical hippo weighs, or what the average habitat height is! 🦛