# 🎓 Mini-Project – Explore Your Own Dataset

This is your final session! You’ll use everything you’ve learned so far to analyse and present a small dataset.
You can use the hippo data provided or bring your own (optional!).

**What to do:**
- Load and explore the data
- Clean or tidy it (if needed)
- Summarise and describe key features
- Compare groups or test a simple hypothesis
- Create at least one visualisation
- Write a short conclusion

## 🐾 Step 1 – Load the Data

In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/data-analysis/main/data/hippos_cleaned.csv')  # or upload your own
df.head()

Unnamed: 0,Name,Species,Weight_kg,Height_cm,Habitat,Sex
0,Hugo,hippo,1500,150,River,Male
1,Fiona,hippo,1400,150,River,Female
2,George,hippo,1450,160,Lake,Male
3,Gloria,hippo,1350,145,Lake,Female
4,Fred,hippo,1600,155,River,Male


## 🧹 Step 2 – Clean or Prepare the Data

In [3]:
# Check for missing values or inconsistent entries
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       8 non-null      object
 1   Species    8 non-null      object
 2   Weight_kg  8 non-null      int64 
 3   Height_cm  8 non-null      int64 
 4   Habitat    8 non-null      object
 5   Sex        8 non-null      object
dtypes: int64(2), object(4)
memory usage: 512.0+ bytes


Name         0
Species      0
Weight_kg    0
Height_cm    0
Habitat      0
Sex          0
dtype: int64

## 📊 Step 3 – Describe and Summarise

In [4]:
# Summary statistics and grouping
df.groupby('Sex')[['Weight_kg', 'Height_cm']].mean()

Unnamed: 0_level_0,Weight_kg,Height_cm
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,1455.0,151.5
Male,1507.5,153.5


## 🧪 Step 4 – Compare Groups

In [None]:
from scipy import stats

male = df[df['Sex'] == 'Male']['Weight_kg']
female = df[df['Sex'] == 'Female']['Weight_kg']

stats.ttest_ind(male, female)

## 📈 Step 5 – Visualise

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=df, x='Sex', y='Height_cm')
plt.title('Height by Sex')
plt.show()

## 📝 Step 6 – Conclude

In [None]:
# Write your observations here as a comment:
# - What did you notice?
# - Any surprising results?
# - What would you explore next if you had more time?

## ✅ Optional Challenges
- Try a regression model: `Weight ~ Height`
- Compare a third variable (e.g. habitat)
- Turn your analysis into a short presentation
- Try using a different dataset (NDNS/FFQ/etc.)

## 🦛 Congratulations!
You’ve completed the Python for Data Analysis course. 🎉

You now know how to:
- Write Python code
- Clean and reshape messy data
- Summarise, plot, and analyse datasets
- Understand and test statistical differences

*Python is like a river – you’ve learned to swim in it. Now go explore!* 🐾