<a href="https://colab.research.google.com/github/aaniaahh/DataScience-2025/blob/main/Completed/06-Working_with_Data/03_loading_and_exploring_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📂 Notebook 03: Loading and Exploring Data
Welcome to the real world — where data lives in messy CSVs and your job is to make sense of it.

In this notebook, you’ll:
* Load a CSV file using `pd.read_csv()`
* Use `.head()`, `.tail()`, `.info()`, and `.describe()` to explore your data
* Identify potential issues (missing values, bad types, oddball rows)

## Let’s get our hands dirty.

In [None]:
import pandas as pd

## 📥 Load a CSV
Replace the filename below with a real CSV path or URL. For testing, use built-in seaborn datasets.

In [None]:
# Example with seaborn's Titanic dataset
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

## 📊 Peek at the Data

In [None]:
# First and last few rows
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

## 🧠 Understand Structure

In [None]:
# Dimensions and columns
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

## 🧼 Data Types & Missing Values

In [None]:
# Data types and null counts
print("\nInfo:")
df.info()

# How many missing values per column?
print("\nMissing values per column:")
print(df.isnull().sum())

## 📈 Summary Statistics

In [None]:
df.describe(include="all")

## ✏️ Renaming Columns (Optional but fun)

In [None]:
# Rename 'sex' to 'gender' for clarity
df.rename(columns={"sex": "gender"}, inplace=True)
df.head(2)

## 🔍 Your Turn
1. Load a dataset of your choice (`pd.read_csv()` or `sns.load_dataset()`)
2. Print the first and last 5 rows.
3. Show `.info()` and `.describe()` results.
4. Print the column names. Rename one of them.

🎯 **Bonus**: What percentage of rows have any missing values?

HINT: df.isnull().any(axis=1).mean() * 100  # percent of rows with any NaNs

In [1]:
import pandas as pd
import seaborn as sns

# 1️ Load a dataset of your choice
df = sns.load_dataset("penguins")

# 2️ Print the first and last 5 rows
print("First 5 rows:")
print(df.head(), "\n")

print("Last 5 rows:")
print(df.tail(), "\n")

# 3️ Show .info() and .describe() results
print("DataFrame Info:")
print(df.info(), "\n")

print("Descriptive Statistics:")
print(df.describe(), "\n")

# 4️ Print column names and rename one
print("Original Columns:", df.columns.tolist(), "\n")

# Rename a column (for fun 😈)
df.rename(columns={"species": "bird_type"}, inplace=True)
print("Renamed Columns:", df.columns.tolist(), "\n")

# BONUS: Percentage of rows with any missing values
missing_percent = df.isnull().any(axis=1).mean() * 100
print(f"Percentage of rows with missing values: {missing_percent:.2f}%")


First 5 rows:
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female   

Last 5 rows:
    species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
339  Gentoo  Biscoe             NaN            NaN                NaN   
340  Gentoo  Biscoe            46.8           14.3              215.0   
341  Gentoo  Biscoe            50.4           15.7              222.0   
342  Gentoo  Biscoe            45.2           14.8              212.0   
343  Gentoo