# Week 5 â€“ Day 2 â€“ ExerciseXP  
## Fundamentals of Data Analysis â€“ Full Submission

This notebook contains **Exercises 1 to 6** completed in a single notebook, as requested.

### Datasets used:
- **Sleep:** Time Americans Spend Sleeping  
- **Mental Health (general):** clean_dataset.csv  
- **Mental Health (depression):** Mental health Depression disorder Data.csv  
- **Credit Card Approvals:** crx.csv  

All explanations are written in **student language**, directly inside the notebook.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

SLEEP_PATH = r"""/mnt/data/Time Americans Spend Sleeping.csv"""
MENTAL_GENERAL_PATH = r"""/mnt/data/clean_dataset.csv"""
MENTAL_DEPRESSION_PATH = r"""/mnt/data/Mental health Depression disorder Data.csv"""
CREDIT_PATH = r"""/mnt/data/crx.csv"""

print("Datasets loaded from:")
print(SLEEP_PATH)
print(MENTAL_GENERAL_PATH)
print(MENTAL_DEPRESSION_PATH)
print(CREDIT_PATH)


# ðŸŒŸ Exercise 1 â€“ Introduction to Data Analysis

### What is data analysis?
Data analysis is the process of collecting, cleaning, exploring, and interpreting data in order to extract useful information.  
It helps transform raw numbers or text into insights that can be understood and used for decisions.

### Why is data analysis important today?
In modern contexts, huge amounts of data are generated every day (apps, healthcare systems, finance, social networks).
Data analysis helps reduce uncertainty, detect patterns, and support decisions based on evidence rather than intuition.

### Three application areas of data analysis
1. **Healthcare** â€“ understanding disease trends, improving treatments, and planning resources.
2. **Finance** â€“ credit approval, fraud detection, and risk assessment.
3. **Marketing / Business** â€“ customer behavior analysis, segmentation, and performance measurement.

**Conclusion:** Data analysis plays a key role in transforming data into knowledge and guiding better decisions.


# ðŸŒŸ Exercise 2 â€“ Dataset Loading and Initial Analysis

In [None]:
def load_csv(path):
    try:
        return pd.read_csv(path)
    except Exception:
        return pd.read_csv(path, sep=';')

sleep_df = load_csv(SLEEP_PATH)
mental_general_df = load_csv(MENTAL_GENERAL_PATH)
mental_depression_df = load_csv(MENTAL_DEPRESSION_PATH)
credit_df = load_csv(CREDIT_PATH)

sleep_df.head(), mental_general_df.head(), mental_depression_df.head(), credit_df.head()


In [None]:
def dataset_summary(df, name):
    print("\n" + "="*80)
    print(name)
    print("="*80)
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))
    print("\nData types:")
    print(df.dtypes)
    print("\nMissing values (top 10):")
    print(df.isna().sum().sort_values(ascending=False).head(10))

dataset_summary(sleep_df, "Sleep Dataset")
dataset_summary(mental_general_df, "Mental Health â€“ General Dataset")
dataset_summary(mental_depression_df, "Mental Health â€“ Depression Dataset")
dataset_summary(credit_df, "Credit Card Approvals Dataset")


# ðŸŒŸ Exercise 3 â€“ Identifying Data Types

In [None]:
def classify_column(series):
    if pd.api.types.is_numeric_dtype(series):
        if series.nunique() <= 10:
            return "Qualitative (coded)", "Numeric values representing categories"
        return "Quantitative", "Numeric values with measurable magnitude"
    return "Qualitative", "Categorical or text-based values"

def classify_dataset(df):
    rows = []
    for col in df.columns:
        t, reason = classify_column(df[col])
        rows.append({
            "Column": col,
            "Data type": str(df[col].dtype),
            "Classification": t,
            "Reason": reason
        })
    return pd.DataFrame(rows)

sleep_types = classify_dataset(sleep_df)
mental_general_types = classify_dataset(mental_general_df)
mental_depression_types = classify_dataset(mental_depression_df)
credit_types = classify_dataset(credit_df)

sleep_types, mental_general_types, mental_depression_types, credit_types


# ðŸŒŸ Exercise 4 â€“ Exploring Data Types with the Iris Dataset

In [None]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
iris_df = iris.frame
iris_df.head()


In [None]:
iris_types = classify_dataset(iris_df)

# Correct conceptual classification for target column
iris_types.loc[iris_types["Column"] == "target", "Classification"] = "Qualitative (label)"
iris_types.loc[iris_types["Column"] == "target", "Reason"] = "Encoded species category"

iris_types


**Explanation:**  
- Sepal and petal measurements are quantitative because they represent real physical measurements.  
- The `target` column is qualitative because it represents flower species categories.


# ðŸŒŸ Exercise 5 â€“ Basic Data Analysis

In [None]:
col = "sepal length (cm)"

print("Mean:", iris_df[col].mean())
print("Median:", iris_df[col].median())
print("Mode:", iris_df[col].mode().tolist())


In [None]:
plt.hist(iris_df[col])
plt.title("Iris Dataset â€“ Sepal Length Distribution")
plt.xlabel(col)
plt.ylabel("Count")
plt.show()


# ðŸŒŸ Exercise 6 â€“ Basic Observation Skills

In [None]:
# Sleep dataset observations
print("Sleep dataset observations:")
print("- Sleep duration can be analyzed across age or gender groups.")
print("- Time-related columns allow trend analysis over years.")

# Depression dataset observations
print("\nDepression dataset observations:")
print("- Prevalence rates can be compared across countries or years.")
print("- Trend analysis can show how depression rates evolve over time.")


In [None]:
# Example: simple group comparison on depression dataset (if possible)
numeric_cols = [c for c in mental_depression_df.columns if pd.api.types.is_numeric_dtype(mental_depression_df[c])]
cat_cols = [c for c in mental_depression_df.columns if not pd.api.types.is_numeric_dtype(mental_depression_df[c])]

if numeric_cols and cat_cols:
    metric = numeric_cols[0]
    group = cat_cols[0]

    summary = (
        mental_depression_df[[group, metric]]
        .dropna()
        .groupby(group)[metric]
        .mean()
        .sort_values(ascending=False)
        .head(10)
    )

    summary.plot(kind="bar", title=f"Mean {metric} by {group}")
    plt.show()
