# Notebook 01 — Data Collection (Kaggle → Local)

**Goal:**  
Get the IBM HR Attrition dataset from Kaggle into this project.  
Save a fast, clean copy so the next notebooks load it quickly.

**What I will do:**  
1. Read Kaggle credentials from environment (no secrets in code).  
2. Install the Kaggle CLI if needed.  
3. Download the dataset into `data/raw/` and unzip it.  
4. Load the CSV, show a quick summary (shape, columns, memory).  
5. Save a Parquet version to `data/processed/hr_attrition.parquet`.

**Why this matters:**  
- Proves I can collect data from an **endpoint** (Kaggle) inside a notebook.  
- Keeps raw vs processed data separate and reproducible.

**Inputs:**  
- Kaggle dataset: `pavansubhasht/ibm-hr-analytics-attrition-dataset`  
- (Optional) Existing file: `data/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv`

**Outputs:**  
- `data/processed/hr_attrition.parquet`

In [1]:
# Setup Kaggle API credentials
import os
import json

kaggle_credentials = {
    "username": "ahmedgodah",
    "key": "68faed8df5add337f9b9169e619b560b"
}

kaggle_dir = os.path.expanduser("~/.kaggle")
os.makedirs(kaggle_dir, exist_ok=True)

kaggle_json_path = os.path.join(kaggle_dir, "kaggle.json")
with open(kaggle_json_path, "w") as f:
    json.dump(kaggle_credentials, f)
os.chmod(kaggle_json_path, 0o600)
print("Kaggle credentials configured")

Kaggle credentials configured


In [2]:
!pip install kaggle
!mkdir -p ../data/raw
!kaggle datasets download -d pavansubhasht/ibm-hr-analytics-attrition-dataset -p ../inputs/datasets/raw --unzip
print("✅ Downloaded into ../data/raw")

Collecting kaggle
  Downloading kaggle-1.7.4.5-py3-none-any.whl.metadata (16 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode (from kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading kaggle-1.7.4.5-py3-none-any.whl (181 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [kaggle]2m3/4[0m [kaggle]
[1A[2KSuccessfully installed kaggle-1.7.4.5 python-slugify-8.0.4 text-unidecode-1.3 tqdm-4.67.1
Dataset URL: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
License(s): Db

In [3]:
import pandas as pd
import numpy as np

In [4]:
# Load the CSV, show a tiny summary so we know the data looks right.
import pandas as pd
from pathlib import Path

RAW_DIR = Path("../data/raw")  # keep this folder consistent with your download step
candidates = list(RAW_DIR.glob("*.csv"))
if not candidates:
    raise SystemExit("CSV not found under ../data/raw. Check download step.")

CSV_PATH = candidates[0] # first CSV found
df = pd.read_csv(CSV_PATH, low_memory=False)

print("✅ Loaded dataset")
print(f"Shape (rows, cols): {df.shape}")
print("First 8 columns:", df.columns[:8].tolist())
display(df.head())  # shows a small preview

print("\nData Types:")
print(df.dtypes)

mem_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nApprox. memory usage: {mem_mb:.1f} MB")

✅ Loaded dataset
Shape (rows, cols): (1470, 35)
First 8 columns: ['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField']


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2



Data Types:
Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWo

In [5]:
# Save a fast, clean copy for the next notebooks and Streamlit pages.
from pathlib import Path

PROCESSED_DIR = Path("../data/processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
PARQUET_PATH = PROCESSED_DIR / "hr_attrition.parquet"
df.to_parquet(PARQUET_PATH, index=False)
print(f"✅ Saved Parquet → {PARQUET_PATH}")

✅ Saved Parquet → ../data/processed/hr_attrition.parquet
