> Before running this notebook, follow these instructions from the README:

### Getting Started

1. Clone the repository (bash terminal)

```bash
git clone https://github.com/angelakberry/beauty_wizard.git
cd beauty_wizard
```
2. Set up environment and database (bash terminal)

```bash
chmod +x setup_beautywiz.sh
./setup_beautywiz.sh
```

3. Open the Jupyter Notebook
- Run all cells in `bbbbbbbbbBeautyWizard_Capstone.ipynb`

# Beauty Wizard: Cosmetic Ingredient Transparency & Risk Indicators

## 1. Introduction

This capstone project synthesizes multiple exploratory and pipeline notebooks into a single, end-to-end analysis aligned with data analysis capstone requirements. The objective is to examine cosmetic product formulations, ingredient usage patterns, and regulatory risk signals by integrating retail product data, ingredient hazard sources, and government chemical reporting into a relational SQLite database.

**Core Questions**

* Which ingredients are most prevalent across cosmetic products?
* How complex are typical cosmetic formulations?
* Do higher-priced or higher-ranked products differ in ingredient diversity or risk indicators?
* Which ingredients appear most frequently in regulatory or hazard datasets?

---



## 2. Imports & Libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image
import re
import seaborn as sns
import sqlite3

plt.rcParams['figure.figsize'] = (10,6)

## 3. Data Sources

* Sephora Skincare Product Ingredients (Kaggle CSV) (cosmetic_p.csv)
* BeautyFeeds Skincare & Haircare Dataset (BeautyFeeds.csv)
* California Safe Cosmetics Program (CSCP) Open Data (cscpopendata.csv)

Product and ingredient data from these sources are consolidated into a master Ingredients table, which serves as the central join point for product composition and regulatory hazard data sourced from California Chemicals in Cosmetics.

In [None]:
# Import data
DATA = Path("data")

cosmetic = pd.read_csv(DATA / "cosmetic_p.csv")
beauty = pd.read_csv(DATA / "BeautyFeeds.csv")
cscp = pd.read_csv(DATA / "cscpopendata.csv")

## 4. Data Analysis
### 4a. Cleaning & Standardization
"Normalization" defined:
- Standardized text (case, whitespace, characters)
- Consistent column names (e.g., ChemicalName becomes chemical_name)
- Consistent missing-value handling
- Consistent data types
- Reusable, universal dataset logic

In [None]:
def normalize(name):
    if pd.isna(name):
        return None
    name = re.sub(r"\s+", " ", name.strip())
    return name.lower()

In [None]:
# Text normalization (for any and all string columns)
def normalize_text(value):
    if pd.isna(value):
        return None
    value = str(value).strip()
    value = re.sub(r"\s+", " ", value)
    return value.lower()

In [None]:
# Column name normalization
def normalize_columns(df):
    df = df.copy()
    df.columns = (
        df.columns
        .str.strip()
        .str.lower()
        .str.replace(" ", "_")
    )
    return df

In [None]:
# Generic dataframe text normalization
def normalize_text_columns(df, columns):
    df = df.copy()
    for col in columns:
        if col in df.columns:
            df[col] = df[col].apply(normalize_text)
    return df

In [None]:
# Coerce price to numeric, invalid values become NaN
cosmetic['price'] = pd.to_numeric(cosmetic['price'], errors='coerce')

# Inspect missing values introduced by coercion
cosmetic['price'].isna().sum()

### 4b. Apply normalization to each dataset:

In [None]:
# Cosmetic product dataset
cosmetic = normalize_columns(cosmetic)

# Rename product category labels for clearer semantics
cosmetic.rename(columns={'label': 'product_type'}, inplace=True)
cosmetic.rename(columns={'name': 'product_name'}, inplace=True)

# Text normalization
cosmetic = normalize_text_columns(
    cosmetic,
    ['brand', 'product_name', 'product_type', 'ingredients']
)

# Remove rows missing key evaluation metrics
cosmetic_cleaned = cosmetic.dropna(subset=['rank'])

In [None]:
cosmetic.columns

In [None]:
# BeautyFeeds dataset
beauty = normalize_columns(beauty)

# Rename product category labels for clearer semantics
beauty.rename(columns={'type': 'product_type'}, inplace=True)
beauty.rename(columns={'name': 'product_name'}, inplace=True)

# Text normalization
beauty = normalize_text_columns(
    beauty,
    ['ingredients', 'brand', 'name']
)

In [None]:
beauty.columns

In [None]:
cscp.columns

In [None]:
# Drop unused index column
if 'Unnamed: 0' in cscp.columns:
    cscp.drop(columns=['Unnamed: 0'], inplace=True)
    
# CSCP chemical reports dataset
cscp = normalize_columns(cscp)

# Rename product category labels for clearer semantics
cscp.rename(columns={'type': 'product_type'}, inplace=True)
cscp.rename(columns={'name': 'product_name'}, inplace=True)

# Text normalization
cscp = normalize_text_columns(
    cscp,
    ['chemicalname']
)

In [None]:
cscp.head()

### 4c. Ingredient (granular) normalization

In [None]:
def clean_ingredient_column(df, ingredient_col):
    """
    Standardize ingredient text and return a dataframe
    with cleaned and tokenized ingredient lists.
    """
    df = df.copy()

    # Ensure string type
    df[ingredient_col] = df[ingredient_col].astype(str)

    # Remove special characters (preserve commas)
    df['clean_ingredients'] = df[ingredient_col].apply(
        lambda x: re.sub(r'[^a-zA-Z0-9,\s]', '', x)
    )

    # Split into list and strip whitespace
    df['ingredient_list'] = df['clean_ingredients'].str.split(',')
    df['ingredient_list'] = df['ingredient_list'].apply(
        lambda lst: [item.strip() for item in lst if item.strip()]
    )

    # Normalize each ingredient token
    df['ingredient_list_norm'] = df['ingredient_list'].apply(
        lambda lst: [normalize(i) for i in lst]
    )

    return df

# Ingredient-specific logic
cosmetic = clean_ingredient_column(cosmetic, 'ingredients')

### 4d. Missing Data Handling

* Product data `(cosmetic)`:

In [None]:
# Create NaN (not a number) placeholder values
cosmetic['price'] = pd.to_numeric(cosmetic['price'], errors='coerce')

# Missing value inspection
cosmetic.isna().sum()

# Fill missing categorical values
cosmetic['brand'] = cosmetic['brand'].fillna('Unknown')

# Remove rows missing key evaluation metrics, creating rank-valid subset
cosmetic_cleaned = cosmetic.dropna(subset=['rank'])

# Data cleaning check:
print(cosmetic.shape)
print(cosmetic_cleaned.shape)

* Ingredients data `(beauty)`:

In [None]:
# Missing value inspection
beauty.isna().sum()

# Coerce numeric values
beauty['price'] = pd.to_numeric(beauty['price'], errors='coerce')

# Fill missing categorical values that actually exist
beauty['brand'] = beauty['brand'].fillna('Unknown')
beauty['product_type'] = beauty['product_type'].fillna('Unknown')

# Require ingredient text
beauty_cleaned = beauty.dropna(subset=['ingredients'])

# Data cleaning check:
print(beauty.shape)
print(beauty_cleaned.shape)

* Chemical reporting data `(cscp)`:

In [None]:
# Missing value inspection
cscp.isna().sum()

# Coerce report count to numeric
cscp['chemicalcount'] = pd.to_numeric(cscp['chemicalcount'], errors='coerce')

# Data cleaning check:
cscp.shape

* Inspected missing values across all datasets using dataset-specific strategies.
* Coerced numeric fields, labeled missing categorical values as Unknown, and dropped rows only when key metrics were missing.
* Verified dataset sizes before and after cleaning to avoid over-filtering and preserve integrity.

### 4e. Outlier Handling

In [None]:
# IQR method to flag extreme values:
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series[(series < lower) | (series > upper)]

In [None]:
# Outliers by price:
price_outliers = detect_outliers_iqr(cosmetic["price"])
price_outliers.count()

In [None]:
#Flag instead of drop

cosmetic["price_outlier"] = cosmetic["price"].isin(price_outliers)

In [None]:
# Visualization to help with decision making
plt.figure(figsize=(12, 6))
cosmetic.boxplot(column="price")
plt.title("Price Distribution with Outliers")
plt.show()

* Extreme values were flagged for transparency rather than removed indiscriminately.
* Price outliers identified by IQR and vizualizations
* Outliers are included in EDA but are noted where appropriate to preserve real-world variability.

## 5. Exploratory Data Analysis (EDA)
This section provides high-level context on product pricing, rankings, and ingredient usage using the raw cosmetic dataset. These exploratory views help frame later database-driven analyses but are not used directly for conclusions.

In [None]:
# Derived columns for EDA
cosmetic['num_ingredients'] = cosmetic['ingredient_list'].apply(len)
cosmetic_exploded = cosmetic.explode('ingredient_list')

### Ingredient Frequency:

In [None]:
# Get top 20 ingredients by frequency
top_ingredients = (
    cosmetic_exploded['ingredient_list']
    .dropna()
    .value_counts()
    .head(20)
    .reset_index()
)

top_ingredients.columns = ['ingredient', 'count']

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(
    data=top_ingredients,
    x='count',
    y='ingredient'
)

# Styling
plt.title('Top 20 Most Common Ingredients', fontsize=16, weight='bold')
plt.xlabel('Count', fontsize=12)
plt.ylabel('Ingredient', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.6)
sns.despine()
plt.tight_layout()
plt.show()


* Glycerin a surprise upset, beating water out as king of ingredients
* Dominance of relatively few ingredients
* In the macro, this plays out as long-tail distribution of rarely used ingredients

### Ingredient count by product type:

In [None]:
# Sort product types by mean number of ingredients
mean_ingredients = (
    cosmetic.groupby('product_type')['num_ingredients']
            .mean()
            .sort_values()
)

ordered_product_types = mean_ingredients.index

plt.figure(figsize=(12, 6))
sns.boxplot(
    data=cosmetic,
    x='product_type',
    y='num_ingredients',
    order=ordered_product_types,
    legend=False,
    linewidth=1.5,
    fliersize=8
)

plt.title('Ingredient Count by Product Type', fontsize=16, weight='bold')
plt.xlabel('Product Type', fontsize=12)
plt.ylabel('Number of Ingredients', fontsize=12)
plt.xticks(rotation=15)
plt.grid(axis='y', linestyle='--', alpha=0.7)
sns.despine()
plt.tight_layout()
plt.show()


* On average, products in this dataset contain between 20 to 40 ingredients, reflecting moderate formulation complexity
* Summary table:

In [None]:
ingredient_by_type = (
    cosmetic.groupby('product_type')['num_ingredients']
            .agg(['count', 'mean', 'median'])
            .round(1)
            .sort_values('mean')
)

ingredient_by_type

### Product price distribution:

In [None]:
# Price distribution (exclude missing or zero prices)
prices = cosmetic['price'].dropna()
prices = prices[prices > 0]

plt.figure(figsize=(12, 6))
plt.hist(prices, bins=30, edgecolor='white')
plt.title('Distribution of Product Prices', fontsize=16, weight='bold')
plt.xlabel('Price (USD)', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
sns.despine()
plt.tight_layout()
plt.show()

* Skewed distribution show most products clustered in lower price ranges (around $38) 
* Small number of luxury-priced outliers raise the average.

### Product Price and Ranking:

In [None]:
# Price vs. rank
plt.figure(figsize=(12, 6))
plt.scatter(cosmetic['price'], cosmetic['rank'], alpha=0.4)
plt.title('Price vs. Product Rank')
plt.xlabel('Price')
plt.ylabel('Rank')
sns.despine()
plt.show()

* No strong linear relationship between price and ranking suggests that higher-priced products do not consistently receive better rankings.

## 6. Database Design & Creation

This project uses SQLite to enforce relational integrity and enable SQL-based analysis.

**Tables**

* Products
* Ingredients
* ProductIngredients (many-to-many)
* IngredientHazards
* ChemicalReports



In [None]:
img = Image.open("schema/beauty_wizard_ERD.png")
plt.figure(figsize=(12, 6))
plt.imshow(img)
plt.axis("off")
plt.show()
plt.close()

In [None]:
# Database config
PROJECT_ROOT = Path.cwd()
DB_PATH = PROJECT_ROOT / "BeautyWiz.db"

conn = sqlite3.connect(DB_PATH)

In [None]:
# Enforce foreign keys
conn.execute("PRAGMA foreign_keys = ON;")

# Creates cursor object which ties to db
cursor = conn.cursor()

In [None]:
# Data import
DATA = Path("data")

cosmetic = pd.read_csv(DATA / "cosmetic_p.csv")
beauty = pd.read_csv(DATA / "BeautyFeeds.csv")
cscp = pd.read_csv(DATA / "cscpopendata.csv")

In [None]:
# Validation
print("Cosmetic shape:", cosmetic.shape)
print("BeautyFeeds shape:", beauty.shape)
print("CSCP shape:", cscp.shape)

In [None]:
cosmetic.head()

## 7. Advanced SQL Querying



Examples include:

* Multi-table joins across products, ingredients, and hazards
* Aggregations with HAVING clauses
* Subqueries identifying high-risk ingredients used in top-ranked products



Hazard & Regulatory Signals

### Hazard Coverage

* Percentage of ingredients with hazard data
* Average hazard score per product

### Regulatory Reporting

* Ingredients appearing in CSCP reports
* Report frequency and discontinuation flags

Derived Metric:

* **Product Safety Indicator** = count of flagged ingredients per product

---



In [None]:
# SQL-Derived Metrics: Ingredient Counts
ingredient_counts = pd.read_sql("""
SELECT product_id, COUNT(*) AS ingredient_count
FROM ProductIngredients
GROUP BY product_id
""", conn)

### Query 1: Ingredient complexity by brand

Which brands tend to use more complex formulations (more ingredients per product)?

In [None]:
# Average ingredients per product by brand

conn = sqlite3.connect(DB_PATH)

query_brand_complexity = '''
SELECT
    p.brand,
    COUNT(DISTINCT pi.ingredient_id) * 1.0
        / COUNT(DISTINCT p.product_id) AS avg_ingredients_per_product,
    COUNT(DISTINCT p.product_id) AS product_count
FROM Products p
JOIN ProductIngredients pi
    ON p.product_id = pi.product_id
GROUP BY p.brand
HAVING COUNT(DISTINCT p.product_id) >= 5
ORDER BY avg_ingredients_per_product DESC;
'''

df_brand = pd.read_sql(query_brand_complexity, conn)
df_brand.shape

In [None]:
# Q1 Visualization

df_brand.sort_values("avg_ingredients_per_product").plot(
    kind="barh",
    x="brand",
    y="avg_ingredients_per_product",
    legend=False
)
plt.title("Average Ingredients per Product by Brand")
plt.xlabel("Average Number of Ingredients")
plt.ylabel("Brand")
plt.show()

* This analysis compares average formulation complexity across brands by measuring the number of unique ingredients used per product. Brands with higher averages tend to produce more complex formulations.

### Query 2: Most Widely Used Ingredients Across Products

Which ingredients appear most frequently across cosmetic products?

In [None]:
# Ingredient prevalence

conn = sqlite3.connect(DB_PATH)

query_ingredients = '''
SELECT
    i.ingredient_name,
    COUNT(DISTINCT pi.product_id) AS product_count
FROM Ingredients i
JOIN ProductIngredients pi
    ON i.ingredient_id = pi.ingredient_id
GROUP BY i.ingredient_name
HAVING COUNT(DISTINCT pi.product_id) >= 20
ORDER BY product_count DESC
LIMIT 25;
'''

df_top_ingredients = pd.read_sql(query_ingredients, conn)
df_top_ingredients.shape

In [None]:
# Q2 Visualization

df_top_ingredients.plot(
    kind="bar",
    x="ingredient_name",
    y="product_count",
    legend=False
)
plt.title("Most Widely Used Ingredients Across Products")
plt.xlabel("Ingredient")
plt.ylabel("Number of Products")
plt.xticks(rotation=75, ha="right")
plt.tight_layout()
plt.show()

This query identifies ingredients that appear most frequently across cosmetic products, highlighting formulation staples that dominate product compositions.

### Query 3: Products with regulatory history

This analysis links cosmetic products to regulatory reporting records to identify which products and brands contain ingredients with documented regulatory histories. Results are presented at two levels:
* Product-level detail to support ingredient transparency
* Brand-level aggregate to assess overall regulatory exposure

### Product-level regulatory exposure:

This query shows individual products that contain ingredients with documented regulatory reporting or discontinuation history

In [None]:
# Products with chemical reporting records

conn = sqlite3.connect(DB_PATH)

query_top_products_regulated_ingredients = """
SELECT
    p.brand,
    p.product_name AS product,
    COUNT(DISTINCT pi.ingredient_id) AS regulated_ingredient_count
FROM Products p
JOIN ProductIngredients pi
    ON p.product_id = pi.product_id
JOIN ChemicalReports cr
    ON pi.ingredient_id = cr.ingredient_id
GROUP BY p.product_id
ORDER BY regulated_ingredient_count DESC
LIMIT 10;
"""

df_top_products = pd.read_sql(query_top_products_regulated_ingredients, conn)
df_top_products

* Products with regulatory history are typically linked to a small set of widely used ingredients rather than rare or niche compounds, highlighting the importance of monitoring common formulation components.

### Brand-level regulatory exposure:

To assess broader exposure patterns, the following query aggregates regulatory activity at the brand level.

In [None]:
# Brands with chemical reporting records
query_brand_exposure = """
WITH ingredient_reports AS (
    SELECT
        ingredient_id,
        COUNT(DISTINCT report_id) AS regulatory_event_count
    FROM ChemicalReports
    GROUP BY ingredient_id
)
SELECT
    p.brand,
    COUNT(DISTINCT ir.ingredient_id) AS regulated_ingredient_count,
    SUM(ir.regulatory_event_count) AS total_regulatory_events
FROM ingredient_reports ir
JOIN ProductIngredients pi
    ON ir.ingredient_id = pi.ingredient_id
JOIN Products p
    ON pi.product_id = p.product_id
WHERE ir.regulatory_event_count > 0
GROUP BY p.brand
ORDER BY total_regulatory_events DESC;
"""

df_brand_hazards = pd.read_sql(query_brand_exposure, conn)
df_brand_hazards.head(10)

### Brand hazard summary:

In [None]:
brand_hazard_summary = (
    df_brand_hazards
    .sort_values('total_regulatory_events', ascending=False)
)

brand_hazard_summary.head(10)

* Regulatory exposure is unevenly distributed across brands. A small number of brands account for a disproportionate share of reported ingredients, suggesting concentrated regulatory risk rather than uniform exposure across the market.

## 10. Key Findings

* A small subset of ingredients dominates cosmetic formulations
* Most products contain a long tail of low-frequency ingredients
* Regulatory and hazard reporting is concentrated among relatively few chemicals
* Ingredient diversity does not strongly correlate with price

---



## 11. Limitations

* Ingredient presence does not imply concentration or exposure level
* Hazard scores are source-dependent and not definitive safety measures
* Dataset coverage varies by brand and product category

---



## 12. Conclusion & Next Steps

This project demonstrates an end-to-end data analysis workflow including ETL, database design, SQL analysis, Python EDA, and professional documentation. Future extensions could include:

* Automated data refresh
* API-driven product lookups
* Integration of consumer review sentiment

---



## 13. Reproducibility Notes

* All work performed via command line Git commits
* No file uploader used after initial dataset acquisition
* Notebook structured for portfolio and PDF export (if needed)
