# Beauty Wizard: Cosmetic Ingredient Transparency & Risk Indicators

## 1. Introduction

This capstone project synthesizes multiple exploratory and pipeline notebooks into a single, end-to-end analysis aligned with data analysis capstone requirements. The objective is to examine cosmetic product formulations, ingredient usage patterns, and regulatory risk signals by integrating retail product data, ingredient hazard sources, and government chemical reporting into a relational SQLite database.

**Core Questions**

* Which ingredients are most prevalent across cosmetic products?
* How complex are typical cosmetic formulations?
* Do higher-priced or higher-ranked products differ in ingredient diversity or risk indicators?
* Which ingredients appear most frequently in regulatory or hazard datasets?

---



In [None]:
## 2. Data Sources

* Sephora Skincare Product Ingredients (Kaggle CSV)
* BeautyFeeds Skincare & Haircare Dataset
* California Safe Cosmetics Program (CSCP) Open Data

These sources are combined to exceed minimum row and column requirements and allow meaningful relational joins.

---



## 3. Environment & Libraries



In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

In [None]:
## 4. Database Design & Creation

This project uses SQLite to enforce relational integrity and enable SQL-based analysis.

**Tables**

* Products
* Ingredients
* ProductIngredients (many-to-many)
* IngredientHazards
* ChemicalReports

```python
conn = sqlite3.connect("BeautyWiz.db")
conn.execute("PRAGMA foreign_keys = ON;")
```

(Full schema creation code included here.)

---



In [None]:
## 5. Data Cleaning & Standardization

### Ingredient Normalization

* Lowercasing
* Whitespace normalization
* Consistent matching across datasets

```python
def normalize(name):
    if pd.isna(name): return None
    name = re.sub(r"\s+", " ", name.strip())
    return name.lower()
```

### Missing Data Handling

* Numeric: retained or median-imputed where appropriate
* Categorical: labeled as `Unknown`
* Relational keys: dropped if integrity could not be preserved

---



In [None]:
## 6. ETL Pipeline

Steps:

1. Insert products
2. Build master ingredient list
3. Create productâ€“ingredient links
4. Load hazard data
5. Load chemical reporting data

Each step is implemented as a reusable Python function and committed incrementally.

---



In [None]:
## 7. Exploratory Data Analysis (EDA)

### Ingredient Frequency

* Top 20 most common ingredients
* Long-tail distribution of rarely used ingredients

### Product Formulation Complexity

* Distribution of ingredient counts per product
* Comparison across product types and brands

### Price & Rank Analysis

* Price vs. ingredient count
* Rank vs. formulation complexity

```python
ingredient_counts = pd.read_sql("""
SELECT product_id, COUNT(*) AS ingredient_count
FROM ProductIngredients
GROUP BY product_id
""", conn)
```

---



In [None]:
## 8. Hazard & Regulatory Signals

### Hazard Coverage

* Percentage of ingredients with hazard data
* Average hazard score per product

### Regulatory Reporting

* Ingredients appearing in CSCP reports
* Report frequency and discontinuation flags

Derived Metric:

* **Product Safety Indicator** = count of flagged ingredients per product

---



In [None]:
## 9. Advanced SQL Queries

Examples include:

* Multi-table joins across products, ingredients, and hazards
* Aggregations with HAVING clauses
* Subqueries identifying high-risk ingredients used in top-ranked products

---



In [None]:
## 10. Key Findings

* A small subset of ingredients dominates cosmetic formulations
* Most products contain a long tail of low-frequency ingredients
* Regulatory and hazard reporting is concentrated among relatively few chemicals
* Ingredient diversity does not strongly correlate with price

---



In [None]:
## 11. Limitations

* Ingredient presence does not imply concentration or exposure level
* Hazard scores are source-dependent and not definitive safety measures
* Dataset coverage varies by brand and product category

---



In [None]:
## 12. Conclusion & Next Steps

This project demonstrates an end-to-end data analysis workflow including ETL, database design, SQL analysis, Python EDA, and professional documentation. Future extensions could include:

* Automated data refresh
* API-driven product lookups
* Integration of consumer review sentiment

---



In [None]:
## 13. Reproducibility Notes

* All work performed via command line Git commits
* No file uploader used post-initial dataset acquisition
* Notebook structured for portfolio and PDF export
