# Hazardous Ingredients in Cosmetics Project

This project contains data from the California Department of Public Health where they maintain a database of all personal beauty products that contain ingredients that could potentially be hazardous. All products are self-reported by the manufacturers or companies, and reporting is required if the company:

   * Has annual aggregate sales of cosmetic products of one million dollars or more, and
   * Has sold cosmetic products in California on or after January 1, 2007.
   
Additional data from Kaggle was added to ensure that products that do not contain a hazardous ingredient are included in the project.

## Goals:
   * Determine features that could determine whether a product will contain a hazardous ingredient.
   * Build a model that can accurately predict if a product contains a hazardous ingredient.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import warnings
warnings.filterwarnings("ignore")

import seaborn as sns
from scipy import stats


from sklearn.model_selection import train_test_split
import sklearn.preprocessing
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import category_encoders as ce
import prepare as p

## Acquire
* Data acquired from California Department of Public Health (CDPH) and Kaggle.com
* Two separate dataframes are acquired before cleaning
    * DF1, the data from Kaggle, has 11 columns and 1472 rows before cleaning
    * DF2, the data from CDPH, has 22 columns and 114635 rows before cleaning
* Each row is a single product
* Each column contains information about the product

## Prepare
DF1 (Kaggle):
* Duplicates removed
* Ingredients extracted from single cell
* Target column, has_hazard_ingredient, created when comparing ingredients to hazard ingredient list.
* Unused columns dropped
* Product types renamed to match CHDP data.
    * Only Skincare products and Sunscreen products
* Columns renamed to match CDHP data.

DF2 (CDPH):
* Duplicates removed
* Target column, has_hazard_ingredient, created
* Unused columns dropped
* Columns renamed

Final dataframe:
* Combined Kaggle and CDPH dataframes
* All other types of products other than Skincare and Sunscreen removed:
    * These two types were the only ones that contained products with non-hazardous ingredients
* Split the data into train, validate, split in a 50, 30, 20 split, stratified on has_hazard_ingredient.
* 'Brand' and 'Type' features encoded:
    * LeaveOneOut Encoding used for dimensionality issues with a sigma = 0.5 to avoid overfitting.
        * Target excluded in test data
* Final dataframe has 4 columns and 7193 rows.

In [2]:
# acquire
df1, df2 = p.get_cosmetic_data()

## Before we clean it up, let's see some interesting numbers from the CDHP database:

In [3]:
#top 5 hazardous chemicals products
df2.ChemicalName.value_counts().sort_values(ascending=False).head(5)

Titanium dioxide                                                                                       93480
Silica, crystalline (airborne particles of respirable size)                                             2817
Retinol/retinyl esters, when in daily dosages in excess of 10,000 IU, or 3,000 retinol equivalents.     2154
Mica                                                                                                    1919
Butylated hydroxyanisole                                                                                1888
Name: ChemicalName, dtype: int64

In [4]:
#list of 10 companies with the most products with hazardous ingredients
df2.CompanyName.value_counts().sort_values(ascending=False).head(10)

L'Oreal USA                            5747
S+                                     5165
Coty                                   5162
Revlon Consumer Product Corporation    4341
Bare Escentuals Beauty, Inc.           3828
The Procter & Gamble Company           3535
NYX Los Angeles, Inc.                  3227
Charlotte Tilbury Beauty Ltd           2770
Tarte Cosmetics                        2497
Victoria's Secret Beauty               2219
Name: CompanyName, dtype: int64

In [5]:
#list of 10 brands with the most products with hazardous ingredients
df2.BrandName.value_counts().sort_values(ascending=False).head(9)

SEPHORA                     3394
NYX                         3227
bareMinerals                3158
Charlotte Tilbury           2453
Revlon                      2335
NARS                        2185
Victoria's Secret Beauty    2106
tarte                       2101
Sally Hansen                1834
Name: BrandName, dtype: int64

In [6]:
# clean the first df
df1, df2 = p.prep_df1(df1,df2)

#clean the second df
df2 = p.prep_df2(df2)

#clean the final df
df = p.final_prep(df1,df2)

In [8]:
#split the data
train, validate, test = p.train_validate_test_split(df, 'has_hazard_ingredient', seed=611)

## Explore