# Chemicals In Cosmetics
Etienne Batiste <br>
CS 2316 <br>
Fall 2020

#Introduction
I chose the topic of chemicals in cosmetics because of interest in the safety of ingredients found in cosmetics. I am looking to learn how many products currently being sold have potentially hazardous chemicals in them. I expect to gather knowledge about the safety of specific ingredients found in products and to trace potentially hazardous chemicals back to specific brands selling them. 

***Please do not run any cells. My code was extracting from live data sources which have since changed. Running will cause error output. THANK YOU!**

See Insights section for results

## Data Cleaning and Parsing of Chemical In Cosmetics file from Health.gov



In [None]:
import csv
import pandas as pd

def data_parser(filename):
    with open(filename,"r", encoding ="utf8") as fin:
            reader = csv.reader(fin)
            readerList = [line for line in reader]
            newList = []
            newList.append([readerList[0][1],readerList[0][5],readerList[0][6],readerList[0][8],readerList[0][10],readerList[0][14],readerList[0][15],readerList[0][16],readerList[0][17]])

            for line in readerList[1:]:
                line[1] = line[1].strip()
                line[5] = line[5].strip()
                if line[6] == '':
                    line[6] = "Unknown"
                else: 
                    line[6] = line[6].strip()
                line[8] = line[8].strip()
                line[10] = line[10].strip()
                line[14] = line[14].strip()
                line[15] = line[15].strip()
                line[16] = line[16].strip()
                if line[17] == '':
                    line[17] = "None"
                else: 
                    line[17] = line[17].strip()

                alist = [line[1],line[5],line[6],line[8],line[10],line[14],line[15],line[16],line[17]]
                newList.append(alist)

    df = pd.DataFrame(data = newList[1:], columns = newList[0])
    return df.drop_duplicates(ignore_index=True).to_csv('chemincos.csv')
    

############ Function Call ############
data_parser("cscpopendata.csv")



## Web Collection Parsing from cosmeticsinfo.org


In [None]:
import csv
import requests
from bs4 import BeautifulSoup 

def web_parser1(url):
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html.parser')
    urlblock = soup.find(class_= "view-content")
    urls = urlblock.find_all('a')
    hrefs = [a.get('href') for a in urls] #all hrefs for each letter's page

    bigList = []
    for href in hrefs:
        req = requests.get("https://cosmeticsinfo.org" + href)
        soup = BeautifulSoup(req.content, 'html.parser')
        if soup.find_all(class_="pager") != []:
            pageblock = soup.find(class_="pager")
            pages = pageblock.find_all('a')
           
            ingredientList = soup.find_all(class_ = "field-content")
            ingredientList = [ingredient.find("a").text for ingredient in ingredientList]
            bigList += ingredientList
            for num in [a.get('href') for a in pages]:  #hrefs for each page of each letter
                req = requests.get("https://cosmeticsinfo.org" + num)
                soup = BeautifulSoup(req.content, 'html.parser')
                ingredientList = soup.find_all(class_ = "field-content")
                ingredientList = [ingredient.find("a").text for ingredient in ingredientList]
                bigList += ingredientList
        else: 
            ingredientList = soup.find_all(class_ = "field-content")
            ingredientList = [ingredient.find("a").text for ingredient in ingredientList]
            bigList += ingredientList

    with open("chemicals.txt","w") as fout:
      for r in bigList:
        fout.write(r + "\n")
  
############ Function Call ############
web_parser1("https://cosmeticsinfo.org/ingredient-alphabetical/1")


## Data Sources

*   Dataset Source: <a>https://healthdata.gov/dataset/chemicals-cosmetics</a>   (114,636 rows by 22 columns and 30.4 MB)
*   Web Collection Source:<a>https://cosmeticsinfo.org/ingredient-alphabetical/1</a>



# Data Analysis

## Insights

In [None]:
def insight1():
    with open('chemincos.csv',"r", encoding ="utf8") as fin:
        reader = csv.reader(fin)
        readerList = [line for line in reader]
        
        chemicalSet = set()
        for line in readerList[1:]: #set for unique chemical names
            chemicalSet.add(line[6])
        
        chemicalDict = {}
        for chemical in chemicalSet:
            chemicalDict[chemical] = {}
            discontinuedNumber = 0
            stillOnShelves = 0
            for line in readerList[1:]:
                if chemical == line[6] and line[9] == 'None':
                    stillOnShelves += 1
                elif chemical == line[6]: discontinuedNumber += 1
            chemicalDict[chemical]["NumDiscontinued"] = discontinuedNumber
            chemicalDict[chemical]["NumStillSold"] = stillOnShelves
    
            
        df = pd.DataFrame(data = chemicalDict)
        df = df.transpose().sort_values(by="NumStillSold",ascending = False)
        df["PercentDiscontinued(%)"] = ((df["NumDiscontinued"]/(df["NumDiscontinued"] + df["NumStillSold"])) * 100).round(2)
           
      
    return df
    
            

############ Function Call ############
insight1()

Unnamed: 0,NumDiscontinued,NumStillSold,PercentDiscontinued(%)
Titanium dioxide,4191,29947,12.28
"Silica, crystalline (airborne particles of respirable size)",56,1324,4.06
Carbon black,35,694,4.80
Cocamide diethanolamine,346,633,35.34
Retinyl palmitate,19,607,3.04
...,...,...,...
N-Nitrosodimethylamine,3,1,75.00
Dichloromethane (Methylene chloride),6,0,100.00
Goldenseal root powder,2,0,100.00
Diethanolamides of the fatty acids of coconut oil,1,0,100.00


### Insight 1 Explanation

This insight tells you how many products containing a certain chemical are still being sold on the market and how many products containing a certain chemical have been discontinued. I have added a percent discontinued column to put in perspective how much the proportion of products that contain that chemical have been discontinued. It can be inferred that the higher the percent discontinued that the more potentially hazardous a chemical is. I have sorted the data by number of products still sold since the higher up chemicals are of the most importance because they are the most prevalent in cosmetics today. Doing this also indicates that the chemicals located further down on the table are never or rarely sold which can indicate its danger.

In [None]:
import re
from pprint import pprint

def insight2():
    insight = insight1()
    mostcommon = list(insight.index)[0:13]
    with open("chemicals.txt",'r') as fin:
      chemList = fin.read().split('\n')
      
      newList = set()
      for mcchemical in mostcommon:
        for chemical in chemList: 
          if mcchemical.lower() in chemical.lower() or chemical.lower() in mcchemical.lower():
            newList.add(chemical)
  
    newList.remove('Retinol and Retinyl Palmitate')
    newList.remove('')

    newDict = {}       
    for chemical in newList:   
        req = requests.get("https://cosmeticsinfo.org/ingredient/" + chemical.replace(" ","-").replace("(","").replace(")","").replace(",","").replace("/",""))
        soup = BeautifulSoup(req.content, 'html.parser')
        newDict[chemical] = {}
        paras = soup.find_all("p")
        
        newDict[chemical]["Definition"] = paras[1].text.replace("u00a0","").replace("u2019s","")
        if soup.find_all(class_="field__item")!= [] and chemical != "Titanium Dioxide" and chemical != "Silica":
          safety = soup.find_all(class_="field__item")[1]
          newDict[chemical]["SafetyInfo"] = safety.find("p").text.replace("u00a0","").replace("u2019s","")
        elif chemical == "Titanium Dioxide":
          #safety = soup.find_all(class_="field__item")[0]
          newDict[chemical]["SafetyInfo"] = paras[9].text.replace("u00a0","").replace("u2019s","")
        elif chemical == "Silica":
          #safety = soup.find_all(class_="field__item")[0]
          newDict[chemical]["SafetyInfo"] = paras[5].text.replace("u00a0","").replace("u2019s","")
        else:
          newDict[chemical]["SafetyInfo"] = "Use Link"
       
        safetyinfo = newDict[chemical]["SafetyInfo"].lower() 
        if ("approved" in safetyinfo and "not approved" not in safetyinfo) or ("permitted" in safetyinfo and "not permitted" not in safetyinfo) or ("allowed" in safetyinfo and "not allowed" not in safetyinfo): 
          newDict[chemical]["Safe?"] = "yes"
        else:
          newDict[chemical]["Safe?"] = "Use Link"

        newDict[chemical]["Link"] = "https://cosmeticsinfo.org/ingredient/" + chemical.replace(" ","-").replace("(","").replace(")","").replace(",","").replace("/","")
        
    df = pd.DataFrame(data = newDict)
    df = df.transpose()
     
    return df


############ Function Call ############
insight2()

Unnamed: 0,Definition,SafetyInfo,Safe?,Link
Diethanolamine,"Triethanolamine (TEA), Diethanolamine (DEA) an...",The U.S. Food and Drug Administration (FDA) in...,yes,https://cosmeticsinfo.org/ingredient/Diethanol...
Ethanolamine,"Triethanolamine (TEA), Diethanolamine (DEA) an...",The U.S. Food and Drug Administration (FDA) in...,yes,https://cosmeticsinfo.org/ingredient/Ethanolamine
Retinyl Palmitate,Retinol is the primary naturally occurring for...,Retinol and retinyl palmitate are produced by ...,Use Link,https://cosmeticsinfo.org/ingredient/Retinyl-P...
Titanium Dioxide,Titanium dioxide is a naturally occurring mine...,FDA lists titanium dioxide as a color additive...,yes,https://cosmeticsinfo.org/ingredient/Titanium-...
Cocamide DEA,"Cocamide DEA, Lauramide DEA, Linoleamide DEA a...","Cocamide DEA, Lauramide DEA, Linoleamide DEA a...",Use Link,https://cosmeticsinfo.org/ingredient/Cocamide-DEA
Mica,Mica is a naturally occurring group of silicat...,The Food and Drug Administration (FDA) lists M...,Use Link,https://cosmeticsinfo.org/ingredient/Mica
Triethanolamine,"Triethanolamine (TEA), Diethanolamine (DEA) an...",The U.S. Food and Drug Administration (FDA) in...,yes,https://cosmeticsinfo.org/ingredient/Triethano...
Retinol,Retinol is the primary naturally occurring for...,Retinol and retinyl palmitate are produced by ...,Use Link,https://cosmeticsinfo.org/ingredient/Retinol
Silica,"Silica, also called silicone dioxide, and Hydr...",The Food and Drug Administration (FDA) permits...,yes,https://cosmeticsinfo.org/ingredient/Silica
Talc,"Talc, also known as French chalk, is powdered ...",The safety of Talc has been assessed by the Co...,yes,https://cosmeticsinfo.org/ingredient/Talc


### Insight 2 Explanation

This insight serves as connection between the cleaned "chemicals in cosmetics" file and the web. This analysis chooses the 10 most common chemicals found in products still being sold (found from the insight(1)) and uses those chemical names to find matching chemicals on the cosmeticsinfo.org website. It then returns a definition and safety info and quick way to see if the chemical is safe by checking if key words such as "approved", "allowed" and "permitted" are contained in the safety info. If not you can read the safety info or use the link.

In [None]:
import numpy as np
import pandas as pd

def insight3():
  insight = insight1()
  mostcommon = list(insight.index)[0:8]

  with open('chemincos.csv',"r", encoding ="utf8") as fin:
        reader = csv.reader(fin)
        readerList = [line for line in reader]
        
        brandSet = set()
        for line in readerList[1:]: #set for unique brand names
            brandSet.add(line[3])
    
        brandDict = {}
        for brand in brandSet:
          brandDict[brand] = {}
          for commonchem in mostcommon:
            brandDict[brand][commonchem[:21]] = 0
            for line in readerList[1:]:
              if brand == line[3] and commonchem == line[6]:
                brandDict[brand][commonchem[:21]] += 1
                
        df = pd.DataFrame(data = brandDict)
        df = df.transpose()
        df["Total_Products"] = df.sum(axis=1)
        df = df.sort_values(by="Total_Products", ascending = False)
  return df[df.astype('bool').mean(axis=1)>0].head()  

############ Function Call ############
insight3()

Unnamed: 0,Titanium dioxide,"Silica, crystalline (",Carbon black,Cocamide diethanolami,Retinyl palmitate,Retinol/retinyl ester,Talc,Mica,Total_Products
The Body Shop,1005,0,0,69,0,0,0,0,1074
Revlon,1038,0,16,0,0,0,0,0,1054
Gelish,775,169,81,0,0,0,0,0,1025
Anastasia Beverly Hills,877,0,0,0,0,0,0,0,877
Entity,581,203,65,0,0,0,0,0,849


### Insight 3 Explanation

This insight finds the number of products each brand has that contains a certain "most common" chemical. We created a Total products row that sums the products of each brand with these chemicals. From this insight we are able to see which brands use which chemicals and which brands have the most products with chemicals by ordering our total products column as descending. This insight is helpful with deciphering which brand you should look more closely at their ingredients. If they are high on the list they sell many products with one or more chemicals in it.

## Data Visualization

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

def visual1():
  data = pd.read_csv('chemincos.csv')
  data = data[['PrimaryCategory', 'ChemicalName']]
  by_category = data.groupby('PrimaryCategory').count().sort_values(by="ChemicalName", ascending = False)
  categories = by_category.reset_index()
  fig = px.bar(categories, x = 'PrimaryCategory', y = 'ChemicalName', labels = {'PrimaryCategory': 'Product Category', 'ChemicalName': 'Products with Chemicals'}, title = 'Number of Products with Chemicals by Category')
  fig.show()






############ Function Call ############
visual1()

### Visualization Explanation

This visualization shows the number of products that contain chemicals by product category. The insights it reveals is that makeup products overwhelming contain chemicals compared to for example, baby products.

# Video Presentation
*   Video Presentation Link <a>https://bluejeans.com/s/YCn3Euq8Uhh/</a>


