# ECHA: cosmetics and fragrances

Here's what we need to find out:
- Of all the substances registered under REACH, how many are used exclusively in cosmetics (i.e. in product category 28 and/or 39 only)?
- What is the EC identification number and registered tonnage band of these substances?
- Which of these substances have an ECHA decision associated with them?
- What is the date, status and web address of these decisions?

# Initial list of PC 28 and 39 substances

- Used ECHA advanced searach for chemicals on 1 Apr 2021
- Under 'Uses and exposure'
    - Selected 'Consumer Uses'
    - Selected categories
        - 'PC 28' perfumes, fragrances
        - 'PC 39' cosmetics, personal care products
    - Selected 'OR'
- Returned 5,821 results
- Downloaded as CSV

In [12]:
import pandas as pd
import re

In [2]:
search = pd.read_csv('search-export-28-39-1-apr-2021.csv', sep='\t', skiprows=3)

In [3]:
search.drop(columns=[search.columns.to_list()[-1]], inplace=True)

In [4]:
search.head()

Unnamed: 0,Substance Name,EC Number,CAS Number,Substance Information Page,Brief Profile Page,Substance Regulatory Obligations Page
0,"''amyl nitrite'', mixed isomers",203-770-8,110-46-3,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
1,((Methylethylene)bis(oxy))dipropanol,246-466-0,24800-44-0,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
2,(+)-bornan-2-one,207-355-2,464-49-3,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
3,(+)-Butyl lactate,252-036-3,34451-19-9,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
4,(+)-L-arginine hydrochloride,214-275-1,1119-34-2,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,


In [5]:
search.drop(961, inplace=True)

In [6]:
search.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5820 entries, 0 to 5820
Data columns (total 6 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Substance Name                         5820 non-null   object
 1   EC Number                              5820 non-null   object
 2   CAS Number                             5820 non-null   object
 3   Substance Information Page             5820 non-null   object
 4   Brief Profile Page                     5820 non-null   object
 5   Substance Regulatory Obligations Page  1285 non-null   object
dtypes: object(6)
memory usage: 318.3+ KB


# Scraping for full uses and exposure data

In [63]:
import requests

from bs4 import BeautifulSoup

In [8]:
API_KEY = "c6df1900baa34395a8bfd0db327a09b1"

proxy_host = 'proxy.crawlera.com'
proxy_port = '8011'
proxy_auth = f'{API_KEY}:'


proxies = {"https": f"https://{proxy_auth}@{proxy_host}:{proxy_port}/",
           "http": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/"}

Format: `r = requests.get(url, proxies=proxies, verify=False)`

See: https://docs.zyte.com/smart-proxy-manager-integrations.html#zyte-proxy-requests

In [9]:
urls = search[['EC Number', 'Brief Profile Page']]

## Exporting urls into 6 CSV files

The proxy would only allow one thread at a time, so this part isn't necessary any more.

In [10]:
files = ['in' + str(n) + '.csv' for n in range(1, 7)]

In [14]:
for n, file in zip(range(0, 5800, 1000), files):
    if n < 5000:
        urls.iloc[n : n + 1000, :].to_csv(file, header=False, index=False)
    else:
        urls.iloc[n:, :].to_csv(file, header=False, index=False)
    

In [15]:
!ls *.csv

cosmetics-substances-export.csv      in5.csv
dossier-evaluation-status-export.csv in6.csv
in1.csv                              non-cosmetics-substances-export.csv
in2.csv                              search-export-28-39-1-apr-2021.csv
in3.csv                              test_in.csv
in4.csv


In [16]:
!cat in1.csv | head -n 5

203-770-8,https://echa.europa.eu/brief-profile/-/briefprofile/100.003.429
246-466-0,https://echa.europa.eu/brief-profile/-/briefprofile/100.042.227
207-355-2,https://echa.europa.eu/brief-profile/-/briefprofile/100.006.688
252-036-3,https://echa.europa.eu/brief-profile/-/briefprofile/100.047.291
214-275-1,https://echa.europa.eu/brief-profile/-/briefprofile/100.012.978


## Setting up subprocesses

In [17]:
in_files = files
out_files = ['out' + str(n) + '.json' for n in range(1, 7)]
log_files = ['log' + str(n) + '.txt' for n in range(1, 7)]

In [18]:
import subprocess

In [20]:
subprocess.run(['./scrape.py', 'test_in.csv', 'test_out.json', 'test_log.txt', '0'])

CompletedProcess(args=['./scrape.py', 'test_in.csv', 'test_out.json', 'test_log.txt', '0'], returncode=0)

In [21]:
!cat test_log.txt


DEBUG:root:At row 0: AttributeError 'NoneType' object has no attribute 'find_all'
CRITICAL:root:Website not responding. Stopped at line 1
DEBUG:root:line 0: AttributeError 'NoneType' object has no attribute 'find_all'
CRITICAL:root:line 1: Website not responding
DEBUG:root:line 0: AttributeError 'NoneType' object has no attribute 'find_all'
DEBUG:root:line 2: AttributeError 'NoneType' object has no attribute 'find_all'
DEBUG:root:line 0: AttributeError 'NoneType' object has no attribute 'find_all'
DEBUG:root:line 2: AttributeError 'NoneType' object has no attribute 'find_all'


# Cleaning the JSON  files

In [1]:
import json

In [7]:
out = json.load(open('out_commas.json'))

In [8]:
len(out.keys())

5735

In [10]:
keys = list(out.keys())

In [11]:
out[keys[0]]

{'general': ['This substance is registered under the REACH Regulation and is manufactured in and / or imported to the European Economic Area, at ≥ 10 to < 100 per annum.',
  'This substance is used by consumers, in formulation or re-packing, at industrial sites and in manufacturing.'],
 'consumer uses': ['This substance is used in the following products: leather treatment products and washing & cleaning products.',
  'Other release to the environment of this substance is likely to occur from: indoor use (e.g. machine wash liquids/detergents, automotive care products, paints and coating or adhesives, fragrances and air fresheners).'],
 'article service life': ['ECHA has no public registered data on the use of this substance in activities or processes at the workplace.',
  'ECHA has no public registered data on the routes by which this substance is most likely to be released to the environment.',
  'ECHA has no public registered data indicating whether or into which articles the substanc

In [13]:
df = pd.read_json('out_commas.json')

In [15]:
df.T.head()

Unnamed: 0,general,consumer uses,article service life,widespread uses by professional workers,formulation or re-packing,uses at industrial sites,manufacture,biocidal uses
203-770-8,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
246-466-0,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
207-355-2,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following activ...,
252-036-3,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following activ...,
214-275-1,[This substance is registered under the REACH ...,[ECHA has no public registered data indicating...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,


In [50]:
uses = df.T['consumer uses']

In [51]:
test_uses = [r[1] for r in uses[:4].map(lambda x: x[0]).iteritems()]

In [52]:
x = test_uses[0]
print(x)

This substance is used in the following products: leather treatment products and washing & cleaning products.


In [53]:
import re

In [55]:
pat = re.compile(r'[a-zA-Z ]+: (.+)')

In [56]:
m = re.match(pat, test_uses[0])

In [57]:
m

<re.Match object; span=(0, 109), match='This substance is used in the following products:>

In [60]:
test_uses2 = [re.match(pat, x).group(1).rstrip('.') for x in test_uses]

In [61]:
test_uses2

['leather treatment products and washing & cleaning products',
 'lubricants and greases and anti-freeze products',
 'perfumes and fragrances, cosmetics and personal care products, air care products, biocides (e.g. disinfectants, pest control products), polishes and waxes, washing & cleaning products, fuels and inks and toners',
 'plant protection products']

In [80]:
with open('product_categories.txt', 'r') as f:
    pc = f.read()

In [73]:
pat = re.compile(r'(PC [0-9abc]{1,2}): (.+)')

In [74]:
product_categories = {}
for x in pc:
    m = re.match(pat, x)
    product_categories[m.group(2).lower()] = m.group(1)

In [75]:
product_categories.items()

dict_items([('other', 'PC 0'), ('adhesives, sealants', 'PC 1'), ('adsorbents', 'PC 2'), ('air care products', 'PC 3'), ('anti-freeze and de-icing products', 'PC 4'), ('base metals and alloys', 'PC 7'), ('biocidal products (e.g. disinfectants, pest control)', 'PC 8'), ('coatings and paints, thinners, paint removes', 'PC 9a'), ('fillers, putties, plasters, modelling clay', 'PC 9b'), ('finger paints', 'PC 9c'), ('explosives', 'PC 11'), ('fertilisers', 'PC 12'), ('fuels', 'PC 13'), ('metal surface treatment products', 'PC 14'), ('non-metal-surface treatment products', 'PC 15'), ('heat transfer fluids', 'PC 16'), ('hydraulic fluids', 'PC 17'), ('ink and toners', 'PC 18'), ('intermediate', 'PC 19'), ('products such as ph-regulators, flocculants, precipitants, neutralisation agents', 'PC 20'), ('laboratory chemicals', 'PC 21'), ('leather treatment products', 'PC 23'), ('lubricants, greases, release products', 'PC 24'), ('metal working fluids', 'PC 25'), ('paper and board treatment products', 

In [83]:
product_categories['washing and cleaning products']

'PC 35'

In [86]:
complications = [k for k in product_categories.keys()
        if ('and' in k) or (',' in k)]

In [88]:
vals = [v for k, v in product_categories.items()]
vals

['PC 0',
 'PC 1',
 'PC 2',
 'PC 3',
 'PC 4',
 'PC 7',
 'PC 8',
 'PC 9a',
 'PC 9b',
 'PC 9c',
 'PC 11',
 'PC 12',
 'PC 13',
 'PC 14',
 'PC 15',
 'PC 16',
 'PC 17',
 'PC 18',
 'PC 19',
 'PC 20',
 'PC 21',
 'PC 23',
 'PC 24',
 'PC 25',
 'PC 26',
 'PC 27',
 'PC 28',
 'PC 29',
 'PC 30',
 'PC 31',
 'PC 32',
 'PC 33',
 'PC 34',
 'PC 35',
 'PC 36',
 'PC 37',
 'PC 38',
 'PC 39',
 'PC 40',
 'PC 41',
 'PC 42']

In [89]:
template = pd.Series(data=len(vals) * [0], index=vals)

In [91]:
product_categories['washing & cleaning products'] = product_categories['washing and cleaning products']

In [94]:
def get_pc(x):
    result = template.copy()
    for k in product_categories.keys():
        if k in x:
            result[product_categories[k]] = 1
    return result

In [95]:
get_pc(test_uses2[0])

PC 0     0
PC 1     0
PC 2     0
PC 3     0
PC 4     0
PC 7     0
PC 8     0
PC 9a    0
PC 9b    0
PC 9c    0
PC 11    0
PC 12    0
PC 13    0
PC 14    0
PC 15    0
PC 16    0
PC 17    0
PC 18    0
PC 19    0
PC 20    0
PC 21    0
PC 23    1
PC 24    0
PC 25    0
PC 26    0
PC 27    0
PC 28    0
PC 29    0
PC 30    0
PC 31    0
PC 32    0
PC 33    0
PC 34    0
PC 35    1
PC 36    0
PC 37    0
PC 38    0
PC 39    0
PC 40    0
PC 41    0
PC 42    0
dtype: int64