# ECHA: cosmetics and fragrances

Here's what we need to find out:
- Of all the substances registered under REACH, how many are used exclusively in cosmetics (i.e. in product category 28 and/or 39 only)?
- What is the EC identification number and registered tonnage band of these substances?
- Which of these substances have an ECHA decision associated with them?
- What is the date, status and web address of these decisions?

# Initial list of PC 28 and 39 substances

- Used ECHA advanced searach for chemicals on 1 Apr 2021
- Under 'Uses and exposure'
    - Selected 'Consumer Uses'
    - Selected categories
        - 'PC 28' perfumes, fragrances
        - 'PC 39' cosmetics, personal care products
    - Selected 'OR'
- Returned 5,821 results
- Downloaded as CSV

In [1]:
import pandas as pd
import re

In [2]:
echa_df = pd.read_csv('../data/search-export-28-39-1-apr-2021.csv', sep='\t', skiprows=3)

In [3]:
echa_df.drop(columns=[echa_df.columns.to_list()[-1]], inplace=True)

In [4]:
echa_df.head()

Unnamed: 0,Substance Name,EC Number,CAS Number,Substance Information Page,Brief Profile Page,Substance Regulatory Obligations Page
0,"''amyl nitrite'', mixed isomers",203-770-8,110-46-3,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
1,((Methylethylene)bis(oxy))dipropanol,246-466-0,24800-44-0,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
2,(+)-bornan-2-one,207-355-2,464-49-3,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,https://echa.europa.eu/legislation-obligation/...
3,(+)-Butyl lactate,252-036-3,34451-19-9,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
4,(+)-L-arginine hydrochloride,214-275-1,1119-34-2,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,


In [5]:
echa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5821 entries, 0 to 5820
Data columns (total 6 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Substance Name                         5821 non-null   object
 1   EC Number                              5821 non-null   object
 2   CAS Number                             5821 non-null   object
 3   Substance Information Page             5821 non-null   object
 4   Brief Profile Page                     5820 non-null   object
 5   Substance Regulatory Obligations Page  1285 non-null   object
dtypes: object(6)
memory usage: 273.0+ KB


In [6]:
echa_df[echa_df['EC Number'] == '-']

Unnamed: 0,Substance Name,EC Number,CAS Number,Substance Information Page,Brief Profile Page,Substance Regulatory Obligations Page
524,"1-ol, [[4-[(3-aminophenyl)amino]-6-chloro-1,3,...",-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
776,"2-((E)-(4-((4-(1,3-dioxo-1,3-dihydro-2H-hetere...",-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
825,2-[(8-amino-7-{[4-substituted-2-sulfonatopheny...,-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
1081,"2-propanol and 2-butanol production, distn. re...",-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
1089,"2-Propenoic acid, 2-hydroxyethyl ester, reacti...",-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
...,...,...,...,...,...,...
5590,TexFRon AG,-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
5595,Thiazol Blau,-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
5673,Tris(2-hydroxyethyl)ammonium salts of tall-oil...,-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,
5769,Z-109,-,-,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,


# Cleaning the JSON  files

In [7]:
import json

In [8]:
scrape = json.load(open('../data/out_commas.json'))

In [9]:
len(scrape.keys())

5735

In [10]:
keys = list(scrape.keys())

In [11]:
scrape_df = pd.read_json('../data/out_commas.json')

In [12]:
scrape_df.T.head()

Unnamed: 0,general,consumer uses,article service life,widespread uses by professional workers,formulation or re-packing,uses at industrial sites,manufacture,biocidal uses
203-770-8,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
246-466-0,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
207-355-2,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following activ...,
252-036-3,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following produ...,[This substance is used in the following activ...,
214-275-1,[This substance is registered under the REACH ...,[ECHA has no public registered data indicating...,[ECHA has no public registered data on the use...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,


In [13]:
sum(1 for x in scrape_df.columns.to_list() if x == '-')

1

## Investigating consumer uses

In [14]:
uses = scrape_df.T['consumer uses']

In [15]:
# only keep first paragraph of uses
uses = uses.map(lambda x: x[0])

In [16]:
def strip_prefix(x):
    pat = re.compile(r'[a-zA-Z ]+: (.+)')
    if m := re.match(pat, x):
        return m.group(1).rstrip('.')
    else:
        return ""

In [17]:
uses = uses.map(strip_prefix)

In [18]:
uses[:5]

203-770-8    leather treatment products and washing & clean...
246-466-0      lubricants and greases and anti-freeze products
207-355-2    perfumes and fragrances, cosmetics and persona...
252-036-3                            plant protection products
214-275-1                                                     
Name: consumer uses, dtype: object

## Product categories

In [19]:
with open('../product_categories.txt', 'r') as f:
    pc = f.read()

In [20]:
pat = re.compile(r'(PC [0-9abc]{1,2}): (.+)')

In [21]:
product_categories = {}
for m in re.finditer(pat, pc):
    product_categories[m.group(2).lower()] = m.group(1)

In [22]:
keys, vals = zip(*product_categories.items())
keys = list(keys)
vals = list(vals)

In [23]:
template = pd.Series(data=len(vals) * [0], index=vals)

In [24]:
product_categories['washing & cleaning products'] = product_categories['washing and cleaning products']

In [25]:
def get_pc_vec(x):
    result = template.copy()
    for k in keys:
        if k in x:
            result[product_categories[k]] = 1
    return result

In [26]:
def get_pc_list(x):
    result = []
    for k in keys:
        if k in x:
            result.append(k)
    return result

In [27]:
for n in range(5):
    print(get_pc_list(uses[n]))

['leather treatment products']
[]
['air care products', 'fuels']
['plant protection products']
[]


There are formatting differences between the "official list" and the categories scraped from the ECHA website...

We could try removing all punctuation (except "-") and removing "and", and then comparing each key to each row.

## Filtering for PC 28 and 39

In [28]:
peta_keys = keys[37].split(', ')
peta_keys.extend(keys[26].split(', '))

In [29]:
x = peta_keys.pop(1)

In [30]:
peta_keys.extend(x.split(' '))

In [31]:
peta_keys

['cosmetics', 'perfumes', 'fragrances', 'personal', 'care', 'products']

In [32]:
pat = re.compile(r'[A-Za-z-]+')
re.findall(pat, uses[0])

['leather', 'treatment', 'products', 'and', 'washing', 'cleaning', 'products']

In [33]:
def match2839(x):
    pat = re.compile(r'[A-Za-z-]+')
    for m in re.findall(pat, x):
        if m not in peta_keys:
            return False
    else:
        return True

In [34]:
filt = uses.map(match2839)

In [35]:
filt2 = uses == ""

In [36]:
def any2839(x):
    pat = re.compile(r'[A-Za-z-]+')
    for m in re.findall(pat, x):
        if m in peta_keys[:-2]:
            return True
    return False

In [37]:
filt3 = uses.map(any2839)

In [38]:
sum(filt3)

2085

In [39]:
uses[filt3]

207-355-2    perfumes and fragrances, cosmetics and persona...
239-387-8    air care products, biocides (e.g. disinfectant...
218-691-4    air care products, biocides (e.g. disinfectant...
201-766-0    cosmetics and personal care products, washing ...
944-817-9    biocides (e.g. disinfectants, pest control pro...
                                   ...                        
200-412-2    cosmetics and personal care products and pharm...
224-403-8    pharmaceuticals and cosmetics and personal car...
276-333-2    air care products, biocides (e.g. disinfectant...
230-636-6                 cosmetics and personal care products
412-050-4    air care products, biocides (e.g. disinfectant...
Name: consumer uses, Length: 2085, dtype: object

## Cosmetic only products

In [40]:
filt4 = uses == 'cosmetics and personal care products'

In [41]:
uses[filt4]

813-180-8    cosmetics and personal care products
700-097-6    cosmetics and personal care products
700-185-4    cosmetics and personal care products
242-893-1    cosmetics and personal care products
204-815-4    cosmetics and personal care products
                             ...                 
261-674-1    cosmetics and personal care products
240-178-9    cosmetics and personal care products
244-955-3    cosmetics and personal care products
947-887-9    cosmetics and personal care products
230-636-6    cosmetics and personal care products
Name: consumer uses, Length: 471, dtype: object

In [42]:
sum(filt4)

471

In [43]:
cosmetics = scrape_df.T[filt4]

In [44]:
cosmetics.head()

Unnamed: 0,general,consumer uses,article service life,widespread uses by professional workers,formulation or re-packing,uses at industrial sites,manufacture,biocidal uses
813-180-8,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
700-097-6,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
700-185-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
242-893-1,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,
204-815-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,


In [45]:
cosmetics.iloc[0, 0]

['This substance is registered under the REACH Regulation and is manufactured in and / or imported to the European Economic Area, at ≥ 1 to < 10 per annum.',
 'This substance is used by consumers, in formulation or re-packing and in manufacturing.']

In [46]:
def tonnage(x):
    pat = re.compile(r'at (.+?) per annum.?')
    if m := re.search(pat, x):
        return m.group(1)
    else:
        return ""

In [47]:
tonnage(cosmetics.iloc[0, 0][0])

'≥ 1 to < 10'

In [48]:
cosmetics['tonnage'] = cosmetics['general'].map(lambda x: x[0]).map(tonnage)

In [49]:
cosmetics.head()

Unnamed: 0,general,consumer uses,article service life,widespread uses by professional workers,formulation or re-packing,uses at industrial sites,manufacture,biocidal uses,tonnage
813-180-8,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10
700-097-6,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10
700-185-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10 tonnes
242-893-1,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 10 to < 100
204-815-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 100 to < 1 000 tonnes


In [50]:
no_tonnage_filt = cosmetics['tonnage'] == ""

In [51]:
cosmetics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 471 entries, 813-180-8 to 230-636-6
Data columns (total 9 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   general                                  471 non-null    object
 1   consumer uses                            471 non-null    object
 2   article service life                     471 non-null    object
 3   widespread uses by professional workers  471 non-null    object
 4   formulation or re-packing                471 non-null    object
 5   uses at industrial sites                 471 non-null    object
 6   manufacture                              471 non-null    object
 7   biocidal uses                            6 non-null      object
 8   tonnage                                  471 non-null    object
dtypes: object(9)
memory usage: 36.8+ KB


## Investigating article service life

In [52]:
cosmetics['article service life'][0]

['ECHA has no public registered data on the use of this substance in activities or processes at the workplace.',
 'ECHA has no public registered data on the routes by which this substance is most likely to be released to the environment.',
 'ECHA has no public registered data indicating whether or into which articles the substance might have been processed.']

In [53]:
import numpy as np

In [54]:
def no_public_data(x):
    if x is np.nan:
        return x
    elif x.startswith('ECHA has no public registered data'):
        return np.nan
    else:
        return x

In [55]:
asl = cosmetics['article service life']

In [56]:
asl = (asl.transform(lambda x: pd.Series(x))
          .applymap(no_public_data)
          .dropna(how='all'))

In [57]:
asl.iloc[1, 0]

'This substance is used in the following activities or processes at workplace: transfer of chemicals, mixing in open batch processes, transfer of substance into small containers and production of mixtures or articles by tabletting, compression, extrusion or pelletisation.'

## Investigating widespread use by professional workers

In [58]:
wubpw = cosmetics['widespread uses by professional workers']

In [59]:
wubpw = wubpw.transform(lambda x: pd.Series(x))

In [60]:
wubpw = wubpw.applymap(no_public_data).dropna(how='all')

In [61]:
wubpw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 253 entries, 250-342-1 to 947-887-9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       219 non-null    object
 1   1       61 non-null     object
 2   2       206 non-null    object
 3   3       253 non-null    object
dtypes: object(4)
memory usage: 9.9+ KB


In [62]:
for n in range(5):
    print(wubpw.iloc[n, 2], end='\n\n')

This substance is used in the following activities or processes at workplace: roller or brushing applications and hand mixing with intimate contact only with personal protective equipment available.

This substance is used in the following activities or processes at workplace: transfer of chemicals, closed processes with no likelihood of exposure, closed, continuous processes with occasional controlled exposure, closed batch processing in synthesis or formulation, mixing in open batch processes, transfer of substance into small containers, production of mixtures or articles by tabletting, compression, extrusion or pelletisation and laboratory work.

This substance is used in the following activities or processes at workplace: hand mixing with intimate contact only with personal protective equipment available.

nan

This substance is used in the following activities or processes at workplace: mixing in open batch processes.



## Investigating uses at industrial sites

In [63]:
industrial = cosmetics['uses at industrial sites']

In [64]:
industrial = (industrial.transform(lambda x: pd.Series(x))
              .applymap(no_public_data))[0]

In [65]:
industrial

813-180-8                                                  NaN
700-097-6                                                  NaN
700-185-4                                                  NaN
242-893-1                                                  NaN
204-815-4                                                  NaN
                                   ...                        
261-674-1                                                  NaN
240-178-9    This substance is used in the following produc...
244-955-3                                                  NaN
947-887-9                                                  NaN
230-636-6                                                  NaN
Name: 0, Length: 471, dtype: object

In [66]:
filt1 = industrial == 'This substance is used in the following products: cosmetics and personal care products.'
filt2 = industrial.isna()

In [67]:
cos_no_industry = cosmetics[filt1 | filt2]

In [68]:
cos_no_industry.head()

Unnamed: 0,general,consumer uses,article service life,widespread uses by professional workers,formulation or re-packing,uses at industrial sites,manufacture,biocidal uses,tonnage
813-180-8,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10
700-097-6,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10
700-185-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 1 to < 10 tonnes
242-893-1,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 10 to < 100
204-815-4,[This substance is registered under the REACH ...,[This substance is used in the following produ...,[ECHA has no public registered data on the use...,[ECHA has no public registered data indicating...,[This substance is used in the following produ...,[ECHA has no public registered data indicating...,[This substance is used in the following activ...,,≥ 100 to < 1 000 tonnes


In [69]:
cos_no_industry.loc['-', :]

general                                    [This substance is registered under the REACH ...
consumer uses                              [This substance is used in the following produ...
article service life                       [ECHA has no public registered data on the use...
widespread uses by professional workers    [This substance is used in the following produ...
formulation or re-packing                  [This substance is used in the following produ...
uses at industrial sites                   [ECHA has no public registered data indicating...
manufacture                                [ECHA has no public registered data on the use...
biocidal uses                                                                            NaN
tonnage                                                                          ≥ 1 to < 10
Name: -, dtype: object

# Merging and exporting

- `cos_no_industry` has cosmetics with no other industrial use.
    - tonnage is what was reported, sometime "tonnes" was missing
    - tonnage is the only column we're keeping from cos_no_industry, since we were mainly interested in filtering out substances with industrial uses.
- perfumes and fragrances might be missing, but this is less than 30 items (I think)
- merge with `echa_df`

In [70]:
filt = echa_df['EC Number'] == '-'

In [71]:
final = pd.merge(echa_df[~filt], cos_no_industry[['tonnage']], left_on='EC Number', right_index=True)

In [72]:
final.head()

Unnamed: 0,Substance Name,EC Number,CAS Number,Substance Information Page,Brief Profile Page,Substance Regulatory Obligations Page,tonnage
23,(10R)-10-hydroxyoctadecanoic acid,813-180-8,5856-32-6,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 to < 10
33,"(1R,2S,5R)-5-methyl-2-(propan-2-yl)-N-[2-(pyri...",700-097-6,847565-09-7,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 to < 10
34,"(1R,2S,5R)-5-methyl-2-(propan-2-yl)cyclohexyl ...",700-185-4,1122460-01-8,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 to < 10 tonnes
48,(2-hydroxy-3-sulphopropyl)dimethyl[3-[(1-oxodo...,242-893-1,19223-55-3,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 10 to < 100
49,(2-hydroxyethyl)ammonium mercaptoacetate,204-815-4,126-97-6,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 100 to < 1 000 tonnes


In [73]:
final = final.reset_index(drop=True)

In [74]:
writer = pd.ExcelWriter('cosmetics_6apr2021_1211pm.xlsx')

final.to_excel(writer, 'Sheet1')

writer.save()

# Adding dossier info

In [75]:
df = pd.read_csv('../data/dossier-evaluation-status-export.csv', sep='\t', skiprows=16)
df = df.drop(columns=['Unnamed: 13'])

In [76]:
df.head()

Unnamed: 0,Substance name,Description,EC / List no,CAS no,Decision type,Scope,Status,Decision date,Decision's deadline(s),Decision,Appeal information,Dossier url,Latest update
0,Tetramethylammonium hydroxide,,200-882-9,75-59-2,TPE,Testing Proposal,Information requested,16/03/2021,23/09/2021,,,https://www.echa.europa.eu/web/guest/registrat...,18/03/2021
1,N-isopropyl-N'-phenyl-p-phenylenediamine,,202-969-7,101-72-4,CCH,Comprehensive,Information requested,16/03/2021,21/06/2023,,,https://www.echa.europa.eu/web/guest/registrat...,18/03/2021
2,"Fatty acids, C16-C18 and C18-unsatd., ME ester...",,605-143-8,158318-67-3,TPE,Testing Proposal,,,,,,https://www.echa.europa.eu/web/guest/registrat...,18/03/2021
3,Camphene,,201-234-8,79-92-5,TPE,Testing Proposal,Ongoing,,,,,https://www.echa.europa.eu/web/guest/registrat...,18/03/2021
4,Camphene,,201-234-8,79-92-5,CCH,Comprehensive,Ongoing,,,,,https://www.echa.europa.eu/web/guest/registrat...,18/03/2021


In [77]:
df.columns

Index(['Substance name', 'Description', 'EC / List no', 'CAS no',
       'Decision type', 'Scope', 'Status', 'Decision date',
       'Decision's deadline(s)', 'Decision', 'Appeal information ',
       'Dossier url', 'Latest update'],
      dtype='object')

In [78]:
dossier = pd.merge(final, df, left_on='EC Number', right_on='EC / List no')

In [79]:
dossier.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 0 to 79
Data columns (total 20 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Substance Name                         80 non-null     object
 1   EC Number                              80 non-null     object
 2   CAS Number                             80 non-null     object
 3   Substance Information Page             80 non-null     object
 4   Brief Profile Page                     80 non-null     object
 5   Substance Regulatory Obligations Page  17 non-null     object
 6   tonnage                                80 non-null     object
 7   Substance name                         80 non-null     object
 8   Description                            1 non-null      object
 9   EC / List no                           80 non-null     object
 10  CAS no                                 80 non-null     object
 11  Decision type        

In [80]:
dossier.head()

Unnamed: 0,Substance Name,EC Number,CAS Number,Substance Information Page,Brief Profile Page,Substance Regulatory Obligations Page,tonnage,Substance name,Description,EC / List no,CAS no,Decision type,Scope,Status,Decision date,Decision's deadline(s),Decision,Appeal information,Dossier url,Latest update
0,"(1R,2S,5R)-5-methyl-2-(propan-2-yl)-N-[2-(pyri...",700-097-6,847565-09-7,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 to < 10,"(1R,2S,5R)-5-methyl-2-(propan-2-yl)-N-[2-(pyri...",,700-097-6,847565-09-7,CCH,Targeted,Concluded,,,,,https://www.echa.europa.eu/web/guest/registrat...,13/10/2018
1,(2-hydroxyethyl)ammonium mercaptoacetate,204-815-4,126-97-6,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 100 to < 1 000 tonnes,(2-hydroxyethyl)ammonium mercaptoacetate,,204-815-4,126-97-6,TPE,Testing Proposal,Concluded,,,,,https://www.echa.europa.eu/web/guest/registrat...,09/03/2020
2,(2-hydroxyethyl)ammonium mercaptoacetate,204-815-4,126-97-6,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 100 to < 1 000 tonnes,(2-hydroxyethyl)ammonium mercaptoacetate,,204-815-4,126-97-6,CCH,Comprehensive,Information requested,07/12/2018,"16/12/2019""14/06/2022",https://www.echa.europa.eu/documents/10162/e49...,,https://www.echa.europa.eu/web/guest/registrat...,02/04/2019
3,"(3R,6R)-3,6-dimethyl-1,4-dioxane-2,5-dione",603-436-5,13076-17-0,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 000 to < 10 000 tonnes,"(3R,6R)-3,6-dimethyl-1,4-dioxane-2,5-dione",,603-436-5,13076-17-0,CCH,Targeted,Concluded,,,,,https://www.echa.europa.eu/web/guest/registrat...,13/10/2018
4,"(3R,6R)-3,6-dimethyl-1,4-dioxane-2,5-dione",603-436-5,13076-17-0,https://echa.europa.eu/substance-information/-...,https://echa.europa.eu/brief-profile/-/briefpr...,,≥ 1 000 to < 10 000 tonnes,"(3R,6R)-3,6-dimethyl-1,4-dioxane-2,5-dione",,603-436-5,13076-17-0,CCH,Targeted,Concluded,27/07/2015,03/02/2016,https://www.echa.europa.eu/documents/10162/119...,,https://www.echa.europa.eu/web/guest/registrat...,04/10/2018


# Recent dossier info

In [82]:
dossier_new = pd.read_excel('../evaluation_report_2020_en.xlsx')

In [83]:
dossier_new.head()

Unnamed: 0,Public Name,EC Number,CAS Number,Evaluation Type,Status,Concluded on,Information requested,Deadline(s) for data
0,Cysteine hydrochloride,200-157-7,52-89-1,CCH,Concluded,2020-06-23 00:00:00,no further info requested,
1,Nicotine sulphate,200-606-7,65-30-5,CCH,Follow-up,2020-05-29 00:00:00,1. In vitro gene mutation study in bacteria (A...,2020-12-07 00:00:00
2,"1,1-difluoroethane",200-866-1,75-37-6,CCH,Concluded,2020-02-07 00:00:00,no further info requested,
3,"3a,4,7,7a-tetrahydro-4,7-methanoindene",201-052-9,77-73-6,CCH & TPE,Wait for follow-up,2020-05-26 00:00:00,Under CCH:\n1. Sub-chronic toxicity study (90-...,31/08/2021 & 31/08/2022
4,Pentaerithrityl tetranitrate,201-084-3,78-11-5,CCH,Concluded,2020-12-07 00:00:00,no further info requested,


In [84]:
dossier_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434 entries, 0 to 433
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Public Name            434 non-null    object
 1   EC Number              434 non-null    object
 2   CAS Number             434 non-null    object
 3   Evaluation Type        434 non-null    object
 4   Status                 434 non-null    object
 5   Concluded on           434 non-null    object
 6   Information requested  434 non-null    object
 7   Deadline(s) for data   309 non-null    object
dtypes: object(8)
memory usage: 27.2+ KB


In [85]:
pat = re.compile(r'\d{1,3}-\d{1,3}-\d{1,3}')
sum(dossier_new['EC Number'].str.match(pat))

434

In [86]:
merge_new = pd.merge(dossier_new, cos_no_industry[['tonnage']], left_on='EC Number', right_index=True)

In [87]:
merge_new

Unnamed: 0,Public Name,EC Number,CAS Number,Evaluation Type,Status,Concluded on,Information requested,Deadline(s) for data,tonnage
50,(2-hydroxyethyl)ammonium mercaptoacetate,204-815-4,126-97-6,TPE,Concluded,2020-03-05 00:00:00,no further info requested,,≥ 100 to < 1 000 tonnes
135,Sodium 2-sulphonatoethyl laurate,230-949-8,7381-01-3,CCH,Wait for follow-up,2020-05-27 00:00:00,1. In vitro gene mutation study in bacteria (A...,2023-12-04 00:00:00,≥ 10 000 to < 100 000 tonnes
211,"Mercaptoacetic acid, monoester with propane-1,...",250-264-8,30618-84-9,CCH,Wait for follow-up,2020-08-05 00:00:00,1. In vitro gene mutation study in bacteria (A...,2021-05-13 00:00:00,≥ 1 to < 10
264,"Fatty acids, C12-18 and C18-unsatd., 2-sulfoet...",287-024-7,85408-62-4,CCH,Wait for follow-up,2020-05-27 00:00:00,1. In vitro gene mutation study in bacteria (A...,2023-09-04 00:00:00,≥ 10 000 to < 100 000
267,"Butanedioic acid, sulfo-, 1-C12-18-alkyl ester...",290-836-4,90268-36-3,CCH,Wait for follow-up,2020-03-26 00:00:00,"1. Sub-chronic toxicity study (90-day), oral r...",2023-01-03 00:00:00,≥ 100 to < 1 000 tonnes
313,"Alcohols, C16-18 and C18-unsatd., ethoxylated,...",500-345-1,157627-95-7,CCH,Wait for follow-up,2020-12-17 00:00:00,1. In vitro gene mutation study in bacteria (A...,2023-03-24 00:00:00,≥ 100 to < 1 000 tonnes
344,Olaflur,911-915-8,NS,CCH,Concluded,2020-10-15 00:00:00,no further info requested,,≥ 100 to < 1 000


In [88]:
merge_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 50 to 344
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Public Name            7 non-null      object
 1   EC Number              7 non-null      object
 2   CAS Number             7 non-null      object
 3   Evaluation Type        7 non-null      object
 4   Status                 7 non-null      object
 5   Concluded on           7 non-null      object
 6   Information requested  7 non-null      object
 7   Deadline(s) for data   5 non-null      object
 8   tonnage                7 non-null      object
dtypes: object(9)
memory usage: 560.0+ bytes


In [89]:
writer = pd.ExcelWriter('cosmetics_eval_report_7apr2021.xlsx')

merge_new.to_excel(writer, 'Sheet1')

writer.save()

# 2018 - 2021 dossier info

In [90]:
df = pd.read_csv('../data/dossier-evaluation-status-export.csv', sep='\t', skiprows=16)

df = df.drop(columns=['Unnamed: 13'])