# Let's remind ourselves how web scraping works.

## This series looks like a decent resource:
https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/
The content isn't until part 2 though

### Except I'm going to use [requests](https://3.python-requests.org/) library instead of urllib.
So maybe this series is a better resource: https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3


Let's start simple, with a single long-ass page of ingredients:
https://www.vrg.org/ingredients/index.php

This looks short enough that we can write these data into a csv or json file directly.
Then we can import that into a database at a later time.

In [None]:
import requests

In [1]:
from bs4 import BeautifulSoup

In [2]:
import re
import pandas as pd

In [202]:
from typing import List, Dict

In [20]:
url = 'http://www.vrg.org/ingredients'

In [21]:
r = requests.get(url)

In [22]:
r.status_code

406

In [11]:
r

<Response [406]>

In [12]:
# Oh. Umm... let's look that up
"The server has found a resource matching the Request-URI, but not one that satisfies the conditions identified by the Accept and Accept-Encoding request headers. Unless it was a HEAD request, the response should include an entity containing a list of resource characteristics and locations from which the user or user agent can choose the one most appropriate. The entity format is specified by the media type given in the Content-Type header field. Depending upon the format and the capabilities of the user agent, selection of the most appropriate choice may be performed automatically."


'The server has found a resource matching the Request-URI, but not one that satisfies the conditions identified by the Accept and Accept-Encoding request headers. Unless it was a HEAD request, the response should include an entity containing a list of resource characteristics and locations from which the user or user agent can choose the one most appropriate. The entity format is specified by the media type given in the Content-Type header field. Depending upon the format and the capabilities of the user agent, selection of the most appropriate choice may be performed automatically.'

Found [this stack overflow post](https://stackoverflow.com/questions/38624561/error-in-open-connectionx-rb-http-error-406)

If we use curl and specify the user-agent, we should be able to move forward

In [40]:
# Okay, this is an issue with the user-agent.
# I'll copy the info from my web browser and see if that works.

In [31]:
r = requests.get('https://www.vrg.org/ingredients/index.php', 
                 headers={'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"})


In [32]:
r

<Response [200]>

In [33]:
r.status_code

200

In [None]:
# 200 means we have some data.

In [35]:
len(r.text)

348291

In [41]:
# Let's look at a snippet of the full text
r.text[10000:12000]

'p>\n\n<p><b><i>More on Definitions</i></b><br />\nIt is a tedious undertaking to classify the sources of food ingredients for these five reasons:</p>\n<ol>\n\n<li>Ingredients can be composed of multiple parts where each part may be derived from a different source. The common preservative, sodium benzoate, is an example. It contains both mineral (sodium) and synthetic (benzoate) parts. In these cases, both (or all, if more than two are present) sources are listed.</li>\n\n<li>Processing aids, used during the commercial processing of an ingredient, may be unknown or vary from manufacturer to manufacturer. A common example is cattle bone char used to decolorize cane sugar. Consumers can inquire about processing aids when in doubt. In many cases, manufacturers do not have to list processing aids on food labels. Only careful research may reveal their presence. Manufacturers may call them &quot;proprietary.&quot;</li>\n\n<li>&quot;<i>Synthetic</i>&quot; ingredients may contain components de

In [38]:
# Now let's read this into Beautiful Soup

In [39]:
soup = BeautifulSoup(r.text, 'html.parser')

### Now let's go to the webpage and poke at the formatting and what tags we need to mark.
Looking at the source, looks like the ingredients are stored under h2 tags.
And sub-categories are stored in strong tags, with text for the ingredient following.
Then finally in <em> tags, the vegan/vegetarian/not-vegetarian sourcing.

## Looking at this formatting, it actually makes more sense to write a manual parser for the text we're interested in!
Because the data is all jumbled into one big page, it will be easier to parse just the ingredients section.
Then we can read the data into a pandas dataframe, which we can use to output to a csv, or json, or whatever.

In [55]:
# Let's use the strong tag to find all the sub-categories on the page
set(soup.find_all('strong'))

{<strong>Additional Information:</strong>,
 <strong>Additional Information</strong>,
 <strong>Additional information:</strong>,
 <strong>Also known as:</strong>,
 <strong>Also known as</strong>,
 <strong>Also known</strong>,
 <strong>Alternate Names</strong>,
 <strong>Alternate names for dicalcium phosphate</strong>,
 <strong>Alternate names for monocalcium phosphate</strong>,
 <strong>Alternate names for tricalcium phosphate</strong>,
 <strong>Alternate names:</strong>,
 <strong>Alternate names</strong>,
 <strong>Alternative name</strong>,
 <strong>Alternative names</strong>,
 <strong>Commercial Source</strong>,
 <strong>Commercial Sources</strong>,
 <strong>Commercial source</strong>,
 <strong>Common Examples</strong>,
 <strong>Definition added:</strong>,
 <strong>Definition</strong>,
 <strong>Entry Added Updated</strong>,
 <strong>Entry Updated</strong>,
 <strong>Entry added</strong>,
 <strong>Entry updated</strong>,
 <strong>Example</strong>,
 <strong>Examples:</strong>,
 <strong>E

Oh...there are inconsistencies. That's to be expected of what appears to be an ongoing manual update process done by a poor non-profit. (Thank you interns!)
There are few enough categories that we can classify these manually.

In [269]:
# Store both the column name we'll use, and the possible tags that will map to that column.

strong_categories = {'additional_information': ['Additional Information:',
                                                    'Additional Information',
                                                    'Additional information:'],
                  'also_known_as': ['Also known as:',
                                                 'Also known as',
                                                 'Also known'],
                  'alternate_name': ['Alternate Names',
                                                 'Alternate names for dicalcium phosphate',
                                                 'Alternate names for tricalcium phosphate',
                                                 'Alternate names:',
                                                 'Alternate names',
                                                 'Alternative name',
                                                 'Alternative names'],
                  'commercial_source': ['Commercial Source',
                                                    'Commercial Sources',
                                                    'Commercial source'],
                  'common_example': ['Common Examples'],
                  'definition': ['Definition'],
                  'entry_updated': ['Definition added:',
                                                'Entry Added Updated',
                                                'Entry Updated',
                                                'Entry added',
                                                'Entry updated'],
                  'example': ['Example',
                                         'Examples:',
                                         'Examples'],
                  'exists_in': ['Exists in', 'Exists',
                                           'Found in'],
                  'manufacturer': ['Major Manufacturer', 'Major Manufacturers',
                                              'Manufacturers:', 'Manufacturers'],
                  'naturally_present_in': ['Naturally present in',
                                                      'Naturally present'],
                  'use': ['Used as a', 'Used as', 'Used for', 'Used in', 'Used on'],
                  'more_information': ['More information:']
                  }


 
# "<strong>Ingredients:</strong>" & "Further information" was left out because it's not part of the run-down of ingredients
# "More information:" links to other entries.

                  
# This may be better as a class than a dict...but now I'm committed.


In [270]:
# And I totally read it into the wrong order

strong_cat_inv ={}

for k,v in strong_categories.items():
    for list_item in v:
        strong_cat_inv[list_item]=k



In [265]:
strong_cat_inv

{'Additional Information:': 'additional_information',
 'Additional Information': 'additional_information',
 'Additional information:': 'additional_information',
 'Also known as:': 'also_known_as',
 'Also known as': 'also_known_as',
 'Also known': 'also_known_as',
 'Alternate Names': 'alternate_name',
 'Alternate names for dicalcium phosphate': 'alternate_name',
 'Alternate names for tricalcium phosphate': 'alternate_name',
 'Alternate names:': 'alternate_name',
 'Alternate names': 'alternate_name',
 'Alternative name': 'alternate_name',
 'Alternative names': 'alternate_name',
 'Commercial Source': 'commercial_source',
 'Commercial Sources': 'commercial_source',
 'Commercial source': 'commercial_source',
 'Common Examples': 'common_example',
 'Definition': 'definition',
 'Definition added:': 'entry_updated',
 'Entry Added Updated': 'entry_updated',
 'Entry Updated': 'entry_updated',
 'Entry added': 'entry_updated',
 'Entry updated': 'entry_updated',
 'Example': 'example',
 'Examples:': 

In [56]:
soup.find_all('h2')[:10]

[<h2 id="acesulfamek">acesulfame K</h2>,
 <h2 id="acetic_acid">acetic acid</h2>,
 <h2 id="acidcasein">acid casein</h2>,
 <h2 id="acidulant">acidulant</h2>,
 <h2 id="acrylicacid">acrylic acid</h2>,
 <h2 id="activatedcarbon">activated carbon</h2>,
 <h2 id="adipicacid">adipic acid</h2>,
 <h2 id="agar">agar</h2>,
 <h2 id="agaragar">agar-agar</h2>,
 <h2 id="alanine">alanine</h2>]

In [57]:
soup.find_all('strong')[:10]

[<strong>Also known as</strong>,
 <strong>Commercial source</strong>,
 <strong>Used in</strong>,
 <strong>Definition</strong>,
 <strong>Commercial source</strong>,
 <strong>Exists in</strong>,
 <strong>Used in</strong>,
 <strong>Definition</strong>,
 <strong>Commercial source</strong>,
 <strong>Used in</strong>]

In [58]:
soup.find_all('em')[:10]

[<em>Vegan</em>,
 <em>Vegan</em>,
 <em>Vegetarian</em>,
 <em>Typically Vegetarian</em>,
 <em>Vegan</em>,
 <em>May be Non-Vegetarian</em>,
 <em>May be Non-Vegetarian</em>,
 <em>Vegan</em>,
 <em>Typically Vegetarian</em>,
 <em>Vegetarian</em>]

In [None]:
# Categories, from the source of the webpage
<li><a href="#vegetarian">Vegetarian</a></li>
<li><a href="#vegan"><span class="ingredient">Vegan</span></a></li>
<li><a href="#non_vegetarian"><span class="ingredient">Non-Vegetarian</span></a></li>
<li><a href="#typically_vegetarian"><span class="ingredient">Typically Vegetarian</span></a></li>
<li><a href="#typically_vegan"><span class="ingredient">Typically Vegan</span></a></li>
<li><a href="#maybe_non_vegetarian"><span class="ingredient">May be Non-Vegetarian</span></a></li>
<li><a href="#typically_non_vegetarian"><span class="ingredient">Typically Non-Vegetarian</span></a></li>
</ul>

In [59]:
# Okay, so we need to pull out just the text we're interested in.
r.text.find('<h2 id="acesulfamek"')

20607

In [71]:
r.text.find('Additional information about zein:')

250240

In [130]:
r.text[20600:20700]

'\n\n\n\n\t\t\t<h2 id="acesulfamek">acesulfame&nbsp;K</h2>\n\t\t\t<strong>Also known as</strong>: acesulfame pot'

In [73]:
r.text[250240:250270]

'Additional information about z'

In [137]:
raw_html = r.text[20607:250241]

In [140]:
raw_html[-500:]

'\n<!--<em>Typically Vegan</em>-->\n<br />\n\t\t\t<em>Typically Vegan</em>\n<br />\n\t\t\t<strong>Entry added</strong>: August 2014<br />\n<p class="smaller">\n<a href="#copyright">Copyright Information</a></p>\n<p><a href="#top">top</a></p>\n\n\t\t\t<h2 id="zein">zein</h2>\n\t\t\t<strong>Commercial source</strong>: vegetable.<br />\n\t\t\t<strong>Used in</strong>: nuts, grain products, confections.<br />\n\t\t\t<strong>Definition</strong>: A corn protein which functions as a coating or glaze.<br />\n\t\t\t<em>Vegan</em>\n\n<br />\nA'

In [160]:
len(raw_html)

229634

### We'll now start to parse through the raw_html.

In [169]:
# Replace \t with ''
raw_html_2 = raw_html.replace('\t', '')

# Split on \n into list
text_list_1 = raw_html_2.split('\n')

In [170]:
len(text_list_1)

3681

In [211]:
text_list_1[:500]

['<h2 id="acesulfamek">acesulfame&nbsp;K</h2>',
 '<strong>Also known as</strong>: acesulfame potassium, Sunette.<br />',
 '<strong>Commercial source</strong>: synthetic<br />',
 '<strong>Used in</strong>: dry beverage mixes, canned fruit, chewing gum.<br />',
 '<strong>Definition</strong>: A low-calorie sweetener.<br />',
 '<em>Vegan</em>',
 '',
 '<p class="smaller">',
 '<a href="#copyright">Copyright Information</a></p>',
 '<p><a href="#top">top</a></p>',
 '',
 '<h2 id="acetic_acid">acetic&nbsp;acid</h2>',
 '<strong>Commercial source</strong>: vegetable<br />',
 '<strong>Exists in</strong>: many fruits and plants, in milk, and in synthetic form.<br />',
 '<strong>Used in</strong>: catsup, mayonnaise, and pickles.<br />',
 '<strong>Definition</strong>: Common preservative and flavoring agent which is the principal ingredient of vinegar.<br />',
 '<em>Vegan</em>',
 '',
 '<!--<p class="smaller"><a href="#copyright">Copyright Information</a></p>-->',
 '<p class="smaller">',
 '<a href="#co

In [165]:
BeautifulSoup('acesulfame&nbsp;K', 'html.parser')

acesulfame K

In [None]:
# Let's write a quick function to parse this list, and we'll iterate through.
# Let's write this out into a csv file.
# USe strong_cat_inv to map the <strong> entries

# Hmm, there's a "Production Infomration" section as well.

def parse_line(s: str):
    if s.startswith('<h2'):
        # READ BETWEEN > and <
        # grabs all between and including > & < in the string
        # Use group(0) since each line should only have one group
        # Use [1:-1] to get rid of the > & <
        ingredient = re.search('>.+<', s).group(0)[1:-1]
        
    elif s.startswith('<strong>'):
        # REad between > and <
        strong_category = re.search('strong>.+</strong', s).group(0)[1:-1]
        strong_value = re.search('>:.+<', s).group(0)[2:-1]
        # Look up the category name to assign to column
        strong_cat_name = strong_cat_inv['strong_category']

    elif s.startswith('<em>'):
        v_category = re.search('>.+<', s).group(0)[1:-1]

In [None]:
# Hmm... we don't just want to go line by line. 
# We want to go from the <h2> line with the ingredient
# Then read in the other things
# And stop when we see <p class="smaller">

# We could try... building up a dictionary?
# key = h2 ingredient name, values are the different categories?
# Or...one dictionary, with name as 
# 
# Keep track of ingredient name as we go.

In [233]:
# def parse_text(text_list: List):
#     d = {}
#     for line in text_list:
#         if line.startswith('<h2'):
#             ingredient = re.search('>.+<', line).group(0)[1:-1]
            
#         elif line.startswith('<strong>'):
#             # REad between > and <
#             strong_category = re.search('>.+</', line).group(0)[1:-2]
#             print(f"{strong_category}")
#             strong_value = re.search('>:.+<', line).group(0)[2:-1]
#             # Look up the category name to assign to column
#             strong_cat_name = strong_cat_inv[strong_category]
            
#             d[ingredient] = {strong_cat_name: strong_value}

#         elif line.startswith('<em>'):
#             v_category = re.search('>.+<', line).group(0)[1:-1]
#             d[ingredient] = {'veg_category': v_category}
            
#         elif line.startswith('<br /><br /><u>Product information'):
#             d[ingredient] = {'product_information': line.split('</u>: ')[1]}
            
#     return d

In [234]:
# Dictionary of INGredients
ding = parse_text(text_list_1)

Also known as
Commercial source
Used in
Definition
Commercial source
Exists in
Used in
Definition
Commercial source
Used in
Definition
Commercial source
Examples
Used in
Definition
Also known as
Commercial source
Used in
Definition
Commercial source
Used in
Definition
Also known as
Commercial source
Exists in
Used in
Definition
Also known as
Commercial source
Used in
Definition
Commercial source
Exists in
Used in
Definition
Commercial source
Used in
Definition
Commercial source
Examples
Used in
Definition
Commercial source
Used in
Definition
Commercial source
Used in
Definition
Also known as
Commercial source
Definition
Commercial source
Examples
Used in
Definition
Commercial source
Used in
Definition
Also known as
Commercial source
Used in
Definition
Also known as
Commercial source
Examples
Used in
Used for
Definition
Commercial source
Exists in
Examples
Used in
Definition
Also known as
Commercial source
Used in
Definition
Commercial source
Exists in
Used in
Definition
Commercial sour

KeyError: 'Also known as</strong>: <i>n'

In [283]:
text_list = text_list_1

In [284]:
len(text_list)

3681

In [314]:
d = {}
dd={}
for i, line in enumerate(text_list):
    if line.startswith('<h2'):
        ing = re.search('>.+<', line).group(0)[1:-1]
        ingredient = BeautifulSoup(ing, 'html.parser')
        
    elif line.startswith('<strong>'):
        # Hmm, this doesn't seem to be working...I think it might be writing the dictionary?
        try:
            # REad between > and <
            strong_category = re.search('strong>.+</strong', line).group(0)[7:-8]
            # read the value of the category
            strong_value = re.search('/strong>:.+<', line).group(0)[10:-1]
            # Look up the category name to assign to column
            strong_cat_name = strong_cat_inv[strong_category]
#             print(strong_cat_name)
            
            # I thought this would work...?
            d[ingredient] = {strong_cat_name: strong_value}
        except:
            print(f"Skipping category for row {i}")

    elif line.startswith('<em>'):
        v_category = re.search('>.+</', line).group(0)[1:-2]
        d[ingredient] = {'veg_category': v_category}

    elif line.startswith('<br /><br /><u>Product information'):
        try:
            d[ingredient] = {'product_information': line.split('</u>: ')[1]}
        except:
            print(f"Skipping product info for entry {i}")

#     print(i)

Skipping category for row 524
Skipping category for row 567
Skipping product info for entry 729
Skipping category for row 1665
Skipping category for row 2048
Skipping category for row 2574


In [313]:
d

{acesulfame K: {'veg_category': 'Vegan'},
 acetic acid: {'veg_category': 'Vegan'},
 acid casein: {'veg_category': 'Vegetarian'},
 acidulant: {'veg_category': 'Typically Vegetarian'},
 acrylic acid: {'veg_category': 'Vegan'},
 activated carbon: {'veg_category': 'May be Non-Vegetarian'},
 adipic acid: {'product_information': 'DuPont Chemicals, a manufacturer of adipic acid, reports that oleic acid derived from animal fat is used as a defoaming agent in the production of adipic acid. The oleic acid is present in the final product at a few parts per million. An alternative to this part of the process is thought to be possible but there are no plans to use it.'},
 agar: {'veg_category': 'Vegan'},
 alanine: {'veg_category': 'Typically Vegetarian'},
 albumen: {'veg_category': 'Vegetarian'},
 albumin: {'veg_category': 'Vegan'},
 alginic acid: {'veg_category': 'Vegan'},
 alum: {'veg_category': 'Vegan'},
 amino acid: {'veg_category': 'Typically Vegetarian'},
 amylase: {'veg_category': 'Typically

In [304]:
line = '<strong>Also known as</strong>: acesulfame potassium, Sunette.<br />'
# REad between > and <
strong_category = re.search('strong>.+</strong', line).group(0)[7:-8]
strong_value = re.search('/strong>:.+<', line).group(0)[10:-1]
# Look up the category name to assign to column
strong_cat_name = strong_cat_inv[strong_category]



In [305]:
strong_category

'Also known as'

In [306]:
strong_value

'acesulfame potassium, Sunette.'

In [300]:
strong_cat_name

'also_known_as'

In [249]:
# re.search('/strong>:.+<', '<strong>Also known as</strong>: <i>n</i>-butyric acid, butanoic acid.<br />')

<re.Match object; span=(22, 70), match='/strong>: <i>n</i>-butyric acid, butanoic acid.<'>

In [299]:
d

{acesulfame K: {'veg_category': 'Vegan'},
 acetic acid: {'veg_category': 'Vegan'},
 acid casein: {'veg_category': 'Vegetarian'},
 acidulant: {'veg_category': 'Typically Vegetarian'},
 acrylic acid: {'veg_category': 'Vegan'},
 activated carbon: {'veg_category': 'May be Non-Vegetarian'},
 adipic acid: {'product_information': 'DuPont Chemicals, a manufacturer of adipic acid, reports that oleic acid derived from animal fat is used as a defoaming agent in the production of adipic acid. The oleic acid is present in the final product at a few parts per million. An alternative to this part of the process is thought to be possible but there are no plans to use it.'},
 agar: {'veg_category': 'Vegan'},
 alanine: {'veg_category': 'Typically Vegetarian'},
 albumen: {'veg_category': 'Vegetarian'},
 albumin: {'veg_category': 'Vegan'},
 alginic acid: {'veg_category': 'Vegan'},
 alum: {'veg_category': 'Vegan'},
 amino acid: {'veg_category': 'Typically Vegetarian'},
 amylase: {'veg_category': 'Typically

In [281]:
df = pd.DataFrame(d)

In [282]:
df.head()

Unnamed: 0,[acesulfame K],[acetic acid],[acid casein],[acidulant],[acrylic acid],[activated carbon],[adipic acid],[agar],[alanine],[albumen],...,"[vitamin D, [3]]",[vitamin E],[wax],[wheat gluten],[whey],[wine],[xanthan gum],[yeast food],[Zeaxanthin],[zein]
veg_category,Vegan,Vegan,Vegetarian,Typically Vegetarian,Vegan,May be Non-Vegetarian,,Vegan,Typically Vegetarian,Vegetarian,...,,Vegan,Typically Vegetarian,Vegan,Typically Vegetarian,May be Non-Vegetarian,Vegan,Typically Vegan,,Vegan
product_information,,,,,,,"DuPont Chemicals, a manufacturer of adipic aci...",,,,...,,,,,,,,,,
entry_updated,,,,,,,,,,,...,March 2014,,,,,,,,August 2014,
definition,,,,,,,,,,,...,,,,,,,,,,


In [204]:
import pandas as pd

In [207]:
pd.DataFrame(data)

Unnamed: 0,acetic_acid,another_acid
commercial,vegetale,XX
exists_in,many...,Y
veg_cat,Vegan,Vegn


In [None]:
'<h2 id="acetic_acid">acetic&nbsp;acid</h2>',
 '<strong>Commercial source</strong>: vegetable<br />',
 '<strong>Exists in</strong>: many fruits and plants, in milk, and in synthetic form.<br />',
 '<strong>Used in</strong>: catsup, mayonnaise, and pickles.<br />',
 '<strong>Definition</strong>: Common preservative and flavoring agent which is the principal ingredient of vinegar.<br />',
 '<em>Vegan</em>',

In [None]:
strong_categories

In [177]:
'asdfs'.startswith('s')

False

In [189]:
m = re.search('>.+<', '<h2 id="acidulant">acidulant</h2>')

In [200]:
m.group(0)[1:-1]

'acidulant'

In [195]:
m.string

<function Match.expand(template)>

In [201]:
re.search('>.+<', '<h2 id="acidulant">acidulant</h2>').group(0)[1:-1]

'acidulant'

In [4]:
# Sample file for determining formatting
test_file = '/Volumes/ja2/vegan/vegan_parser/test_data/tocopherol.html'

In [7]:
# read in the text file
with open('/Volumes/ja2/vegan/vegan_parser/test_data/tocopherol.html', 'r') as f:
    toco_text = f.read()
    

In [21]:
toco_text[-300:]

'static.zdassets.com/ekr/snippet.js?key=ddae71b0-53e4-4646-9859-d51edea50265"> </script>\n<script type="text/javascript">\n  zE(\'webWidget\', \'hide\');\n</script>\n<script type="text/javascript" src="https://s7.addthis.com/js/300/addthis_widget.js#pubid=ra-4e31767a3587c42e" async></script>\n</body>\n</html>\n'

In [46]:
test_soup = BeautifulSoup(toco_text, 'html.parser')

In [24]:
ptag = test_soup.p

In [25]:
ptag

<p class="searches hidden"><span>16,031,296</span> searches since Oct. 27, 2014</p>

In [49]:
# Get the name of the chemical
test_soup.find('h2', attrs={'class': 'chemical-name text-block'}).text

'TOCOPHEROL'

In [36]:
# Okay, so this is the beautiful soup call to the ingredients list.
test_soup.find('p', attrs={'class': "chemical-info chemical-concerns-text"}).text

'Contamination concerns (high)'

In [44]:
# And to find the data score URL
test_soup.find('div', attrs={'class': "chemical-score float-r"}).find('img')

<img class="squircle" src="/skindeep/squircle/show.svg?score=1&amp;score_min=1"/>

In [42]:
# And to find data availablity
test_soup.find('div', attrs={'class': "chemical-score float-r"}).text  # Hmm...we could just parse ths out by hand.

'\nScore:\n\nData: Fair\n'

In [26]:
p_list = test_soup.find_all('p')  # class="chemical-score float-r"')

In [27]:
p_list

[<p class="searches hidden"><span>16,031,296</span> searches since Oct. 27, 2014</p>,
 <p>Menu</p>,
 <p class="links">
 <a href="/skindeep/contents/about-page">About SkinDeep<sup>®</sup>/Methodology</a>
 | <a href="https://www.ewg.org/disclaimer">Legal Disclaimer</a><br/>
 <a href="https://www.ewg.org/skindeep/build_your_own_report">Build Your Own Report</a>
 | <a href="/skindeep/contents/faq">FAQ</a>
 </p>,
 <p class="has_elements">Sun</p>,
 <p class="has_elements">Skin</p>,
 <p class="has_elements">Hair</p>,
 <p class="has_elements">Nails</p>,
 <p class="has_elements">Makeup</p>,
 <p class="has_elements">Fragrance</p>,
 <p class="has_elements">Babies</p>,
 <p class="has_elements">Oral Care</p>,
 <p class="has_elements">Men</p>,
 <p>
 <a href="/skindeep/browse/ewg_verified">EWG VERIFIED™</a>
 </p>,
 <p>image source: <a href="https://pubchem.ncbi.nlm.nih.gov/compound/14985">PubChem</a></p>,
 <p>Score:</p>,
 <p>Data: <span>Fair</span></p>,
 <p class="chemical-info chemical-concerns-text

In [29]:
for i, elem in enumerate(p_list):
    if i==1:
#         name=a.find('p', attrs={'class':'_3wU53n'})
        print(elem.find('class'))


None
