# Accessing Links and Extracting Recall Information

The dataset below was extracted using Selenium+BeautifulSoup. You can check the code that results in the below dataset [here](https://github.com/aleivaar94/Project-CFIA-Food-Recalls-Web-Scrapping-Selenium-BeautifulSoup/blob/master/CFIA-Food-Recall-Web-Scrapping-Selenium-BS4.ipynb).

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

pd.set_option('display.max_colwidth', -1) # display max column width

recalls = pd.read_csv('C:\\Users\\jorge\\DataScience\\CFIA-Food-Recall-Web-Scrapping\\cfia_recalls_v2.csv', encoding='utf-16')
recalls.head()

  """


Unnamed: 0.1,Unnamed: 0,title,link,date_scrapped
0,0,One Ocean brand Sliced Smoked Wild Sockeye Salmon recalled due to Listeria monocytogenes,https://healthycanadians.gc.ca//recall-alert-rappel-avis/inspection/2021/74995r-eng.php,"February 14, 2021"
1,1,Pastene brand Green Olives Sliced recalled due to container integrity defects,https://healthycanadians.gc.ca//recall-alert-rappel-avis/inspection/2021/75001r-eng.php,"February 14, 2021"
2,2,Unauthorized hand sanitizers and hard surface disinfectants sold by Protegel Quebec Inc. and Hangel Canada Inc. may pose health risks,https://healthycanadians.gc.ca//recall-alert-rappel-avis/hc-sc/2021/74953a-eng.php,"February 14, 2021"
3,3,Casa Italia brand Soppressata Piccante Salami recalled due to possible spoilage,https://healthycanadians.gc.ca//recall-alert-rappel-avis/inspection/2021/74975r-eng.php,"February 14, 2021"
4,4,Obiji brand Palm Oil recalled due to Sudan IV,https://healthycanadians.gc.ca//recall-alert-rappel-avis/inspection/2021/74949r-eng.php,"February 14, 2021"


# Extract Recall Details (Lists + Error)

This code will extract all recalls, including non-food recalls. This is because food recalls are nested in the HTML code in
a specific way. The code below will add 'Error' to the list.

In [18]:
names = []
dates = []
notifications = []
sub_types = []
hazards = []
hazards_class = []
sources_recalls = []
companies = []
distributions = []
channels = []
ref_nums = []


for i in recalls['link']:
    page_source = requests.get(i)
    soup = BeautifulSoup(page_source.content, 'html.parser')
    details = soup.find_all('dd', class_ = 'width45 paddingNone')

    # Recall details
    try:
        name = soup.find(id= 'cn-cont').text.strip()
    except:
        name = 'Error'

    try:
        date = details[0].text.strip()
    except:
        date = 'Error'

    try:
        notification = details[1].text.strip()
    except:
        notification = 'Error'

    try:
        sub_type = details[2].text.strip()
    except:
        sub_type = 'Error'

    try:
        hazard = details[3].text.strip()
    except:
        hazard = 'Error'

    try:
        hazard_class = details[4].text.strip()
    except:
        hazard_class = 'Error'

    try:
        source_recall = details[5].text.strip()
    except:
        source_recall = 'Error'

    try:
        company = details[6].text.strip()
    except:
        company = 'Error'

    try:
        distribution = details[7].text.strip()
    except:
        distribution = 'Error'

    try:
        channel = details[8].text.strip()
    except:
        channel = 'Error'

    try:
        ref_num = details[9].text.strip()
    except:
        ref_num = 'Error'

    names.append(name)
    dates.append(date)
    notifications.append(notification)
    sub_types.append(sub_type)
    hazards.append(hazard)
    hazards_class.append(hazard_class)
    sources_recalls.append(source_recall)
    companies.append(company)
    distributions.append(distribution)
    channels.append(channel)
    ref_nums.append(ref_num)

# Create a DataFrame ouf our lists

recall_details = pd.DataFrame({
    'name': names,
    'date': dates,
    'notification': notifications,
    'sub_type': sub_types,
    'hazard': hazards,
    'hazard_class': hazards_class,
    'source_recall': sources_recalls,
    'company': companies,
    'distribution': distributions,
    'channel': channels,
    'ref_num': ref_nums,
    'date_scrapped': 'February 15, 2021'
})
recall_details.head()

Unnamed: 0,name,date,notification,sub_type,hazard,hazard_class,source_recall,company,distribution,channel,ref_num,date_scrapped
0,One Ocean brand Sliced Smoked Wild Sockeye Salmon recalled due to Listeria monocytogenes,"February 10, 2021",Recall,Updated Food Recall Warning,Microbiological - Listeria,Class 1,Canadian Food Inspection Agency,Orca Specialty Foods Ltd.,British Columbia,Consumer,14232,"February 15, 2021"
1,Pastene brand Green Olives Sliced recalled due to container integrity defects,"February 8, 2021",Recall,Notification,Other,Class 3,Canadian Food Inspection Agency,Pastene Enterprises ULC,"New Brunswick, Nova Scotia, Ontario, Quebec",Retail,14227,"February 15, 2021"
2,Unauthorized hand sanitizers and hard surface disinfectants sold by Protegel Quebec Inc. and Hangel Canada Inc. may pose health risks,"February 5, 2021","February 5, 2021",Advisory,"Affects children, pregnant or breast feeding women, Affects children, pregnant or breast feeding women, Affects children, pregnant or breast feeding women",Health Canada,Unauthorized products,General Public,RA-74953,Error,Error,"February 15, 2021"
3,Casa Italia brand Soppressata Piccante Salami recalled due to possible spoilage,"February 3, 2021",Recall,Notification,Microbiological - Non harmful (Quality/Spoilage),Class 3,Canadian Food Inspection Agency,SYSCO Toronto,Ontario,Hotel/Restaurant/Institutional,14222,"February 15, 2021"
4,Obiji brand Palm Oil recalled due to Sudan IV,"February 1, 2021",Recall,Notification,Chemical,Class 2,Canadian Food Inspection Agency,Crestar Healthcare Group Ltd.,"Alberta, British Columbia",Retail,14220,"February 15, 2021"


# Extract Recall Details (Dictionaries)

This code will extract only food recalls because food recalls are the only webpages that meet the below parsing conditions.
An example of the execution of this code can be found [here](https://github.com/aleivaar94/Project-CFIA-Food-Recalls-Web-Scrapping-Selenium-BeautifulSoup/blob/master/CFIA-Food-Recall-Web-Scrapping-Selenium-BS4.ipynb).

In [None]:
recalls = []

for i in df['link']:
    driver.get(i)
    driver.implicitly_wait(2)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')
    details = soup.find_all('dd', class_ = 'width45 paddingNone')
    name = soup.find(id= 'cn-cont')
    try:
        dictionary = {
            'name': name.text.strip(),
            'date': details[0].text.strip(),
            'type': details[1].text.strip(),
            'sub_type': details[2].text.strip(),
            'hazard': details[3].text.strip(),
            'hazard_class': details[4].text.strip(),
            'source_of_recall': details[5].text.strip(),
            'company': details[6].text.strip(),
            'distribution': details[7].text.strip(),
            'channel': details[8].text.strip(),
            'ref_num': details[9].text.strip()
        }
        recalls.append(dictionary)
    except:
        pass

recall_details = pd.DataFrame(recalls)
recall_details.head()

# Extract Recalled Products

In [8]:
products = []

for i in recalls['link']:
    page_source = requests.get(i)
    soup = BeautifulSoup(page_source.content, 'html.parser')
    
    # Case 1
    try:
        table = soup.find('table', class_ = 'table table-bordered table-condensed')
        affected_products = table.find_all('td')
        for i in range(len(affected_products)):
                product = affected_products[i].text
                products.append(product)
    except:
        pass

    # Case 2
    try:
        table = soup.find('table', class_ = 'margin-top-small')
        affected_products = table.find_all('td')
        for i in range(len(affected_products)):
                product = affected_products[i].text.strip()
                products.append(product)
    except:
        pass
    
# Append products to a DataFrame
recalled_products = pd.DataFrame(products, columns=['recalled_products'])
recalled_products.head()

Unnamed: 0,recalled_products
0,One Ocean
1,Sliced Smoked Wild Sockeye Salmon
2,300 g
3,6 25984 00005 3
4,11253


# Save DataFrames to .csv file

In [19]:
recall_details.to_csv('C:\\Users\\jorge\\DataScience\\CFIA-Food-Recall-Web-Scrapping\\cfia_recalls_details_v3.csv', encoding='utf-16')

In [9]:
recalled_products.to_csv('C:\\Users\\jorge\\DataScience\\CFIA-Food-Recall-Web-Scrapping\\cfia_recalls_products_v3.csv', encoding='utf-16')