# The Problem

A website cooking recipes website owner wants to migrate the recipes data to a new website but when the site was developed a way to download the recipes data was not developed.

The client want a python developer that can scrapy all recipes data from 9 pages and get the following data into a CSV file:


URL, Publication date, image url, headline, sub-headline, ingredients list and directoins.

the site that needs to be scraped is:

https://renfroepecan.com/blogs/recipes?page=1

## Loading required libraries

In [1]:
from bs4 import BeautifulSoup
from datetime import datetime

import time
import requests
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Defyning headers and functions

In [2]:
headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}

def get_page(url):
    page = requests.get(url, headers = headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

## Getting all recipes urls from each website page

In [3]:
recipes_urls = [] # defyning a list to storage all urls

for page in range(1,10): # loop. through all 9 pages
    
    # getting the pages HTML data
    soup = get_page('https://renfroepecan.com/blogs/recipes?page='+str(page))

    # getting all divs that contain a recipe url
    alpha = soup.find_all('div',{'class':'sixteen columns'})[1].find_all('div', {'class':'one-third column alpha article'})

    middle = soup.find_all('div',{'class':'sixteen columns'})[1].find_all('div', {'class':'one-third column article'})

    omega = soup.find_all('div',{'class':'sixteen columns'})[1].find_all('div', {'class':'one-third column omega article'})

    # passing all into a list
    cols = [alpha, middle, omega]

    for col in cols: # loop through each column of data

        for i in range(0,len(col)): # loop through each url from each column alpha, middle and omega

            # appending all urls into recipes_urls list
            recipes_urls.append('https://renfroepecan.com'+col[i].find('a', href = True)['href'])
    
    time.sleep(1) # sleeping for 2 seconds to not over request the server

## Getting all data from each recipe url

In [4]:
# creating a empty data frame to storage all requested data
recipes_dataset = pd.DataFrame(columns = ['url','publi_date','image_url',
                                          'headline','sub_headline',
                                          'ingredients1','ingredients2',
                                          'directions'])

for p in recipes_urls[0:97]: # looping through each url
    
    soup_recipe = get_page(p) # getting the HTML page
    
    # extracting publication date
    publi_date = soup_recipe.body.find_all('p',{'class':'blog_meta'})[0].text.replace('\n','')
    
    # extracting image url
    image_url = str(soup_recipe.find_all('div',{'class':'parallax'})[0]).split()[7].replace('src="//','').replace('"/>','')
    
    # extracting recipe headline
    headline = soup_recipe.body.find_all('div',{'class':'shogun-heading-component'})[0:2][0].text.replace('\n','').replace('          ','')
    
    # extracting recipe sub-headline
    sub_headline = soup_recipe.body.find_all('div',{'class':'shogun-heading-component'})[0:2][1].text.replace('\n','').replace('          ','')
    
    # extracting recipe ingredients
    if len(soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'}))==2:
        
        ingredients1 = soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'})[0].text.replace('\n','')
        ingredients2 = ''
        
    if len(soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'}))>2:
        
        ingredients1 = soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'})[0].text.replace('\n','')
        ingredients2 = soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'})[1].text.replace('\n','')
    
    # extracting recipe directions
    if len(soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'}))==2:
        directions = soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'})[1].text.replace('\n','')
    
    if len(soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'}))>2:
        directions = soup_recipe.body.find_all('div',{'class':'shg-rich-text shg-theme-text-content'})[2].text.replace('\n','')
    
    # storaging each requested data into our empty dataframea
    recipes_dataset = recipes_dataset.append({'url':p,
                                              'publi_date':publi_date,
                                              'image_url':image_url,
                                              'headline':headline,
                                              'sub_headline':sub_headline,
                                              'ingredients1':ingredients1,
                                              'ingredients2':ingredients2,
                                              'directions':directions},
                                            ignore_index = True
                                            )
    
    #time.sleep(2) # sleeping for 2 seconds to not over request the server

In [5]:
recipes_dataset['publi_date'].replace()

0      January 05, 2022
1     December 16, 2021
2     December 08, 2021
3     December 02, 2021
4     December 20, 2021
            ...        
92    November 03, 2020
93      August 22, 2020
94      August 19, 2020
95      August 16, 2020
96      August 15, 2020
Name: publi_date, Length: 97, dtype: object

In [6]:
# replacing the comma on each publication date
recipes_dataset['publi_date'] = recipes_dataset['publi_date'].apply(lambda x: "".join(x for x in x.replace(',','')))

In [7]:
# converting our string dates to datetime format
recipes_dataset['publi_date'] = recipes_dataset['publi_date'].apply(lambda x: datetime.strptime(x,'%B %d %Y'))

In [8]:
recipes_dataset = recipes_dataset.sort_values(by = 'publi_date', ascending = False)

In [9]:
recipes_dataset.head()

Unnamed: 0,url,publi_date,image_url,headline,sub_headline,ingredients1,ingredients2,directions
0,https://renfroepecan.com/blogs/recipes/salmon-...,2022-01-05,cdn.shopify.com/s/files/1/0011/1702/8412/artic...,Salmon with Pecan Honey Glaze,Make your New Year's Resolution easy! This hea...,• 1 cup pecan halves or pieces• 1 cup honey• 3...,,1. Preheat oven to 400°F.2. Spread pecans even...
4,https://renfroepecan.com/blogs/recipes/whopper...,2021-12-20,cdn.shopify.com/s/files/1/0011/1702/8412/artic...,Whopper Cookies,All of your favorites in one cookie! This whop...,• 1 cup butter or oleo• 1-1/4 cups brown sugar...,,1. Preheat oven to 350°F.2. Melt butter. Combi...
8,https://renfroepecan.com/blogs/recipes/six-lay...,2021-12-17,cdn.shopify.com/s/files/1/0011/1702/8412/artic...,Six Layer Cookie Bars,Six layers of delectable sweetness! This uniqu...,"• 1/2 cup margarine, melted• 8 oz. swiss choco...",,1. Preheat oven to 350°F (325°F if using a gla...
1,https://renfroepecan.com/blogs/recipes/sand-tarts,2021-12-16,cdn.shopify.com/s/files/1/0011/1702/8412/artic...,Sand Tarts,Sand tarts are a special butter pecan cookie r...,"• 1/2 cup butter, room temperature• 2 tbsp. su...",,1. Preheat oven to 325°F.2. Cream butter. Add ...
5,https://renfroepecan.com/blogs/recipes/praline...,2021-12-09,cdn.shopify.com/s/files/1/0011/1702/8412/artic...,Praline Shortbread Cookies,This pecan shortbread cookie recipe is a class...,"• 1 cup butter, softened• 1-1/2 cups flour• 3/...",,1. Preheat oven to 325°F.2. Cream together but...


In [None]:
recipes_dataset.to_csv('recipes.csv')