<h2>Caveat</h2>
Web sites often change the format of their pages so this may not always work. If it doesn't, rework the examples after examining the html content of the page (most browsers will let you see the html source - look for a "page source" option - though you might have to turn on the developer mode in your browser preferences. For example, on Chrome you need to click the "developer mode" check box under Extensions in the Preferences/Options menu. 

<h3>Import necessary modules</h3>

In [1]:
import requests
from bs4 import BeautifulSoup

<h3>The http request response cycle</h3>

In [2]:
url = "http://www.epicurious.com/search/Tofu Chili"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


In [3]:
keywords = input("Please enter the things you want to see in a recipe")
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")
# Lamb chops

Please enter the things you want to see in a recipeLamb chops
Success


In [5]:
response.content



<h3>Set up the BeautifulSoup object</h3>

In [9]:
results_page = BeautifulSoup(response.content,'lxml')
# type(results_page)
print(results_page.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="app-id=312101965" name="apple-itunes-app"/>
  <title>
   Search | Epicurious.com
  </title>
  <link href="//assets.adobedtm.com" rel="dns-prefetch"/>
  <link href="https://www.google-analytics.com" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//static.parsely.com" rel="dns-prefetch"/>
  <link href="//condenast.demdex.net" rel="dns-prefetch"/>
  <link href="//capture.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//pixel.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//use.typekit.net" rel="dns-prefetch"/>
  <link href="//fonts.typekit.net" rel="dns-prefetch"/>
  <link href="//p.typekit.net" rel="dns-prefetch"/>
  <link href="//assets.epicurious.com" rel="dns-prefetch"/>
  <link href="//ad.doubleclick.net" rel="dns-prefetch"/>
  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//z.moatads.com" rel="dns-prefetch

<h3>BS4 functions</h3>

<h4>find_all finds all instances of a specified tag</h4>
<h4>returns a result_set (a list)</h4>

In [19]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

<class 'bs4.element.ResultSet'>


In [20]:
all_a_tags

[<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>,
 <a data-reactid="72" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops">Spice-Marinated and Grilled Lamb Chops</a>,
 <a class="photo-link" data-reactid="89" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops"><div class="photo-wrap" data-reactid="90"><div class="component-lazy pending" data-component="Lazy" data-reactid="91"></div></div></a>,
 <a class="view-complete-item" data-reactid="92" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops" itemprop="url" title="Spice-Marinated and Grilled Lamb Chops"><!-- react-text: 93 -->View “<!-- /react-text --><!-- react-text: 94 -->Spice-Marinated and Grilled Lamb Chops<!-- /react-text --><!-- react-text: 95 -->”<!-- /react-text --></a>,
 <a class="view-complete-item" data-reactid="97" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops">View Recipe</a>,
 <a class="show-quick-view" 

<h4>find finds the first instance of a specified tag</h4>
<h4>returns a bs4 element</h4>


In [None]:
div_tag = results_page.find('div')
print(div_tag)

In [None]:
type(div_tag)


<h4>bs4 functions can be recursively applied on elements</h4>

In [23]:
div_tag.find('a')

<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>

<h4>Both find as well as find_all can be qualified by css selectors</h4>
<li>using selector=value
<li>using a dictionary

In [16]:
#When using this method and looking for 'class' use 'class_' 
#(because class is a reserved word in python)
#Note that we get a list back because find_all returns a list
results_page.find_all('article',class_="recipe-content-card")

[<article class="recipe-content-card" data-has-quickview="false" data-index="0" data-reactid="68" itemscope="" itemtype="https://schema.org/Recipe"><header class="summary" data-reactid="69"><strong class="tag" data-reactid="70">recipe</strong><h4 class="hed" data-reactid="71" data-truncate="3" itemprop="name"><a data-reactid="72" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops">Spice-Marinated and Grilled Lamb Chops</a></h4><p class="dek" data-reactid="73" data-truncate="1">You don’t need a roaring-hot grill for this lamb chops recipe. Grilling them over moderate heat will allow some of the fat to soften and render.</p><dl class="recipes-ratings-summary" data-reactid="74" data-reviews-count="0" data-reviews-rating="0" itemprop="aggregateRating" itemscope="" itemtype="https://schema.org/AggregateRating"><dt class="rating-label" data-reactid="75">Average user rating</dt><span class="reviews-count-container" data-reactid="76"><dd class="rating" data-rating="unrated

In [14]:
#Since we're using a string as the key, 
#the fact that class is a reserved word is not a problem
#We get an element back because find returns an element
article = results_page.find('article',{'class':'recipe-content-card'})
print(article.prettify())

<article class="recipe-content-card" data-has-quickview="false" data-index="0" data-reactid="68" itemscope="" itemtype="https://schema.org/Recipe">
 <header class="summary" data-reactid="69">
  <strong class="tag" data-reactid="70">
   recipe
  </strong>
  <h4 class="hed" data-reactid="71" data-truncate="3" itemprop="name">
   <a data-reactid="72" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops">
    Spice-Marinated and Grilled Lamb Chops
   </a>
  </h4>
  <p class="dek" data-reactid="73" data-truncate="1">
   You don’t need a roaring-hot grill for this lamb chops recipe. Grilling them over moderate heat will allow some of the fat to soften and render.
  </p>
  <dl class="recipes-ratings-summary" data-reactid="74" data-reviews-count="0" data-reviews-rating="0" itemprop="aggregateRating" itemscope="" itemtype="https://schema.org/AggregateRating">
   <dt class="rating-label" data-reactid="75">
    Average user rating
   </dt>
   <span class="reviews-count-containe

<h4>get_text() returns the marked up text (the content) enclosed in a tag.</h4>
<li>returns a string

In [17]:
article.get_text()

'recipeSpice-Marinated and Grilled Lamb ChopsYou don’t need a roaring-hot grill for this lamb chops recipe. Grilling them over moderate heat will allow some of the fat to soften and render.Average user rating0/4Reviews0Percentage of reviewers who will make this recipe again0%View “Spice-Marinated and Grilled Lamb Chops”View RecipeQuick viewCompare Recipe'

<h4>get returns the value of a tag attribute</h4>
<li>returns a string

In [18]:
recipe_tag = article
recipe_link = recipe_tag.find('a')
print("a tag:",recipe_link)
link_url = recipe_link.get('href')
print("link url:",link_url)
print(type(link_url))

a tag: <a data-reactid="72" href="/recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops">Spice-Marinated and Grilled Lamb Chops</a>
link url: /recipes/food/views/indian-spice-marinated-and-grilled-lamb-chops
<class 'str'>


<h1>A function that returns a list containing recipe names, recipe descriptions (if any) and recipe urls</h1>

In [28]:
def get_recipes(keywords):
    recipe_list = list()
    import requests
    from bs4 import BeautifulSoup
    url = "http://www.epicurious.com/search/" + keywords
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        recipes = results_page.find_all('article',class_="recipe-content-card")
        for recipe in recipes:
            recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
            recipe_name = recipe.find('a').get_text()
            try:
                recipe_description = recipe.find('p',class_='dek').get_text()
            except:
                recipe_description = ''
            recipe_list.append((recipe_name,recipe_link,recipe_description))
        return recipe_list
    except:
        return None

In [48]:
# recipe_name, recipe_link, recipe_description

get_recipes("Tofu chili")

[('Spicy Lemongrass Tofu',
  'http://www.epicurious.com/recipes/food/views/spicy-lemongrass-tofu-233844',
  "Dau hu xa ot\nEditor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.\nWhile traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I alw

In [39]:
get_recipes('Nothing')

[]

<h2>Let's write a function that</h2>
<h3>given a recipe link</h3>
<h3>returns a dictionary containing the ingredients and preparation instructions</h3>

In [40]:
recipe_link = "http://www.epicurious.com" + '/recipes/food/views/spicy-lemongrass-tofu-233844'

In [41]:
def get_recipe_info(recipe_link):
    recipe_dict = dict()
    import requests
    from bs4 import BeautifulSoup
    try:
        response = requests.get(recipe_link)
        if not response.status_code == 200:
            return recipe_dict
        result_page = BeautifulSoup(response.content,'lxml')
        ingredient_list = list()
        prep_steps_list = list()
        for ingredient in result_page.find_all('li',class_='ingredient'):
            ingredient_list.append(ingredient.get_text())
        for prep_step in result_page.find_all('li',class_='preparation-step'):
            prep_steps_list.append(prep_step.get_text().strip())
        recipe_dict['ingredients'] = ingredient_list
        recipe_dict['preparation'] = prep_steps_list
        return recipe_dict
    except:
        return recipe_dict
        

In [42]:
get_recipe_info(recipe_link)

{'ingredients': ['2 lemongrass stalks, outer layers peeled, bottom white part thinly sliced and finely chopped (about 1/4 cup)',
  '1 1/2 tablespoons soy sauce',
  '2 teaspoons chopped Thai bird chilies or another fresh chili',
  '1/2 teaspoon dried chili flakes',
  '1 teaspoon ground turmeric',
  '2 teaspoons sugar',
  '1/2 teaspoon salt',
  '12 ounces tofu, drained, patted dry and cut into 3/4-inch cubes',
  '4 tablespoons vegetable oil',
  '1/2 yellow onion, cut into 1/8-inch slices',
  '2 shallots, thinly sliced',
  '1 teaspoon minced garlic',
  '4 tablespoons chopped roasted peanuts',
  '10 la lot, or pepper leaves, shredded, or 2/3 cup loosely packed Asian basil leaves'],
 'preparation': ['1. Combine the lemongrass, soy sauce, chilies, chili flakes, turmeric, sugar and salt in a bowl. Add the tofu cubes and turn to coat them evenly. Marinate for 30 minutes.',
  '2. Heat half of the oil in a 12-inch nonstick skillet over moderately high heat. Add the onion, shallot and garlic and 

In [43]:
len(get_recipe_info(recipe_link))

2

<h2>Construct a list of dictionaries for all recipes</h2>

In [34]:
def get_all_recipes(keywords):
    results = list()
    all_recipes = get_recipes(keywords)
    for recipe in all_recipes:
        recipe_dict = get_recipe_info(recipe[1])
        recipe_dict['name'] = recipe[0]
        recipe_dict['description'] = recipe[2]
        results.append(recipe_dict)
    return(results)

In [21]:
get_all_recipes("Tofu chili")

[{'description': "Dau hu xa ot\nEditor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.\nWhile traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I always love serving this to friends who think tofu dishes are bland.",
  'ingredients': ['2 le

In [49]:
get_all_recipes("Tofu chili")[0].keys()

dict_keys(['ingredients', 'preparation', 'name', 'description'])

In [50]:
get_all_recipes('Nothing')

[]

<h1>Logging in to a web server</h1>

<h2>Get username and password</h2>
<li>Best to store in a file for reuse
<li>You will need to set up your own login and password and place them in a file called wikidata.txt
<li>Line one of the file should contain your username
<li>Line two your password

In [7]:
with open('wikidata.txt') as f:
    contents = f.read().split('\n')
    username = contents[0]
    password = contents[1]


<h3>Construct an object that contains the data to be sent to the login page</h3>

In [11]:

payload = {
    'wpName': username,
    'wpPassword': password,
    'wploginattempt': 'Log in',
    'wpEditToken': "+\\",
#    'wpEditToken': "+\",    
    'title': "Special:UserLogin",
    'authAction': "login",
    'force': "",
    'wpForceHttps': "1",
    'wpFromhttp': "1",
    #'wpLoginToken': ‘', #We need to read this from the page
    }

<h3>get the value of the login token</h3>

In [9]:
def get_login_token(response):
    soup = BeautifulSoup(response.text, 'lxml')
    token = soup.find('input',{'name':"wpLoginToken"}).get('value')
    return token


<h3>Setup a session, login, and get data</h3>

In [10]:
import requests
from bs4 import BeautifulSoup

with requests.session() as s:
    response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
    payload['wpLoginToken'] = get_login_token(response)
    #Send the login request
    response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    #Get another page and check if we’re still logged in
    response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
    data = BeautifulSoup(response.content,'lxml')
    print(data.find('div',class_='mw-changeslist').get_text())

23 February 2018
(User creation log); 09:37 . . User account Explorewo (talk | contribs) was created ‎


