<h2>Caveat</h2>
Web sites often change the format of their pages so this may not always work. If it doesn't, rework the examples after examining the html content of the page (most browsers will let you see the html source - look for a "page source" option - though you might have to turn on the developer mode in your browser preferences. For example, on Chrome you need to click the "developer mode" check box under Extensions in the Preferences/Options menu. 

Feel free to use the links below to navigate the notebook:
- [**STEP 0**](#import): Import necessary modules (requests and BeautifulSoup)
- [**STEP 1**](#step1): Check the http Request-Response cycle (**response.status_code == 200**)
    - [Model 0](#model0): Set up the BeautifulSoup object (**results_page**)
- [**STEP 2**](#step2): BS4 functions
    - [Model 1](#model1): **find_all()**
    - [Model 2](#model2): **find()**
    - [Model 3](#model3): find() and find_all() qualification by CSS selectors
    - [Model 4](#model4): __get_text()__ returns the marked up text (the content) enclosed in a tag
    - [Model 5](#model5): __get()__ returns the __value__ of a tag attribute
- [**STEP 3**](#step3): Dealing with tags (Example)
    - [Model 6](#model6): Build a new function (Example 1)
    - [Model 7](#model7): Construct a list of dictionaries
- [**STEP 4**](#step4): Logging in to a web server
    - [Model 8](#model8): Get username and password
    - [Model 9](#model9): Construct an object that contains the data to be sent to the login page
    - [Model 10](#model10): Get the value of the login token
    - [Model 11](#model11): Setup a session, login, and get data

<a id='import'></a>

<h3>Import necessary modules</h3>

In [5]:
import requests
from bs4 import BeautifulSoup

<a id='step1'></a>
<h3>The http Request-Response cycle</h3>

In [2]:
url = "http://www.epicurious.com/search/Tofu Chili"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


In [3]:
keywords = input("Please enter the things you want to see in a recipe")
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Please enter the things you want to see in a recipe tofu chili
Success


<a id='model0'></a>
<h3>Set up the BeautifulSoup object</h3>

In [4]:
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="app-id=312101965" name="apple-itunes-app"/>
  <title>
   Search | Epicurious.com
  </title>
  <link href="//assets.adobedtm.com" rel="dns-prefetch"/>
  <link href="https://www.google-analytics.com" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//static.parsely.com" rel="dns-prefetch"/>
  <link href="//condenast.demdex.net" rel="dns-prefetch"/>
  <link href="//capture.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//pixel.condenastdigital.com" rel="dns-prefetch"/>
  <link href="//use.typekit.net" rel="dns-prefetch"/>
  <link href="//fonts.typekit.net" rel="dns-prefetch"/>
  <link href="//p.typekit.net" rel="dns-prefetch"/>
  <link href="//assets.epicurious.com" rel="dns-prefetch"/>
  <link href="//ad.doubleclick.net" rel="dns-prefetch"/>
  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//z.moatads.com" rel="dns-prefetch

<a id='step2'></a>
<h3>BS4 functions</h3>

<a id='model1'></a>
<h4>find_all</h4> finds all instances of a specified tag returns a **result_set (a list)**

In [5]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

<class 'bs4.element.ResultSet'>


In [6]:
all_a_tags

[<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>,
 <a data-reactid="72" href="/recipes/food/views/spicy-lemongrass-tofu-233844">Spicy Lemongrass Tofu</a>,
 <a class="photo-link" data-reactid="89" href="/recipes/food/views/spicy-lemongrass-tofu-233844"><div class="photo-wrap" data-reactid="90"><div class="component-lazy pending" data-component="Lazy" data-reactid="91"></div></div></a>,
 <a class="view-complete-item" data-reactid="92" href="/recipes/food/views/spicy-lemongrass-tofu-233844" itemprop="url" title="Spicy Lemongrass Tofu"><!-- react-text: 93 -->View “<!-- /react-text --><!-- react-text: 94 -->Spicy Lemongrass Tofu<!-- /react-text --><!-- react-text: 95 -->”<!-- /react-text --></a>,
 <a class="view-complete-item" data-reactid="97" href="/recipes/food/views/spicy-lemongrass-tofu-233844">View Recipe</a>,
 <a class="show-quick-view" data-reactid="99" href="/recipes/food/views/spicy-lemongrass-tofu-233844" title="Spicy Lemongrass Tofu">Quick view</a>,

<a id='model2'></a>
<h4>find</h4> finds the <u>first</u> instance of a specified tag and returns a **bs4 element**

In [7]:
div_tag = results_page.find('div')
print(div_tag)

<div class="header-wrapper" data-reactid="2"><div class="header" data-reactid="3" role="banner"><h2 data-reactid="4" itemtype="https://schema.org/Organization"><a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a></h2><div class="search-form-container" data-reactid="6"><form action="/search/" autocomplete="off" data-reactid="7" method="get" role="search"><fieldset data-reactid="8"><button class="submit" data-reactid="9" type="submit">search</button><input autocomplete="off" data-reactid="10" maxlength="120" name="terms" placeholder="Find a Recipe" type="text" value=" tofu chili"/><button class="filter mobile" data-reactid="11">filters</button><button class="filter tablet" data-reactid="12">filter results</button></fieldset></form><div class="ingredient-filters" data-reactid="13"><h3 data-reactid="14">Include/Exclude Ingredients</h3><form class="include-ingredients" data-reactid="15"><fieldset data-reactid="16"><input aria-label="Include ingredients" data-reactid

In [8]:
type(div_tag)


bs4.element.Tag

In [9]:
type(results_page)

bs4.BeautifulSoup

In [10]:
print(div_tag.prettify())

<div class="header-wrapper" data-reactid="2">
 <div class="header" data-reactid="3" role="banner">
  <h2 data-reactid="4" itemtype="https://schema.org/Organization">
   <a data-reactid="5" href="/" itemprop="url" title="Epicurious">
    Epicurious
   </a>
  </h2>
  <div class="search-form-container" data-reactid="6">
   <form action="/search/" autocomplete="off" data-reactid="7" method="get" role="search">
    <fieldset data-reactid="8">
     <button class="submit" data-reactid="9" type="submit">
      search
     </button>
     <input autocomplete="off" data-reactid="10" maxlength="120" name="terms" placeholder="Find a Recipe" type="text" value=" tofu chili"/>
     <button class="filter mobile" data-reactid="11">
      filters
     </button>
     <button class="filter tablet" data-reactid="12">
      filter results
     </button>
    </fieldset>
   </form>
   <div class="ingredient-filters" data-reactid="13">
    <h3 data-reactid="14">
     Include/Exclude Ingredients
    </h3>
    <f

In [11]:
div_tag.find('div').find('div').find('div')

<div class="ingredient-filters" data-reactid="13"><h3 data-reactid="14">Include/Exclude Ingredients</h3><form class="include-ingredients" data-reactid="15"><fieldset data-reactid="16"><input aria-label="Include ingredients" data-reactid="17" placeholder="Include ingredients:" type="text"/><button data-reactid="18">include</button></fieldset></form><form class="exclude-ingredients" data-reactid="19"><fieldset data-reactid="20"><input aria-label="Exclude ingredients" data-reactid="21" placeholder="Exclude ingredients:" type="text"/><button data-reactid="22">exclude</button></fieldset></form></div>

<h4>bs4 functions can be recursively applied on elements</h4>

In [12]:
div_tag.find('a')

<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>

<a id='model3'></a>
Both __find__ as well as __find_all__ can be qualified by css selectors
<li>using selector=value
<li>using a dictionary

In [13]:
#When using this method and looking for 'class' use 'class_' (because class is a reserved word in python)
#Note that we get a list back because find_all returns a list
results_page.find_all('article',class_="recipe-content-card")

[<article class="recipe-content-card" data-has-quickview="false" data-index="0" data-reactid="68" itemscope="" itemtype="https://schema.org/Recipe"><header class="summary" data-reactid="69"><strong class="tag" data-reactid="70">recipe</strong><h4 class="hed" data-reactid="71" data-truncate="3" itemprop="name"><a data-reactid="72" href="/recipes/food/views/spicy-lemongrass-tofu-233844">Spicy Lemongrass Tofu</a></h4><p class="dek" data-reactid="73" data-truncate="1">Dau hu xa ot
 Editor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.
 While traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag o

In [14]:
#Since we're using a string as the key, the fact that class is a reserved word is not a problem
#We get an element back because find returns an element
results_page.find('article',{'class':'recipe-content-card'})

<article class="recipe-content-card" data-has-quickview="false" data-index="0" data-reactid="68" itemscope="" itemtype="https://schema.org/Recipe"><header class="summary" data-reactid="69"><strong class="tag" data-reactid="70">recipe</strong><h4 class="hed" data-reactid="71" data-truncate="3" itemprop="name"><a data-reactid="72" href="/recipes/food/views/spicy-lemongrass-tofu-233844">Spicy Lemongrass Tofu</a></h4><p class="dek" data-reactid="73" data-truncate="1">Dau hu xa ot
Editor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.
While traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of f

<a id='model4'></a>
__get_text()__ returns the marked up text (the content) enclosed in a tag.
<li>returns a string

In [15]:
results_page.find('article',{'class':'recipe-content-card'}).get_text()

"recipeSpicy Lemongrass TofuDau hu xa ot\nEditor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.\nWhile traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I always love serving this to friends who think tofu dishes are bland.Average user rat

In [16]:
print(results_page.find('article',{'class':'recipe-content-card'}).get_text())

recipeSpicy Lemongrass TofuDau hu xa ot
Editor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.
While traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I always love serving this to friends who think tofu dishes are bland.Average user rating

<a id='model5'></a>
__get()__ returns the __value__ of a tag attribute
<li>returns a string

In [17]:
recipe_tag = results_page.find('article',{'class':'recipe-content-card'})
recipe_link = recipe_tag.find('a')
print("a tag:",recipe_link)
link_url = recipe_link.get('href')
print("link url:",link_url)
print(type(link_url))

a tag: <a data-reactid="72" href="/recipes/food/views/spicy-lemongrass-tofu-233844">Spicy Lemongrass Tofu</a>
link url: /recipes/food/views/spicy-lemongrass-tofu-233844
<class 'str'>


<a id='step3'></a>
<h1>A function that returns a list containing recipe names, recipe descriptions (if any) and recipe urls</h1>

In [18]:
def get_recipes(keywords):
    recipe_list = list()
    import requests
    from bs4 import BeautifulSoup
    url = "http://www.epicurious.com/search/" + keywords
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        recipes = results_page.find_all('article',class_="recipe-content-card")
        for recipe in recipes:
#            recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
#            recipe_name = recipe.find('a').get_text()
#            try:
#                recipe_description = recipe.find('p',class_='dek').get_text()
#            except:
#                recipe_description = ''
#            recipe_list.append((recipe_name,recipe_link,recipe_description))
            recipe_list.append(recipes)
        return recipe_list
    except:
        return None

In [19]:
get_recipes("Tofu chili")

[[<article class="recipe-content-card" data-has-quickview="false" data-index="0" data-reactid="68" itemscope="" itemtype="https://schema.org/Recipe"><header class="summary" data-reactid="69"><strong class="tag" data-reactid="70">recipe</strong><h4 class="hed" data-reactid="71" data-truncate="3" itemprop="name"><a data-reactid="72" href="/recipes/food/views/spicy-lemongrass-tofu-233844">Spicy Lemongrass Tofu</a></h4><p class="dek" data-reactid="73" data-truncate="1">Dau hu xa ot
  Editor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.
  While traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a ba

And we can see that that's sitting inside a __paragraph__
here with class equals __dek__.
So we can always get the description
by looking for a __paragraph tag__ with the __class = dek__.
We also see-- and we saw this before--
that the link to the next recipe,
to the recipe detail page, is inside an __annotate tag__.
So there's an __a-tag__ over here.
And that contains the link and it also
contains the name of the recipe.

So with these two things, by finding that annotate tag,
the first annotate tag in our recipe content card article,
and the first paragraph tag that has a class equals dek,
we can get the _name, the link, and the description_.
So let's add these three things to our setup here.

In [20]:
def get_recipes(keywords):
    recipe_list = list()
    import requests
    from bs4 import BeautifulSoup
    url = "http://www.epicurious.com/search/" + keywords
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        recipes = results_page.find_all('article',class_="recipe-content-card")
        for recipe in recipes:
            recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
            recipe_name = recipe.find('a').get_text()
            try:
                recipe_description = recipe.find('p',class_='dek').get_text()
            except:
                recipe_description = ''
            recipe_list.append((recipe_name,recipe_link,recipe_description))
        return recipe_list
    except:
        return None

In [21]:
get_recipes("Tofu chili")

[('Spicy Lemongrass Tofu',
  'http://www.epicurious.com/recipes/food/views/spicy-lemongrass-tofu-233844',
  "Dau hu xa ot\nEditor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.\nWhile traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I alw

In [22]:
get_recipes('Nothing')

[]

<a id='model6'></a>
<h2>Let's write a new function!</h2>

Given a recipe link returns a dictionary containing the ingredients and preparation instructions

In [23]:
recipe_link = "http://www.epicurious.com" + '/recipes/food/views/spicy-lemongrass-tofu-233844'

In [24]:
def get_recipe_info(recipe_link):
    recipe_dict = dict()
    import requests
    from bs4 import BeautifulSoup
    try:
        response = requests.get(recipe_link)
        if not response.status_code == 200:
            return recipe_dict
        result_page = BeautifulSoup(response.content,'lxml')
        ingredient_list = list()
        prep_steps_list = list()
        for ingredient in result_page.find_all('li',class_='ingredient'):
            ingredient_list.append(ingredient.get_text())
        for prep_step in result_page.find_all('li',class_='preparation-step'):
            prep_steps_list.append(prep_step.get_text().strip())
        recipe_dict['ingredients'] = ingredient_list
        recipe_dict['preparation'] = prep_steps_list
        return recipe_dict
    except:
        return recipe_dict
        

In [25]:
get_recipe_info(recipe_link)

{'ingredients': ['2 lemongrass stalks, outer layers peeled, bottom white part thinly sliced and finely chopped (about 1/4 cup)',
  '1 1/2 tablespoons soy sauce',
  '2 teaspoons chopped Thai bird chilies or another fresh chili',
  '1/2 teaspoon dried chili flakes',
  '1 teaspoon ground turmeric',
  '2 teaspoons sugar',
  '1/2 teaspoon salt',
  '12 ounces tofu, drained, patted dry and cut into 3/4-inch cubes',
  '4 tablespoons vegetable oil',
  '1/2 yellow onion, cut into 1/8-inch slices',
  '2 shallots, thinly sliced',
  '1 teaspoon minced garlic',
  '4 tablespoons chopped roasted peanuts',
  '10 la lot, or pepper leaves, shredded, or 2/3 cup loosely packed Asian basil leaves'],
 'preparation': ['1. Combine the lemongrass, soy sauce, chilies, chili flakes, turmeric, sugar and salt in a bowl. Add the tofu cubes and turn to coat them evenly. Marinate for 30 minutes.',
  '2. Heat half of the oil in a 12-inch nonstick skillet over moderately high heat. Add the onion, shallot and garlic and 

<a id='model7'></a>
<h2>Construct a list of dictionaries for all recipes</h2>

In [26]:
def get_all_recipes(keywords):
    results = list()
    all_recipes = get_recipes(keywords)
    for recipe in all_recipes:
        recipe_dict = get_recipe_info(recipe[1])
        recipe_dict['name'] = recipe[0]
        recipe_dict['description'] = recipe[2]
        results.append(recipe_dict)
    return(results)

In [27]:
get_all_recipes("Tofu chili")

[{'description': "Dau hu xa ot\nEditor's note: The recipe and introductory text below are excerpted from Pleasures of the Vietnamese Table by Mai Pham and are part of our story on Lunar New Year.\nWhile traveling on a train one time to the coastal town of Nha Trang, I sat next to an elderly nun. Over the course of our bumpy eight-hour ride, she shared stories of life at the temple and the difficult years after the end of the war when the Communist government cracked down on religious factions. Toward the end of our chat, she pulled out a bag of food she'd prepared for the trip. It was tofu that had been cooked in chilies, lemongrass and la lot, an aromatic leaf also known as pepper leaf. When she gave me a taste, I knew immediately that I had to learn how to make it. This is my rendition of that fabulous dish. Make sure to pat the tofu dry before marinating it and use very fresh lemongrass. I always love serving this to friends who think tofu dishes are bland.",
  'ingredients': ['2 le

<a id='step4'></a>
<h1>Logging in to a web server</h1>

<a id='model8'></a>
<h2>Get username and password</h2>
<li>Best to store in a file for reuse
<li>You will need to set up your own login and password and place them in a file called wikidata.txt
<li>Line one of the file should contain your username
<li>Line two your password

In [28]:
with open('wikidata.txt') as f:
    contents = f.read().split('\n')
    username = contents[0]
    password = contents[1]
print(username,password)

VSerpak NewPeriod


<a id='model9'></a>
<h3>Construct an object that contains the data to be sent to the login page</h3>

In [29]:

payload = {
    'wpName': username,
    'wpPassword': password,
    'wploginattempt': 'Log in',
    'wpEditToken': "+\\",
    'title': "Special:UserLogin",
    'authAction': "login",
    'force': "",
    'wpForceHttps': "1",
    'wpFromhttp': "1",
    #'wpLoginToken': ‘', #We need to read this from the page
    }

<a id='model10'></a>
<h3>Get the value of the login token</h3>

In [30]:
def get_login_token(response):
    soup = BeautifulSoup(response.text, 'lxml')
    token = soup.find('input',{'name':"wpLoginToken"}).get('value')
    return token


<a id='model11'></a>
<h3>Setup a session, login, and get data</h3>

In [31]:
with requests.session() as s:
    response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
    payload['wpLoginToken'] = get_login_token(response)
    #Send the login request
    response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    #Get another page and check if we’re still logged in
    response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
    data = BeautifulSoup(response.content,'lxml')
    print(data.find('div',class_='mw-changeslist').get_text())

25 February 2018
(User creation log); 14:50 . . User account VSerpak (talk | contribs) was created ‎


