# Coffee Availability By Roaster - Web Scraping Project
## Using BeautifulSoup to gather information about coffees for sale from select roasters

The goal of this project is to create a DataFrame with information compiled from selected Coffee Roaster websites in the U.S., with the goal of being able to compare and contrast the available options. 

Once the following info is in a pandas DataFrame, it will be possible to search and filter coffees based on preferred origin, price, etc. 

 - Information I hope to include:
     + Roaster Name
     + Coffee Name
     + Country/Countries of Origin
     + Description
     + Tasting Notes
     + Variety
     + Process
     + Size Options
     + Avg Price Per Pound or Ounce

First, import the necessary libraries:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

A list of a few coffee roasters to begin:

In [2]:
onyx_url = 'https://onyxcoffeelab.com/collections/coffee'
equator_url = 'https://www.equatorcoffees.com/collections/coffees'
ruby_url = 'https://rubycoffeeroasters.com/collections/coffee'

This first function compiles a list of urls from each of the above websites; each url is one type of coffee.

In [3]:
def gather_coffee_links(url):
    r= requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    coffee_links = soup.find_all('a')
    
    links = []
    for link in coffee_links:
        links.append(url+link["href"])

    coffee_link_list = []
    for link in links:
        if '/product' in link:
            coffee_link_list.append(link)

    return coffee_link_list

First, I want to gather information from Onyx Coffee Labs in Arkansas.

In [4]:
onyx_links = gather_coffee_links(onyx_url)

With a list of pages that include the information about each coffee, I can begin to extract the desired data. The function below scrapes a page for coffee attributes:

In [5]:
def get_coffee_attrs(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    coffee_data = {}
    try:
        name = soup.find('h1').text.strip()
    except:
        name = None
    try:
        coffee_description = soup.find('div', {'class':'main-blurb'}).getText().strip()
    except:
        coffee_description = None
    for div in soup.find_all('div', {'class':'a-feature'}):
        try:
            label = div.find('div', {'class':'label'})
            label_value = div.find('div', {'class':'value'})
            coffee_data[label.text]=label_value.text
        except:
            label = None
            label_value = None
    coffee_data.update({'Name':name, 'Description':coffee_description})
    return coffee_data

At the moment, I'm having difficulty extracting the price. As I experiment, I have separated sizes and prices to a different function seen below: 

In [6]:
def get_size_and_price():
    r= requests.get('https://onyxcoffeelab.com/products/southern-weather?variant=31862699917410')
    soup = BeautifulSoup(r.text, 'html.parser')
    coffee_data = {}
    try:
        size = soup.find('div', {'class':'size-select option-type'}).text.strip()
        sizes = size.split('\n')
    except:
        size = None
    try:
        price = soup.select('div.price.variant-price')
    except: 
        price = None
    coffee_data.update({'Sizes Available' : sizes, 'Prices': price})
    print(coffee_data)

In [7]:
get_size_and_price()

{'Sizes Available': ['4oz', '10oz', '2lbs', '5lbs'], 'Prices': [<div class="price variant-price">
                    
                    $-.-
                    
                </div>]}


As you can see above, the price is shown as empty. On the website this is (I think?) a flexbox, and the price changes when you select different sizes. Therefore, when I try find or find_all I am given no values/ an empty list.

Moving forward with what does work: we can collect coffee attribute data from all the Onyx coffee links:

In [8]:
coffee_data = []
for link in onyx_links:
    coffee_data.append(get_coffee_attrs(link))

This data is now converted into a pandas DataFrame for easier manipulation:

In [9]:
df_coffee_info = pd.DataFrame(coffee_data)
df_coffee_info

Unnamed: 0,Variety:,Process:,Cup:,Caffeine:,Name,Description,Origin:,Elevation:
0,"Wazuka (Uji), Kyoto, Japan","Shaded 21 days, Processed to tencha, Stone milled","Almond, Pomelo, White Sugar, Brisk Freshness, ...",mg,Uji Matcha,This exceptional matcha powder comes from a sm...,,
1,,,,,404,,,
2,,,,,,,,
3,,,,,Finest coffee in the worldevery month for the ...,,,
4,,Washed,"Milk Chocolate, Plum, Candied Walnuts, Juicy &...",,Southern Weather,Southern Weather embodies everything we love a...,"Colombia, Ethiopia",1850
5,,Washed,"Berries, Stone Fruit, Earl Grey, Honeysuckle, ...",,Geometry,"Geometry has been defined as ""describing space...","Colombia, Ethiopia",1950 - 2100
6,"Colombia, Ethiopia","Washed, Natural","Dark Chocolate, Molasses, Red Wine, Dried Berr...",,Monarch,Monarch is our most developed roast that conve...,,1800
7,,"Natural, Washed","Mixed Berries, Sweet Tea, Raw Honey, Plum",,Tropical Weather,Tropical Weather is a seasonal blend that cele...,Ethiopia,1900
8,"Colombia, Ethiopia","Washed, Raised-Bed Dreid","Brown Sugar, Cocoa, Silky, Floral, Peach",,Power Nap,"OK, so you need a quick burst of energy, but y...",,1950 - 2000
9,,"Washed, Patio Dried","Cocoa, Dates, Brown Sugar, Stone Fruit, Creamy",,Cold Brew,This coffee is intentionally sourced and roast...,"Colombia, Ethiopia",1850


A couple quick things to tidy-up the table to begin:

In [10]:
df_coffee_info = df_coffee_info.drop([0,1,2,3], axis=0).reset_index(drop = True)
df_coffee_info = df_coffee_info.drop(['Caffeine:'], axis=1)
df_coffee_info.columns = df_coffee_info.columns.str.replace(':','')
df_coffee_info.head()

Unnamed: 0,Variety,Process,Cup,Name,Description,Origin,Elevation
0,,Washed,"Milk Chocolate, Plum, Candied Walnuts, Juicy &...",Southern Weather,Southern Weather embodies everything we love a...,"Colombia, Ethiopia",1850
1,,Washed,"Berries, Stone Fruit, Earl Grey, Honeysuckle, ...",Geometry,"Geometry has been defined as ""describing space...","Colombia, Ethiopia",1950 - 2100
2,"Colombia, Ethiopia","Washed, Natural","Dark Chocolate, Molasses, Red Wine, Dried Berr...",Monarch,Monarch is our most developed roast that conve...,,1800
3,,"Natural, Washed","Mixed Berries, Sweet Tea, Raw Honey, Plum",Tropical Weather,Tropical Weather is a seasonal blend that cele...,Ethiopia,1900
4,"Colombia, Ethiopia","Washed, Raised-Bed Dreid","Brown Sugar, Cocoa, Silky, Floral, Peach",Power Nap,"OK, so you need a quick burst of energy, but y...",,1950 - 2000


Eventually, I want to gather data from various roasteries and concatenate the tables. Because of this, I need to include the roaster name: Onyx

In [13]:
df_coffee_info.insert(0, 'Roaster', 'Onyx')
df_coffee_info

Unnamed: 0,Roaster,Variety,Process,Cup,Name,Description,Origin,Elevation
0,Onyx,,Washed,"Milk Chocolate, Plum, Candied Walnuts, Juicy &...",Southern Weather,Southern Weather embodies everything we love a...,"Colombia, Ethiopia",1850
1,Onyx,,Washed,"Berries, Stone Fruit, Earl Grey, Honeysuckle, ...",Geometry,"Geometry has been defined as ""describing space...","Colombia, Ethiopia",1950 - 2100
2,Onyx,"Colombia, Ethiopia","Washed, Natural","Dark Chocolate, Molasses, Red Wine, Dried Berr...",Monarch,Monarch is our most developed roast that conve...,,1800
3,Onyx,,"Natural, Washed","Mixed Berries, Sweet Tea, Raw Honey, Plum",Tropical Weather,Tropical Weather is a seasonal blend that cele...,Ethiopia,1900
4,Onyx,"Colombia, Ethiopia","Washed, Raised-Bed Dreid","Brown Sugar, Cocoa, Silky, Floral, Peach",Power Nap,"OK, so you need a quick burst of energy, but y...",,1950 - 2000
5,Onyx,,"Washed, Patio Dried","Cocoa, Dates, Brown Sugar, Stone Fruit, Creamy",Cold Brew,This coffee is intentionally sourced and roast...,"Colombia, Ethiopia",1850
6,Onyx,"Catuai, Caturra","Honey, Patio Dried","Apple Cider, Cherry, Cacao Nib, Hibiscus",Silverstein,Prepare to embark on a sensory journey that ha...,,1450
7,Onyx,Java,"Koji Inoculated Natural, Raised Bed Dried","Raspberry, Watermelon Candy, Winey, Mango",Colombia El Vergel Java Koji,"This is the coffee that has sparked debates, n...",,1550
8,Onyx,Gesha,"Washed, Raised-Bed Dried","Lemon, Black Tea, Orange Blossom, Honey",Colombia Wilder Lasso Citric Gesha,This silky and refined Gesha gets its citric p...,,1900 MASL
9,Onyx,Red Bourbon,"Natural, Raised-Bed Dried","Dried Cherry, Milk Chocolate, Nectarine, Black...",Burundi Long Miles Gaharo Natural,This natural processed coffee comes to us from...,,1950


In [26]:
df_coffee_info.groupby('Elevation').agg({'Process':'unique', 'Variety':'unique'})

Unnamed: 0_level_0,Process,Variety
Elevation,Unnamed: 1_level_1,Unnamed: 2_level_1
1400 Meters,"[Natural, Raised Bed Dried]",[Parainema]
1450,"[Honey, Patio Dried, Natural, Raised-Bed Dried]","[Catuai, Caturra, Ethiopia Heirloom]"
1450 Meters,"[Natural, Raised-Bed Dried]","[Catuai, Caturra]"
1550,"[Koji Inoculated Natural, Raised Bed Dried]",[Java]
1650 MASL,"[Washed, Raised-Bed Dried]",[Typica]
1700 MASL,"[Lactic Washed, Raised Bed Dried]","[Caturra, Castillo, Colombia]"
1750,"[Washed, Raised Bed Dried]",[Pink Bourbon]
1800,"[Washed, Natural, Washed, Raised-Bed Dried]","[Colombia, Ethiopia, SL28, SL34, Ruiru 11]"
1800 MASL,"[Washed, Raised-Bed Dried]",[Pink Bourbon]
1850,"[Washed, Washed, Patio Dried, Washed, Raised-B...","[nan, Guatemala]"
