# **WEB SCRAPING FROM HEHEMART'S WEBSITE** *with BeautifulSoup*
---

### Introduction:

This project is an opportunity to apply my web scraping skills to extract public data from [Hehe Mart](https://shop.mart.rw/categories-en-3/groceries/vegetables).

The objective is to explore discounts the store offers. This objective might expand and grow into more interesting things.

I also checked (since it's important) if webscraping from HeheMart will be in violation of their privacy policy. Fortunately, it isn't.

---

## Import Packages

In [85]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pd
import re

## Parsing HTML to BeautifulSoup object

In [86]:
# link to vegetables page in hehemart's website
my_url = 'https://shop.mart.rw/categories-en-3/groceries/vegetables'

# opening up a conection and grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

## Exploring the HTML Structure and Extraction

I inspected the html page in the browser and noticed that each product listed on the page lives within a div element with class titled "ty-column4"

In [87]:
# Extracting all conatiners of products in the webpage
containers = page_soup.findAll("div",{"class":"ty-column4"})

len(containers)

52

So this is the number of products on this page (displayed above). My guess is that it's slightly less because some of these div elements may not contain a product.
We'll have to do some investigation later.

---

I'm interested in the following details of each product:
* Product Name
* Price
* Currency
* Unit

---
I decide to test the extraction with the first container to see if it works

What I notice is that there are two span elements each with the class name "ty-price-num". I realized the first refers to the price and the second refers to the currency.

In [97]:
# test with first container (product_name, price, currency, unit)

container = containers[0]

product_name = container.find("div",{"class":"ty-grid-list__item-name"}).bdi.a.text
price = container.findAll("span","ty-price-num")[0].string
curr = container.findAll("span","ty-price-num")[1].string
unit = container.findAll("span","unit-price-line1")[0].p.string

print("Product Name:", product_name)
print("Price:", price)
print("Currency:", curr)
print("Unit:", unit)

Product Name: Tomatoes (Open Field) (kg)
Price: 1,690
Currency: RWF
Unit: / Kg


---
So far, so good.

A few points though:
* Checking the webpage, not all the products have units so we'd have to catch those exceptions. *An example is the **Amaranth (Dodo)** product. It has no unit.*
* Also, each unit has a slash and a whitespace. I'd probably do this in pandas with regex to extract just the unit. Eg: "/ Kg" --> "Kg"

---

The following gives us the exception errors and the indices of the errors.

In [98]:
# function to display index of divs in containers that are throwing errors
def where_errors(container_list):
    counter=-1
    counter_list=[]

    for container in container_list:
        counter+=1
        try:
            container.findAll("span","unit-price-line1")[0]
        except IndexError:
            pass
            counter_list.append(counter)
    
    return counter_list

In [100]:
index_of_errors = where_errors(containers)
print(index_of_errors)

[39, 47, 49, 50, 51]


---

Now that we know where these errors are, let's find their product names so we can investigate what the issue is in the webpage

In [91]:
# check if these errors don't have units

for i in index_of_errors:
    try:
        container_e = containers[i]
        product_name = container_e.find("div",{"class":"ty-grid-list__item-name"}).bdi.a.string

        print(product_name, "-", i)
        
    except AttributeError:
        pass
        print("div tag index", i, "still has an error")

Amaranth (Dodo) - 39
Green Box - 47
Exotic Box - 49
div tag index 50 still has an error
div tag index 51 still has an error


## Removing empty divs

After webpage inspection, we realize that 39, 47 and 49 represent divs of products that don't have a unit. We would have to extract as blank.

With 50 and 51 though, they don't contain anything at all. Let's confirm this.

In [92]:
print(containers[50])
print(containers[51])

<div class="ty-column4"></div>
<div class="ty-column4"></div>


Now let's remove the empty containers from the list. Any container that's not empty will have a length of more than 0 so that's what we'd use as the criterion

In [93]:
containers_clean = [s for s in containers if len(s) > 0]

print("We now have", len(containers_clean), "products from the webpage we can start extracting from")

We now have 50 products from the webpage we can start extracting from


## Write extracted data to a csv file

In [94]:
# # csv create to dump data in
# filename = 'products_hehe.csv'
# f = open(filename, 'w')

# # creating headers for the csv file
# headers = "product_name, price, currency, unit\n"
# f.write(headers)

In [95]:
# for container in containers:
#     product_name = container.find("div",{"class":"ty-grid-list__item-name"}).bdi.a.text
#     price = container.findAll("span","ty-price-num")[0].string

#     # write to csv
#     f.write(brand + "," + product_name.replace(",","|") + "," + shipping + "\n")

# f.close()