# **WEB SCRAPING NOTEBOOK**

# **Food Image Classification for the course** ***Foundations of Deep Learning***
**Professors:** 

Paolo Napoletano 

Marco Buzzelli

**Tutor:**

Mirko Agarla

# NOTEBOOK FOR DATA SCRAPING IN ORDER TO CREATE REJECTION CLASS

> What does it do?

This notebook provide a way of downloading images from https://commons.wikimedia.org by scraping, based on a custom choice of specific image categories. The amount of images scraped is equal to 254 for four categories such as *landscapes*, *architecture*, *clouds*, *entertainment*.

Before the scraping took place, the page https://commons.wikimedia.org/robots.txt has been checked in order to understand if any copyright violation was about to be committed. There was no indication that specific files were not free to download or scrape.

Furthermore, they usually are available either under CC (Creative Commons: https://en.wikipedia.org/wiki/Creative_Commons_license) licence, Creative Commons Attribution-ShareAlike 4.0 International or the images come from https://unsplash.com (licence: https://unsplash.com/it/licenza), so they are freely available to be used by everyone and almost for any purpose.

> What is the goal?

The goal of this notebook is to obtain a certain amount of images in order to create another class, the *rejection_class* that it should be used in order to train the model. In doing so, the model should have the possibility to be able to distinguish also categories that are different from food.

In [None]:
# # REMOVE PREVIOUS FOLDER IF EXISTS
# !rm -rf rejection_class/

In [None]:
# IMPORTING PACKAGES
import time
import os.path
from os import path
from lxml import html
from re import search
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs4

In [None]:
# Link Google Drive account
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# PATH TO WHERE SAVE THE NEW DATA
DIR_ORIGINAL_DATA = r"/content/gdrive/MyDrive/Data_Science_2020-2022/Secondo_anno_Secondo_Semestre/FoDL_Project/Project_Example_Food/ExampleFoodImageDataset/rejection_class"

In [None]:
# BASE URL FOR CATEGORY
BASE_URL_CATEGORY_LANDSCAPES = r"https://commons.wikimedia.org/wiki/Category:Landscapes"
BASE_URL_CATEGORY_ARCHITECTURE = r"https://commons.wikimedia.org/wiki/Category:Architecture"
BASE_URL_CATEGORY_CLOUDS = r"https://commons.wikimedia.org/wiki/Category:Clouds_in_Nottuln"
BASE_URL_CATEGORY_ENTERTAINMENT = r"https://commons.wikimedia.org/wiki/Category:Aerial_photographs_of_events"

# BASE URL FOR WIKIMEDIA
BASE_URL_WIKIMEDIA = r"https://commons.wikimedia.org"

In [None]:
def get_download_path(base_url):
  """
    The goal of this function is to get the downloadable paths
    to use in the scrape_images function.
    You are going to get all the path for each single image that you'll
    use after with the scrape_images function.
    
    From https://www.simplilearn.com/tutorials/python-tutorial/python-internet-access-using-urllib-request-and-urlopen#:~:text=The%20urlopen()%20function%20provides,objects%20that%20perform%20these%20services.:
    
    The urlopen() function provides a fairly simple interface. 
    It is capable of retrieving URLs with a variety of protocols. 
    It also has a little more complicated interface for dealing with typical 
    scenarios, such as basic authentication, cookies, and proxies. 
    Handlers and openers are objects that perform these services.
  """

  # FETCH THE BASE URL
  # ERROR! HERE YOU NEED A STRING, NOT A LIST OF URL!!!
  # E SE SALVASSIMO I RISULTATI IN UNA LISTA E POI DA QUELLA LISTA, E NON
  # DALLA FUNZIONE, INIZIASSIMO AD APPLICARE LA FUNZIONE FINALE?
  html = urlopen(base_url)

  # APPLY bs4
  bs4_apply = bs4(html, 'html.parser')
  # FIND ALL ELEMENTS WITH THE CLASS galleryfilename galleryfilename-truncate
  link_images = bs4_apply.find_all('a', {"class": "galleryfilename galleryfilename-truncate"})

  # CREATE AN EMPTY LIST WHERE TO SAVE ALL THE LINKS
  save_href_list = []

  for el in link_images:
    get_ref = el.get('href')
    save_href_list.append(BASE_URL_WIKIMEDIA + get_ref)

  print(save_href_list)

  return save_href_list

In [None]:
# CREATE LISTS WHERE TO STORE ALL THE LINKS FOR THE IMAGES
landscapes = get_download_path(BASE_URL_CATEGORY_LANDSCAPES)
architecture = get_download_path(BASE_URL_CATEGORY_ARCHITECTURE)
clouds = get_download_path(BASE_URL_CATEGORY_CLOUDS)
entertainment = get_download_path(BASE_URL_CATEGORY_ENTERTAINMENT)

['https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg', 'https://commons.wikimedia.org/wiki/File:Fallcreek_Oct_2008.jpg', 'https://commons.wikimedia.org/wiki/File:Galway,_Irlanda.jpg', 'https://commons.wikimedia.org/wiki/File:IMG-20210214-WA0300.jpg', 'https://commons.wikimedia.org/wiki/File:In_Bangor.JPG', 'https://commons.wikimedia.org/wiki/File:Joe_Mania_2016-08-01_(Unsplash_PFLchrsv9jY).jpg', 'https://commons.wikimedia.org/wiki/File:Layakharo_Evining_sunset_scenery.jpg', 'https://commons.wikimedia.org/wiki/File:Mountain_Glacier_Landscape_(from_AvoPix).jpg', 'https://commons.wikimedia.org/wiki/File:%D0%AF%D1%8F_412a.jpg']
['https://commons.wikimedia.org/wiki/File:%22Grecia._Le_radici_della_Civilt%C3%A0_Europea%22_book_by_Pino_Musi.jpg', 'https://commons.wikimedia.org/wiki/File:%22Grecia._Le_radici_della_Civilt%C3%A0_Europea%22_photobook_by_Pino_Musi.jpg', 'https://commons.wikimedia.org/wiki/File:01_EJHNMC_by_Husos_03b_01_PRuiz.jpg', 'htt

# Create the functions to fetch the correct link and to download the images.

In [None]:
def scrape_images(download_path):
  """
  Using the download_path url this function fetch
  the correct url and download the image in the 
  current working directory.
  """
  try:
    time.sleep(5)
    print("Download path is:", download_path)

    # GET THE FILE NAME IN ORDER TO USE IT LATER TO SAVE THE FILE
    current_link = str(download_path)
    # print("CURRENT LINK:", current_link)

    file_name = current_link.split("File:")[1] # TO CHECK: IS THIS A STRING?
    html = urlopen(download_path)
    bs4_apply = bs4(html, 'html.parser')
    images = bs4_apply.find_all('a')
    
    save_list = []

    for el in images:
      get_ref = el.get('href')
      save_list.append(get_ref)
      
    save_list = save_list[1:]

    correct_url_image = []
    # string in the list
    for s in save_list:
      if 'https' in s:
        # print(s)
        correct_url_image.append(s)

    # print(correct_url_image[0])
    
    # FOR SAVING JUST ONE IMAGE THE IMAGE
    # urllib.request.urlretrieve(correct_url_image[0], file_name)

    # CREATE THE REJECTION_CLASS FOLDER IF IT DOES NOT EXISTS ON GOOGLE DRIVE
    if path.exists(DIR_ORIGINAL_DATA) == False:
      os.mkdir(DIR_ORIGINAL_DATA)

    # return urllib.request.urlretrieve(correct_url_image[0], f"/content/rejection_class/{file_name}")
    # SAVE TO GOOGLE DRIVE
    return urllib.request.urlretrieve(correct_url_image[0], f"{DIR_ORIGINAL_DATA}/{file_name}")
  except:
    print("Error, image not saved. Continue with the next one.")

In [None]:
# TEST ON HOW TO DOWNLOAD AN IMAGE
# scrape_images("https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg")

# Looping trough the lists, get the links and download the images

In [None]:
# Looping trought the LANDSCAPES links
for each_link_land in landscapes:
  scrape_images(each_link_land)

Download path is: https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg
Download path is: https://commons.wikimedia.org/wiki/File:Fallcreek_Oct_2008.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Galway,_Irlanda.jpg
Download path is: https://commons.wikimedia.org/wiki/File:IMG-20210214-WA0300.jpg
Download path is: https://commons.wikimedia.org/wiki/File:In_Bangor.JPG
Download path is: https://commons.wikimedia.org/wiki/File:Joe_Mania_2016-08-01_(Unsplash_PFLchrsv9jY).jpg
Download path is: https://commons.wikimedia.org/wiki/File:Layakharo_Evining_sunset_scenery.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Mountain_Glacier_Landscape_(from_AvoPix).jpg
Download path is: https://commons.wikimedia.org/wiki/File:%D0%AF%D1%8F_412a.jpg


In [None]:
# Looping trought the ARCHITECTURE links
for each_link_architecture in architecture:
  scrape_images(each_link_architecture)

Download path is: https://commons.wikimedia.org/wiki/File:%22Grecia._Le_radici_della_Civilt%C3%A0_Europea%22_book_by_Pino_Musi.jpg
Download path is: https://commons.wikimedia.org/wiki/File:%22Grecia._Le_radici_della_Civilt%C3%A0_Europea%22_photobook_by_Pino_Musi.jpg
Download path is: https://commons.wikimedia.org/wiki/File:01_EJHNMC_by_Husos_03b_01_PRuiz.jpg
Download path is: https://commons.wikimedia.org/wiki/File:01_FREYMING-MERLEBACH-web.jpg
Download path is: https://commons.wikimedia.org/wiki/File:03_EJHNMC_by_Husos_09_MSalinas.jpg
Download path is: https://commons.wikimedia.org/wiki/File:04_Dispersion_by_Husos.jpg
Download path is: https://commons.wikimedia.org/wiki/File:04_UVA_by_Husos_FOTOMONT.jpg
Download path is: https://commons.wikimedia.org/wiki/File:05_Do_It_by_Husos.jpg
Download path is: https://commons.wikimedia.org/wiki/File:05_VCAL_by_Husos.jpg
Download path is: https://commons.wikimedia.org/wiki/File:06_VCAL_by_Husos.jpg
Download path is: https://commons.wikimedia.org/

In [None]:
# Looping trought the CLOUDS links
for each_link_clouds in clouds:
  scrape_images(each_link_clouds)

Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01132.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01133.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01135.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Kriegerged%C3%A4chtniskapelle_--_2015_--_5459.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Pfarramt_--_2015_--_5472.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Schulze-Frenkings-Hof_--_2015_--_5455.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Schulze-Frenkings-Hof_--_2015_--_5476.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Schulze-Frenkings-Hof_--_201

In [None]:
# Looping trought the ENTERTAINMENT links
for each_link_entertainment in entertainment:
  scrape_images(each_link_entertainment)

Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_Aerial_Image.jpg
Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_main.jpg
Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_parking_and_camping.jpg
Download path is: https://commons.wikimedia.org/wiki/File:20120621-FS-UNK-0002.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Aerial_view_of_Al_Ghadha.jpg
Download path is: https://commons.wikimedia.org/wiki/File:After_the_game_(4058587551).jpg
Download path is: https://commons.wikimedia.org/wiki/File:Al_Ghadha_Desert_Festival.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Al_Ghadha_Festival,_Unaizah.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Bonn_Rheinkultur.JPG
Download path is: https://commons.wikimedia.org/wiki/File:Chaos_Communication_Camp_2015_aerial.jpg
Download path is: https://commons.wikimedia.org/wiki/File:D%C3%BClmen,_Wiese_am_Kapellenweg_--_2014_--_8020.jpg
Downlo



---



---

