# 0. NOTEBOOK EXPLANATION

> What does it do?

This notebook provide a way of downloading images from https://commons.wikimedia.org by scraping, based on a manual choice of specific image categories. The amount of images scraped amount to 48 for three categories such as *landscapes*, *clouds*, *entertainment*.
Before scraping, the page https://commons.wikimedia.org/robots.txt has been checked in order to understand if any violation was committed. There was no indication that the specific files were not free to download or scrape.
Furthermore, they are available either under CC (Creative Commons) licence, Creative Commons Attribution-ShareAlike 4.0 International or the images come from https://unsplash.com (licence: https://unsplash.com/it/licenza), so they are freely available to be used by everyone.

> What is the goal?

The goal of this notebook is to obtain a certain amount of images in order to create another class, the *rejection_class* that it should be used in order to train the model. Doing that should give the possibility to the classifier to be able to distinguish also categories different from food, the main topic of this project.


In [41]:
# # REMOVE PREVIOUS FOLDER IF EXISTS
# !rm -rf rejection_class/

In [42]:
# IMPORTING PACKAGES
import time
import os.path
from os import path
from lxml import html
from re import search
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs4

# import os
# from PIL import Image
# from pathlib import Path

In [43]:
# Link Google Drive account
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [44]:
# PATH TO WHERE SAVE THE NEW DATA
DIR_ORIGINAL_DATA = r"/content/gdrive/MyDrive/Data_Science_2020-2022/Secondo_anno_Secondo_Semestre/FoDL_Project/Project_Example_Food/ExampleFoodImageDataset/rejection_class"

In [45]:
# BASE URL FOR CATEGORY
BASE_URL_CATEGORY_LANDSCAPES = r"https://commons.wikimedia.org/wiki/Category:Landscapes"
# BASE_URL_CATEGORY_ARCHITECTURE = r"https://commons.wikimedia.org/wiki/Category:Architecture"
BASE_URL_CATEGORY_CLOUDS = r"https://commons.wikimedia.org/wiki/Category:Clouds_in_Nottuln"
BASE_URL_CATEGORY_ENTERTAINMENT = r"https://commons.wikimedia.org/wiki/Category:Aerial_photographs_of_events"

# BASE URL FOR WIKIMEDIA
BASE_URL_WIKIMEDIA = r"https://commons.wikimedia.org"

In [46]:
def get_download_path(base_url):
  """
    The goal of this function is to get the downloadable paths
    to use in the scrape_images function.
    You are going to get all the path for each single image that you'll
    use after with the scrape_images function.
    
    From https://www.simplilearn.com/tutorials/python-tutorial/python-internet-access-using-urllib-request-and-urlopen#:~:text=The%20urlopen()%20function%20provides,objects%20that%20perform%20these%20services.:
    
    The urlopen() function provides a fairly simple interface. 
    It is capable of retrieving URLs with a variety of protocols. 
    It also has a little more complicated interface for dealing with typical 
    scenarios, such as basic authentication, cookies, and proxies. 
    Handlers and openers are objects that perform these services.
  """

  # FETCH THE BASE URL
  # ERROR! HERE YOU NEED A STRING, NOT A LIST OF URL!!!
  # E SE SALVASSIMO I RISULTATI IN UNA LISTA E POI DA QUELLA LISTA, E NON
  # DALLA FUNZIONE, INIZIASSIMO AD APPLICARE LA FUNZIONE FINALE?
  html = urlopen(base_url)

  # APPLY bs4
  bs4_apply = bs4(html, 'html.parser')
  # FIND ALL ELEMENTS WITH THE CLASS galleryfilename galleryfilename-truncate
  link_images = bs4_apply.find_all('a', {"class": "galleryfilename galleryfilename-truncate"})

  # CREATE AN EMPTY LIST WHERE TO SAVE ALL THE LINKS
  save_href_list = []

  for el in link_images:
    get_ref = el.get('href')
    save_href_list.append(BASE_URL_WIKIMEDIA + get_ref)

  print(save_href_list)

  return save_href_list

In [47]:
# CREATE LISTS WHERE TO STORE ALL THE LINKS FOR THE IMAGES
landscapes = get_download_path(BASE_URL_CATEGORY_LANDSCAPES)
# architecture = get_download_path(BASE_URL_CATEGORY_ARCHITECTURE)
clouds = get_download_path(BASE_URL_CATEGORY_CLOUDS)
entertainment = get_download_path(BASE_URL_CATEGORY_ENTERTAINMENT)

['https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg', 'https://commons.wikimedia.org/wiki/File:Fallcreek_Oct_2008.jpg', 'https://commons.wikimedia.org/wiki/File:IMG-20210214-WA0300.jpg', 'https://commons.wikimedia.org/wiki/File:In_Bangor.JPG', 'https://commons.wikimedia.org/wiki/File:Joe_Mania_2016-08-01_(Unsplash_PFLchrsv9jY).jpg', 'https://commons.wikimedia.org/wiki/File:Layakharo_Evining_sunset_scenery.jpg', 'https://commons.wikimedia.org/wiki/File:Mountain_Glacier_Landscape_(from_AvoPix).jpg', 'https://commons.wikimedia.org/wiki/File:Port_Waikato_Township.jpg', 'https://commons.wikimedia.org/wiki/File:%D0%AF%D1%8F_412a.jpg']
['https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01132.jpg', 'https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01133.jpg', 'https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebec

# Create the functions to fetch the correct link and to download the images.

In [48]:
def scrape_images(download_path):
  """
  Using the download_path url this function fetch
  the correct url and download the image in the 
  current working directory.
  """
  try:
    time.sleep(5)
    print("Download path is:", download_path)

    # GET THE FILE NAME IN ORDER TO USE IT LATER TO SAVE THE FILE
    current_link = str(download_path)
    # print("CURRENT LINK:", current_link)

    file_name = current_link.split("File:")[1] # TO CHECK: IS THIS A STRING?
    html = urlopen(download_path)
    bs4_apply = bs4(html, 'html.parser')
    images = bs4_apply.find_all('a')
    
    save_list = []

    for el in images:
      get_ref = el.get('href')
      save_list.append(get_ref)
      
    save_list = save_list[1:]

    correct_url_image = []
    # string in the list
    for s in save_list:
      if 'https' in s:
        # print(s)
        correct_url_image.append(s)

    # print(correct_url_image[0])
    # FOR SAVING JUST ONE IMAGE THE IMAGE
    # urllib.request.urlretrieve(correct_url_image[0], file_name)

    # FOR SAVING MULTIPLE IMAGES
    # FIRST CREATE A FOLDER WHERE TO STORE THE IMAGES, AND USE IT AS A REPO
    # FOR ALL THE IMAGES DOWNLOADED. DOWNLOAD THE FOLDER TO YOUR LOCAL COMPUTER

    # CREATE THE REJECTION_CLASS FOLDER IF IT DOES NOT EXISTS
    # if path.exists('/content/rejection_class') == False:
    #   os.mkdir('/content/rejection_class')

    # CREATE THE REJECTION_CLASS FOLDER IF IT DOES NOT EXISTS ON GOOGLE DRIVE
    if path.exists(DIR_ORIGINAL_DATA) == False:
      os.mkdir(DIR_ORIGINAL_DATA)

    # return urllib.request.urlretrieve(correct_url_image[0], f"/content/rejection_class/{file_name}")
    # SAVE TO GOOGLE DRIVE
    return urllib.request.urlretrieve(correct_url_image[0], f"{DIR_ORIGINAL_DATA}/{file_name}")
  except:
    print("Error, image not saved. Continue with the next one.")

In [49]:
# TEST ON HOW TO DOWNLOAD AN IMAGE
# scrape_images("https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg")

# Looping trough the lists, get the links and download the images

In [50]:
# Looping trought the LANDSCAPES links
for each_link_land in landscapes:
  scrape_images(each_link_land)

Download path is: https://commons.wikimedia.org/wiki/File:Crest-_Yosemite%3F_(17a34fd143a7446ea27c2138f52c2e16).jpg
https://upload.wikimedia.org/wikipedia/commons/0/0f/Crest-_Yosemite%3F_%2817a34fd143a7446ea27c2138f52c2e16%29.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Fallcreek_Oct_2008.jpg
https://upload.wikimedia.org/wikipedia/commons/0/07/Fallcreek_Oct_2008.jpg
Download path is: https://commons.wikimedia.org/wiki/File:IMG-20210214-WA0300.jpg
https://upload.wikimedia.org/wikipedia/commons/2/2f/IMG-20210214-WA0300.jpg
Download path is: https://commons.wikimedia.org/wiki/File:In_Bangor.JPG
https://upload.wikimedia.org/wikipedia/commons/e/ee/In_Bangor.JPG
Download path is: https://commons.wikimedia.org/wiki/File:Joe_Mania_2016-08-01_(Unsplash_PFLchrsv9jY).jpg
https://upload.wikimedia.org/wikipedia/commons/6/62/Joe_Mania_2016-08-01_%28Unsplash_PFLchrsv9jY%29.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Layakharo_Evining_sunset_scenery.jpg
https://uploa

In [52]:
# Looping trought the CLOUDS links
for each_link_clouds in clouds:
  scrape_images(each_link_clouds)

Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01132.jpg
https://upload.wikimedia.org/wikipedia/commons/6/60/Nottuln%2C_Appelh%C3%BClsen%2C_Haus_Klein-Schonebeck_--_2015_--_01132.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01133.jpg
https://upload.wikimedia.org/wikipedia/commons/6/65/Nottuln%2C_Appelh%C3%BClsen%2C_Haus_Klein-Schonebeck_--_2015_--_01133.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Haus_Klein-Schonebeck_--_2015_--_01135.jpg
https://upload.wikimedia.org/wikipedia/commons/0/0f/Nottuln%2C_Appelh%C3%BClsen%2C_Haus_Klein-Schonebeck_--_2015_--_01135.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Nottuln,_Appelh%C3%BClsen,_Kriegerged%C3%A4chtniskapelle_--_2015_--_5459.jpg
https://upload.wikimedia.org/wikipedia/commons/2/27/Nottuln%2C_Appelh%C3%BClsen%2C_Kriegerged%C3%A4chtn

In [53]:
# Looping trought the ENTERTAINMENT links
for each_link_entertainment in entertainment:
  scrape_images(each_link_entertainment)

Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_Aerial_Image.jpg
https://upload.wikimedia.org/wikipedia/commons/e/ec/1979_Festival_Aerial_Image.jpg
Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_main.jpg
https://upload.wikimedia.org/wikipedia/commons/8/82/1979_Festival_main.jpg
Download path is: https://commons.wikimedia.org/wiki/File:1979_Festival_parking_and_camping.jpg
https://upload.wikimedia.org/wikipedia/commons/a/a6/1979_Festival_parking_and_camping.jpg
Download path is: https://commons.wikimedia.org/wiki/File:20120621-FS-UNK-0002.jpg
https://upload.wikimedia.org/wikipedia/commons/d/d9/20120621-FS-UNK-0002.jpg
Download path is: https://commons.wikimedia.org/wiki/File:Aerial_view_of_Al_Ghadha.jpg
https://upload.wikimedia.org/wikipedia/commons/a/a5/Aerial_view_of_Al_Ghadha.jpg
Download path is: https://commons.wikimedia.org/wiki/File:After_the_game_(4058587551).jpg
https://upload.wikimedia.org/wikipedia/commons/9/9e/After_the_game_%

In [54]:
# # CONVERT THE REJECTION_CLASS FOLDER TO ZIP IN ORDER TO DOWNLOAD IT
# !zip -r /content/rejection_class.zip /content/rejection_class

In [55]:
# # DOWNLOAD THE REJECTION_CLASS FOLDER TO YOUR LOCAL COMPUTER
# from google.colab import files
# files.download("/content/rejection_class.zip")