# Google Art & Culture - Case study using CRISPS-DM

#### Autors: Manuel Alejandro Aponte, Cristian Beltran, Maria Paula Peña

In this notebook it will webscraping of the page Google Art & Culture

## Objectives
The objective of this notebooks is:

* Download images using webscraping.
* Download images metadata.
* Store all information in a datasheet.

## Prerequisites

* Familiarity with python 
* Lastest version of Google WebDriver, Source: https://chromedriver.chromium.org/
* Install python packages.
* Use VPN (Recomended)

## Background 
This notebook belongs to Google Art & Culture Case Study using CRIPS-DM, where would be include process such as webscraping, exploratory data analysis, ML classificators and dashboards. 

In [None]:
#Package instalation
!pip install pandas
!pip install selenium

In [10]:
#Import packages
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from pprint import pprint
from concurrent.futures import ThreadPoolExecutor
from time import sleep

# Configuration

In [11]:
#Domain color to be extracted
COLORS = ["WHITE","PINK","YELLOW","PURPLE","BLUE","TEAL","GREEN","ORANGE","RED","BROWN","BLACK"] 

#URL Target
URL= "https://artsandculture.google.com/color"

#Location of driver
executable_path = r'chromedriver.exe'

#Number of scrolls in principal pages
SCROLL_DOWN = 40 

# Utils

In [12]:
class Manager:
    """
    Object to manage the web session and simulate interactions.

    Attributes
    ----------
    driver : webdriver
        Selenium driver
        
    Methods
    -------
    open
        Open a page of images by color.
    scroll_down
        Scroll down the current session.
    js_details
        Get details of current picture.
    js_pages
        Get pages links in session.
    js_images
        Get image links in session.
    mergeJS
        Merge JS scripts into one.
        

    """
    def __init__(self, driver):
        self.driver = driver
        
    def open(self,color):
        """Open the page that contains images of this color

        Parameters
        ----------
        scrolls:int
            number of scroll by execute.        
        """
        target_page = f'https://artsandculture.google.com/color?col={color}'
        self.driver.get(target_page)
        
    def scroll_down(self,scrolls):
        """Scroll the page for load images dynamically

        Parameters
        ----------
        scrolls:int
            number of scroll by execute.        
        """
        for j in range(0, scrolls):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(8)  
            
    def js_details(self):
        """Get all details of picture page in json format as string.

        Returns
        -------
        str
            Details of picture page in json as string.       
        """
        
        script = '''
        var details = [...document.querySelector(".ve9nKb").querySelectorAll('li')];
        var info = details.reduce((acc,item)=>{
            const text = item.textContent;
            const chunks = text.split(':');
            const key = chunks[0];
            const value = chunks.splice(1).join("")
            return {...acc,[key]:value}
        },{})
        return JSON.stringify(info);
        '''
        details = self.driver.execute_script(script)
        return details
     
    def js_pages(self):
        """Get all link picture pages

        Returns
        -------
        list(str)
            Each of the image page links in current session       
        """
        
        scripts=[
            "var containers = [...document.querySelectorAll('.DuHQbc')];",
            'var img_elements = containers.map(contain => contain.firstElementChild);'
            'return img_elements.map(a => a.href)'
        ]
            
        script = self.mergeJS(scripts) 
        pages = self.driver.execute_script(script)
        return pages
        
    
    def js_images(self):
        """Get all picture link

        Returns
        -------
        list(str)
            Each of the picture link in current session       
        """
        
        scripts=[
            "var containers = [...document.querySelectorAll('.DuHQbc')];",
            'var img_elements= containers.map(contain => contain.firstElementChild);'
            'return img_elements.map(a =>window.getComputedStyle(a, false).backgroundImage)'
        ]
        script = self.mergeJS(scripts) 
        urls_raw = self.driver.execute_script(script)
        urls = list(filter(lambda url:url!='none',urls_raw))
        links = list(map(lambda url:url.split('"')[1],urls))
        return links

    
    def mergeJS(self,scripts):
        """Combine a list of js scripts into one

        Parameters
        ----------
        scripts : list(str)
            List of related scripts

        Returns
        -------
        str
            Scripts combined     
        """
        return ''.join(scripts)

class Storage:
    """
    Store, read and export data in webscraping process
        
    Methods
    -------
    add
        Add new register
    export
        export all stored data

    """
    def __init__(self):
        self.storage = []
            
    def add(self, url="NULL",data="NULL",category="NULL"):
        """Add new register into storage

        Parameters
        ----------
        url : str
            Url of de image picture
        data : str
            Data related of a picture
        category : str
            color category     
        """
        register = {
            'url': str(url),
            'data': str(data),
            'category': str(category)
        }
        self.storage.append(register)
    
    def export(self,name):
        """Name of export file 
        
        NOTE: The filename should be extension .csv

        Parameters
        ----------
        name : str
            name of export file    
        """
        df = pd.DataFrame(self.storage)
        df.to_csv(name, index = False)


# Webscraping

In [15]:
def create_driver():
    driver = webdriver.Chrome(executable_path = executable_path)
    return driver

def webscraping(color):
    #Create instances.
    driver = create_driver()
    manager = Manager(driver)
    storage = Storage()
    
    #Open Page.
    manager.open(color)
    print('Scraping Color:', color)
    
    #Scroll page until the end.
    time.sleep(3)
    manager.scroll_down(SCROLL_DOWN)
    time.sleep(7)
    
    #Get picture pages and image links
    pages =manager.js_pages()
    image_links= manager.js_images()
    
    #Show state of current page
    length_items = len(pages)    
    print('Number of elements:', length_items, color)
    
    #For each picture, get theirs details.
    for i,(page,img_link) in enumerate(zip(pages,image_links)):
        #Show progress
        if(i%100==0):
            print(color)
            print('   Progress:', i,' of ',length_items)
        
        #Open picture page ,get and store its information.
        driver.get(page)
        time.sleep(0.2)
        details = manager.js_details()
        storage.add(img_link,details,color) 
    #Export data
    filename = color+'.csv'
    storage.export(filename)
    

# Parallel WebScraping

In [14]:
if __name__ == '__main__':
    result =[]
    with ThreadPoolExecutor(max_workers=None) as exe: 
        print('Running')
        result = exe.map(webscraping,COLORS)
print('End')

Running


  driver = webdriver.Chrome(executable_path = executable_path)


Scraping Color: PURPLE
Scraping Color: WHITE
Scraping Color: BLACK
Scraping Color: PINK
Scraping Color: BLUE
Scraping Color: GREEN
Scraping Color: ORANGE
Scraping Color: BROWN
Scraping Color: TEAL
Scraping Color: RED
Scraping Color: YELLOW
missing: 30
Number of elements: 75 PURPLE
PURPLE
   Progress: 0  of  75
missing: 30
Number of elements: 75 WHITE
WHITE
   Progress: 0  of  75
missing: 30
Number of elements: 75 BLACK
BLACK
   Progress: 0  of  75
missing: 30
Number of elements: 75 PINK
PINK
   Progress: 0  of  75
missing: 30
Number of elements: 75 BLUE
BLUE
   Progress: 0  of  75
missing: 30
Number of elements: 75 GREEN
GREEN
   Progress: 0  of  75
missing: 30
Number of elements: 75 ORANGE
ORANGE
   Progress: 0  of  75
missing: 30
Number of elements: 75 BROWN
BROWN
   Progress: 0  of  75
missing: 30
Number of elements: 75 TEAL
TEAL
   Progress: 0  of  75
missing: 30
Number of elements: 75 RED
RED
   Progress: 0  of  75
missing: 30
Number of elements: 75 YELLOW
YELLOW
   Progress: 0  o