# Webscraping
In this notebook, it webscraping google-art-and-culture page, using selenium. 

Here is scrape:
1) picture files
2) picture color
3) picture page

Example: 

![picture](../references/picture.jpg)

- color: WHITE

- page: https://lh3.googleusercontent.com/ci/AC_FhM_TtHnpV4uxifY2CR4N_7aK3dNuIQaNCvCLymW8qlht_f4w6RxNpwLJxcGe94_hKVdNJvfrHMQ=w218-c-h218-rw-v1

- picture_url: https://artsandculture.google.com/asset/the-magpie/rQGnadHwK8lSmg

!NOTE: This project is only for educational purpose

# Scraping

In this notebook the images from the [Google Art and Culture](https://artsandculture.google.com/) page will be scraped. These images are categorized by 11 different colors, which will be the target variable for the rest of the project.

Multithreading will be used to make scraping faster :)

Scraping consists of 2 parts:

1) Extract all the urls of the images and the color of each image.
2) Download the images in the file explorer.

# Notebook config

In [80]:
%load_ext autoreload

%autoreload 3

%reload_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Download webdriver
For manual installation check `https://chromedriver.chromium.org/downloads`

In [81]:
!wget https://chromedriver.storage.googleapis.com/108.0.5359.71/chromedriver_win32.zip -O chromedriver.zip && unzip chromedriver.zip

--2023-01-08 22:03:59--  https://chromedriver.storage.googleapis.com/108.0.5359.71/chromedriver_win32.zip
Resolving chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)... 142.250.78.16, 2800:3f0:4005:407::2010
Connecting to chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)|142.250.78.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6904173 (6,6M) [application/zip]
Saving to: ‘chromedriver.zip’


2023-01-08 22:04:00 (10,9 MB/s) - ‘chromedriver.zip’ saved [6904173/6904173]

Archive:  chromedriver.zip
  inflating: chromedriver.exe        


# Packages

In [5]:
import pyprojroot
from pathlib  import Path
import time
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
#Local Packages
from src.scraping.scraping_pictures import ScrapingPictures
from src.scraping.get_img import getImage


# Paths

In [78]:
root_path =  pyprojroot.here()
path_driver = (root_path / "notebooks"/'chromedriver.exe').relative_to('/')
data_raw_folder = (root_path / "data"/'raw')
data_interim_folder = (root_path / "data"/'interim')
data_processed_folder = (root_path / "data"/'processed')

# Scraping

In [23]:
page_sections = ["WHITE","PINK","YELLOW","PURPLE","BLUE","TEAL","GREEN","ORANGE","RED","BROWN","BLACK"] #Color images to be scraped
scroll_down_times = 120 #Num of scroll for charge more images

In [24]:
def scrapePictures(color) -> pd.DataFrame:
    """
    Scrapes the pictures of the given color from the website.
    """
    scraper = ScrapingPictures(path_driver,color)
    scraper.open()
    scraper.scroll_down(scroll_down_times)
    time.sleep(8)
    df = scraper.scrape()
    return df

In [19]:
def parallelScrapePictures() -> pd.DataFrame:
    """
    Scrape all the pictures from the website using multiprocessing
    """
    with ProcessPoolExecutor() as executor:
        scrape_df_list = executor.map(scrapePictures, page_sections)
        scrape_df = pd.concat(list(scrape_df_list), ignore_index=True)
        return scrape_df

In [26]:
t1 = time.perf_counter()
df = parallelScrapePictures() 
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')

Open page section YELLOW
Open page section BLUE
Open page section PINK
Open page section TEAL
Open page section BLACK
Open page section GREEN
Open page section RED
Open page section BROWN
Open page section ORANGE
Open page section PURPLE
Open page section WHITE
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling down
Finish scrolling downFinish scrolling down

Finished in 373.1641500099995 seconds


In [27]:
len(df)

16378

In [30]:
df.sample(15)

Unnamed: 0,image,page,color
9526,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/escrav...,BROWN
6055,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/landsc...,GREEN
6703,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/fenest...,GREEN
10053,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/pierro...,BROWN
3718,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/hostag...,BLUE
12819,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/a-stor...,BLACK
2353,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/white-...,PURPLE
1649,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/powley...,YELLOW
15130,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/self-p...,BLACK
16313,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/gassed...,BLACK


# Export Picture Data

In [31]:
df.to_csv(data_raw_folder/'pictures.csv',index=False)

# Download images

In [67]:
df_pictures = pd.read_csv(data_raw_folder/'pictures.csv')
df_pictures.head()


Unnamed: 0,image,page,color
0,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/the-ma...,WHITE
1,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/sympho...,WHITE
2,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/the-cr...,WHITE
3,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/the-cr...,WHITE
4,"url(""https://lh3.googleusercontent.com/ci/AC_F...",https://artsandculture.google.com/asset/portra...,WHITE


In [68]:
(df_pictures.image == 'none' ).sum()

67

67 Images have missing image url. Maybe scraping not wait for load this content

In [69]:
df_pictures = df_pictures[(df_pictures.image != 'none' )]

In [70]:
len(df_pictures)

16311

In [71]:
df_pictures['image'] = df_pictures['image'].str.split("\"",expand=True)[1]

In [72]:
df_pictures['index']= df_pictures.index

Create index for identify each image

In [20]:
image_paths =df_pictures['image'].values.tolist()
index_list =df_pictures['index'].values.tolist()

In [34]:
def downloadImg(url:str,index:int) -> bool:
    """
    Download a image based url and save using their index.
    """
    filename = f'{index}.jpg'
    fullpath = data_processed_folder/'img'/filename
    try:
        getImage(url,fullpath)
        return True
    except:
        print(f'Failed to download {url}')
        return False

def parallelDownloadImg(image_paths,index_list) -> None:
    """
    Download images in parallel.
    """
    with ProcessPoolExecutor() as executor:
        executor.map(downloadImg, image_paths,index_list)


In [None]:
parallelDownloadImg(image_paths,index_list)

# Export pictures data + images location

In [73]:
df_pictures['filename'] = df_pictures['index'].astype('str')+'.jpg'

In [79]:
df_pictures.to_csv(data_interim_folder/'pictures.csv',index=False)

# Data Integrity

In [51]:
print(df_pictures['image'].nunique())
print(df_pictures['page'].nunique())

16311
16311


Image and page columns, have all unique name. So the data is correct.

In [56]:
!ls ../data/processed/img | wc -l

16311


The number of images in file system is the same that in dataframe