# Google Art & Culture - Case study using CRISPS-DM

#### Autors: Manuel Alejandro Aponte, Cristian Beltran, Maria Paula Peña

In this notebook it will webscraping of the page Google Art & Culture

## Objectives
The objective of this notebooks is:

* Download images using webscraping.
* Download images metadata.
* Store all information in a datasheet.

## Prerequisites

* Familiarity with python 
* Lastest version of Google WebDriver, Source: https://chromedriver.chromium.org/
* Install python packages.
* Use VPN (Recomended)

## Background 
This notebook belongs to Google Art & Culture Case Study using CRIPS-DM, where would be include process such as webscraping, exploratory data analysis, ML classificators and dashboards. 

In [1]:
#Import project packages
import sys
sys.path.append('../src/')
from Scraper import Scraper
from Parallel import parallel
#Import packages
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize
import json

# Global Settings

In [2]:
#Colors of sections to be webscraped
COLORS = ["WHITE","PINK","YELLOW","PURPLE","BLUE","TEAL","GREEN","ORANGE","RED","BROWN","BLACK"] 

#Webdriver path
DRIVER_PATH = r"../chromedriver.exe"

#Folder of data resources
DATA_RAW_FOLDER = "../data/raw"
DATA_PROCESSED_FOLDER = "../data/processed"
DATA_FINAL_FOLDER = "../data/final"


# Webscraping Attributes Extraction (Phase 1)

In [None]:
def webscraping(color:str):
    """Perform webscraping of Google Art & Culture web page

    Parameters
    ----------
    color : str
        the color of target page
    """
    scraper = Scraper(DRIVER_PATH,color)
    scraper.open()
    data = scraper.scraping_data()
    return data
    
    
webscraping_data = parallel(webscraping,COLORS)

inspect some data

In [4]:
#Print WHITE collection
print(webscraping_data[0][0:2])
#Print PINK collection
print(webscraping_data[1][0:2])

[{'url': 'https://lh3.googleusercontent.com/ci/AC_FhM_oasJFGZVi8jye_ImxogO_y6DHA6Ha2nK85qkgEdZiJd5ku_wyJ6AOyfEOKIW-rEsmFZB5ng=w313-c-h313-fcrop64=1,00000d37ffff8ce4-rw-v1', 'data': '{"Title":" Symphony in White, No. 1 The White Girl","Creator":" James McNeill Whistler","Date Created":" 1862","External Link":"  For more information about this and thousands of other works of art in the NGA collection, please visit\xa0http//www.nga.gov/","Medium":" oil on canvas","Object Credit":" Harris Whittemore Collection","Dimensions":" overall 213 x 107.9 cm (83 7/8 x 42 1/2 in.)\\u000b\\u000bframed 244.2 x 136.5 x 8.3 cm (96 1/8 x 53 3/4 x 3 1/4 in.)","Classification":" Painting","Artist School":" American","Artist Nationality":" American","Artist Details":" American, 1834 - 1903"}', 'category': 'WHITE'}, {'url': 'https://lh3.googleusercontent.com/ci/AC_FhM9f2_H_Vib6NbUnODNBmKAF4_nxQENdrmuCgD_qbON7A8MhrGE8artJsUQA-1s0VCBUPmhAkQ=w313-c-h313-fcrop64=1,00000b77ffffdd97-rw-v1', 'data': '{"Title":" The 

Merge Collections into one

In [5]:
flatten_data = [item for sublist in webscraping_data for item in sublist]

In [6]:
flatten_data[0:2]

[{'url': 'https://lh3.googleusercontent.com/ci/AC_FhM_oasJFGZVi8jye_ImxogO_y6DHA6Ha2nK85qkgEdZiJd5ku_wyJ6AOyfEOKIW-rEsmFZB5ng=w313-c-h313-fcrop64=1,00000d37ffff8ce4-rw-v1',
  'data': '{"Title":" Symphony in White, No. 1 The White Girl","Creator":" James McNeill Whistler","Date Created":" 1862","External Link":"  For more information about this and thousands of other works of art in the NGA collection, please visit\xa0http//www.nga.gov/","Medium":" oil on canvas","Object Credit":" Harris Whittemore Collection","Dimensions":" overall 213 x 107.9 cm (83 7/8 x 42 1/2 in.)\\u000b\\u000bframed 244.2 x 136.5 x 8.3 cm (96 1/8 x 53 3/4 x 3 1/4 in.)","Classification":" Painting","Artist School":" American","Artist Nationality":" American","Artist Details":" American, 1834 - 1903"}',
  'category': 'WHITE'},
 {'url': 'https://lh3.googleusercontent.com/ci/AC_FhM9f2_H_Vib6NbUnODNBmKAF4_nxQENdrmuCgD_qbON7A8MhrGE8artJsUQA-1s0VCBUPmhAkQ=w313-c-h313-fcrop64=1,00000b77ffffdd97-rw-v1',
  'data': '{"Title"

In [7]:
df = pd.DataFrame(flatten_data)
df['index'] = df.index
df.to_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv', index = False)  

# Webscraping Image Extraction (Phase 2)

Get the image_url and the identify key (index)

In [8]:
   
df = pd.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df = df[['index','url']]   
data = list(df.itertuples(index=False, name=None)) # Convert data into tuples

In [9]:
data[0:3]

[(0,
  'https://lh3.googleusercontent.com/ci/AC_FhM_oasJFGZVi8jye_ImxogO_y6DHA6Ha2nK85qkgEdZiJd5ku_wyJ6AOyfEOKIW-rEsmFZB5ng=w313-c-h313-fcrop64=1,00000d37ffff8ce4-rw-v1'),
 (1,
  'https://lh3.googleusercontent.com/ci/AC_FhM9f2_H_Vib6NbUnODNBmKAF4_nxQENdrmuCgD_qbON7A8MhrGE8artJsUQA-1s0VCBUPmhAkQ=w313-c-h313-fcrop64=1,00000b77ffffdd97-rw-v1'),
 (2,
  'https://lh3.googleusercontent.com/ci/AC_FhM8sQxk2zXS1WRTp6PoUOIrEiuW3JY-zFPb4AKLNM27TZCeQACe19Vxssvw_ssBA4nOnScwyDRc=w313-c-h313-rw-v1')]

Apply parallel transformation for download files 

In [None]:
def get_image(data):
    id = data[0]
    link = data[1]
    path = f'{DATA_FINAL_FOLDER}/img/{id}.jpg'
    img = requests.get(link).content
    with open(path, 'wb') as handler:
        handler.write(img)
    return (id,f'{id}.jpg')

image_files = parallel(get_image,data) 
print(image_files[0:3])

Convert data into df and export it

In [12]:
df_images = pd.DataFrame(image_files, columns=['index','filename']) #Parse data into df
df_images.to_csv(f'{DATA_PROCESSED_FOLDER}/picture_files.csv', index = False)  # Export

# Transform

* Filtering, cleansing, de-duplicating, validating, and authenticating the data.
* Performing calculations, translations, or summarizations based on the raw data. This can  include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more.
* Ensure data quality
* Formatting the data into tables or joined tables to match the schema of the target data warehouse.

### Check Data integrity

In [13]:
df_original = pd.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df_images = pd.read_csv(f'{DATA_PROCESSED_FOLDER}/picture_files.csv')
df = df_original.merge(df_images, left_on='index', right_on='index')
print(df.head())

                                                 url  \
0  https://lh3.googleusercontent.com/ci/AC_FhM_oa...   
1  https://lh3.googleusercontent.com/ci/AC_FhM9f2...   
2  https://lh3.googleusercontent.com/ci/AC_FhM8sQ...   
3  https://lh3.googleusercontent.com/ci/AC_FhM9aa...   
4  https://lh3.googleusercontent.com/ci/AC_FhM-xJ...   

                                                data category  index filename  
0  {"Title":" Symphony in White, No. 1 The White ...    WHITE      0    0.jpg  
1  {"Title":" The Cradle","Date Created":" 1872",...    WHITE      1    1.jpg  
2  {"Title":" The Magpie","Date Created":" 1868 -...    WHITE      2    2.jpg  
3  {"Title":" Summer evening on Skagen Sønderstra...    WHITE      3    3.jpg  
4  {"Title":" Composition with red, yellow and bl...    WHITE      4    4.jpg  


In [14]:
print('Final Data Shape:',df.shape)
print('Original Data Shape:',df_original.shape)
print('Image Files Data Shape:',df_images.shape)

Final Data Shape: (10287, 5)
Original Data Shape: (10287, 4)
Image Files Data Shape: (10287, 2)


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10287 entries, 0 to 10286
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   url       10287 non-null  object
 1   data      10287 non-null  object
 2   category  10287 non-null  object
 3   index     10287 non-null  int64 
 4   filename  10287 non-null  object
dtypes: int64(1), object(4)
memory usage: 482.2+ KB


In [16]:
print('Exist Duplicated:', df['index'].duplicated().any())
print('Exist Duplicated:', df['url'].duplicated().any())

Exist Duplicated: False
Exist Duplicated: False


In [17]:
df.isnull().sum()

url         0
data        0
category    0
index       0
filename    0
dtype: int64

### Parse data column into a df

In [18]:

data_string = list(df.data.values)
data_json = list(map(lambda string_json:json.loads(string_json),data_string))

In [19]:
data_df = json_normalize(data_json)
data_df.head(3)

  data_df = json_normalize(data_json)


Unnamed: 0,Title,Creator,Date Created,External Link,Medium,Object Credit,Dimensions,Classification,Artist School,Artist Nationality,...,Additional Artist Death Date,Additional Artist Birth Date,Alternate Titles,Scientist,Catalogue Reference,Additional Artist Details,"Painter, printmaker",Artist Death Place,Maker,Frame
0,"Symphony in White, No. 1 The White Girl",James McNeill Whistler,1862,For more information about this and thousand...,oil on canvas,Harris Whittemore Collection,overall 213 x 107.9 cm (83 7/8 x 42 1/2 in.)...,Painting,American,American,...,,,,,,,,,,
1,The Cradle,,1872,http//www.musee-orsay.fr/en/collections/work...,,,,,,,...,,,,,,,,,,
2,The Magpie,,1868 - 1869,http//www.musee-orsay.fr/en/collections/work...,,,,,,,...,,,,,,,,,,


In [20]:
full_df = pd.concat([df,data_df ],axis=1)
full_df = full_df.drop(['data'],axis=1)
full_df.head(3)

Unnamed: 0,url,category,index,filename,Title,Creator,Date Created,External Link,Medium,Object Credit,...,Additional Artist Death Date,Additional Artist Birth Date,Alternate Titles,Scientist,Catalogue Reference,Additional Artist Details,"Painter, printmaker",Artist Death Place,Maker,Frame
0,https://lh3.googleusercontent.com/ci/AC_FhM_oa...,WHITE,0,0.jpg,"Symphony in White, No. 1 The White Girl",James McNeill Whistler,1862,For more information about this and thousand...,oil on canvas,Harris Whittemore Collection,...,,,,,,,,,,
1,https://lh3.googleusercontent.com/ci/AC_FhM9f2...,WHITE,1,1.jpg,The Cradle,,1872,http//www.musee-orsay.fr/en/collections/work...,,,...,,,,,,,,,,
2,https://lh3.googleusercontent.com/ci/AC_FhM8sQ...,WHITE,2,2.jpg,The Magpie,,1868 - 1869,http//www.musee-orsay.fr/en/collections/work...,,,...,,,,,,,,,,


Check again data integrity

In [21]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10287 entries, 0 to 10286
Columns: 501 entries, url to Frame
dtypes: int64(1), object(500)
memory usage: 39.4+ MB


# Export all information 

In [None]:
full_df.to_csv(f'{DATA_PROCESSED_FOLDER}/picture_data.csv', index = False)