# Google Art & Culture - Case study using CRISPS-DM

#### Autors: Manuel Alejandro Aponte, Cristian Beltran, Maria Paula Peña

In this notebook it will webscraping of the page Google Art & Culture

## Objectives
The objective of this notebooks is:

* Download images using webscraping.
* Download images metadata.
* Store all information in a datasheet.

## Prerequisites

* Familiarity with python 
* Lastest version of Google WebDriver, Source: https://chromedriver.chromium.org/
* Install python packages.
* Use VPN (Recomended)

## Background 
This notebook belongs to Google Art & Culture Case Study using CRIPS-DM, where would be include process such as webscraping, exploratory data analysis, ML classificators and dashboards. 

In [1]:
#Import project packages

from src.Scraper.Scraper import Scraper
from src.Parallel.Parallel import parallel
#Import packages
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize
import json

# Global Settings

In [None]:
#Colors of sections to be webscraped
COLORS = ["WHITE","PINK","YELLOW","PURPLE","BLUE","TEAL","GREEN","ORANGE","RED","BROWN","BLACK"] 

#Webdriver path
DRIVER_PATH = r"../chromedriver.exe"

#Folder of data resources
DATA_RAW_FOLDER = "../data/raw"
DATA_PROCESSED_FOLDER = "../data/processed"
DATA_FINAL_FOLDER = "../data/final"


# Webscraping Attributes Extraction (Phase 1)

In [None]:
def webscraping(color:str):
    """Perform webscraping of Google Art & Culture web page

    Parameters
    ----------
    color : str
        the color of target page
    """
    scraper = Scraper(DRIVER_PATH,color)
    scraper.open()
    data = scraper.scraping_data()
    return data
    
    
webscraping_data = parallel(webscraping,COLORS)

inspect some data

In [None]:
#Print WHITE collection
print(webscraping_data[0][0:2])
#Print PINK collection
print(webscraping_data[1][0:2])

Merge Collections into one

In [None]:
flatten_data = [item for sublist in webscraping_data for item in sublist]

In [None]:
flatten_data[0:2]

In [None]:
df = pd.DataFrame(flatten_data)
df['index'] = df.index
df.to_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv', index = False)  

# Webscraping Image Extraction (Phase 2)

Get the image_url and the identify key (index)

In [None]:
   
df = pd.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df = df[['index','url']]   
data = list(df.itertuples(index=False, name=None)) # Convert data into tuples

In [None]:
data[0:3]

Apply parallel transformation for download files 

In [None]:
def get_image(data):
    id = data[0]
    link = data[1]
    path = f'{DATA_FINAL_FOLDER}/img/{id}.jpg'
    img = requests.get(link).content
    with open(path, 'wb') as handler:
        handler.write(img)
    return (id,f'{id}.jpg')

image_files = parallel(get_image,data) 
print(image_files[0:3])

Convert data into df and export it

In [None]:
df_images = pd.DataFrame(image_files, columns=['index','filename']) #Parse data into df
df_images.to_csv(f'{DATA_PROCESSED_FOLDER}/picture_files.csv', index = False)  # Export

# Transform

* Filtering, cleansing, de-duplicating, validating, and authenticating the data.
* Performing calculations, translations, or summarizations based on the raw data. This can  include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more.
* Ensure data quality
* Formatting the data into tables or joined tables to match the schema of the target data warehouse.

### Check Data integrity

In [None]:
df_original = pd.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df_images = pd.read_csv(f'{DATA_PROCESSED_FOLDER}/picture_files.csv')
df = df_original.merge(df_images, left_on='index', right_on='index')
print(df.head())

In [None]:
print('Final Data Shape:',df.shape)
print('Original Data Shape:',df_original.shape)
print('Image Files Data Shape:',df_images.shape)

In [None]:
df.info()

In [None]:
print('Exist Duplicated:', df['index'].duplicated().any())
print('Exist Duplicated:', df['url'].duplicated().any())

In [None]:
df.isnull().sum()

### Parse data column into a df

In [None]:

data_string = list(df.data.values)
data_json = list(map(lambda string_json:json.loads(string_json),data_string))

In [None]:
data_df = json_normalize(data_json)
data_df.head(3)

In [None]:
full_df = pd.concat([df,data_df ],axis=1)
full_df = full_df.drop(['data'],axis=1)
full_df.head(3)

Check again data integrity

In [None]:
full_df.info()

# Export all information 

In [None]:
full_df.to_csv(f'{DATA_PROCESSED_FOLDER}/picture_data.csv', index = False)