# Google Art & Culture - Case study using CRISPS-DM

#### Autors: Manuel Alejandro Aponte, Cristian Beltran, Maria Paula Peña

In this notebook it will webscraping of the page Google Art & Culture

## Objectives
The objective of this notebooks is:

* Download images using webscraping.
* Download images metadata.
* Store all information in a datasheet.

## Prerequisites

* Familiarity with python 
* Lastest version of Google WebDriver, Source: https://chromedriver.chromium.org/
* Install python packages.
* Use VPN (Recomended)

## Background 
This notebook belongs to Google Art & Culture Case Study using CRIPS-DM, where would be include process such as webscraping, exploratory data analysis, ML classificators and dashboards. 

In [None]:
#Import project packages
import sys
sys.path.append('../src/')
from Scraper import Scraper
from Parallel import parallel
#Import packages
from typing import List
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from pprint import pprint
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import requests

# Global Settings

In [None]:
#Colors of sections to be webscraped
COLORS = ["WHITE","PINK","YELLOW","PURPLE","BLUE","TEAL","GREEN","ORANGE","RED","BROWN","BLACK"] 
DRIVER_PATH = r"../chromedriver.exe"
DATA_RAW_FOLDER = "../data/raw"
DATA_PROCESSED_FOLDER = "../data/processed"
DATA_FINAL_FOLDER = "../data/final"


# Webscraping (Extraction Phase 1)

In [None]:


def webscraping(color:str):
    """Perform webscraping of Google Art & Culture web page

    Parameters
    ----------
    color : str
        the color of target page
    """
    scraper = Scraper(DRIVER_PATH,color)
    scraper.open()
    data = scraper.exec()
    return data
    
    
webscraping_data = parallel(webscraping,COLORS)
print(webscraping_data[:5])

In [None]:
df = pd.DataFrame(webscraping_data)
df['index'] = df.index
df.to_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv', index = False)  

# Webscraping (Extraction Phase 2)

Get the image_url and the identify key (index)

In [None]:
   
df = df.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df = df[['index','url']]   
data = list(df.itertuples(index=False, name=None)) # Convert data into tuples

Apply parallel transformation for download files 

In [None]:
def get_image(data):
    id = data[0]
    link = data[1]
    path = f'../data/raw/img/{id}.jpg'
    img = requests.get(link).content
    with open(path, 'wb') as handler:
        handler.write(img)
    return (id,f'{id}.jpg')

data_processed = parallel(get_image,data) 
print(data_processed[0:5])

Convert data into df and export it

In [None]:
df_processed = pd.DataFrame(data_processed, columns=['id','filename']) #Parse data into df
df.to_csv(f'{DATA_PROCESSED_FOLDER}/pictures_images.csv', index = False)  # Export

# Transform

* Filtering, cleansing, de-duplicating, validating, and authenticating the data.
* Performing calculations, translations, or summarizations based on the raw data. This can  include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more.
* Ensure data quality
* Formatting the data into tables or joined tables to match the schema of the target data warehouse.

In [None]:
df_original = df.read_csv(f'{DATA_RAW_FOLDER}/pictures_original.csv')
df_images = df.read_csv(f'{DATA_PROCESSED_FOLDER}/pictures_images.csv')
df = df_original.merge(df_images, left_on='index', right_on='index')
print(df.head())

In [None]:
print(df.shape)
print(df_original.shape)
print(df_images.shape)

In [None]:
df.info()

In [None]:
df.duplicated().value_counts() 

In [None]:
df.isnull().sum()

In [None]:
#TODO: MERGE DF INTO ONE 
#     PARSE DATA COLUMN 

In [None]:
from tqdm.notebook import tqdm
import time
for i in tqdm(range(10)):
    time.sleep(1)