## Webscrapping NIVEL 2

### 2.3.- Extracción en dos dimensiones horizontales en IGN con Scrapy

**OBJETIVO:**
    - Extraer informacion sobre Articulos, Reviews y Videos en IGN
    - Aprender a realizar extracciones de informacion de diferente tipo al mismo tiempo
    - Aprender a realizar extracciones verticales y horizontales utilizando reglas
    - Aprender a realizar extracciones con dos dimensiones de horizontalidad

In [1]:
import numpy as np
import pandas as pd
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader

In [2]:
class Articulo(Item):
    titulo = Field()
    contenido = Field()
    
class Review(Item):
    titulo = Field()
    calificacion = Field()

class Video(Item):
    titulo=Field()
    fecha_publicacion = Field()
    
class IGNCrawler(CrawlSpider):
    name = 'IGN'
    custom_settings = {
        'USER AGENT':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/80.0.3987.149 Safari/537.36',
        'CLOSESPIDER_PAGECOUNT':30
    }
    
    allowed_domains = ['latam.ign.com']
    download_delay = 1
    start_urls = ['https://latam.ign.com/se/?model=article%2Cvideo&order_by=-date&q=ps4']
    
    rules = (
        #Horizontalidad por tipo de información
         Rule(
            LinkExtractor(
                allow=r'type='
             ), follow=True
         ),
        # Horizontalidad por paginación
        Rule(
            LinkExtractor(
                allow = r'&page=\d+'
            ), follow = True
        
        ),
        # Horizontalidad por tipo de contenido
        #REVIEWS
        Rule(
            LinkExtractor(
                allow=r'/review/'
            ), follow=True,callback='parse_review'      
        ),
        #VIDEOS
        Rule(
            LinkExtractor(
                allow=r'/video/'
            ), follow=True,callback='parse_video'
        
        ),
        #ARTICULOS
        Rule(
            LinkExtractor(
                allow=r'/news/'            
            ), follow = True,callback='parse_news'  
        )
    )
    
    def parse_news(self,response):
        item = ItemLoader(Articulo(),response)
        item.add_xpath('titulo','//h1/text()')
        item.add_xpath('contenido','//div[@id="id_text"]//*/text()')
        
        yield item.load_item()
    
    def parse_review(self,response):
        item = ItemLoader(Review(),response)
        item.add_xpath('titulo','//h1/text()')
        item.add_xpath('calificacion','//span[@class="side-wrapper side-wrapper hexagon-content"]/text()')
        
        yield item.load_item()
        
    def parse_video(self,response):
        item = ItemLoader(Video(),response)
        item.add_xpath('titulo','//h1/text()')
        item.add_xpath('fecha_publicacion','//span[@class="publish-date"]/text()')
        
        yield item.load_item()

In [3]:
#from scrapy.spiders import Spider
#from scrapy.crawler import CrawlerProcess

#if __name__ == "__main__": # Código que se va a ejecutar al dar clic en RUN
#        process = CrawlerProcess()
#        process.crawl(IGNCrawler) # Nombre de la clase de mi Spider
#        process.start()

In [4]:
from Crawlerfunctions import *
crawlerapp(IGNCrawler)

2020-12-30 17:41:31 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-30 17:41:31 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-12-30 17:41:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-30 17:41:31 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 30}
2020-12-30 17:41:31 [scrapy.extensions.telnet] INFO: Telnet Password: 3d231f6cb002c19f
2020-12-30 17:41:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2020-12-30 17:41:31 [scrapy.middleware] INFO: Enabled downloader middlew

2020-12-30 17:41:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-1/73505/video/playstation-5-vs-playstation-4-pro-comparando-los-tiempos-de-carga>
{'fecha_publicacion': ['6 de Noviembre de 2020'],
 'titulo': ['PlayStation 5 vs. PlayStation 4 Pro: comparando los tiempos de '
            'carga']}
2020-12-30 17:41:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/spider-man-ps4/73578/video/los-primeros-22-minutos-de-spider-man-miles-morales-en-ps5-gameplay-en-4k> (referer: https://latam.ign.com/se/?model=article%2Cvideo&order_by=-date&q=ps4)
2020-12-30 17:41:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/spider-man-ps4/73578/video/los-primeros-22-minutos-de-spider-man-miles-morales-en-ps5-gameplay-en-4k>
{'fecha_publicacion': ['11 de Noviembre de 2020'],
 'titulo': ['Los primeros 22 minutos de Spider-Man: Miles Morales en PS5 - '
            'Gameplay en 4K']}
2020-12-30 17:41:45 [scrapy.core.engine] DEB

2020-12-30 17:41:48 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'it.ign.com': <GET https://it.ign.com/ps5/175513/news/ps5-brutte-notizie-per-i-pc-gamer-la-nuova-console-non-supportera-la-risoluzione-a-1440p>
2020-12-30 17:41:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-4-gaming-hardware/73509/news/playstation-5-con-su-proximo-lanzamiento-ps4-esta-bajando-de-precio-y-estas-son-las-mejores-ofertas> (referer: https://latam.ign.com/se/?model=article%2Cvideo&order_by=-date&q=ps4)
2020-12-30 17:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-4-gaming-hardware/73509/news/playstation-5-con-su-proximo-lanzamiento-ps4-esta-bajando-de-precio-y-estas-son-las-mejores-ofertas>
{'contenido': ['PlayStation 5 llegará al mercado la próxima semana, pero si '
               'resulta que no compraron la PlayStation 4 y están considerando '
               'hacerlo ahora, les podemos decir que es un buen m

2020-12-30 17:41:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/call-of-duty-black-ops-cold-war/73652/news/call-of-duty-black-ops-cold-war-usuarios-de-ps5-podrian-estar-jugando-la-version-de-ps4-por-error> (referer: https://latam.ign.com/se/?model=article%2Cvideo&order_by=-date&q=ps4)
2020-12-30 17:41:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/call-of-duty-black-ops-cold-war/73652/news/call-of-duty-black-ops-cold-war-usuarios-de-ps5-podrian-estar-jugando-la-version-de-ps4-por-error>
{'contenido': ['Para aquellos que pudieron asegurar una PlayStation 5 en su '
               'lanzamiento y han estado jugando ',
               'Call of Duty: Black Ops Cold War',
               ', existe la posibilidad de que hayan estado jugando la versión '
               'de PS4 por accidente.',
               'De acuerdo al reporte de Eurogamer, este problema parece estar '
               'afectando a ciertas personas que han comprado y descargado 

2020-12-30 17:41:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-5-playstation-5/72541/video/god-of-war-ragnarok-teaser-trailer> (referer: https://latam.ign.com/se/?type=video&q=ps4&order_by=-date)
2020-12-30 17:41:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-playstation-5/72541/video/god-of-war-ragnarok-teaser-trailer>
{'fecha_publicacion': ['16 de Septiembre de 2020'],
 'titulo': ['God of War: Ragnarok - Teaser Tráiler']}
2020-12-30 17:41:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/mlb-the-show-20/69016/review/review-mlb-the-show-20> (referer: https://latam.ign.com/se/?type=review&q=ps4&order_by=-date)
2020-12-30 17:41:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/mlb-the-show-20/69016/review/review-mlb-the-show-20>
{'calificacion': ['8.5', '9', '7', '8', '8'],
 'titulo': ['MLB The Show 20 - Review']}
2020-12-30 17:42:00 [scrapy.core.engine] DEBUG: Crawle

2020-12-30 17:42:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-5-1/70288/video/trailer-de-gameplay-de-deathloop-ps5> (referer: https://latam.ign.com/playstation-5-1/73454/video/marvels-spider-man-comparacion-ps4-2018-vs-remasterizacion-para-ps5-2020)
2020-12-30 17:42:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-1/70288/video/trailer-de-gameplay-de-deathloop-ps5>
{'fecha_publicacion': ['11 de Junio de 2020'],
 'titulo': ['Trailer de gameplay de Deathloop | PS5']}
2020-12-30 17:42:04 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://latam.ign.com/playstation-5-1/73505/video/playstation-5-vs-playstation-4-pro-comparando-los-tiempos-de-carga> from <GET https://latam.ign.com/playstation-5-1/73505/video/www.instagram.com/ignlatam>
2020-12-30 17:42:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/spider-man-ps4/53180/video/review-marvels-spider-man> (referer: ht

2020-12-30 17:42:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-4-gaming-hardware/72834/news/ps-plus-estos-son-los-juegos-gratuitos-de-octubre-2020?utm_source=recirc> (referer: https://latam.ign.com/playstation-4-gaming-hardware/73509/news/playstation-5-con-su-proximo-lanzamiento-ps4-esta-bajando-de-precio-y-estas-son-las-mejores-ofertas)
2020-12-30 17:42:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-4-gaming-hardware/72834/news/ps-plus-estos-son-los-juegos-gratuitos-de-octubre-2020?utm_source=recirc>
{'contenido': ['N',
               'Need for Speed: Payback y Vampyr son los juegos gratuitos de '
               'PlayStation Plus para octubre de 2020.',
               'Desde el martes 6 de octubre y hasta el lunes 2 de noviembre, '
               'podrás adquirir gratis el juego de carreras y el RPG como '
               'parte de tu suscripción a PS Plus.',
               '\n',
               'Need for Speed: 

2020-12-30 17:42:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-5-1/73899/news/playstation-5-circulan-rumores-sobre-una-version-lite-mas-economica-de-la-consola?utm_source=recirc> (referer: https://latam.ign.com/playstation-5-1/73630/news/playstation-5-como-funciona-el-remote-play-en-playstation-4)
2020-12-30 17:42:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-1/73899/news/playstation-5-circulan-rumores-sobre-una-version-lite-mas-economica-de-la-consola?utm_source=recirc>
{'contenido': ['En esta nueva generación de consolas Microsoft puso disponible '
               'dos versiones de su consola Xbox, la Series X con mayor '
               'potencia y la Series S con especificaciones y precio reducido. '
               'Por su parte, Sony no siguió esa misma estrategia con su ',
               'PlayStation 5',
               ' ya que aunque hay dos versiones, no hay tanta diferencia en '
               'su prec

2020-12-30 17:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-5-playstation-5/65703/video/godfall-trailer-de-presentacion> (referer: https://latam.ign.com/playstation-5-playstation-5/72541/video/god-of-war-ragnarok-teaser-trailer)
2020-12-30 17:42:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-playstation-5/65703/video/godfall-trailer-de-presentacion>
{'fecha_publicacion': ['13 de Diciembre de 2019'],
 'titulo': ['Godfall - Tráiler de presentación']}
2020-12-30 17:42:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://latam.ign.com/playstation-5-1/72696/news/playstation-5-sony-confirmo-que-si-puedes-usar-los-discos-de-ps4?utm_source=recirc> (referer: https://latam.ign.com/mlb-the-show-20/69016/review/review-mlb-the-show-20)
2020-12-30 17:42:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://latam.ign.com/playstation-5-1/72696/news/playstation-5-sony-confirmo-que-si-puedes-usar-los-discos-de-ps4?u

2020-12-30 17:42:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://latam.ign.com/spider-man-ps4/53180/video/review-marvels-spider-man> from <GET https://latam.ign.com/spider-man-ps4/53180/video/www.instagram.com/ignlatam>
2020-12-30 17:42:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://latam.ign.com/forza-horizon-4/50671/video/creando-el-mclaren-senna-en-forza-horizon-4-ign-first> from <GET https://latam.ign.com/forza-horizon-4/50671/video/www.instagram.com/ignlatam>
2020-12-30 17:42:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 19305,
 'downloader/request_count': 46,
 'downloader/request_method_count/GET': 46,
 'downloader/response_bytes': 967393,
 'downloader/response_count': 46,
 'downloader/response_status_count/200': 40,
 'downloader/response_status_count/301': 6,
 'dupefilter/filtered': 492,
 'elapsed_time_seconds': 56.55867,
 'finish_reason': 'closespider_pagecount',
 'fi