### Fase 2: Extracción de Detalles de Películas con Selenium
Utiliza Selenium para obtener información adicional de las películas listadas previamente.

**Información requerida:**
- Calificación de IMDB.
- Dirección (director o directores).
- Guionistas.
- Argumento.
- Duración (en minutos).

______________________________

In [1]:
import pandas as pd
import pickle
from time import sleep
from tqdm import tqdm

from bs4 import BeautifulSoup

# Importar librerías para automatización de navegadores web con Selenium
# -----------------------------------------------------------------------
from selenium import webdriver  # Selenium es una herramienta para automatizar la interacción con navegadores web.
from webdriver_manager.chrome import ChromeDriverManager  # ChromeDriverManager gestiona la instalación del controlador de Chrome.
from selenium.webdriver.common.keys import Keys  # Keys es útil para simular eventos de teclado en Selenium.
from selenium.webdriver.support.ui import Select  # Select se utiliza para interactuar con elementos <select> en páginas web.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

In [18]:
df_pelis = pd.read_pickle("../datos/df_pelis.pkl")
df_pelis.head(2)

Unnamed: 0,id,title,genre,type,release_year,release_month
0,tt0059325,Jahrgang 45,Drama,Movie,1990,10.0
1,tt0059900,"Wenn du groß bist, lieber Adam",Drama,Movie,1990,10.0


Hay varias películas repetidas ya que tienen varios géneros, vamos a agruparlas:

In [105]:
df_pelis.drop(columns=["genre"]).duplicated().sum()

np.int64(5533)

In [108]:
df_agrupado = df_pelis.groupby("id").agg(
    title=("title", "min"), 
    genre=("genre", lambda texts: " ".join(texts)),
    type=("type", "min"), 
    release_year=("release_year", "max"), 
    release_month=("release_month", "max")
).reset_index()

df_agrupado.head()

Unnamed: 0,id,title,genre,type,release_year,release_month
0,tt0011216,La fête espagnole,Drama,Movie,2019,1.0
1,tt0011801,Tötet nicht mehr,Action,Movie,2019,
2,tt0015724,Dama de noche,Drama Mystery Romance Thriller,Movie,1993,3.0
3,tt0016906,Frivolinas,Comedy,Movie,2014,10.0
4,tt0035423,Kate & Leopold,Comedy Fantasy Romance,Movie,2001,2.0


Vamos a coger una pequeña muestra para llevar a cabo esta fase pues por cada película se estiman unos 10 segundos para la busqueda de los parámetros requeridos.

In [158]:
df_sample = df_agrupado.sample(300)
df_sample.head(1)

Unnamed: 0,id,title,genre,type,release_year,release_month
17484,tt1254322,The Girl King,Romance,Movie,2015,12.0


In [159]:
df_sample.to_pickle("../datos/df_sample_fase2.pkl")

Cargamos el df que hemos almacenamos y comenzamos la busqueda para dichas películas:

In [166]:
df_sample = pd.read_pickle("../datos/df_sample_fase2.pkl")
df_sample.shape

(300, 6)

In [167]:
lista_final = []

for i in tqdm(range(df_sample.shape[0])):

    id = df_agrupado.loc[i, "id"]

    driver = webdriver.Chrome()
    url = f"https://www.imdb.com/es-es/title/{id}/"
    driver.get(url)
    driver.maximize_window()

    html_contenido = driver.page_source
    sleep(2)

    sopa = BeautifulSoup(html_contenido, "html.parser")

    # Encontramos el argumento
    try:
        argumento = sopa.find("span", {"class":"sc-3ac15c8d-1 gkeSEi"}).text
    except:
        argumento = None

    # Buscamos los directores y guionistas
    try:
        contenido = sopa.findAll("ul", {"class":"ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content baseAlt"})
        directores = contenido[0].findAll("a")
        guionistas = contenido[1].findAll("a")


        lista_directores = [director.text for director in directores]
        lista_guionistas = [guion.text for guion in guionistas]
    except:
        lista_directores = []
        lista_guionistas = []

    # Buscamos la nota
    try:
        nota_txt = sopa.find("span", {"class":"sc-d541859f-1 imUuxf"}).text
        nota = float(nota_txt.replace(",", "."))
    except:
        nota = None

    # Buscamos la duración
    try:
        hora = sopa.find("ul", {"class":"ipc-inline-list ipc-inline-list--show-dividers sc-ec65ba05-2 joVhBE baseAlt"})
        lista_hora = hora.findAll("li")[1].text.replace("min", "").split("h")
        if len(lista_hora)>1:
            hora=int(lista_hora[0].strip())
            minutos = int(lista_hora[1].strip())
            minutos_totales = hora*60+minutos
        else:
            minutos_totales = int(lista_hora[0].strip())
    except:
        minutos_totales = None

    tupla_final = (id, nota, lista_directores, lista_guionistas, argumento, minutos_totales)
    lista_final.append(tupla_final)

    driver.quit()

    if i%100==0:
        with open('../datos/selenium_pelis.plk', 'wb') as f:
            pickle.dump(lista_final, f)


 68%|██████▊   | 204/300 [41:31<15:59,  9.99s/it] Exception ignored in: <function Service.__del__ at 0x000001F741405EE0>
Traceback (most recent call last):
  File "C:\Users\Elena\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\selenium\webdriver\common\service.py", line 189, in __del__
    self.stop()
  File "C:\Users\Elena\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\selenium\webdriver\common\service.py", line 146, in stop
    self.send_remote_shutdown_command()
  File "C:\Users\Elena\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\selenium\webdriver\common\service.py", line 126, in send_remote_shutdown_command
    request.urlopen(f"{self.service_url}/shutdown")
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2

In [3]:
with open('../datos/selenium_pelis.plk', 'rb') as f:
    data = pickle.load(f)

data

[('tt0011216',
  6.7,
  ['Germaine Dulac'],
  ['Louis Delluc'],
  'Coveted by two different men, a woman turns to a third man instead.',
  67),
 ('tt0011801',
  None,
  ['Lupu Pick'],
  ['Gerhard Lamprecht', 'Lupu Pick'],
  "In Russia's revolution, a violinist 's son dies,the father is jailed for assaulting the governor. Years later, the violinist's daughter wants to marry the son of the man who sentenced her father.A tragic fate awaits her new husband.",
  None),
 ('tt0015724',
  6.3,
  ['Eva López Sánchez'],
  ['Eva López Sánchez', 'David Martin del Campo'],
  "Bruno, a novelist without luck, responds to the distress call of his love, Sofía, who finds herself in desolate Veracruz and doesn't know what to do with the corpse of her boyfriend. Bruno proposes to Sofía to get rid of the corpse.",
  102),
 ('tt0016906',
  5.3,
  ['Arturo Carballo'],
  ['José López Alonso', 'Juan Belmonte', 'María Caballé'],
  "Basada en espectáculos de variedades de los años 20 tales como 'Arco Iris', 'La 