## Web scrapping de IMDB

Descarga la información correspondiente y guarda en un csv el top de las 250 películas mediante webscrapping. Encapsúlalo en un script.

Obtén:
* Título
* Año
* Duración
* Posición
* Rating

In [28]:
# Si la petición te devuelve un 403, puedes probar con:
# pip install fake-useragent
# from fake_useragent import UserAgent
# ua = UserAgent()
# headers = {'User-Agent': ua.random}
# response = requests.get(url, headers=headers)

In [29]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import json

In [30]:
url = "https://www.imdb.com/es-es/chart/top/"
# response = requests.get(url)
# response

In [31]:
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get(url, headers=headers)
response

<Response [200]>

In [32]:
html = response.content
print(html)



In [33]:
soup = bs(html, 'html.parser')
# print(soup)

In [34]:
soup.title

<title>Las 250 mejores películas de IMDb</title>

In [35]:
movies_str = soup.find("script", type="application/ld+json")
movies_json = json.loads(movies_str.string)
movies_json

{'@type': 'ItemList',
 'itemListElement': [{'@type': 'ListItem',
   'item': {'@type': 'Movie',
    'url': 'https://www.imdb.com/es-es/title/tt0111161/',
    'name': 'The Shawshank Redemption',
    'alternateName': 'Cadena perpetua',
    'description': 'Andy Dufresne es encarcelado por matar a su esposa y al amante de esta. Tras una dura adaptación, intenta mejorar las condiciones de la prisión y dar esperanza a sus compañeros.',
    'image': 'https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_.jpg',
    'aggregateRating': {'@type': 'AggregateRating',
     'bestRating': 10,
     'worstRating': 1,
     'ratingValue': 9.3,
     'ratingCount': 3045590},
    'contentRating': '13',
    'genre': 'Drama',
    'duration': 'PT2H22M'}},
  {'@type': 'ListItem',
   'item': {'@type': 'Movie',
    'url': 'https://www.imdb.com/es-es/title/tt0068646/',
    'name': 'The Godfather',
    'alternateName': 'El padrino',
    'description': 'El envejecido patriarca de una

In [36]:
soup.find_all("img", alt=True)[1]


<img alt="Morgan Freeman and Tim Robbins in Cadena perpetua (1994)" class="ipc-image" loading="lazy" sizes="50vw, (min-width: 480px) 34vw, (min-width: 600px) 26vw, (min-width: 1024px) 16vw, (min-width: 1280px) 16vw" src="https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_QL75_UY207_CR6,0,140,207_.jpg" srcset="https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_QL75_UY207_CR6,0,140,207_.jpg 140w, https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_QL75_UY311_CR8,0,210,311_.jpg 210w, https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_QL75_UY414_CR11,0,280,414_.jpg 280w" width="140"/>

In [None]:
import re
# re.search(r"\((\d{4})\)",soup.find_all("img", alt=True)[1]['alt']).group().strip("()")

In [None]:
img_tags = soup.find_all("img", alt=True)
img_tags

In [None]:
peliculas = {"titulo":[], "anio": [], "duracion": [], "posicion": [], "rating":[]}

In [None]:
for i, entry in enumerate(movies_json["itemListElement"], start=1):
    movie = entry["item"]
    
    title = movie.get("alternateName") or movie.get("name")
    rating = movie.get("aggregateRating", {}).get("ratingValue", "N/A")
    duration = movie.get("duration", "N/A")
    url = movie.get("url", "N/A")

    # if 'alt' in img_tags[i]:
    #     match = re.search(r"\((\d{4})\)",img_tags[i]['alt']) 

    #     if match:
    #         year = match.group().strip("()")

    # que no vale esta versión dado que solo 28 pelis tienen alt ... mi gozo en un pozo
    

    peliculas['titulo'].append(title)
    peliculas['anio'].append(year)
    peliculas['duracion'].append(duration)
    peliculas['posicion'].append(i)
    peliculas['rating'].append(rating)

In [None]:
peliculas

In [None]:
df = pd.DataFrame(peliculas)
df.to_csv("peliculas_imdb_anio_rgx.csv")

In [None]:
df

In [37]:
# alternativa

soup.find_all('h3', class_ = "ipc-title__text")[1:-1]       # solo genera 25 lazy loading

[<h3 class="ipc-title__text ipc-title__text--reduced">1. Cadena perpetua</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">2. El padrino</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">3. El caballero oscuro</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">4. El padrino parte II</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">5. 12 hombres sin piedad</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">6. El señor de los anillos: El retorno del rey</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">7. La lista de Schindler</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">8. Pulp Fiction</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">9. El señor de los anillos: La comunidad del anillo</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">10. El bueno, el feo y el malo</h3>,
 <h3 class="ipc-title__text ipc-title__text--reduced">11. Forrest Gump</h3>,
 <h3 class="ipc-title__text ipc-title__text--red

In [42]:
list_rnk = []
list_anio = []
list_duracion = []
list_edad = []
list_nombre = []

for x in soup.find_all('h3', class_ = "ipc-title__text")[1:-1]:
    list_rnk.append(x.get_text().split('. ')[0])
    list_nombre. append(x.get_text().split('. ')[1])

In [44]:
soup.find_all("div", class_="sc-4b408797-7 fUdAcX cli-title-metadata")

[<div class="sc-4b408797-7 fUdAcX cli-title-metadata"><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">1994</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">2h 22m</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">13</span></div>,
 <div class="sc-4b408797-7 fUdAcX cli-title-metadata"><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">1972</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">2h 55m</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">18</span></div>,
 <div class="sc-4b408797-7 fUdAcX cli-title-metadata"><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">2008</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">2h 32m</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">13</span></div>,
 <div class="sc-4b408797-7 fUdAcX cli-title-metadata"><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">1974</span><span class="sc-4b408797-8 iurwGb cli-title-metadata-item">3

In [45]:
for y in soup.find_all("span", class_="sc-4b408797-8 iurwGb cli-title-metadata-item"):
    # print(y.get_text())

    if len(y.text) == 4: 
        list_anio.append(y.text)
    if len(y.text) > 4:
        list_duracion.append(y.text)
    if len(y.text) > 0 and len(y.text) <= 2:
        list_edad.append(y.text)

    

In [46]:
print(list_anio)
print(list_duracion)
print(list_edad)

['1994', '1972', '2008', '1974', '1957', '2003', '1993', '1994', '2001', '1966', '1994', '2002', '1999', '2010', '1980', '1999', '1990', '2014', '1975', '1995', '1946', '1991', '1954', '1998', '2002']
['2h 22m', '2h 55m', '2h 32m', '3h 22m', '1h 36m', '3h 21m', '3h 15m', '2h 34m', '2h 58m', '3h 2m', '2h 22m', '2h 59m', '2h 19m', '2h 28m', '2h 4m', '2h 16m', '2h 25m', '2h 49m', '2h 13m', '2h 7m', '2h 10m', '1h 58m', '3h 27m', '2h 49m', '2h 10m']
['13', '18', '13', '18', 'A', '13', '13', '18', '12', '14', 'A', '13', '18', '12', 'A', '18', '18', '12', '18', '18', 'A', '18', 'A', '13', '18']


In [47]:
list_rating = []
for z in soup.find_all("span", class_ = 'ipc-rating-star--rating'):
    list_rating.append(z.text)


In [48]:
df = pd.DataFrame({
    "Ranking":list_rnk,
    "Titulo":list_nombre,
    "Anio": list_anio,
    "Duracion": list_duracion,
    "Edad": list_edad,
    "Rating": list_rating
})

In [51]:
df.to_csv('alt_movies.csv', index=False)

In [50]:
df

Unnamed: 0,Ranking,Titulo,Anio,Duracion,Edad,Rating
0,1,Cadena perpetua,1994,2h 22m,13,93
1,2,El padrino,1972,2h 55m,18,92
2,3,El caballero oscuro,2008,2h 32m,13,90
3,4,El padrino parte II,1974,3h 22m,18,90
4,5,12 hombres sin piedad,1957,1h 36m,A,90
5,6,El señor de los anillos: El retorno del rey,2003,3h 21m,13,90
6,7,La lista de Schindler,1993,3h 15m,13,90
7,8,Pulp Fiction,1994,2h 34m,18,89
8,9,El señor de los anillos: La comunidad del anillo,2001,2h 58m,12,89
9,10,"El bueno, el feo y el malo",1966,3h 2m,14,88


## JSON

In [52]:
import json

In [54]:
soup.find_all("script", type="application/ld+json")

[<script type="application/ld+json">{"@type":"ItemList","itemListElement":[{"@type":"ListItem","item":{"@type":"Movie","url":"https://www.imdb.com/es-es/title/tt0111161/","name":"The Shawshank Redemption","alternateName":"Cadena perpetua","description":"Andy Dufresne es encarcelado por matar a su esposa y al amante de esta. Tras una dura adaptación, intenta mejorar las condiciones de la prisión y dar esperanza a sus compañeros.","image":"https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_.jpg","aggregateRating":{"@type":"AggregateRating","bestRating":10,"worstRating":1,"ratingValue":9.3,"ratingCount":3045590},"contentRating":"13","genre":"Drama","duration":"PT2H22M"}},{"@type":"ListItem","item":{"@type":"Movie","url":"https://www.imdb.com/es-es/title/tt0068646/","name":"The Godfather","alternateName":"El padrino","description":"El envejecido patriarca de una dinastía del crimen organizado en la ciudad de Nueva York de la posguerra transfiere el con

In [None]:
data = json.loads(soup.find("script", type="application/ld+json").text) # para coger solo el contenido
data

{'@type': 'ItemList',
 'itemListElement': [{'@type': 'ListItem',
   'item': {'@type': 'Movie',
    'url': 'https://www.imdb.com/es-es/title/tt0111161/',
    'name': 'The Shawshank Redemption',
    'alternateName': 'Cadena perpetua',
    'description': 'Andy Dufresne es encarcelado por matar a su esposa y al amante de esta. Tras una dura adaptación, intenta mejorar las condiciones de la prisión y dar esperanza a sus compañeros.',
    'image': 'https://m.media-amazon.com/images/M/MV5BMTA1MjE0Nzk4MDleQTJeQWpwZ15BbWU4MDA0NjIxMjAx._V1_.jpg',
    'aggregateRating': {'@type': 'AggregateRating',
     'bestRating': 10,
     'worstRating': 1,
     'ratingValue': 9.3,
     'ratingCount': 3045590},
    'contentRating': '13',
    'genre': 'Drama',
    'duration': 'PT2H22M'}},
  {'@type': 'ListItem',
   'item': {'@type': 'Movie',
    'url': 'https://www.imdb.com/es-es/title/tt0068646/',
    'name': 'The Godfather',
    'alternateName': 'El padrino',
    'description': 'El envejecido patriarca de una

In [59]:
data['itemListElement'][0]['item']['alternateName']

'Cadena perpetua'

In [60]:
# en vez de enumerate no complicarse mejor un i = 1, i += 1