# Obtención del dataframe de AnimePlanet, con los tags de cada anime

El siguiente código fue obtenido en parte gracias a ChatGPT, después un tiempo considerable de
peticiones y errores, a modo de obtener el dataframe deseado. Ciertamente costó bastante, pero el
resultado fue satisfactorio. 

Objetivo: Obtiene exitosamente el dataframe de AnimePlanet, con los tags de cada
anime. Se guardó en un archivo .csv para su posterior uso.

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import randint
import os
import ast

## Dataframe inicial con los anime y sus url de AnimePlanet, para luego obtener los tags de cada anime

In [4]:
# Función para recopilar los animes y su url.
def scrape_anime_data(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.50",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "TE": "Trailers",
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # ¿La solicitud fue exitosa?
    
    print(response.status_code)
    
    # Encontrar los animes.
    anime_cards = soup.find_all("li", class_="card")

    # Initialize lists to store data
    anime_names = []
    anime_urls = []

    # Extracción.
    for card in anime_cards:
        name_element = card.find("h3", class_="cardName")
        if name_element:
            anime_names.append(name_element.get_text(strip=True))

            url_element = card.find("a", class_="tooltip")
            if url_element:
                anime_urls.append("https://www.anime-planet.com" + url_element["href"])

    # Crear un DataFrame con los datos.
    anime_df = pd.DataFrame({"Anime": anime_names, "URL": anime_urls})
    print("Scraped anime data from page " + url)
    return anime_df


# Página base.
base_url = "https://www.anime-planet.com/anime/all?page="
# Páginas: son 667 a día de hoy; esto debe modificarlo el usuario según lo que estime conveniente.
total_pages = 667

# Inicializar el DataFrame.
all_anime_df = pd.DataFrame()

# Revisar cada página e ir concatenando los datos.
for page_number in range(1, total_pages + 1):
    page_url = base_url + str(page_number)
    anime_data = scrape_anime_data(page_url)
    all_anime_df = pd.concat([all_anime_df, anime_data], ignore_index=True)
    # Tiempo de espera aleatorio entre 1 y 3 segundos.
    time.sleep(randint(1, 3))

200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=1
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=2
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=3
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=4
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=5
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=6
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=7
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=8
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=9
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=10
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=11
200
Scraped anime data from page https://www.anime-planet.com/anime/all?page=12
200
Scraped anime data from page https://www.anim

Revisión de que todo esté correcto.

In [6]:
all_anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23334 entries, 0 to 23333
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Anime   23334 non-null  object
 1   URL     23334 non-null  object
dtypes: object(2)
memory usage: 364.7+ KB


PERFECTO.

In [9]:
# Mientras, guardar el DataFrame en un archivo csv, por si acaso, en la carpeta data.

all_anime_df.to_csv(os.path.join("data", "tags_anime_csv"), index=False)

-------------------------------------------------------------------------------------------------------------------

Scraper de los tags, que toma en consideración el dataframe anterior, y usa los url de cada anime.

In [4]:
# Consideración del dataframe previo.

all_anime_df = pd.read_csv(os.path.join("data", "tags_anime_csv"))

In [40]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.50",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "TE": "Trailers",
}

# Agregar una nueva columna para los tags
all_anime_df["Tags"] = ""

for index, row in all_anime_df.iterrows():
    anime_url = row["URL"]
    anime_title = row["Anime"]

    # Obtén los tags para cada anime
    
    anime_response = requests.get(anime_url, headers=headers) # type: ignore
    anime_soup = BeautifulSoup(anime_response.content, "html.parser")
    meta_tags = anime_soup.find_all("meta", {"property": "video:tag"})
    tags = [tag["content"] for tag in meta_tags]

    # Verificar consulta de solicitud exitosa y mostrar el título del anime
    
    print(f"Scraped tags for {anime_title} from {anime_url}: status code {anime_response.status_code}")

    # Tiempo de espera aleatorio entre 1 y 3 segundos.

    time.sleep(randint(1, 3))

    # Actualizar la fila correspondiente en el DataFrame
    
    all_anime_df.at[index, "Tags"] = tags

Scraped tags for Attack on Titan The Final Season: The Final Chapters from https://www.anime-planet.com/anime/attack-on-titan-the-final-season-the-final-chapters: status code 200
Scraped tags for Fullmetal Alchemist: Brotherhood from https://www.anime-planet.com/anime/fullmetal-alchemist-brotherhood: status code 200
Scraped tags for Fruits Basket the Final Season from https://www.anime-planet.com/anime/fruits-basket-the-final-season: status code 200


KeyboardInterrupt: 

In [41]:
all_anime_df

Unnamed: 0,Anime,URL,Tags
0,Attack on Titan The Final Season: The Final Ch...,https://www.anime-planet.com/anime/attack-on-t...,"[Action, Drama, Fantasy, Shounen, Dark Fantasy..."
1,Fullmetal Alchemist: Brotherhood,https://www.anime-planet.com/anime/fullmetal-a...,"[Action, Adventure, Drama, Fantasy, Mystery, S..."
2,Fruits Basket the Final Season,https://www.anime-planet.com/anime/fruits-bask...,
3,Demon Slayer: Kimetsu no Yaiba - Entertainment...,https://www.anime-planet.com/anime/demon-slaye...,
4,Jujutsu Kaisen 2nd Season,https://www.anime-planet.com/anime/jujutsu-kai...,
...,...,...,...
23329,The Mad Capsule Markets: Pulse,https://www.anime-planet.com/anime/the-mad-cap...,
23330,Kyoufu no Hiruyasumi,https://www.anime-planet.com/anime/kyoufu-no-h...,
23331,Xing Xueyuan III: Pandora Mijing,https://www.anime-planet.com/anime/xing-xueyua...,
23332,Xiyou Xin Chuan,https://www.anime-planet.com/anime/xiyou-xin-c...,


In [42]:
# Seleccionamos el anime de Fullmetal Alchemist: Brotherhood para comprobar que los tags se han
# añadido correctamente.


all_anime_df[all_anime_df["Anime"] == "Fullmetal Alchemist: Brotherhood"]

# Ahora vemos que sus tags, imprimir completo.

all_anime_df[all_anime_df["Anime"] == "Fullmetal Alchemist: Brotherhood"]["Tags"].values[0]

['Action',
 'Adventure',
 'Drama',
 'Fantasy',
 'Mystery',
 'Shounen',
 'Conspiracy',
 'Death of a Loved One',
 'Military',
 'Siblings',
 'Animal Abuse',
 'Mature Themes',
 'Violence',
 'Based on a Manga',
 'Domestic Abuse']

In [57]:
all_anime_df

Unnamed: 0,Anime,URL,Tags
0,Attack on Titan The Final Season: The Final Ch...,https://www.anime-planet.com/anime/attack-on-t...,"[Action, Drama, Fantasy, Shounen, Dark Fantasy..."
1,Fullmetal Alchemist: Brotherhood,https://www.anime-planet.com/anime/fullmetal-a...,"[Action, Adventure, Drama, Fantasy, Mystery, S..."
2,Fruits Basket the Final Season,https://www.anime-planet.com/anime/fruits-bask...,
3,Demon Slayer: Kimetsu no Yaiba - Entertainment...,https://www.anime-planet.com/anime/demon-slaye...,
4,Jujutsu Kaisen 2nd Season,https://www.anime-planet.com/anime/jujutsu-kai...,
...,...,...,...
23329,The Mad Capsule Markets: Pulse,https://www.anime-planet.com/anime/the-mad-cap...,
23330,Kyoufu no Hiruyasumi,https://www.anime-planet.com/anime/kyoufu-no-h...,
23331,Xing Xueyuan III: Pandora Mijing,https://www.anime-planet.com/anime/xing-xueyua...,
23332,Xiyou Xin Chuan,https://www.anime-planet.com/anime/xiyou-xin-c...,


In [44]:
all_anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23334 entries, 0 to 23333
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Anime   23334 non-null  object
 1   URL     23334 non-null  object
 2   Tags    23334 non-null  object
dtypes: object(3)
memory usage: 547.0+ KB


In [60]:
import ast


all_tags = set(
    tag.strip() if isinstance(tags_str, str) else tag
    for tags_str in all_anime_df["Tags"]
    for tag in (tags_str.strip("[]").split(",") if isinstance(tags_str, str) else tags_str)
)


In [65]:
# Obtener todos los tags únicos en el DataFrame
all_tags = set(
    tag.strip() if isinstance(tags_str, str) else tag
    for tags_str in all_anime_df["Tags"]
    for tag in (
        tags_str.strip("[]").split(",") if isinstance(tags_str, str) else tags_str
    )
)

# Crear columnas para cada tag en el DataFrame
for tag in all_tags:
    all_anime_df[tag] = all_anime_df["Tags"].apply(lambda x: 1 if tag in x else 0)

In [66]:
all_anime_df

Unnamed: 0,Anime,URL,Tags,Unnamed: 4,Adventure,Dark Fantasy,Mature Themes,War,Conspiracy,Violence,...,Animal Abuse,Military,Action,Explicit Violence,Death of a Loved One,Siblings,Shounen,Fantasy,Suicide,Based on a Manga
0,Attack on Titan The Final Season: The Final Ch...,https://www.anime-planet.com/anime/attack-on-t...,"[Action, Drama, Fantasy, Shounen, Dark Fantasy...",0,0,1,1,1,0,0,...,0,1,1,1,0,0,1,1,1,1
1,Fullmetal Alchemist: Brotherhood,https://www.anime-planet.com/anime/fullmetal-a...,"[Action, Adventure, Drama, Fantasy, Mystery, S...",0,1,0,1,0,1,1,...,1,1,1,0,1,1,1,1,0,1
2,Fruits Basket the Final Season,https://www.anime-planet.com/anime/fruits-bask...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Demon Slayer: Kimetsu no Yaiba - Entertainment...,https://www.anime-planet.com/anime/demon-slaye...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Jujutsu Kaisen 2nd Season,https://www.anime-planet.com/anime/jujutsu-kai...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23329,The Mad Capsule Markets: Pulse,https://www.anime-planet.com/anime/the-mad-cap...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23330,Kyoufu no Hiruyasumi,https://www.anime-planet.com/anime/kyoufu-no-h...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23331,Xing Xueyuan III: Pandora Mijing,https://www.anime-planet.com/anime/xing-xueyua...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23332,Xiyou Xin Chuan,https://www.anime-planet.com/anime/xiyou-xin-c...,,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
all_tags

[]