**Creación de la base de datos**

Esta base de datos se construyó mediante la integración de dos fuentes principales de información cinematográfica: el dataset "Full TMDB Movies Dataset 2024 (1M Movies)" de Kaggle, que contiene 1 000 000 películas, y el dataset "Top 10000 Popular Movies Dataset" de la misma plataforma. La fusión se realizó utilizando `pd.merge` en Python, tomando como llave el título original de las películas (`original_title`) y aplicando una unión interna para conservar únicamente los registros presentes en ambos conjuntos. Durante este proceso, se añadieron automáticamente sufijos a las columnas duplicadas (`_1M` y `_10k`) para identificar su origen, y posteriormente se seleccionó un subconjunto relevante de variables relacionadas con características técnicas, desempeño comercial y recepción crítica de las producciones.

In [None]:
import pandas as pd
import numpy as np

In [1]:
# se empleó el sitio "HugginFace" para poder subir los archviso originales de Kaggle

url_one_million = "https://huggingface.co/datasets/cataxmonger7/one_million/resolve/main/one_million.csv"
url_ten_thousand = "https://huggingface.co/datasets/cataxmonger7/ten_thousand/resolve/main/ten_thousand.csv"

# cargar 

one_million = pd.read_csv(
    url_one_million, 
    on_bad_lines='skip',  
    engine='python',     
    encoding='utf-8'     
)

ten_thousand = pd.read_csv(
    url_ten_thousand,
    on_bad_lines='skip',
    engine='python',
    encoding='utf-8'
)

In [8]:
#combiar las bases de datos
merged = pd.merge(
    one_million,
    ten_thousand,
    on='original_title',
    how='inner',
    suffixes=('_1M', '_10k')
)

# todas las columnas disponibles
print("Todas las columnas después del merge:")
for i, col in enumerate(merged.columns):
    print(f"{i}: {col}")

# buscar columnas "duplicadas"
print("\nColumnas que contienen 'vote':")
print([col for col in merged.columns if 'vote' in col])

print("\nColumnas que contienen 'popularity':")
print([col for col in merged.columns if 'popularity' in col])

print("\nColumnas que contienen 'runtime':")
print([col for col in merged.columns if 'runtime' in col])

print("\nColumnas que contienen 'revenue':")
print([col for col in merged.columns if 'revenue' in col])

print("\nColumnas que contienen 'original_language':")
print([col for col in merged.columns if 'original_language' in col])

Todas las columnas después del merge:
0: id_1M
1: title
2: vote_average_1M
3: vote_count_1M
4: status
5: release_date_1M
6: revenue_1M
7: runtime_1M
8: adult
9: backdrop_path
10: budget
11: homepage
12: imdb_id
13: original_language_1M
14: original_title
15: overview_1M
16: popularity_1M
17: poster_path
18: tagline_1M
19: genres
20: production_companies
21: production_countries
22: spoken_languages
23: keywords
24: Unnamed: 0
25: id_10k
26: original_language_10k
27: popularity_10k
28: release_date_10k
29: vote_average_10k
30: vote_count_10k
31: genre
32: overview_10k
33: revenue_10k
34: runtime_10k
35: tagline_10k

Columnas que contienen 'vote':
['vote_average_1M', 'vote_count_1M', 'vote_average_10k', 'vote_count_10k']

Columnas que contienen 'popularity':
['popularity_1M', 'popularity_10k']

Columnas que contienen 'runtime':
['runtime_1M', 'runtime_10k']

Columnas que contienen 'revenue':
['revenue_1M', 'revenue_10k']

Columnas que contienen 'original_language':
['original_language_1M

In [9]:
# seleccionar variables
movies = merged[
    [
        "original_title",
        "vote_average_1M",        
        "vote_count_1M",         
        "revenue_1M",            
        "budget",                 
        "popularity_1M",       
        "runtime_1M",            
        "genre",                 
        "original_language_1M",   
        "production_countries"    
    ]
]

# renombrar columnas
movies = movies.rename(columns={
    "original_title": "title",
    "vote_average_1M": "vote_average",
    "vote_count_1M": "vote_count", 
    "revenue_1M": "revenue",
    "popularity_1M": "popularity", 
    "runtime_1M": "runtime",
    "original_language_1M": "language",
    "production_countries": "country"
})

In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27863 entries, 0 to 27862
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         27655 non-null  object 
 1   vote_average  27863 non-null  float64
 2   vote_count    27863 non-null  int64  
 3   revenue       27863 non-null  int64  
 4   budget        27863 non-null  int64  
 5   popularity    27863 non-null  float64
 6   runtime       27863 non-null  int64  
 7   genre         27655 non-null  object 
 8   language      27863 non-null  object 
 9   country       19004 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 2.1+ MB


In [14]:
movies_muestra = movies.sample(n=5000, random_state=261125).reset_index(drop=True)

In [15]:
movies_muestra.to_excel('movies_muestra.xlsx', index=False)

In [16]:
print(f"Tamaño de la muestra: {len(movies_muestra)}")
print(movies_muestra.info())

Tamaño de la muestra: 5000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         4965 non-null   object 
 1   vote_average  5000 non-null   float64
 2   vote_count    5000 non-null   int64  
 3   revenue       5000 non-null   int64  
 4   budget        5000 non-null   int64  
 5   popularity    5000 non-null   float64
 6   runtime       5000 non-null   int64  
 7   genre         4965 non-null   object 
 8   language      5000 non-null   object 
 9   country       3393 non-null   object 
dtypes: float64(2), int64(4), object(4)
memory usage: 390.8+ KB
None
