# Preguntas a contestar

¿Cuáles son las películas más vistas en general?

¿Qué géneros prefiere cada rango de edad?

Top 5 películas por rating promedio.

# Questions to answer

What are the most-watched movies in general?

What genres does each age group prefer?

Top 5 movies by average rating.


In [24]:
import findspark
findspark.init('/home/gerardo-rodriguez/spark-4.0.0-bin-hadoop3')

In [25]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Analysis').getOrCreate()

In [21]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, LongType

In [90]:
schema = StructType([
    StructField('user_id', LongType(), True),
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
    StructField('country', StringType(), True),
    StructField('film_id', StringType(), True),
    StructField('title', StringType(), True),
    StructField('genre', StringType(), True),
    StructField('duration', StringType(), True),
    StructField('rating', DoubleType(), True)
])

In [92]:
df = spark.read.csv('../ratings_netlfix/', schema=schema)

In [93]:
df.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- film_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- rating: double (nullable = true)



In [94]:
df.show()

+-------+----------------+---+---------+-------+--------------------+--------------------+---------+-------------------+
|user_id|            name|age|  country|film_id|               title|               genre| duration|             rating|
+-------+----------------+---+---------+-------+--------------------+--------------------+---------+-------------------+
|   6542|  Chris Martinez| 60|   France|  s2416|  Queen of the South|Crime TV Shows, T...|4 Seasons| 2.8309745132788167|
|  15674|     Chris Smith| 25|      USA|  s2974|            Hakkunde|Comedies, Dramas,...|   93 min| 1.5831828240851442|
|  14225|   Michael Davis| 55|   Brazil|  s4524|          Blood Pact|Crime TV Shows, I...| 1 Season|  4.484233958046046|
|  11181|    Sarah Miller| 56|    Japan|  s6249|      Basic Instinct|Classic Movies, T...|  128 min| 3.7227980266817786|
|  12561|Michael Martinez| 51|    India|  s2322|George Lopez: We'...|     Stand-Up Comedy|   52 min| 1.0716140892266441|
|   2833| Michael Johnson| 34|  

# Question 1 | Pregunta 1 - Part 1 | Parte 1

- ¿Cuáles son las películas más vistas en general?

- What are the most-watched movies in general?

In [99]:
view = df.groupby(['film_id', 'title']).count().orderBy('count', ascending=False)
view.show(5,truncate=False)

[Stage 61:>                                                       (0 + 12) / 12]

+-------+---------------------------------+-----+
|film_id|title                            |count|
+-------+---------------------------------+-----+
|s8499  |The Score                        |83   |
|s7653  |On Yoga The Architecture of Peace|80   |
|s3861  |Crime Diaries: Night Out         |80   |
|s4003  |Bombairiya                       |79   |
|s2177  |Rogue Warfare: The Hunt          |78   |
+-------+---------------------------------+-----+
only showing top 5 rows


                                                                                

## Top 5 Most Watched Movies

| Ranking | Title | Views|
|---------|-------|------|
| 1 | The Disastrous Life of Saiki K.: Reawakened with | *99* |
| 2 | Alien Contact: Outer Space with | *92* |
| 3 | American Psycho with | *91* |
| 4 | Harry and Snowman with | *91* |
| 5 | Boy Bye with | *79*|


# Question 2 | Pregunta 2 - Part 1 | Parte 1

- ¿Qué géneros prefiere cada rango de edad?

- What genres does each age group prefer?

In [100]:
df.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|            100000|
|   mean|           46.2936|
| stddev|19.522113875533606|
|    min|                13|
|    max|                80|
+-------+------------------+



In [157]:
from pyspark.sql.functions import col, when, lit, concat, count, row_number, mean, avg
from pyspark.sql import Window

In [107]:
df_rg_age = df.withColumn(
    'range_age',
    when((col('age') >= 10) & (col('age') <= 20), lit('10-20'))
    .when((col('age') >= 21) & (col('age') <= 30), lit('21-30'))
    .when((col('age') >= 31) & (col('age') <= 40), lit('31-40'))
    .when((col('age') >= 41) & (col('age') <= 50), lit('41-50'))
    .when(col('age') > 50, lit('50+'))
    .otherwise(lit('Other Range'))
)

In [123]:
df_group = df_rg_age.groupby(['range_age', 'genre']).agg(
    count(col('genre')).alias('count')
    )

In [121]:
df_group.show(truncate=False)

+---------+-----------------------------------------------------------+-----+
|range_age|genre                                                      |count|
+---------+-----------------------------------------------------------+-----+
|21-30    |Dramas, International Movies                               |638  |
|10-20    |Comedies, Dramas, Independent Movies                       |141  |
|50+      |Dramas, International Movies, Thrillers                    |518  |
|31-40    |International TV Shows, Korean TV Shows, TV Comedies       |17   |
|50+      |Classic Movies, Documentaries                              |65   |
|41-50    |Comedies, Independent Movies                               |48   |
|10-20    |Comedies, International Movies, Thrillers                  |6    |
|41-50    |British TV Shows, Reality TV                               |11   |
|21-30    |British TV Shows, International TV Shows, Romantic TV Shows|8    |
|21-30    |Classic Movies, Dramas                               

In [133]:
window_spec = Window.partitionBy('range_age').orderBy(col('count').desc())

In [135]:
df_max = df_group.withColumn(
    'rank', row_number().over(window_spec)
).filter(
    col('rank') == 1
).drop(
    'rank'
)

In [137]:
df_max.show(truncate=False)

+---------+----------------------------+-----+
|range_age|genre                       |count|
+---------+----------------------------+-----+
|10-20    |Stand-Up Comedy             |486  |
|21-30    |Dramas, International Movies|638  |
|31-40    |Documentaries               |601  |
|41-50    |Documentaries               |615  |
|50+      |Documentaries               |1810 |
+---------+----------------------------+-----+



## preferred genres by age

|range_age|genre                       |count|
|---------|----------------------------|-----|
|10-20    |Stand-Up Comedy             |486  |
|21-30    |Dramas, International Movies|638  |
|31-40    |Documentaries               |601  |
|41-50    |Documentaries               |615  |
|50+      |Documentaries               |1810 |


# Question 3 | Pregunta 3 - Part 1 | Parte 1

- Top 5 películas por rating promedio.
- Top 5 movies by average rating.

In [159]:
df_mean = df.groupBy(['film_id','title']).agg(
    avg(col('rating')).alias('avg_rating')
)

### Option 1 | Opcion 1

In [161]:
window_mean = Window.orderBy(col('avg_rating').desc())

In [162]:
df_mean.withColumn('rank', row_number().over(window_mean)).show(5, truncate=False)

25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+-------+---------------------------+-----------------+----+
|film_id|title                      |avg_rating       |rank|
+-------+---------------------------+-----------------+----+
|s76    |You vs. Wild: Out Cold     |4.988773823533204|1   |
|s7103  |Issaq                      |4.985297088139854|2   |
|s8783  |Yoga Hosers                |4.984790297648168|3   |
|s7815  |R.L. Stine's Mostly Ghostly|4.984177396971178|4   |
|s7976  |Secrets of Scotland Yard   |4.983490175728225|5   |
+-------+---------------------------+-----------------+----+
only showing top 5 rows


25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/08/25 12:50:45 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


### Option 2 | Opcion 2

In [163]:
df_mean.orderBy(
    col('avg_rating').desc()
).show(5, truncate=False)

+-------+---------------------------+-----------------+
|film_id|title                      |avg_rating       |
+-------+---------------------------+-----------------+
|s76    |You vs. Wild: Out Cold     |4.988773823533204|
|s7103  |Issaq                      |4.985297088139854|
|s8783  |Yoga Hosers                |4.984790297648168|
|s7815  |R.L. Stine's Mostly Ghostly|4.984177396971178|
|s7976  |Secrets of Scotland Yard   |4.983490175728225|
+-------+---------------------------+-----------------+
only showing top 5 rows


## Top 5 movies by average rating.

|film_id|title                      |avg_rating       |
|-------|---------------------------|-----------------|
|s76    |You vs. Wild: Out Cold     |4.988773823533204|
|s7103  |Issaq                      |4.985297088139854|
|s8783  |Yoga Hosers                |4.984790297648168|
|s7815  |R.L. Stine's Mostly Ghostly|4.984177396971178|
|s7976  |Secrets of Scotland Yard   |4.983490175728225|

