In [1]:
import findspark
findspark.init('/home/gerardo-rodriguez/spark-4.0.0-bin-hadoop3')

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ml').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/25 12:59:39 WARN Utils: Your hostname, Lanz-Lenovo, resolves to a loopback address: 127.0.1.1; using 192.168.1.145 instead (on interface wlp2s0)
25/08/25 12:59:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/25 12:59:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/25 12:59:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/25 12:59:41 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/08/25 12:59:41 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [4]:
df = spark.read.csv('../Create_ratings/netflix_users.csv', header=True, inferSchema=True)

In [5]:
df.printSchema()

root
 |-- User_ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Subscription_Type: string (nullable = true)
 |-- Watch_Time_Hours: double (nullable = true)
 |-- Favorite_Genre: string (nullable = true)
 |-- Last_Login: date (nullable = true)



# Create Column Churn

In [13]:
from pyspark.sql.functions import col, datediff, current_date, when

In [29]:
df_churn = df.withColumn('churn', when((col('Watch_Time_Hours') < 300) & (datediff(current_date(), col('Last_Login')) > 60), 1).otherwise(0))

In [34]:
df_ml = df_churn.withColumn('days_login', datediff(current_date(), col('Last_Login')))
df_ml.show(5)

+-------+--------------+---+-------+-----------------+----------------+--------------+----------+-----+----------+
|User_ID|          Name|Age|Country|Subscription_Type|Watch_Time_Hours|Favorite_Genre|Last_Login|churn|days_login|
+-------+--------------+---+-------+-----------------+----------------+--------------+----------+-----+----------+
|      1|James Martinez| 18| France|          Premium|           80.26|         Drama|2024-05-12|    1|       470|
|      2|   John Miller| 23|    USA|          Premium|          321.75|        Sci-Fi|2025-02-05|    0|       201|
|      3|    Emma Davis| 60|     UK|            Basic|           35.89|        Comedy|2025-01-24|    1|       213|
|      4|   Emma Miller| 44|    USA|          Premium|          261.56|   Documentary|2024-03-25|    1|       518|
|      5|    Jane Smith| 68|    USA|         Standard|           909.3|         Drama|2025-01-14|    0|       223|
+-------+--------------+---+-------+-----------------+----------------+---------

# Classification Model for churn

In [62]:
from pyspark.ml.classification import (LogisticRegression, RandomForestClassifier)
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler

In [35]:
indexer = StringIndexer(inputCols=['Subscription_Type', 'Country'], outputCols=['subscription_indexed', 'country_indexed'])
indexed = indexer.fit(df_ml)

In [36]:
df_indexed = indexed.transform(df_ml)

In [37]:
df_indexed.columns

['User_ID',
 'Name',
 'Age',
 'Country',
 'Subscription_Type',
 'Watch_Time_Hours',
 'Favorite_Genre',
 'Last_Login',
 'churn',
 'days_login',
 'subscription_indexed',
 'country_indexed']

## Seleccion de variables 

Para la predicción de abandono (churn), se seleccionaron las siguientes variables por su relevancia potencial en el comportamiento del usuario:

Edad (Age): Los usuarios más jóvenes, al estar más familiarizados con la tecnología y con mayor disposición a explorar nuevas plataformas de streaming, presentan una mayor probabilidad de migrar. En contraste, los usuarios de mayor edad tienden a mantener hábitos más estables, lo que podría reflejarse en una menor tasa de abandono.

Horas de visualización (Watch_Time_Hours): Un mayor tiempo de visualización es un fuerte indicador de compromiso con la plataforma. Si el usuario encuentra contenido de interés y dedica muchas horas, la probabilidad de abandono disminuye significativamente.

Días desde el último acceso (Days_Login): La inactividad prolongada es una señal directa de desinterés o desconexión con el servicio. Usuarios que no han ingresado recientemente son más propensos a cancelar su suscripción.

Tipo de suscripción (Subscription_Type): Los usuarios con planes de mayor costo (como Premium) suelen compartir la cuenta con familiares o amigos. Este factor reduce la probabilidad de abandono, ya que la decisión involucra a más personas. En cambio, los planes básicos, generalmente individuales, presentan mayor volatilidad.

País (Country): La situación económica de cada país influye en la capacidad de pago. En regiones donde el valor del dólar representa un costo elevado, los usuarios pueden considerar la suscripción como un gasto prescindible, aumentando el riesgo de abandono.

## Features Selection

For churn prediction, the following variables were selected due to their potential relevance to user behavior:

Age: Younger users, being more familiar with technology and more inclined to explore alternative streaming platforms, may have a higher probability of leaving. In contrast, older users often maintain more stable habits, which may translate into lower churn rates.

Watch Time (Watch_Time_Hours): Higher watch time is a strong indicator of user engagement. If users find content of interest and dedicate significant hours to watching, the likelihood of churn decreases.

Days Since Last Login (Days_Login): Extended inactivity is a direct signal of disengagement. Users who have not logged in recently are more likely to cancel their subscription.

Subscription Type (Subscription_Type): Users with higher-tier plans (e.g., Premium) often share accounts with family or friends. This shared dependency lowers the probability of churn, as the decision to cancel involves multiple people. Conversely, basic plans, usually for individual use, tend to be more volatile.

Country: Economic conditions in each country affect users’ ability to pay. In regions where the dollar represents a relatively high cost, users may perceive the subscription as a non-essential expense, increasing churn risk.

In [50]:
asembler = VectorAssembler(inputCols=['Age',
 'Watch_Time_Hours',
 'days_login',
 'subscription_indexed',
 'country_indexed'], outputCol='features')

In [52]:
df_assembled = asembler.transform(df_indexed)

In [55]:
df_assembled.printSchema()

root
 |-- User_ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Subscription_Type: string (nullable = true)
 |-- Watch_Time_Hours: double (nullable = true)
 |-- Favorite_Genre: string (nullable = true)
 |-- Last_Login: date (nullable = true)
 |-- churn: integer (nullable = false)
 |-- days_login: integer (nullable = true)
 |-- subscription_indexed: double (nullable = false)
 |-- country_indexed: double (nullable = false)
 |-- features: vector (nullable = true)



In [56]:
train_data, test_data = df_assembled.randomSplit([0.8, 0.2], seed=1)

### Logistic Regression

In [58]:
logistic = LogisticRegression(featuresCol='features', labelCol='churn')
model = logistic.fit(train_data)

In [59]:
prediction = model.transform(test_data)

In [61]:
prediction.select(['churn', 'prediction']).show()

+-----+----------+
|churn|prediction|
+-----+----------+
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows


In [63]:
eva_log = BinaryClassificationEvaluator(labelCol='churn')
pred_log = eva_log.evaluate(prediction)

In [64]:
print("Evaluation Logistic regression")
pred_log

Evaluation Logistic regression


0.999999415457866

### Random Forest

In [69]:
rf = RandomForestClassifier(featuresCol='features', labelCol='churn', maxDepth=20, numTrees=200, seed=1)
model_rf = rf.fit(train_data)

25/08/25 19:18:15 WARN DAGScheduler: Broadcasting large task binary with size 1056.2 KiB
25/08/25 19:18:16 WARN DAGScheduler: Broadcasting large task binary with size 1535.9 KiB
25/08/25 19:18:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/08/25 19:18:18 WARN DAGScheduler: Broadcasting large task binary with size 2.8 MiB
25/08/25 19:18:19 WARN DAGScheduler: Broadcasting large task binary with size 3.5 MiB
25/08/25 19:18:20 WARN DAGScheduler: Broadcasting large task binary with size 4.2 MiB
25/08/25 19:18:21 WARN DAGScheduler: Broadcasting large task binary with size 4.9 MiB
25/08/25 19:18:22 WARN DAGScheduler: Broadcasting large task binary with size 5.4 MiB
25/08/25 19:18:23 WARN DAGScheduler: Broadcasting large task binary with size 5.8 MiB
25/08/25 19:18:24 WARN DAGScheduler: Broadcasting large task binary with size 5.8 MiB
25/08/25 19:18:25 WARN DAGScheduler: Broadcasting large task binary with size 5.3 MiB
25/08/25 19:18:26 WARN DAGScheduler: Broadcastin

In [70]:
predict_rf = model_rf.transform(test_data)
predict_rf.select(['churn', 'prediction']).show()

+-----+----------+
|churn|prediction|
+-----+----------+
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows


25/08/25 19:18:28 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB


In [71]:
eva_rf = MulticlassClassificationEvaluator(labelCol='churn', metricName='accuracy')
pred_rf = eva_rf.evaluate(predict_rf)

25/08/25 19:18:29 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB


In [72]:
print("Evaluation Random Forest")
pred_rf

Evaluation Random Forest


0.9954064309966048

# Linear Model

In [85]:
df_show = spark.read.csv('../Create_ratings/netflix_titles.csv', header=True, inferSchema=True)
df_show = df_show.select(['show_id',
                     'type',
                     'title',
                     'director',
                     'cast',
                     'country',
                     'date_added',
                     'release_year',
                     'rating',
                     'duration',
                     'listed_in',
                     'description'])

In [82]:
df_show.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



In [88]:
df_show = df_show.na.drop()
df_show.show()

+-------+-------+--------------------+-------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|           director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+-------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|     s8|  Movie|             Sankofa|       Haile Gerima|Kofi Ghanaba, Oya...|United States, Gh...|September 24, 2021|        1993| TV-MA|  125 min|Dramas, Independe...|On a photo shoot ...|
|     s9|TV Show|The Great British...|    Andy Devonshire|Mel Giedroyc, Sue...|      United Kingdom|September 24, 2021|        2021| TV-14|9 Seasons|British TV Shows,...|A talented batch ...|
|    s10|  Movie|        The Starling|  

In [89]:
df_show.columns

['show_id',
 'type',
 'title',
 'director',
 'cast',
 'country',
 'date_added',
 'release_year',
 'rating',
 'duration',
 'listed_in',
 'description']

In [91]:
df_train = df_show.select([
 'type',
 'director',
 'cast',
 'release_year',
 'duration',
 'listed_in',
 'rating'])

df_train.printSchema()

root
 |-- type: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- rating: string (nullable = true)

