[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fisamz/Repositorio_MCDAA/blob/main/Tarea3/Tarea3.ipynb)

# Tarea 3 — Manipulación de datos con DataFrames en PySpark
**Alumno:** Fisam Zavala  
**Dataset:** Resultados de futbol & momios de casas de apuestas.  
**Fuente:** [European Soccer Database](https://www.kaggle.com/datasets/hugomathien/soccer)


In [1]:
#%pip install pyspark
import pyspark
pyspark.__version__

'4.1.1'


En esta práctica se utilizó la API de DataFrames de PySpark para cargar y manipular el conjunto de datos `match.csv`. 
Se realizaron transformaciones a nivel de columnas mediante `withColumn`, creando variables derivadas como `total_goals`, `goal_diff`, `is_home_win` y `season_year`.

Posteriormente, se aplicaron filtros para seleccionar subconjuntos de interés (por ejemplo, partidos con alta anotación) y se obtuvieron estadísticas descriptivas con `describe()` y agregaciones con `agg()`.

Finalmente, se realizaron agrupaciones por temporada usando `groupBy().agg()` para calcular promedios y conteos, mostrando el uso típico de DataFrames para análisis tipo SQL en Spark.


In [None]:
# (Opcional / referencia) Descarga desde Kaggle:
# !pip install kaggle
# !kaggle datasets download -d hugomathien/soccer
# !unzip soccer.zip -d data/

#import sqlite3, pandas as pd

#conn = sqlite3.connect("../data/database.sqlite")

#df_match = pd.read_sql("SELECT * FROM Match", conn)
#df_match.to_csv("../data/match.csv", index=False)

#conn.close()


In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("Tarea3_DataFrames")
    .master("local[*]")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("ERROR")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/20 11:38:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/02/20 11:38:31 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
from pyspark.sql import functions as F

df_match = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv("../data/match.csv")
)

print("Filas:", df_match.count())
df_match.printSchema()
df_match.show(5, truncate=False)


                                                                                

Filas: 27383
root
 |-- id: string (nullable = true)
 |-- country_id: string (nullable = true)
 |-- league_id: integer (nullable = true)
 |-- season: string (nullable = true)
 |-- stage: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- match_api_id: integer (nullable = true)
 |-- home_team_api_id: integer (nullable = true)
 |-- away_team_api_id: integer (nullable = true)
 |-- home_team_goal: integer (nullable = true)
 |-- away_team_goal: integer (nullable = true)
 |-- home_player_X1: double (nullable = true)
 |-- home_player_X2: double (nullable = true)
 |-- home_player_X3: double (nullable = true)
 |-- home_player_X4: double (nullable = true)
 |-- home_player_X5: double (nullable = true)
 |-- home_player_X6: double (nullable = true)
 |-- home_player_X7: double (nullable = true)
 |-- home_player_X8: double (nullable = true)
 |-- home_player_X9: double (nullable = true)
 |-- home_player_X10: double (nullable = true)
 |-- home_player_X11: double (nullable = true)
 |--

In [3]:
df_match2 = (
    df_match
    .withColumn("total_goals", F.col("home_team_goal") + F.col("away_team_goal"))
    .withColumn("goal_diff", F.col("home_team_goal") - F.col("away_team_goal"))
    .withColumn("is_home_win", (F.col("home_team_goal") > F.col("away_team_goal")).cast("int"))
    .withColumn("season_year", F.substring(F.col("season"), 1, 4).cast("int"))
)

df_match2.select(
    "season", "season_year", "date",
    "home_team_goal", "away_team_goal",
    "total_goals", "goal_diff", "is_home_win"
).show(10, truncate=False)


+---------+-----------+-------------------+--------------+--------------+-----------+---------+-----------+
|season   |season_year|date               |home_team_goal|away_team_goal|total_goals|goal_diff|is_home_win|
+---------+-----------+-------------------+--------------+--------------+-----------+---------+-----------+
|2008/2009|2008       |2008-08-17 00:00:00|1             |1             |2          |0        |0          |
|2008/2009|2008       |2008-08-16 00:00:00|0             |0             |0          |0        |0          |
|2008/2009|2008       |2008-08-16 00:00:00|0             |3             |3          |-3       |0          |
|2008/2009|2008       |2008-08-17 00:00:00|5             |0             |5          |5        |1          |
|2008/2009|2008       |2008-08-16 00:00:00|1             |3             |4          |-2       |0          |
|2008/2009|2008       |2008-09-24 00:00:00|1             |1             |2          |0        |0          |
|2008/2009|2008       |2008-

In [4]:
df_high_scoring = (
    df_match2
    .filter(F.col("total_goals").isNotNull())
    .filter(F.col("total_goals") >= 5)
    .select("date", "season", "home_team_goal", "away_team_goal", "total_goals")
    .orderBy(F.col("total_goals").desc())
)

df_high_scoring.show(10, truncate=False)


+-------------------+---------+--------------+--------------+-----------+
|date               |season   |home_team_goal|away_team_goal|total_goals|
+-------------------+---------+--------------+--------------+-----------+
|2015-12-20 00:00:00|2015/2016|10            |2             |12         |
|2010-05-05 00:00:00|2009/2010|6             |6             |12         |
|2013-03-30 00:00:00|2012/2013|9             |2             |11         |
|2009-11-22 00:00:00|2009/2010|9             |1             |10         |
|2010-10-24 00:00:00|2010/2011|10            |0             |10         |
|2011-08-28 00:00:00|2011/2012|8             |2             |10         |
|2011-11-06 00:00:00|2011/2012|6             |4             |10         |
|2009-11-08 00:00:00|2009/2010|5             |5             |10         |
|2012-12-29 00:00:00|2012/2013|7             |3             |10         |
|2013-10-30 00:00:00|2013/2014|7             |3             |10         |
+-------------------+---------+-------

In [5]:
df_match2.select("home_team_goal", "away_team_goal", "total_goals", "goal_diff").describe().show()

df_match2.agg(
    F.count("*").alias("n_rows"),
    F.avg("total_goals").alias("avg_total_goals"),
    F.min("total_goals").alias("min_total_goals"),
    F.max("total_goals").alias("max_total_goals")
).show()


+-------+------------------+------------------+------------------+-------------------+
|summary|    home_team_goal|    away_team_goal|       total_goals|          goal_diff|
+-------+------------------+------------------+------------------+-------------------+
|  count|             25979|             25979|             25979|              25979|
|   mean|1.5445937103044767|1.1609376804341969|2.7055313907386735|0.38365602987027986|
| stddev|1.2971582225804408|1.1421103393870686| 1.672456180315952| 1.7824032339537441|
|    min|                 0|                 0|                 0|                 -9|
|    max|                10|                 9|                12|                 10|
+-------+------------------+------------------+------------------+-------------------+

+------+------------------+---------------+---------------+
|n_rows|   avg_total_goals|min_total_goals|max_total_goals|
+------+------------------+---------------+---------------+
| 27383|2.7055313907386735|         

In [6]:
df_by_season = (
    df_match2
    .filter(F.col("season_year").isNotNull())
    .groupBy("season_year")
    .agg(
        F.avg("total_goals").alias("avg_goals"),
        F.count("*").alias("num_matches"),
        F.avg("is_home_win").alias("home_win_rate")  # promedio de 0/1 = tasa
    )
    .orderBy("season_year")
)

df_by_season.show(20, truncate=False)


+-----------+------------------+-----------+------------------+
|season_year|avg_goals         |num_matches|home_win_rate     |
+-----------+------------------+-----------+------------------+
|2008       |2.607336139506915 |3326       |0.47083583884546  |
|2009       |2.6724458204334365|3230       |0.4743034055727554|
|2010       |2.6837423312883435|3260       |0.4662576687116564|
|2011       |2.7164596273291925|3220       |0.4652173913043478|
|2012       |2.7726993865030676|3260       |0.4429447852760736|
|2013       |2.766820580474934 |3032       |0.4630606860158311|
|2014       |2.6757894736842105|3325       |0.4493233082706767|
|2015       |2.7546602525556225|3326       |0.4386650631389056|
+-----------+------------------+-----------+------------------+

