# Premier League Data Analysis

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("premier_league_data.csv")

The dataset comes from the Fantasy Football game _Biwenger_ and contains information about all Premier League players in the 2024-2025 season.  
It includes **616 players** and **9 columns**, capturing both real life performance and market value metrics.

## Feature Glossary  
- **Team**: The club the player belongs to.  
- **Jugador**: Name of the player (not used directly in models, only as identifier).  
- **Posición**: Role on the pitch (e.g., goalkeeper, defender, midfielder, forward).  
- **Puntos**: Total points scored by the player (performance measure).  
- **Precio**: Market price of the player in the fantasy game. Driven by supply and demand: if many users buy a player, price goes up; if many sell, price goes down.   
- **PJ (Partidos Jugados)**: Number of matches played.  
- **Casa**: Number of matches played at home.  
- **Fuera**: Number of matches played away.  
- **Media**: Average points per match.

In [3]:
df.head()

Unnamed: 0,Equipo,Jugador,Posición,Puntos,Precio,PJ,Casa,Fuera,Media
0,Liverpool,Salah,Delantero,416,18120000,38,19,19,10.95
1,Chelsea,Cole Palmer,Centrocampista,340,17520000,37,19,18,9.19
2,Brentford FC,Mbeumo,Delantero,321,15670000,38,19,19,8.45
3,Manchester United,Bruno Fernandes,Centrocampista,309,15490000,36,18,18,8.58
4,Wolverhampton,Matheus Cunha,Delantero,299,14050000,33,16,17,9.06


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Equipo    505 non-null    object 
 1   Jugador   505 non-null    object 
 2   Posición  505 non-null    object 
 3   Puntos    505 non-null    int64  
 4   Precio    505 non-null    int64  
 5   PJ        505 non-null    int64  
 6   Casa      505 non-null    int64  
 7   Fuera     505 non-null    int64  
 8   Media     505 non-null    float64
dtypes: float64(1), int64(5), object(3)
memory usage: 35.6+ KB


In [5]:
df.describe()

Unnamed: 0,Puntos,Precio,PJ,Casa,Fuera,Media
count,505.0,505.0,505.0,505.0,505.0,505.0
mean,91.586139,2798990.0,18.833663,9.441584,9.392079,3.661525
std,80.605872,2925560.0,13.552809,6.86354,6.809244,2.16614
min,0.0,290000.0,0.0,0.0,0.0,0.0
25%,13.0,800000.0,4.0,2.0,2.0,2.52
50%,80.0,1900000.0,20.0,10.0,10.0,4.04
75%,153.0,3510000.0,32.0,16.0,16.0,5.08
max,416.0,18120000.0,38.0,19.0,19.0,10.95


## Research Questions and Methods

### Question 1  
**Is it possible to estimate a player’s price based on their position, team, and performance (points)?**  
- **Possible features**: `Posición`, `Team`, `Puntos`, `PJ`.  
- **Target**: `Precio` (continuous).  
- **Best algorithm**:  
  - Linear Regression (baseline).  
  - Random Forest Regressor (if nonlinear effects are relevant).  
- **Useful insights / conclusions**:  
  - Helps understand which factors most influence player prices.  
  - Can highlight if price is mostly performance-driven or influenced by team/position.  
  - Useful for predicting price trends before the gameweek.  

---

### Question 2  
**Is it possible to discover natural clusters of players without using position labels?**  
- **Possible features**: `Precio`, `Puntos`, `Media`, `PJ`.  
- **Target**: None (unsupervised).  
- **Best algorithm**:  
  - K-Means clustering (with feature scaling).  
    - Use the **elbow method** to determine the optimal number of clusters.  
  - Hierarchical clustering (suitable for small datasets).  
- **Useful insights / conclusions**:  
  - Can reveal hidden groupings of players based on performance and price.  
  - Helps identify under- or over-valued players relative to their cluster.  
  - Can inform transfer or selection strategies in fantasy football.  

---

### Question 3  
**Is price (with other features, excluding points) a good indicator of whether a player performs above or below average?**  
- **Possible features**: `Precio`, `Team`, `PJ`, `Casa`, `Fuera` (excluding `Puntos`).  
- **Target**: Binary label = 1 if `Puntos > Media`, else 0.  
- **Best algorithm**:  
  - Logistic Regression (to assess statistical significance).  
  - Random Forest Classifier (to capture nonlinear relationships).  
- **Useful insights / conclusions**:  
  - Determines if fantasy users’ buying/selling behavior (reflected in price) predicts actual performance.  
  - Can reveal if high-priced players are truly better performers or just popular.  
  - Helps validate the efficiency of the fantasy market.  
