In [2]:
import pandas as pd

<h1 style="text-align:center; font-size:28px;"> PLAYER PERFORMANCE MODEL </h1>

The player performance prediction model aims to predict a specific aspect of a player's performance in an NFL game. In this case, we chose to predict the rushing yards for running backs (RBs). Here's a detailed breakdown of how the model works and the data it uses:

### Data Used
- **Dataset**: The model utilizes data from "PlayerStats.csv", which contains detailed statistics for individual NFL players.
- **Target Variable**: The model predicts 'rushing_yards', which is a numerical variable indicating the total rushing yards gained by a player in a game.
- **Feature**: For simplicity, the model currently uses only one feature, 'avg_rushing_yards', which is the average rushing yards per game for each player. This is calculated based on historical performance data.

### Model Workflow
1. **Data Filtering**: 
   - The dataset is filtered to include only running backs (RBs), as the rushing yards are most relevant to this position.

2. **Feature Engineering**:
   - **Average Rushing Yards**: For each player, their average rushing yards per game is calculated. This is done by grouping the data by player name and then computing the mean of the 'rushing_yards' for each player. This feature is assumed to be indicative of a player's typical performance.

3. **Data Preparation**:
   - The feature ('avg_rushing_yards') and the target variable ('rushing_yards') are extracted from the dataset.
   - The data is then split into training and testing sets (typically an 80-20 split) to train the model and evaluate its performance on unseen data.

4. **Model Selection and Training**:
   - A **Random Forest Regressor** is chosen. Random Forest is a versatile and robust machine learning algorithm that can handle nonlinear relationships and interactions between features. It works by building multiple decision trees and merging their outputs.
   - The model is trained on the training set using the 'avg_rushing_yards' as the input feature and 'rushing_yards' as the output to predict.

5. **Model Prediction and Evaluation**:
   - The model makes predictions on the test set.
   - The performance is evaluated using the Mean Squared Error (MSE), which measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit of the model to the data.

### Assumptions and Limitations
- The model assumes that a player's historical average rushing yards is a strong indicator of their future performance. While this is a reasonable starting point, player performance can be influenced by many other factors.
- Currently, the model only uses one feature. Including more features like the opposing team's defense quality, player fitness, recent form, etc., could significantly enhance the model's accuracy.
- The model does not account for team dynamics or specific game contexts, which can also impact player performance.

### Potential Improvements
- **Incorporate More Features**: Adding more relevant features can help the model capture the complexity of player performance better.
- **Model Complexity and Tuning**: Experimenting with different models or tuning the hyperparameters of the Random Forest could yield better results.
- **Cross-Validation**: Implementing cross-validation would help in better assessing the model's performance and generalizability.

In summary, this model is a basic approach to predict player performance, focusing on a single key statistic. It serves as a foundation upon which more complex and accurate predictive models can be built by incorporating a broader range of features and employing more advanced modeling techniques.

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the PlayerStats.csv dataset
player_stats_df = pd.read_csv('data/PlayerStats.csv')

# For demonstration, let's predict the rushing yards for running backs (RB)
# Filter dataset for RBs
rb_stats = player_stats_df[player_stats_df['position'] == 'RB']

# Feature Engineering
# Calculate average rushing yards per game for each player
avg_rushing_yards = rb_stats.groupby('player_display_name')['rushing_yards'].mean().to_dict()

# Create a feature for average rushing yards
rb_stats['avg_rushing_yards'] = rb_stats['player_display_name'].map(avg_rushing_yards)

# Selecting features and target
features = ['avg_rushing_yards']
X = rb_stats[features]
y = rb_stats['rushing_yards']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training using Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)

# Output the model's performance
print("Mean Squared Error:", mse)

# Note: Adjust the file path in the pd.read_csv() function as per your file location.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rb_stats['avg_rushing_yards'] = rb_stats['player_display_name'].map(avg_rushing_yards)


NameError: name 'train_test_split' is not defined

<h1 style="text-align:center; font-size:28px;"> GAME MODEL </h1>

The game outcome prediction model is designed to predict the result of an NFL game, specifically whether the home team wins or not. This model employs logistic regression, a common method for binary classification tasks. Here's a detailed explanation of how this model works and the data it uses:

### Data Used
- **Dataset**: The model uses data from "Games.csv", which contains detailed information about various NFL games.
- **Target Variable**: The model predicts a binary outcome (`home_win`), indicating whether the home team wins (1) or not (0).
- **Features**: In this implementation, two basic features are used: the average scores of the home team (`home_team_avg_score`) and the away team (`away_team_avg_score`) in previous games.

### Model Workflow
1. **Feature Engineering**:
   - **Average Scores Calculation**: The average scores for each team when playing at home and away are computed. This is done by grouping the games by team and calculating the mean score.
   - **New Features**: These average scores are then mapped to each game in the dataset, creating new features that represent the typical scoring performance of the home and away teams.

2. **Data Preparation**:
   - The new features (`home_team_avg_score` and `away_team_avg_score`) and the target variable (`home_win`) are extracted from the dataset.
   - The dataset is split into training and testing sets to evaluate the model's performance on unseen data.

3. **Model Selection and Training**:
   - A **Logistic Regression** model is used. Logistic regression is suitable for binary classification tasks and predicts the probability of each class.
   - The model is trained on the training set using the two features as input.

4. **Model Prediction and Evaluation**:
   - The model makes predictions on the test set.
   - The performance is evaluated using accuracy (the proportion of correctly predicted outcomes) and a classification report, which includes precision, recall, and F1-score for each class.

### Assumptions and Limitations
- The model assumes that the average scoring performance of a team is a significant predictor of winning a game. This is a simplified assumption, as game outcomes can be influenced by numerous factors.
- The model only uses two features, which might not capture the complexity of factors affecting game outcomes (e.g., player injuries, weather conditions, team form).
- Logistic regression is a linear model and may not capture complex relationships as effectively as more sophisticated models.

### Potential Improvements
- **Incorporating More Features**: Including additional features such as head-to-head statistics, player availability, weather conditions, etc., could improve the model's predictive power.
- **Advanced Models**: Exploring more complex models like Random Forests or Gradient Boosting Machines, which can handle non-linear relationships better.
- **Hyperparameter Tuning**: Tuning the logistic regression model or employing regularization techniques to improve performance.

In essence, this game outcome prediction model serves as a foundational approach, demonstrating how logistic regression can be applied to predict binary outcomes in sports analytics. To enhance its predictive accuracy, further development and integration of more nuanced and comprehensive features are necessary.

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the dataset
games_df = pd.read_csv('/path/to/Games.csv')

# Feature Engineering
# Calculate average scores for each team at home and away
avg_home_scores = games_df.groupby('home_team')['home_score'].mean().to_dict()
avg_away_scores = games_df.groupby('away_team')['away_score'].mean().to_dict()

# Create new features in the dataframe for average scores
games_df['home_team_avg_score'] = games_df['home_team'].map(avg_home_scores)
games_df['away_team_avg_score'] = games_df['away_team'].map(avg_away_scores)

# Target variable - Home team wins (1) or not (0)
games_df['home_win'] = (games_df['home_score'] > games_df['away_score']).astype(int)

# Selecting features and target for the model
features = ['home_team_avg_score', 'away_team_avg_score']
X = games_df[features]
y = games_df['home_win']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Training using Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = logistic_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Output the model's performance
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)

# Note: Adjust the file path in the pd.read_csv() function as per your file location.
