In [7]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import sqlite3

In [5]:
# Load data from CSV files in data/ directory

teams_df = pd.read_csv('data/Teams_OCT_10_2025.csv')
games_df = pd.read_csv('data/Games_OCT_10_2025.csv')
player_stats_df = pd.read_csv('data/all_passing_rushing_receiving.csv')
rosters_df = pd.read_csv('data/Rosters_OCT_10_2025.csv')

# Save to root CSVs for compatibility with rest of notebook
teams_df.to_csv('Teams.csv', index=False)
games_df.to_csv('Games.csv', index=False)
player_stats_df.to_csv('PlayerStats.csv', index=False)
rosters_df.to_csv('Rosters.csv', index=False)

print("Loaded data from data/ directory and saved to root CSVs")
print(f"Player stats shape: {player_stats_df.shape}")
print(f"Available columns: {list(player_stats_df.columns)}")

Loaded data from data/ directory and saved to root CSVs
Player stats shape: (45316, 27)
Available columns: ['player', 'player_id', 'team', 'pass_cmp', 'pass_att', 'pass_yds', 'pass_td', 'pass_int', 'pass_sacked', 'pass_sacked_yds', 'pass_long', 'pass_rating', 'rush_att', 'rush_yds', 'rush_td', 'rush_long', 'targets', 'rec', 'rec_yds', 'rec_td', 'rec_long', 'fumbles', 'fumbles_lost', 'game_id', 'opponent_team', 'home', 'position']


# XGBoost

XGBoost is an implementation of Gradient Boosted decision trees. XGBoost models majorly dominate in many Kaggle Competitions.

In this algorithm, decision trees are created in sequential form. Weights play an important role in XGBoost. Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results. The weight of variables predicted wrong by the tree is increased and these variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model. It can work on regression, classification, ranking, and user-defined prediction problems.

##### https://www.geeksforgeeks.org/xgboost/


### Bayesian Inference
When I talk about inference models, I’m usually talking about Bayesian inference. Bayesian inference allows us to use prior information to estimate our target. It’s very rare that you have literally no clue about what you’re trying to estimate. Bayesian inference allows us to create a “weakly informative” prior. I know that a player’s passing touchdowns per game will be positive. I know it will be less than something absurd, like 20 touchdowns per game. I can use this sort of information when crafting a model.

Let’s say we wanted to estimate Lamar Jackson’s passing touchdown output in week 3 of the 2019 season. The statistics classes that I took growing up didn’t have a great approach to this problem. One method might be to average his production in weeks one and two and use that as a guess. Or, we could use last season’s average. It’s clear that both of these approaches are flawed. In weeks one and two, he average 3.5 passing touchdowns. As those familiar with football know, 3.5 passing touchdowns per game is completely unprecedented in a low volume passing offense. He also played a very weak Miami Dolphins team that would inflate his numbers. Last year, he averaged 0.86 touchdowns per game in the regular season. Even with a bigger sample size, that’s also unsatisfactory. He’s young, and he’s probably made some sort of improvement during the offseason. The team also committed to building around his skillset with personnel.

It’s easy to say his passing touchdown output going forward would be somewhere between 0.85 and 3.5. It’s not easy to say exactly where in between 0.85 and 3.5 his passing output would finalize. That’s where Bayesian inference can help us.

### Poisson Distributions
Poisson distributions are ideal for our touchdown prop estimation. If you are unfamiliar with them, you can think of them as a good and simple way to model counts over a fixed period of time. For example, let’s say I drink around 0.86 coffees per day. The Poisson distribution for my coffee intake per day could be modeled by a Poisson distribution with mean 0.86. Unlike Normal distributions, I don’t need to know the standard deviation or any other parameter. The variance is equal to the mean.

https://towardsdatascience.com/create-your-own-nfl-touchdown-props-with-python-b3896f19a588


### https://www.reddit.com/r/algobetting/comments/1di9jvh/which_machine_learning_model_for_my_use_case/

No papers or anything I'd suggest, but fundamentally what you want to end up with is the probability you win your bet, that's what your target needs to be and what the model needs to output. So if your bet is on event A happening, then your model needs to be able to produce a probability for event A happening.

if your bet is on event A happening, then your model needs to be able to produce a probability for event A happening

u/playful_match_9556 This is the key. Took me a bit to figure out but projecting accurate probabilities is the best thing you can do, then find lines where you're +EV

### Example Process for Estimating Player Data (w/o Bayesian/Poisson):
Let’s walk through how you might estimate these inputs:

1. **CARRIES**: Look at the player’s average carries over the past 5 games, adjust for the opponent’s run defense strength.
   - **Estimate**: 10 carries.

2. **RUSHING YARDS**: Combine the player’s average yards per carry with the estimated number of carries and adjust for the opponent's defensive strength.
   - **Estimate**: 50 yards.

3. **RECEPTIONS**: Consider the player’s recent targets, their role in the passing game, and how often the opposing defense allows receptions to similar players.
   - **Estimate**: 3 receptions.

4. **TARGETS**: Look at the player’s target share in recent games and adjust based on expected game flow.
   - **Estimate**: 5 targets.

5. **RECEIVING YARDS**: Use the player’s average yards per reception, multiplied by the estimated number of receptions.
   - **Estimate**: 25 yards.

### Automating the Estimation Process:
If you have historical data, you can create a simple function or model to automate these estimates. For example:

```python
def estimate_player_stats(player_id, opponent_id, recent_games=5):
    # Calculate averages over the last `recent_games` games
    carries = df[(df['player_id'] == player_id) & (df['opponent_id'] == opponent_id)].tail(recent_games)['carries'].mean()
    rushing_yards = df[(df['player_id'] == player_id) & (df['opponent_id'] == opponent_id)].tail(recent_games)['rushing_yards'].mean()
    receptions = df[(df['player_id'] == player_id) & (df['opponent_id'] == opponent_id)].tail(recent_games)['receptions'].mean()
    targets = df[(df['player_id'] == player_id) & (df['opponent_id'] == opponent_id)].tail(recent_games)['targets'].mean()
    receiving_yards = df[(df['player_id'] == player_id) & (df['opponent_id'] == opponent_id)].tail(recent_games)['receiving_yards'].mean()

    return {
        'carries': carries,
        'rushing_yards': rushing_yards,
        'receptions': receptions,
        'targets': targets,
        'receiving_yards': receiving_yards
    }

# Example usage:
player_data = estimate_player_stats(player_id=123, opponent_id=456)
```

### Final Prediction:
Once you have estimated these values, you can use them as inputs to your trained model:

```python
player_df = pd.DataFrame(player_data, index=[0])  # Ensure it's in the correct format
touchdown_probability = xgb_model.predict_proba(player_df)[:, 1]
print(f"Probability of the player scoring a touchdown: {touchdown_probability[0]:.4f}")
```

This process allows you to generate reasonable estimates for the inputs, enabling the model to provide predictions even when actual game data isn’t available yet.

---

# Feature Engineering

***Creating your target and performing a train-test split should always be your first two steps, no matter what kind of machine learning model you may be running.***

In [8]:
# Create defense power rankings

# Step 1: Extract Data from the Database
conn = sqlite3.connect('nfl.db')

query = """
SELECT team_name, SUM(touchdowns_allowed) AS total_touchdowns_allowed, SUM(yards_allowed) AS total_yards_allowed
FROM TeamStats
GROUP BY team_name
"""

team_defense_stats = pd.read_sql_query(query, conn)
conn.close()

# Step 2: Calculate Defense Strength
team_defense_stats['defense_strength'] = (
    0.5 * team_defense_stats['total_touchdowns_allowed'] +
    0.5 * team_defense_stats['total_yards_allowed']
)

# Step 3: Normalize Defense Strength
team_defense_stats['defense_strength'] = (
    (team_defense_stats['defense_strength'] - team_defense_stats['defense_strength'].min()) /
    (team_defense_stats['defense_strength'].max() - team_defense_stats['defense_strength'].min())
)

# Invert the scale so that lower values indicate stronger defense
team_defense_stats['defense_strength'] = 1 - team_defense_stats['defense_strength']

# Step 4: Save to CSV
output_df = team_defense_stats[['team_name', 'defense_strength']]
output_df.to_csv('defense_strength.csv', index=False)

print("Opponent defense strength values saved to 'defense_strength.csv'")


DatabaseError: Execution failed on sql '
SELECT team_name, SUM(touchdowns_allowed) AS total_touchdowns_allowed, SUM(yards_allowed) AS total_yards_allowed
FROM TeamStats
GROUP BY team_name
': no such table: TeamStats

In [9]:
#v2 w/ weights

import pandas as pd

# Define the seasons you want to include (e.g., 2021, 2022, 2023)
seasons = [2021, 2022, 2023]

# Load PlayerStats from CSV
player_stats_df = pd.read_csv('PlayerStats.csv')

# Use all data for defense calculation (no season column in this dataset)
print("Available columns:", player_stats_df.columns.tolist())
filtered_df = player_stats_df.copy()

# Calculate team defense stats using actual column names
team_defense_stats = filtered_df.groupby('opponent_team').agg({
    'rush_td': 'sum',
    'pass_td': 'sum', 
    'rec_td': 'sum',
    'rush_yds': 'sum',
    'rec_yds': 'sum'
}).reset_index()

# Calculate total touchdowns and yards allowed
team_defense_stats['total_touchdowns_allowed'] = team_defense_stats['rush_td'] + team_defense_stats['pass_td'] + team_defense_stats['rec_td']
team_defense_stats['total_yards_allowed'] = team_defense_stats['rush_yds'] + team_defense_stats['rec_yds']

# Keep only needed columns and rename for consistency
team_defense_stats = team_defense_stats[['opponent_team', 'total_touchdowns_allowed', 'total_yards_allowed']].rename(columns={'opponent_team': 'team_name'})

# Step 2: Weight and Normalize Metrics

# Assign weights to touchdowns and yards allowed
weights = {
    'total_touchdowns_allowed': 0.5,
    'total_yards_allowed': 0.5
}

# Calculate a composite defensive strength index
team_defense_stats['composite_defense_strength'] = (
    weights['total_touchdowns_allowed'] * team_defense_stats['total_touchdowns_allowed'] +
    weights['total_yards_allowed'] * team_defense_stats['total_yards_allowed']
)

# Normalize the composite score (min-max normalization)
team_defense_stats['defense_strength'] = (
    (team_defense_stats['composite_defense_strength'] - team_defense_stats['composite_defense_strength'].min()) /
    (team_defense_stats['composite_defense_strength'].max() - team_defense_stats['composite_defense_strength'].min())
)

# Invert so that lower scores indicate stronger defenses
team_defense_stats['defense_strength'] = 1 - team_defense_stats['defense_strength']

# Step 3: Save to CSV
output_df = team_defense_stats[['team_name', 'defense_strength']]
output_df.to_csv('defense_strength.csv', index=False)

print("Opponent defense strength values saved to 'defense_strength.csv'")

# Optionally, open the file
# !open defense_strength.csv

!open defense_strength.csv

Available columns: ['player', 'player_id', 'team', 'pass_cmp', 'pass_att', 'pass_yds', 'pass_td', 'pass_int', 'pass_sacked', 'pass_sacked_yds', 'pass_long', 'pass_rating', 'rush_att', 'rush_yds', 'rush_td', 'rush_long', 'targets', 'rec', 'rec_yds', 'rec_td', 'rec_long', 'fumbles', 'fumbles_lost', 'game_id', 'opponent_team', 'home', 'position']
Opponent defense strength values saved to 'defense_strength.csv'


In [10]:
# Merge opponent defense strength to PlayerStats.csv table 

import pandas as pd

# Paths to CSV files
defense_strength_path = 'defense_strength.csv'
player_stats_path = 'PlayerStats.csv'
output_path = 'PlayerStats.csv'

# Load the CSV files
player_stats_df = pd.read_csv(player_stats_path)
defense_strength_df = pd.read_csv(defense_strength_path)

# Create the 'opponent_team' column directly using a vectorized approach
player_stats_df['opponent_team'] = player_stats_df['away_team'].where(
    player_stats_df['recent_team'] == player_stats_df['home_team'], 
    player_stats_df['home_team']
)

# Merge the player_stats_df with the defense strengths based on the opponent_team
player_stats_df = pd.merge(
    player_stats_df,
    defense_strength_df,
    how='left',
    left_on='opponent_team',
    right_on='team_name'
)

# Rename the defense strength column for clarity
player_stats_df.rename(columns={'defense_strength': 'opponent_defense_strength'}, inplace=True)

# Drop the extra column after merging
columns_to_drop = ['team_name']
player_stats_df.drop(columns=columns_to_drop, inplace=True)

# Save the updated DataFrame back to a CSV file
player_stats_df.to_csv(output_path, index=False)

# Optionally, display the first few rows to verify
print(player_stats_df.head())



KeyError: 'away_team'

## XGBoost

In [None]:
# Load data from CSV and filter for skill positions

df = pd.read_csv('PlayerStats.csv')
df = df[df['position'].isin(['RB', 'WR', 'TE'])]
print(f"Loaded {len(df)} records for RB, WR, TE players")


In [None]:
# Create the target variable using actual column names
# Combine rush_td and rec_td to create a binary outcome scored_touchdown (1 if the player scored a touchdown, 0 if not)

df['scored_touchdown'] = ((df['rush_td'] > 0) | (df['rec_td'] > 0)).astype(int)

In [None]:
# Select features including opponent defense strength
features = ['rush_att', 'rush_yds', 'rec', 'targets', 'rec_yds', 'opponent_defense_strength']
X = df[features].fillna(0)  # Fill NaN values with 0
y = df['scored_touchdown']

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Initialize the model
xgb_model = xgb.XGBClassifier(eval_metric='logloss')
# xgb_model = xgb.XGBClassifier(eval_metric='logloss', n_estimators=100)

In [None]:
# Train the model
xgb_model.fit(X_train, y_train)

In [None]:
# Make predictions
y_pred = xgb_model.predict(X_test)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Accuracy: {accuracy:.4f}")

In [None]:
# 1. Feature Importance
# XGBoost provides a way to access the importance of each feature in the model, which tells you how much each feature 
# contributes to the prediction.import matplotlib.pyplot as plt

# Plot feature importance
xgb.plot_importance(xgb_model)
plt.show()

# Alternatively, you can print the feature importance scores directly
feature_importance = xgb_model.get_booster().get_score(importance_type='weight')
print("Feature Importance:")
for feature, importance in feature_importance.items():
    print(f"{feature}: {importance}")


In [None]:
# 2. Confusion Matrix
# A confusion matrix gives you a detailed breakdown of correct and incorrect predictions. It helps you understand where
# the model is making mistakes.

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Generate a detailed classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)


In [None]:
# 3. Logloss Value
# To see the logloss value, which is the evaluation metric you've used, you can print the evaluation results during training.

# Initialize the model with eval_metric as a parameter
xgb_model = xgb.XGBClassifier(eval_metric="logloss")

# Define the evaluation set
eval_set = [(X_train, y_train), (X_test, y_test)]

# Train the model with evaluation monitoring
xgb_model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

# After training, you can access the evaluation results
evals_result = xgb_model.evals_result()

# Print the evaluation results
print("Evaluation Results:")
print(evals_result)

In [None]:
# Adjust params ^

# Initialize the model with adjusted parameters
xgb_model = xgb.XGBClassifier(
    eval_metric='logloss',
    # n_estimators=200,  # More boosting rounds
    # learning_rate=0.05,  # Smaller learning rate for better performance
    max_depth=4,  # Depth of the trees
    # subsample=0.8,  # Subsampling to reduce overfitting
    # colsample_bytree=0.8  # Feature subsampling
)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions and evaluate the model as before
y_pred = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Accuracy: {accuracy:.4f}")


In [None]:
# All together w/ Bayesian Inteference

def predict_touchdown_probability(player_name, opponent_defense_strength):
    # Connect to the database
    conn = sqlite3.connect('nfl.db')

    # Query to get historical data for the specific player using the correct column name
    query = f"""
    SELECT carries, rushing_yards, receptions, targets, receiving_yards
    FROM PlayerStats
    WHERE player_display_name = '{player_name}'
    """

    # Load the data into a DataFrame
    historical_data = pd.read_sql_query(query, conn)

    # Close the database connection
    conn.close()

    # Calculate prior distributions based on historical data
    carries_prior_mean = historical_data['carries'].mean()
    carries_prior_std = historical_data['carries'].std()

    rushing_yards_prior_mean = historical_data['rushing_yards'].mean()
    rushing_yards_prior_std = historical_data['rushing_yards'].std()

    receptions_prior_mean = historical_data['receptions'].mean()
    receptions_prior_std = historical_data['receptions'].std()

    targets_prior_mean = historical_data['targets'].mean()
    targets_prior_std = historical_data['targets'].std()

    receiving_yards_prior_mean = historical_data['receiving_yards'].mean()
    receiving_yards_prior_std = historical_data['receiving_yards'].std()

    # Create prior distributions
    carries_prior = norm(loc=carries_prior_mean, scale=carries_prior_std)
    rushing_yards_prior = norm(loc=rushing_yards_prior_mean, scale=rushing_yards_prior_std)
    receptions_prior = norm(loc=receptions_prior_mean, scale=receptions_prior_std)
    targets_prior = norm(loc=targets_prior_mean, scale=targets_prior_std)
    receiving_yards_prior = norm(loc=receiving_yards_prior_mean, scale=receiving_yards_prior_std)

    # Adjust the means based on the opponent's defense strength
    carries_likelihood_mean = carries_prior_mean * opponent_defense_strength
    rushing_yards_likelihood_mean = rushing_yards_prior_mean * opponent_defense_strength
    receptions_likelihood_mean = receptions_prior_mean * opponent_defense_strength
    targets_likelihood_mean = targets_prior_mean * opponent_defense_strength
    receiving_yards_likelihood_mean = receiving_yards_prior_mean * opponent_defense_strength

    # Combine prior and likelihood to get posterior estimates
    carries_posterior = (carries_prior.mean() + carries_likelihood_mean) / 2
    rushing_yards_posterior = (rushing_yards_prior.mean() + rushing_yards_likelihood_mean) / 2
    receptions_posterior = (receptions_prior.mean() + receptions_likelihood_mean) / 2
    targets_posterior = (targets_prior.mean() + targets_likelihood_mean) / 2
    receiving_yards_posterior = (receiving_yards_prior.mean() + receiving_yards_likelihood_mean) / 2

    # Create the player data based on the posterior estimates
    player_data = {
        'carries': [carries_posterior],
        'rushing_yards': [rushing_yards_posterior],
        'receptions': [receptions_posterior],
        'targets': [targets_posterior],
        'receiving_yards': [receiving_yards_posterior]
    }

    # Convert this data to a DataFrame
    player_df = pd.DataFrame(player_data)

    # Use the trained model to predict the probability of scoring a touchdown
    touchdown_probability = xgb_model.predict_proba(player_df)[:, 1]  # Probability of class 1 (scored a touchdown)

    # Print the result
    print(f"Probability of {player_name} scoring a touchdown: {touchdown_probability[0]:.4f}")

# Example usage:
# predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=0) # lower opponent_defense_strength better defense
# predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=1) # lower opponent_defense_strength better defense
predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=0.6) # lower opponent_defense_strength better defense

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

conn = sqlite3.connect('nfl.db')
df = pd.read_sql_query("SELECT * FROM PlayerStats WHERE position IN ('RB', 'WR', 'TE');", conn)
conn.close()

# Combine rushing_tds and receiving_tds to create a binary outcome scored_touchdown (1 if the player scored a touchdown, 0 if not)
df['scored_touchdown'] = df[['rushing_tds', 'receiving_tds']].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

features = ['carries', 'rushing_yards', 'receptions', 'targets', 'receiving_yards']
X = df[features]
y = df['scored_touchdown']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")

In [None]:
# w opponent defense

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load player data from the CSV file
df = pd.read_csv('PlayerStats.csv')

# Combine rushing_tds and receiving_tds to create a binary outcome scored_touchdown (1 if the player scored a touchdown, 0 if not)
df['scored_touchdown'] = df[['rushing_tds', 'receiving_tds']].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# Select features including opponent defense strength
features = ['carries', 'rushing_yards', 'receptions', 'targets', 'receiving_yards', 'opponent_defense_strength']
X = df[features]
y = df['scored_touchdown']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")

# Function to predict touchdown probability
def predict_touchdown_probability(player_name, opponent_defense_strength):
    # Query the player's data from the CSV file
    player_data = df[df['player_display_name'] == player_name].copy()

    if player_data.empty:
        raise ValueError(f"No data found for player: {player_name}")

    # Add the provided opponent defense strength to the player's data
    player_data.loc[:, 'opponent_defense_strength'] = opponent_defense_strength

    # Prepare the data for prediction
    X_player = player_data[features]

    # Use the model to predict the probability
    probability = rf_model.predict_proba(X_player)[:, 1]  # Probability of scoring a touchdown

    # Print the result
    print(f"Probability of {player_name} scoring a touchdown against an opponent with defense strength {opponent_defense_strength}: {probability[0]:.4f}")

# Example usage:
predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=0.6)


## Logistic Regression

In [None]:
# w/o opponent defense

import pandas as pd
import sqlite3
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Connect to the database and load data
conn = sqlite3.connect('nfl.db')
df = pd.read_sql_query("SELECT * FROM PlayerStats WHERE position IN ('RB', 'WR', 'TE');", conn)
conn.close()

# Create the target variable
df['scored_touchdown'] = df[['rushing_tds', 'receiving_tds']].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# Select features
features = ['carries', 'rushing_yards', 'receptions', 'targets', 'receiving_yards']
X = df[features]
y = df['scored_touchdown']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)

# Evaluate the model
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

# Function to predict touchdown probability
def predict_touchdown_probability(player_name, opponent_defense_strength):
    # Query the player's data from the database
    conn = sqlite3.connect('nfl.db')
    query = f"""
    SELECT carries, rushing_yards, receptions, targets, receiving_yards
    FROM PlayerStats
    WHERE player_display_name = '{player_name}'
    """
    player_data = pd.read_sql_query(query, conn)
    conn.close()

    if player_data.empty:
        raise ValueError(f"No data found for player: {player_name}")

    # Use the model to predict the probability
    X_player = player_data[features]
    probability = logreg_model.predict_proba(X_player)[:, 1]  # Probability of scoring a touchdown

    # Print the result
    print(f"Probability of {player_name} scoring a touchdown: {probability[0]:.4f}")

# Example usage:
predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=0.8)


In [None]:
# w opponent defense

import pandas as pd
import sqlite3
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Connect to the database and load data
# conn = sqlite3.connect('nfl.db')
# df = pd.read_sql_query("SELECT * FROM PlayerStats WHERE position IN ('RB', 'WR', 'TE');", conn)
# conn.close()
df = pd.read_csv('PlayerStats.csv')

# Create the target variable
df['scored_touchdown'] = df[['rushing_tds', 'receiving_tds']].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

# Select features including opponent defense strength
features = ['carries', 'rushing_yards', 'receptions', 'targets', 'receiving_yards', 'opponent_defense_strength']
X = df[features]
y = df['scored_touchdown']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)

# Evaluate the model
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

# Function to predict touchdown probability
def predict_touchdown_probability(player_name, opponent_defense_strength):
    # Query the player's data from the CSV file
    player_data = df[df['player_display_name'] == player_name].copy()

    if player_data.empty:
        raise ValueError(f"No data found for player: {player_name}")

    # Add the provided opponent defense strength to the player's data
    player_data.loc[:, 'opponent_defense_strength'] = opponent_defense_strength

    # Prepare the data for prediction
    X_player = player_data[features]

    # Use the model to predict the probability
    probability = logreg_model.predict_proba(X_player)[:, 1]  # Probability of scoring a touchdown

    # Print the result
    print(f"Probability of {player_name} scoring a touchdown against an opponent with defense strength {opponent_defense_strength}: {probability[0]:.4f}")

# Example usage:
predict_touchdown_probability(player_name="Travis Kelce", opponent_defense_strength=0.33)

In [11]:
# Predict for an upcoming 2025 game

def predict_2025_touchdown(player_name, opponent_team, opponent_defense_strength, 
                          estimated_carries=0, estimated_rushing_yards=0, 
                          estimated_receptions=3, estimated_targets=5, estimated_receiving_yards=25):
    
    # Create prediction data
    player_data = {
        'carries': estimated_carries,
        'rushing_yards': estimated_rushing_yards,
        'receptions': estimated_receptions,
        'targets': estimated_targets,
        'receiving_yards': estimated_receiving_yards,
        'opponent_defense_strength': opponent_defense_strength
    }
    
    player_df = pd.DataFrame([player_data])
    
    # Get prediction from XGBoost model
    xgb_prob = xgb_model.predict_proba(player_df)[:, 1][0]
    
    print(f"Touchdown Prediction for {player_name} vs {opponent_team}: {xgb_prob:.4f} ({xgb_prob*100:.1f}%)")
    
    return xgb_prob

# Example predictions for 2025 season
print("=== 2025 Season Touchdown Predictions ===")
predict_2025_touchdown("Travis Kelce", "BUF", 0.3, estimated_receptions=4, estimated_targets=6, estimated_receiving_yards=40)


=== 2025 Season Touchdown Predictions ===


NameError: name 'xgb_model' is not defined