
# Analyzing Comeback Situations: Does Shooting More Threes Help?
## Dylan Sloan

### Introduction
        A lot of teams and coaches wonder what the data says about these end of game situations. This project intends to answer the question of if a team needs a comeback, does it actually matter whether they shoot threes or twos?

### Data Gathering

In [3]:
import pandas as pd
import zipfile
import os
import numpy as np

path = './NCAA_D1_PBP_2010_22'
zipped_files = [f for f in os.listdir(path) if f.endswith('.zip')]

all_dataframes = []

for zf in zipped_files:
    with zipfile.ZipFile(os.path.join(path, zf), 'r') as zip_ref:
        zip_ref.extractall(path)
        csv_file = zf.replace('.zip', '')
        df = pd.read_csv(os.path.join(path, csv_file))
        all_dataframes.append(df)

combined_dataframe = pd.concat(all_dataframes, ignore_index=True)


### Defining a Comeback
        I decided to define a comeback as any team who was down 7 or more points within 5 minutes to go. This made it a three possession game.

In [4]:
potential_comebacks = combined_dataframe[
    (combined_dataframe['secs_left_reg'] <= 300) &
    (abs(combined_dataframe['home_score'] - combined_dataframe['away_score']) >= 7)
]

### Filtering Final Scores

In [5]:
final_scores = combined_dataframe.groupby('GameID').last()

### Down and Won Data
        The down and won data is all the games where a team was in a potential comeback situation (down 7 with 5 minutes or less to go) and ended up winning the game. There were 1193 games where the team made a comeback and won.

In [6]:
merged_df = pd.merge(potential_comebacks, final_scores, on='GameID', suffixes=('', '_final'))
actual_comebacks = merged_df[
    ((merged_df['home_score'] > merged_df['away_score']) & (merged_df['home_score_final'] < merged_df['away_score_final'])) |
    ((merged_df['home_score'] < merged_df['away_score']) & (merged_df['home_score_final'] > merged_df['away_score_final']))
]

comeback_game_ids = actual_comebacks['GameID'].unique()

down_and_won_df = combined_dataframe[
    (combined_dataframe['GameID'].isin(comeback_game_ids)) &
    (combined_dataframe['secs_left_reg'] <= 300)
]

down_and_won_df.sort_values(by=['GameID', 'secs_left_reg'], ascending=[True, False])

print(down_and_won_df.head())

           GameID        Date  Season        home_team  \
107926  400588645  2014-11-18    2015  Stetson Hatters   
107927  400588645  2014-11-18    2015  Stetson Hatters   
107928  400588645  2014-11-18    2015  Stetson Hatters   
107929  400588645  2014-11-18    2015  Stetson Hatters   
107930  400588645  2014-11-18    2015  Stetson Hatters   

                             away_team  \
107926  Florida International Panthers   
107927  Florida International Panthers   
107928  Florida International Panthers   
107929  Florida International Panthers   
107930  Florida International Panthers   

                                              play_desc  home_score  \
107926  Kyle Sikora made Layup. Assisted by Brian Pegg.        49.0   
107927                      Tashwan Desir missed Layup.        49.0   
107928                    Brian Pegg Defensive Rebound.        49.0   
107929          Divine Myles missed Three Point Jumper.        49.0   
107930                 Kentwan Smith Offens

In [10]:
unique_game_ids = down_and_won_df['GameID'].nunique()
print("The number of unique GameID's is:", unique_game_ids)


The number of unique GameID's is: 1193


### Down but Lost Data
        The down and lost data is all the games where a team was in a potential comeback situation (down 7 with 5 minutes or less to go) and ended up losing the game. There were 48,826 games where a team could've made a comeback but lost.

In [8]:
merged_df = pd.merge(potential_comebacks, final_scores, on='GameID', suffixes=('', '_final'))

down_but_lost = merged_df[
    ((merged_df['home_score'] > merged_df['away_score']) & (merged_df['home_score_final'] >= merged_df['away_score_final'])) |
    ((merged_df['home_score'] < merged_df['away_score']) & (merged_df['home_score_final'] <= merged_df['away_score_final']))
]

down_but_lost_game_ids = down_but_lost['GameID'].unique()

down_but_lost_df = combined_dataframe[
    (combined_dataframe['GameID'].isin(down_but_lost_game_ids)) &
    (combined_dataframe['secs_left_reg'] <= 300)
]

down_but_lost_df = down_but_lost_df.sort_values(by=['GameID', 'secs_left_reg'], ascending=[True, False])

print(down_but_lost_df.head())

             GameID        Date  Season                home_team  \
16717518  293130025  2009-11-09    2010  California Golden Bears   
16717519  293130025  2009-11-09    2010  California Golden Bears   
16717520  293130025  2009-11-09    2010  California Golden Bears   
16717521  293130025  2009-11-09    2010  California Golden Bears   
16717522  293130025  2009-11-09    2010  California Golden Bears   

                    away_team                                 play_desc  \
16717518  Murray State Racers  Jorge Gutierrez missed Two Point Jumper.   
16717519  Murray State Racers                       B.J. Jenkins Block.   
16717520  Murray State Racers            Tony Easley Defensive Rebound.   
16717521  Murray State Racers               Foul on Patrick Christopher   
16717522  Murray State Racers                Ivan Aska made Free Throw.   

          home_score  away_score  half  secs_left_half  secs_left_reg  \
16717518        66.0        53.0   2.0           293.0          293

In [10]:
unique_game_ids2 = down_but_lost_df['GameID'].nunique()
print("The number of unique GameID's is:", unique_game_ids2)

The number of unique GameID's is: 48826


### Identifying which Teams Won and Lost

In [15]:
import numpy as np
final_scores = down_and_won_df.groupby('GameID').last()

final_scores['winner'] = np.where(final_scores['home_score'] > final_scores['away_score'], final_scores['home_team'], final_scores['away_team'])

down_and_won_df = down_and_won_df.merge(final_scores[['winner']], on='GameID', how='left')

print(down_and_won_df.head())

      GameID        Date  Season        home_team  \
0  400588645  2014-11-18    2015  Stetson Hatters   
1  400588645  2014-11-18    2015  Stetson Hatters   
2  400588645  2014-11-18    2015  Stetson Hatters   
3  400588645  2014-11-18    2015  Stetson Hatters   
4  400588645  2014-11-18    2015  Stetson Hatters   

                        away_team  \
0  Florida International Panthers   
1  Florida International Panthers   
2  Florida International Panthers   
3  Florida International Panthers   
4  Florida International Panthers   

                                         play_desc  home_score  away_score  \
0  Kyle Sikora made Layup. Assisted by Brian Pegg.        49.0        45.0   
1                      Tashwan Desir missed Layup.        49.0        45.0   
2                    Brian Pegg Defensive Rebound.        49.0        45.0   
3          Divine Myles missed Three Point Jumper.        49.0        45.0   
4                 Kentwan Smith Offensive Rebound.        49.0      

In [16]:
import numpy as np
import pandas as pd  

final_scores = down_but_lost_df.groupby('GameID').last().reset_index()

final_scores['loser'] = np.where(final_scores['home_score'] > final_scores['away_score'], final_scores['away_team'], final_scores['home_team'])

down_but_lost_df = pd.merge(down_but_lost_df, final_scores[['GameID', 'loser']], on='GameID', how='left')

print(down_but_lost_df.head())

      GameID        Date  Season                home_team  \
0  293130025  2009-11-09    2010  California Golden Bears   
1  293130025  2009-11-09    2010  California Golden Bears   
2  293130025  2009-11-09    2010  California Golden Bears   
3  293130025  2009-11-09    2010  California Golden Bears   
4  293130025  2009-11-09    2010  California Golden Bears   

             away_team                                 play_desc  home_score  \
0  Murray State Racers  Jorge Gutierrez missed Two Point Jumper.        66.0   
1  Murray State Racers                       B.J. Jenkins Block.        66.0   
2  Murray State Racers            Tony Easley Defensive Rebound.        66.0   
3  Murray State Racers               Foul on Patrick Christopher        66.0   
4  Murray State Racers                Ivan Aska made Free Throw.        66.0   

   away_score  half  secs_left_half  secs_left_reg                play_team  \
0        53.0   2.0           293.0          293.0  California Golden Bea

### Filtering the Three Pointers Shot

In [17]:
down_and_won_df1 = down_and_won_df[(down_and_won_df['play_desc'].str.contains("Three Point", case=False, na=False)) & 
                          (down_and_won_df['play_team'] == down_and_won_df['winner'])]

down_but_lost_df1 = down_but_lost_df[(down_but_lost_df['play_desc'].str.contains("Three Point", case=False, na=False)) & 
                          (down_but_lost_df['play_team'] == down_but_lost_df['loser'])]


### Counting the Three Pointers Shot

In [18]:
down_and_won_df1 = down_and_won_df1.copy()
down_and_won_df1.loc[:, 'three_point_shots'] = down_and_won_df1.groupby('GameID')['GameID'].transform('count')

down_but_lost_df1 = down_but_lost_df1.copy()
down_but_lost_df1.loc[:, 'three_point_shots'] = down_but_lost_df1.groupby('GameID')['GameID'].transform('count')


### Filtering for Testing

In [19]:
down_and_won_df2 = down_and_won_df1.groupby('GameID').first().reset_index()
down_but_lost_df2 = down_but_lost_df1.groupby('GameID').first().reset_index()
down_and_won_df3=down_and_won_df2[['GameID','three_point_shots']]
down_but_lost_df3=down_but_lost_df2[['GameID','three_point_shots']]

### Finding the Games where there was 0 Three Point Shots
        When I filtered the games with three point shots I got rid of the games with 0 three point shots. This code adds those games back in the dataset.

In [20]:
unique_gameid_down_and_won_df1 = set(down_and_won_df1['GameID'].unique())
unique_gameid_down_and_won_df = set(down_and_won_df['GameID'].unique())
not_common_gameid_in_df1 = unique_gameid_down_and_won_df1 - unique_gameid_down_and_won_df
not_common_gameid_in_df = unique_gameid_down_and_won_df - unique_gameid_down_and_won_df1
rows_not_common_in_df1 = down_and_won_df1[down_and_won_df1['GameID'].isin(not_common_gameid_in_df1)]
rows_not_common_in_df = down_and_won_df[down_and_won_df['GameID'].isin(not_common_gameid_in_df)]


In [21]:
unique_gameid_down_but_lost_df1 = set(down_but_lost_df1['GameID'].unique())
unique_gameid_down_but_lost_df = set(down_but_lost_df['GameID'].unique())
not_common_gameid_in_lost_df1 = unique_gameid_down_but_lost_df1 - unique_gameid_down_but_lost_df
not_common_gameid_in_lost_df = unique_gameid_down_but_lost_df - unique_gameid_down_but_lost_df1
rows_not_common_in_lost_df1 = down_but_lost_df1[down_but_lost_df1['GameID'].isin(not_common_gameid_in_lost_df1)]
rows_not_common_in_lost_df = down_but_lost_df[down_but_lost_df['GameID'].isin(not_common_gameid_in_lost_df)]


In [22]:
unique_game_ids = down_and_won_df[~down_and_won_df['GameID'].isin(down_and_won_df1['GameID'])]['GameID'].unique()
unique_game_ids_list = list(unique_game_ids)
unique_game_ids_win_df = pd.DataFrame({'GameID': unique_game_ids_list})
unique_game_ids_win_df['three_point_shots'] = 0



In [23]:
unique_game_ids_lost = down_but_lost_df[~down_but_lost_df['GameID'].isin(down_but_lost_df1['GameID'])]['GameID'].unique()
unique_game_ids_lost_list = list(unique_game_ids_lost)
unique_game_ids_loss_df = pd.DataFrame({'GameID': unique_game_ids_list})
unique_game_ids_loss_df['three_point_shots'] = 0



### Filtering for Testing

In [24]:
down_and_won_df2 = down_and_won_df1.groupby('GameID').first().reset_index()
down_but_lost_df2 = down_but_lost_df1.groupby('GameID').first().reset_index()
down_and_won_df3=down_and_won_df2[['GameID','three_point_shots']]
down_but_lost_df3=down_but_lost_df2[['GameID','three_point_shots']]

In [25]:
combined_df_win = pd.concat([down_and_won_df3, unique_game_ids_win_df], ignore_index=True)
combined_df_loss = pd.concat([down_but_lost_df3, unique_game_ids_loss_df], ignore_index=True)

In [26]:
win_and_comeback = combined_df_win['three_point_shots']
loss_and_comeback = combined_df_loss['three_point_shots']

### Conclusion
        The mean of the threes shot in games where the team came back and won the game was 3.97 and the mean of the threes shot in the games where the team had a chance to comeback but lost is 3.78. To find out whether the threes shot in these two groups was statistically different, it is neccesary to do a t-test. From the t-tests shown below it can be concluded that these two groups are statistically different and that it is better to shoot more threes when in a comeback situation.

In [27]:
win_and_comeback.mean()

3.9715004191114835

In [28]:
loss_and_comeback.mean()

3.781711164277699

In [29]:
from scipy import stats
t_stat, p_value = stats.ttest_ind(win_and_comeback,loss_and_comeback, equal_var=False)
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

alpha = 0.05  
if p_value < alpha:
    print("We reject the null hypothesis: the means of the two groups are statistically different.")
else:
    print("We fail to reject the null hypothesis: there is not enough evidence to say that the means are different.")

t-statistic: 2.509614379987808
p-value: 0.01221424400525684
We reject the null hypothesis: the means of the two groups are statistically different.


In [30]:
from scipy import stats
t_stat, p_value = stats.ttest_ind(win_and_comeback,loss_and_comeback, equal_var=False)
one_tailed_p_value = p_value / 2
print(f"t-statistic: {t_stat}")
print(f"One-tailed p-value: {one_tailed_p_value}")
alpha = 0.05  
if one_tailed_p_value < alpha:
    if t_stat > 0:
        print("We reject the null hypothesis: the mean of comebacks that won is statistically greater than that of comebacks that lost.")
    else:
        print("We reject the null hypothesis: the mean of comebacks that lost is statistically greater than that of comebacks that won.")
else:
    print("We fail to reject the null hypothesis: there is not enough evidence to say that one mean is greater than the other.")

t-statistic: 2.509614379987808
One-tailed p-value: 0.00610712200262842
We reject the null hypothesis: the mean of comebacks that won is statistically greater than that of comebacks that lost.


In [31]:
from scipy import stats
t_stat, p_value = stats.ttest_ind(win_and_comeback,loss_and_comeback, equal_var=False)
one_tailed_p_value = p_value / 2
print(f"t-statistic: {t_stat}")
print(f"One-tailed p-value: {one_tailed_p_value}")
alpha = 0.05  
if one_tailed_p_value < alpha:
    if t_stat < 0:
        print("We reject the null hypothesis: the mean of comebacks that won is statistically greater than that of comebacks that lost.")
    else:
        print("We reject the null hypothesis: the mean of comebacks that lost is statistically greater than that of comebacks that won.")
else:
    print("We fail to reject the null hypothesis: there is not enough evidence to say that one mean is less than the other.")

t-statistic: 2.509614379987808
One-tailed p-value: 0.00610712200262842
We reject the null hypothesis: the mean of comebacks that lost is statistically greater than that of comebacks that won.


### References
            I got the play by play data from the github of Julian Zapatahall. I really appreciate him sharing this data.

In [32]:
!git clone https://github.com/julianzapatahall/NCAA_D1_PBP_2010_22.git

fatal: destination path 'NCAA_D1_PBP_2010_22' already exists and is not an empty directory.
