# Imputations with KNN
Welcome to the Jupyter Notebook for the Milwaukee Bucks Hackathon! In this Jupyter Notebook, we work on imputations with regards to the missing values in the Accounts Dataset. We would like to fill in the missing values that are found, especially within `BasketballPropensity` and `DistanceToArena`.

In [1]:
# all the necessary imports

# importing pandas
import pandas as pd

# importing the IPython library
from IPython.display import display

In [2]:
# convert the excel sheets into a .csv file
account_xlsx  = "BucksAssignment/Hackathon Prompt 1 Data/Prompt1AccountLevel.xlsx"
account_csv = "BucksDatasets/AccountLevel.csv"

game_xlsx = "BucksAssignment/Hackathon Prompt 1 Data/Prompt1GameLevel.xlsx"
game_csv = "BucksDatasets/GameLevel.csv"

seat_xlsx = "BucksAssignment/Hackathon Prompt 1 Data/Prompt1SeatLevel.xlsx"
seat_csv = "BucksDatasets/SeatLevel.csv"

# reading the excel file
account_df = pd.read_excel(account_xlsx, engine='openpyxl')
game_df = pd.read_excel(game_xlsx, engine="openpyxl")
seat_df = pd.read_excel(seat_xlsx, engine="openpyxl")

# saving as a csv file
account_df.to_csv(account_csv, index=False)
game_df.to_csv(game_csv, index=False)
seat_df.to_csv(seat_csv, index=False)

In [3]:
# getting a summary of the account data frames
print("Using .info() on the Account Dataset")
display(account_df.info())

print("Using .describe() on the Account Dataset")
display(account_df.describe())

Using .info() on the Account Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44211 entries, 0 to 44210
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Season                 44211 non-null  int64  
 1   AccountNumber          44211 non-null  int64  
 2   SingleGameTickets      44211 non-null  int64  
 3   PartialPlanTickets     44211 non-null  int64  
 4   GroupTickets           44211 non-null  int64  
 5   STM                    44211 non-null  int64  
 6   AvgSpend               44211 non-null  float64
 7   GamesAttended          44211 non-null  int64  
 8   FanSegment             44211 non-null  object 
 9   DistanceToArena        41088 non-null  float64
 10  BasketballPropensity   37214 non-null  float64
 11  SocialMediaEngagement  44211 non-null  object 
dtypes: float64(3), int64(7), object(2)
memory usage: 4.0+ MB


None

Using .describe() on the Account Dataset


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,DistanceToArena,BasketballPropensity
count,44211.0,44211.0,44211.0,44211.0,44211.0,44211.0,44211.0,44211.0,41088.0,37214.0
mean,2023.645631,21116.251815,1.987582,0.836082,2.476669,0.049648,81.139832,1.262966,143.870668,689.229027
std,0.478327,12508.767388,15.080973,5.177241,178.657845,0.21722,94.742229,2.323687,329.215154,235.621148
min,2023.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,125.0
25%,2023.0,10169.5,0.0,0.0,0.0,0.0,30.0,1.0,8.0,481.0
50%,2024.0,20923.0,2.0,0.0,0.0,0.0,62.0,1.0,30.0,719.0
75%,2024.0,31975.5,3.0,0.0,0.0,0.0,100.0,1.0,87.0,923.0
max,2024.0,43028.0,3120.0,120.0,37200.0,1.0,3297.0,41.0,4240.0,993.0


In [4]:
# getting a summary of the game level data frames
print("Using .info() on the Game Level Dataset")
display(game_df.info())

print("Using .describe() on the Game Level Dataset")
display(game_df.describe())

Using .info() on the Game Level Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Game      82 non-null     object
 1   Giveaway  19 non-null     object
dtypes: object(2)
memory usage: 1.4+ KB


None

Using .describe() on the Game Level Dataset


Unnamed: 0,Game,Giveaway
count,82,19
unique,82,13
top,2023-10-26 Philadelphia 76ers,Cap
freq,1,5


In [5]:
# getting a summary of the seat level data frames
print("Using .info() on the Seat Level Dataset")
display(seat_df.info())

print("Using .describe() on the Seat Level Dataset")
display(seat_df.describe())

Using .info() on the Seat Level Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493884 entries, 0 to 493883
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Season         493884 non-null  int64         
 1   AccountNumber  493884 non-null  int64         
 2   Game           493884 non-null  object        
 3   GameDate       493884 non-null  datetime64[ns]
 4   GameTier       493884 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 18.8+ MB


None

Using .describe() on the Seat Level Dataset


Unnamed: 0,Season,AccountNumber,GameDate
count,493884.0,493884.0,493884
mean,2023.848608,20674.040811,2024-11-20 02:32:03.988628224
min,2023.0,1.0,2023-10-26 00:00:00
25%,2024.0,15971.0,2024-11-12 00:00:00
50%,2024.0,20103.0,2025-01-02 00:00:00
75%,2024.0,26676.0,2025-02-20 00:00:00
max,2024.0,43028.0,2025-04-13 00:00:00
std,0.358431,9970.546369,


In [6]:
# checking the number of null values in the account data frames
print("Any null values in Account: ", account_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Account: \n", account_df.isnull().sum())


Any null values in Account:  True
Number of Null Values in Account: 
 Season                      0
AccountNumber               0
SingleGameTickets           0
PartialPlanTickets          0
GroupTickets                0
STM                         0
AvgSpend                    0
GamesAttended               0
FanSegment                  0
DistanceToArena          3123
BasketballPropensity     6997
SocialMediaEngagement       0
dtype: int64


In [7]:
# checking the number of null values in the game level data frames
print("Any null values in Game Level: ", game_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Game Level:\n ", game_df.isnull().sum())


Any null values in Game Level:  True
Number of Null Values in Game Level:
  Game         0
Giveaway    63
dtype: int64


In [8]:
# checking the number of null values in the seat level data frames
print("Any null values in Seat Level: ", seat_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Seat Level: \n ",seat_df.isnull().sum())

Any null values in Seat Level:  False
Number of Null Values in Seat Level: 
  Season           0
AccountNumber    0
Game             0
GameDate         0
GameTier         0
dtype: int64


In [9]:
# viewing the final datasets
display(account_df.head(5))
display(game_df.head(5))
display(seat_df.head(5))

Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement
0,2023,1,0,0,0,0,467.0,0,F,12.0,872.0,Low
1,2023,2,2,0,0,0,116.0,1,A,47.0,485.0,Low
2,2023,3,3,0,0,0,107.0,1,B,6.0,896.0,Low
3,2023,4,0,0,3,0,27.0,1,C,3.0,467.0,High
4,2023,5,0,0,2,0,14.0,1,A,4.0,582.0,Medium


Unnamed: 0,Game,Giveaway
0,2023-10-26 Philadelphia 76ers,
1,2023-10-29 Atlanta Hawks,Cap
2,2023-10-30 Miami Heat,
3,2023-11-03 New York Knicks,
4,2023-11-08 Detroit Pistons,Lunch Bag


Unnamed: 0,Season,AccountNumber,Game,GameDate,GameTier
0,2023,1,2024-01-24 Cleveland Cavaliers,2024-01-24,D
1,2023,1,2024-01-24 Cleveland Cavaliers,2024-01-24,D
2,2023,1,2024-01-24 Cleveland Cavaliers,2024-01-24,D
3,2023,1,2024-01-24 Cleveland Cavaliers,2024-01-24,D
4,2023,1,2024-01-24 Cleveland Cavaliers,2024-01-24,D


Now, that we have analyzed the datasets, and looked at the Data Frames, we can see that there is some missing values from `DistanceToArena` and `BasketballPropensity`. We decide to deal with the missing values by using a KNN model to predict the missing values and fill them  in.

In [10]:
# mapping the categorical values in SocialMediaEngagement
mapping = {'Low': 1, 'Medium': 2, 'High': 3}
account_df['SocialMediaEngagement'] = account_df['SocialMediaEngagement'].map(mapping)

In [11]:
# creating a new column in the data set
account_df['BasketballPropenstiyFill'] = account_df['BasketballPropensity']
account_df['DistanceToArenaFill'] = account_df['DistanceToArena']

# importing the knn 
from sklearn.neighbors import KNeighborsRegressor

In [12]:
# new data frame with no missing values from basketball propensity and distance to arena
data_drop_propensity_na = account_df[account_df['BasketballPropensity'].notna()]
data_drop_distance_na = account_df[account_df['DistanceToArena'].notna()]

# new data frame with missing values from basketball propensity and distance to arena
data_propensity_na = account_df[account_df['BasketballPropensity'].isna()]
data_distance_na = account_df[account_df['DistanceToArena'].isna()]

# length cross-check for basketball propensity
bucks_propensity_length = len(account_df['BasketballPropensity'])
drop_propensity_length = len(data_drop_propensity_na)
missing_propensity_length = len(data_propensity_na)

# length cross check for distance to arena
bucks_distance_length = len(account_df['DistanceToArena'])
drop_distance_length = len(data_drop_distance_na)
missing_distance_length = len(data_distance_na)

# using a if condition to check if the lengths are equal for basketball propensity
if bucks_propensity_length == (drop_propensity_length + missing_propensity_length):
    print("Valid!")
else:
    print("Not Valid!")

# using a if condition to check if the lengths are equal for distance to arena
if bucks_distance_length == (drop_distance_length + missing_distance_length):
    print("Valid!")
else:
    print("Not Valid!")

# viewing the data frame with no missing values for basketball propensity
print("No Missing Values for Basketball Propensity")
display(data_drop_propensity_na.head(3))
print("\n")

# viewing the data frame with no missing values for distance to arena
print("No Missing Values for Distance to Arena")
display(data_drop_distance_na.head(3))
print("\n")

# viewing the data frame with missing values for basketball propensity
print("Missing Values for Basketball Propensity")
display(data_propensity_na.head(3))
print("\n")

# viewing the data frame with missing values for distance to arena
print("Missing Values for Distane to Arena")
display(data_distance_na.head(3))
print("\n")

# displaying the account data frame
print("Displaying the Data Frame for Account")
account_df

Valid!
Valid!
No Missing Values for Basketball Propensity


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.0,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.0,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.0,1,B,6.0,896.0,1,896.0,6.0




No Missing Values for Distance to Arena


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.0,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.0,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.0,1,B,6.0,896.0,1,896.0,6.0




Missing Values for Basketball Propensity


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
8,2023,9,0,0,2,0,24.5,1,Limited Data,158.0,,2,,158.0
17,2023,18,4,0,0,0,3.6,3,A,6.0,,3,,6.0
70,2023,71,0,0,2,0,112.0,1,Limited Data,,,2,,




Missing Values for Distane to Arena


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
70,2023,71,0,0,2,0,112.0,1,Limited Data,,,2,,
131,2023,132,1,0,0,0,15.0,1,Limited Data,,,1,,
275,2023,276,0,0,3,0,77.0,1,Limited Data,,,2,,




Displaying the Data Frame for Account


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.00,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.00,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.00,1,B,6.0,896.0,1,896.0,6.0
3,2023,4,0,0,3,0,27.00,1,C,3.0,467.0,3,467.0,3.0
4,2023,5,0,0,2,0,14.00,1,A,4.0,582.0,2,582.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,2024,43025,2,0,0,0,2.00,1,A,26.0,290.0,3,290.0,26.0
44207,2024,43026,0,0,3,0,6.34,1,D,6.0,266.0,2,266.0,6.0
44208,2024,43027,0,0,6,0,41.00,1,Limited Data,9.0,392.0,3,392.0,9.0
44209,2024,43028,2,0,0,0,68.00,1,A,6.0,898.0,3,898.0,6.0


In [13]:
# the training dataset and the response variable for basketball propensity
x_train_propensity = data_drop_propensity_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                              'SocialMediaEngagement']]
y_train_propensity = data_drop_propensity_na['BasketballPropensity']

# the feature to predict for basketball propensity
x_predict_propensity = data_propensity_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                           'SocialMediaEngagement']]

# the training dataset and the response variable for distance to arena
x_train_distance =data_drop_distance_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                              'SocialMediaEngagement']]
y_train_distance = data_drop_distance_na['DistanceToArena']

# the feature to predict for distance to arena
x_predict_distance = data_distance_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                           'SocialMediaEngagement']]

# displaying the data frame
account_df

Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.00,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.00,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.00,1,B,6.0,896.0,1,896.0,6.0
3,2023,4,0,0,3,0,27.00,1,C,3.0,467.0,3,467.0,3.0
4,2023,5,0,0,2,0,14.00,1,A,4.0,582.0,2,582.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,2024,43025,2,0,0,0,2.00,1,A,26.0,290.0,3,290.0,26.0
44207,2024,43026,0,0,3,0,6.34,1,D,6.0,266.0,2,266.0,6.0
44208,2024,43027,0,0,6,0,41.00,1,Limited Data,9.0,392.0,3,392.0,9.0
44209,2024,43028,2,0,0,0,68.00,1,A,6.0,898.0,3,898.0,6.0


In [14]:
# instiating the KNN model with 5 neighbors
bucks_KNN_propensity = KNeighborsRegressor(n_neighbors=5)
bucks_KNN_distance = KNeighborsRegressor(n_neighbors=5)

# fitting the model with the training data
bucks_KNN_propensity.fit(x_train_propensity, y_train_propensity)
bucks_KNN_distance.fit(x_train_distance,y_train_distance)

# performing the predictions
propensity_predict = bucks_KNN_propensity.predict(x_predict_propensity)
distance_predict = bucks_KNN_propensity.predict(x_predict_distance)

# printing out the results
print(propensity_predict)
print("Type: ", type(propensity_predict), "\n")

print(distance_predict)
print("Type: ",type(distance_predict), "\n" )

#displaying the account data frame
account_df

[466.2 731.8 518.8 ... 791.6 738.2 670.2]
Type:  <class 'numpy.ndarray'> 

[518.8 804.4 593.4 ... 877.6 720.6 692.8]
Type:  <class 'numpy.ndarray'> 



Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.00,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.00,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.00,1,B,6.0,896.0,1,896.0,6.0
3,2023,4,0,0,3,0,27.00,1,C,3.0,467.0,3,467.0,3.0
4,2023,5,0,0,2,0,14.00,1,A,4.0,582.0,2,582.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,2024,43025,2,0,0,0,2.00,1,A,26.0,290.0,3,290.0,26.0
44207,2024,43026,0,0,3,0,6.34,1,D,6.0,266.0,2,266.0,6.0
44208,2024,43027,0,0,6,0,41.00,1,Limited Data,9.0,392.0,3,392.0,9.0
44209,2024,43028,2,0,0,0,68.00,1,A,6.0,898.0,3,898.0,6.0


In [15]:
# finding the total number of elements including missing values
total_propensity_knn = len(account_df['BasketballPropenstiyFill'])
print("Total is: ", total_propensity_knn)

total_distance_knn = len(account_df['DistanceToArenaFill'])
print("Total is: ", total_distance_knn)

missing_values_knn_propensity = account_df['BasketballPropenstiyFill'].isna().sum()
print("Missing Values: ", missing_values_knn_propensity)

missing_values_knn_distance = account_df['DistanceToArenaFill'].isna().sum()
print("Missing Values: ", missing_values_knn_distance)

# finding the length of the numpy array for propensity
total_values_propensity = propensity_predict.size
print("Total Elements of Numpy Array: ", total_values_propensity)

# finding the length of the numpy array for distance
total_values_distance = distance_predict.size
print("Total Elements of Numpy Array: ", total_values_distance)

# viewing the data frame
account_df

Total is:  44211
Total is:  44211
Missing Values:  6997
Missing Values:  3123
Total Elements of Numpy Array:  6997
Total Elements of Numpy Array:  3123


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.00,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.00,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.00,1,B,6.0,896.0,1,896.0,6.0
3,2023,4,0,0,3,0,27.00,1,C,3.0,467.0,3,467.0,3.0
4,2023,5,0,0,2,0,14.00,1,A,4.0,582.0,2,582.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,2024,43025,2,0,0,0,2.00,1,A,26.0,290.0,3,290.0,26.0
44207,2024,43026,0,0,3,0,6.34,1,D,6.0,266.0,2,266.0,6.0
44208,2024,43027,0,0,6,0,41.00,1,Limited Data,9.0,392.0,3,392.0,9.0
44209,2024,43028,2,0,0,0,68.00,1,A,6.0,898.0,3,898.0,6.0


In [16]:
# boolean masking for missing values in 'BasketballPropenstiyFill'
missing_age_mask_propensity = account_df['BasketballPropenstiyFill'].isna()

# boolean masking for missing values in 'DistancePropensityFill'
missing_age_mask_distance = account_df['DistanceToArenaFill'].isna()

# filling in missing values using the numpy array for propensity
account_df.loc[missing_age_mask_propensity, 'BasketballPropenstiyFill'] = propensity_predict

# filling in missing values using the numpy array for distance
account_df.loc[missing_age_mask_distance, 'DistanceToArenaFill'] = distance_predict

# checking the dataset again for missing values
new_missing_value_propensity = account_df['BasketballPropenstiyFill'].isna().sum()
print("Missing Values Propensity: ", new_missing_value_propensity)

# checking the dataset again for missin values
new_missing_value_distance = account_df['DistanceToArenaFill'].isna().sum()
print("Missing Values Distance: ", new_missing_value_distance)

# viewing the new data frame
account_df

Missing Values Propensity:  0
Missing Values Distance:  0


Unnamed: 0,Season,AccountNumber,SingleGameTickets,PartialPlanTickets,GroupTickets,STM,AvgSpend,GamesAttended,FanSegment,DistanceToArena,BasketballPropensity,SocialMediaEngagement,BasketballPropenstiyFill,DistanceToArenaFill
0,2023,1,0,0,0,0,467.00,0,F,12.0,872.0,1,872.0,12.0
1,2023,2,2,0,0,0,116.00,1,A,47.0,485.0,1,485.0,47.0
2,2023,3,3,0,0,0,107.00,1,B,6.0,896.0,1,896.0,6.0
3,2023,4,0,0,3,0,27.00,1,C,3.0,467.0,3,467.0,3.0
4,2023,5,0,0,2,0,14.00,1,A,4.0,582.0,2,582.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,2024,43025,2,0,0,0,2.00,1,A,26.0,290.0,3,290.0,26.0
44207,2024,43026,0,0,3,0,6.34,1,D,6.0,266.0,2,266.0,6.0
44208,2024,43027,0,0,6,0,41.00,1,Limited Data,9.0,392.0,3,392.0,9.0
44209,2024,43028,2,0,0,0,68.00,1,A,6.0,898.0,3,898.0,6.0
