# Imputations with KNN
Welcome to the Jupyter Notebook for the Milwaukee Bucks Hackathon! In this Jupyter Notebook, we work on imputations with regards to the missing values in the Accounts Dataset. We would like to fill in the missing values that are found, especially within `BasketballPropensity` and `DistanceToArena`.

In [1]:
# all the necessary imports

# importing pandas
import pandas as pd

# importing the IPython library
from IPython.display import display

In [None]:
# convert the excel sheets into a .csv file
account_xlsx  = r"C:\GitHub\BucksHackathon25\BucksAssignment\Hackathon Prompt 1 Data\Prompt1AccountLevel.xlsx"
account_csv = r"C:\GitHub\BucksHackathon25\BucksDatasets\AccountLevel.csv"

game_xlsx = r"C:\GitHub\BucksHackathon25\BucksAssignment\Hackathon Prompt 1 Data\Prompt1GameLevel.xlsx"
game_csv = r"C:\GitHub\BucksHackathon25\BucksDatasets\GameLevel.csv"

seat_xlsx = r"C:\GitHub\BucksHackathon25\BucksAssignment\Hackathon Prompt 1 Data\Prompt1SeatLevel.xlsx"
seat_csv = r"C:\GitHub\BucksHackathon25\BucksDatasets\SeatLevel.csv"

# reading the excel file
account_df = pd.read_excel(account_xlsx, engine='openpyxl')
game_df = pd.read_excel(game_xlsx, engine="openpyxl")
seat_df = pd.read_excel(seat_xlsx, engine="openpyxl")

# saving as a csv file
account_df.to_csv(account_csv, index=False)
game_df.to_csv(game_csv, index=False)
seat_df.to_csv(seat_csv, index=False)

In [None]:
# getting a summary of the account data frames
print("Using .info() on the Account Dataset")
display(account_df.info())

print("Using .describe() on the Account Dataset")
display(account_df.describe())

In [None]:
# getting a summary of the game level data frames
print("Using .info() on the Game Level Dataset")
display(game_df.info())

print("Using .describe() on the Game Level Dataset")
display(game_df.describe())

In [None]:
# getting a summary of the seat level data frames
print("Using .info() on the Seat Level Dataset")
display(seat_df.info())

print("Using .describe() on the Seat Level Dataset")
display(seat_df.describe())

In [None]:
# checking the number of null values in the account data frames
print("Any null values in Account: ", account_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Account: \n", account_df.isnull().sum())


In [None]:
# checking the number of null values in the game level data frames
print("Any null values in Game Level: ", game_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Game Level:\n ", game_df.isnull().sum())


In [None]:
# checking the number of null values in the seat level data frames
print("Any null values in Seat Level: ", seat_df.isnull().values.any())

# number of null values in the data frame
print("Number of Null Values in Seat Level: \n ",seat_df.isnull().sum())

In [None]:
# viewing the final datasets
display(account_df.head(5))
display(game_df.head(5))
display(seat_df.head(5))

Now, that we have analyzed the datasets, and looked at the Data Frames, we can see that there is some missing values from `DistanceToArena` and `BasketballPropensity`. We decide to deal with the missing values by using a KNN model to predict the missing values and fill them  in.

In [None]:
# mapping the categorical values in SocialMediaEngagement
mapping = {'Low': 1, 'Medium': 2, 'High': 3}
account_df['SocialMediaEngagement'] = account_df['SocialMediaEngagement'].map(mapping)

In [None]:
# creating a new column in the data set
account_df['BasketballPropenstiyFill'] = account_df['BasketballPropensity']
account_df['DistanceToArenaFill'] = account_df['DistanceToArena']

# importing the knn 
from sklearn.neighbors import KNeighborsRegressor

In [None]:
# new data frame with no missing values from basketball propensity and distance to arena
data_drop_propensity_na = account_df[account_df['BasketballPropensity'].notna()]
data_drop_distance_na = account_df[account_df['DistanceToArena'].notna()]

# new data frame with missing values from basketball propensity and distance to arena
data_propensity_na = account_df[account_df['BasketballPropensity'].isna()]
data_distance_na = account_df[account_df['DistanceToArena'].isna()]

# length cross-check for basketball propensity
bucks_propensity_length = len(account_df['BasketballPropensity'])
drop_propensity_length = len(data_drop_propensity_na)
missing_propensity_length = len(data_propensity_na)

# length cross check for distance to arena
bucks_distance_length = len(account_df['DistanceToArena'])
drop_distance_length = len(data_drop_distance_na)
missing_distance_length = len(data_distance_na)

# using a if condition to check if the lengths are equal for basketball propensity
if bucks_propensity_length == (drop_propensity_length + missing_propensity_length):
    print("Valid!")
else:
    print("Not Valid!")

# using a if condition to check if the lengths are equal for distance to arena
if bucks_distance_length == (drop_distance_length + missing_distance_length):
    print("Valid!")
else:
    print("Not Valid!")

# viewing the data frame with no missing values for basketball propensity
print("No Missing Values for Basketball Propensity")
display(data_drop_propensity_na.head(3))
print("\n")

# viewing the data frame with no missing values for distance to arena
print("No Missing Values for Distance to Arena")
display(data_drop_distance_na.head(3))
print("\n")

# viewing the data frame with missing values for basketball propensity
print("Missing Values for Basketball Propensity")
display(data_propensity_na.head(3))
print("\n")

# viewing the data frame with missing values for distance to arena
print("Missing Values for Distane to Arena")
display(data_distance_na.head(3))
print("\n")

# displaying the account data frame
print("Displaying the Data Frame for Account")
account_df

In [None]:
# the training dataset and the response variable for basketball propensity
x_train_propensity = data_drop_propensity_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                              'SocialMediaEngagement']]
y_train_propensity = data_drop_propensity_na['BasketballPropensity']

# the feature to predict for basketball propensity
x_predict_propensity = data_propensity_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                           'SocialMediaEngagement']]

# the training dataset and the response variable for distance to arena
x_train_distance =data_drop_distance_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                              'SocialMediaEngagement']]
y_train_distance = data_drop_distance_na['DistanceToArena']

# the feature to predict for distance to arena
x_predict_distance = data_distance_na[['AvgSpend', 'SingleGameTickets', 'STM', 'GamesAttended', 
                                           'SocialMediaEngagement']]

# displaying the data frame
account_df

In [None]:
# instiating the KNN model with 5 neighbors
bucks_KNN_propensity = KNeighborsRegressor(n_neighbors=5)
bucks_KNN_distance = KNeighborsRegressor(n_neighbors=5)

# fitting the model with the training data
bucks_KNN_propensity.fit(x_train_propensity, y_train_propensity)
bucks_KNN_distance.fit(x_train_distance,y_train_distance)

# performing the predictions
propensity_predict = bucks_KNN_propensity.predict(x_predict_propensity)
distance_predict = bucks_KNN_propensity.predict(x_predict_distance)

# printing out the results
print(propensity_predict)
print("Type: ", type(propensity_predict), "\n")

print(distance_predict)
print("Type: ",type(distance_predict), "\n" )

#displaying the account data frame
account_df

In [None]:
# finding the total number of elements including missing values
total_propensity_knn = len(account_df['BasketballPropenstiyFill'])
print("Total is: ", total_propensity_knn)

total_distance_knn = len(account_df['DistanceToArenaFill'])
print("Total is: ", total_distance_knn)

missing_values_knn_propensity = account_df['BasketballPropenstiyFill'].isna().sum()
print("Missing Values: ", missing_values_knn_propensity)

missing_values_knn_distance = account_df['DistanceToArenaFill'].isna().sum()
print("Missing Values: ", missing_values_knn_distance)

# finding the length of the numpy array for propensity
total_values_propensity = propensity_predict.size
print("Total Elements of Numpy Array: ", total_values_propensity)

# finding the length of the numpy array for distance
total_values_distance = distance_predict.size
print("Total Elements of Numpy Array: ", total_values_distance)

# viewing the data frame
account_df

In [None]:
# boolean masking for missing values in 'BasketballPropenstiyFill'
missing_age_mask_propensity = account_df['BasketballPropenstiyFill'].isna()

# boolean masking for missing values in 'DistancePropensityFill'
missing_age_mask_distance = account_df['DistanceToArenaFill'].isna()

# filling in missing values using the numpy array for propensity
account_df.loc[missing_age_mask_propensity, 'BasketballPropenstiyFill'] = propensity_predict

# filling in missing values using the numpy array for distance
account_df.loc[missing_age_mask_distance, 'DistanceToArenaFill'] = distance_predict

# checking the dataset again for missing values
new_missing_value_propensity = account_df['BasketballPropenstiyFill'].isna().sum()
print("Missing Values Propensity: ", new_missing_value_propensity)

# checking the dataset again for missin values
new_missing_value_distance = account_df['DistanceToArenaFill'].isna().sum()
print("Missing Values Distance: ", new_missing_value_distance)

# viewing the new data frame
account_df

### Conversion to `.csv`
Lastly, we would like to convert our data frame `.csv` file.

In [None]:
# converting the data frame into a .csv file
account_df.to_csv('accounts_knn.csv', index=False)