<a href="https://colab.research.google.com/github/biancajayy/Data-Structures-and-Algorithms/blob/master/Hacklytics2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Athlete Forecast**

Simplifying the college football recruitment process by analyzing high school players' inputted statistics, including height, weight, rating, and stars. Easy Recruit outputs existing collegiate players who resembled similar qualities during their time of college recruitment.

For this portion of our project, we will be focusing on the Wide Receiver (WR) position.

**High School Recruits Statistics Dataset**

This dataset was extrapolated from the "College Football Data" website ("https://collegefootballdata.com/exporter/recruiting/players?year=2020&classification=HighSchool&position=WR"). We merged HS recruitment datasets from 2019-2023 to gather the 4 years of recruiting that would comprise the current 2023 NCAA players.

The most important criteria from each athlete provided includes:

1.   Height
2.   Weight
3.   Stars (1-5)
4.   Rating

In [None]:
import pandas as pd
import numpy as np

In [None]:
url = 'https://raw.githubusercontent.com/jeslyn-guo/Hacklytics2024/main/HSRecruitStats.csv'
hsDf = pd.read_csv(url)
hsDf = hsDf.drop(columns=['Name','Unnamed: 6','Unnamed: 11','Id','AthleteId'])
hsDf = hsDf.drop(hsDf.index[1973:2476])

# **How Our Application Works**


1.   College recruiter inputs the height, weight, stars, and rating of a high school prospect
2.   Easy Recruit takes in the 4 criteria and compares against dataset of ~2000 Wide Receivers from the 2023 season
3.   Collegiate players that resembled similar characteristics to high school prospects are outputted with
*   height & weight as a stronger factor
*   stars & rating as a stronger factor



In [None]:
#Transform variables needed for calculations into arrays
stat_columns = ['Full Name','Height','Weight','Stars','Rating']
array_selected = hsDf[stat_columns[1:5]]
array_selected_with_name = hsDf[stat_columns]
hs_stats_array = array_selected.to_numpy()
hs_stats_array_with_name = array_selected_with_name.to_numpy()
print(hs_stats_array)

[[ 69.     172.       5.       0.9971]
 [ 74.     210.       5.       0.9895]
 [ 72.     170.       5.       0.9872]
 ...
 [ 76.     180.       1.       0.6996]
 [ 72.     165.       3.       0.8499]
 [ 71.     175.       3.       0.8104]]


**Using a Scaler to normalize the data**

To prepare the data for comparison, we normalize the height, weight, stars, and ratings values, so they each have the same weight. We did so by importing a Scaler from the sklearn.preprocessing package and performing the fit_transform method.

For comparison, view the printed array above and below.

In [None]:
#Normalize values in array (height, weight, stars, rating)
from sklearn.preprocessing import StandardScaler

#Create a Scaler object
scaler = StandardScaler()

#Standardize to proceed in Euclidean Distance calculations
standardized_data = scaler.fit_transform(array_selected)
print(standardized_data)

[[-1.66690101 -0.68391064  3.35462523  2.96745979]
 [ 0.50170253  2.02449397  3.35462523  2.82000864]
 [-0.36573889 -0.82645825  3.35462523  2.77538527]
 ...
 [ 1.36914394 -0.11372019 -3.31910865 -2.80447665]
 [-0.36573889 -1.18282727  0.01775829  0.11156384]
 [-0.79945959 -0.47008922  0.01775829 -0.6547941 ]]


# **Euclidean Distance Calculations**

We can find the collegiate athlete that most resembles a prospective athlete through a Euclidean Distance Calculation.

We first give the various categories differing weights. This caters to recruiters' preferences to either an athlete with more similar physiques (height & weight) or with more similar ratings.

In [None]:
#Player Data Input
appendedArr = np.array([73,200,4,.951])

heightIndex = 0
weightIndex = 1
starIndex = 2
ratingIndex = 3

#Append input data to dataset to normalize values
result = np.vstack([array_selected, appendedArr])
normalizedRes = scaler.fit_transform(result)

skillRes = normalizedRes
heightWeightRes = normalizedRes

#Skill Calculations
#add a weight of 5 to star data
skillRes[:, starIndex] *= 3
skillRes[:, ratingIndex] *= 6

#label standardized data
normalizedSkill = skillRes[-1]
skillRes = skillRes[:-1]

Skill_distances = np.linalg.norm(normalizedSkill - skillRes, axis=1)
skill_indeces = np.argsort(Skill_distances)[:10]

#Height and Weight Calculations
heightWeightRes[:, heightIndex] *= 8
heightWeightRes[:, weightIndex] *= 8
heightWeightRes[:, starIndex] *= 2
heightWeightRes[:, ratingIndex] *= 2

#label standardized data
normalizedHeightWeight = heightWeightRes[-1]
heightWeightRes = heightWeightRes[:-1]

HW_distances = np.linalg.norm(normalizedHeightWeight - heightWeightRes, axis=1)
HW_indeces = np.argsort(HW_distances)[:10]

print(skill_indeces)
#names_by_skill = hs_stats_array_with_name[skill_indeces][0]
#print(names_by_skill)

print(HW_indeces)
#names_by_HW = hs_stats_array_with_name[HW_indeces][0]
#print(names_by_HW)

[1493  600  602  935  934  603  601   18 1489   15]
[1493  602  603  600  593   23  604  948 1500  929]


In [None]:
ESPN = 'https://raw.githubusercontent.com/jeslyn-guo/Hacklytics2024/main/ESPN%20WR%20Stats.csv'
ESPNdf=pd.read_csv(ESPN)
ESPNdf = ESPNdf.drop(columns=['First Name', 'Last name'])
ESPNdf.head(20)
counter = 0


for index in skill_indeces:
  statsPlayer = hs_stats_array_with_name[index][0]
  player_stats_skill = ESPNdf[ESPNdf['Name'] == statsPlayer]
  if player_stats_skill.empty and (counter <= 9):
    print(statsPlayer + ' not in csv, moving onto next')
    counter += 1
    continue
  elif (counter <= 9):
      print(player_stats_skill)
      name = player_stats_skill.iloc[0,1]
      rushingYD = player_stats_skill.iloc[0,3]
      TD = player_stats_skill.iloc[0,6]
      print(rushingYD)
      print(name)
      print(TD)

      break
if (counter == 10):
  print('Your inputted player data did not resemble any collegiate players with regards to rating & stars.')

for index in HW_indeces:
  hWPlayer = hs_stats_array_with_name[index][0]
  player_stats_HW = ESPNdf[ESPNdf['Name'] == hWPlayer]
  if player_stats_HW.empty and (counter <= 9):
    print(hWPlayer + ' not in csv, moving onto next')
    counter += 1
    continue
  elif (counter <= 9):
    break
if (counter == 10):
  print('Your inputted player data did not resemble any collegiate players with regards to height & weight.')


if (statsPlayer == hWPlayer):
  print(hWPlayer + ' is a match, based upon similarities in all categories.')
  print(player_stats_skill)
else:
  print(statsPlayer + ' is a match, based upon similarities in stars and ratings.')
  print(player_stats_skill)

  print(hWPlayer + ' is a match, based upon similarities in height and weight.')
  print(player_stats_HW)

David Bell not in csv, moving onto next
          RK         Name  REC YDS   AVG   LNG   TD
1047  1042.0  Destyn Hill  6.0  87  14.5  30.0  0.0
87
Destyn Hill
0.0
David Bell not in csv, moving onto next
Destyn Hill is a match, based upon similarities in stars and ratings.
          RK         Name  REC YDS   AVG   LNG   TD
1047  1042.0  Destyn Hill  6.0  87  14.5  30.0  0.0
Jalil Farooq is a match, based upon similarities in height and weight.
        RK          Name   REC  YDS   AVG   LNG   TD
112  113.0  Jalil Farooq  45.0  694  15.4  49.0  2.0
