<a href="https://colab.research.google.com/github/dathomas1/NcaaTool/blob/master/HS_Football_Recruits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose
Identify trends in what makes a 5-star high school football recruit. I will consider height, weight, and hometown. I will also trend what college recruits go to and who makes it to the NFL with links to there Pro-Football-Reference page.

## Import Data

In [87]:
import pandas as pd

# Range of recruit years
RECRUIT_YEAR_START = 2000
RECRUIT_YEAR_END = 2019

# Url to high school recruit data on github
url = "https://raw.githubusercontent.com/dathomas1/NcaaTool/master/CSV/HighSchool%20-%20{0}%20Player%20Rankings.csv"

def get_recruit_filenames(start, end, url, skip=[-9999]):
  filenames = []
  for year in range(start, end + 1):
    if (year in skip):
      continue
    else:
      filenames.append(url.format(year))
  
  return filenames

def dataframe_from_filenames(filenames):
  dataframes = []
  for f in filenames:
    dataframe = pd.read_csv(f)
    dataframe['source'] = f
    dataframe.dropna(thresh=4)
    dataframes.append(dataframe)

    combined_dataframe = pd.concat(dataframes, axis=0, ignore_index=True)
      
  return combined_dataframe

hs_filenames = get_recruit_filenames(RECRUIT_YEAR_START, RECRUIT_YEAR_END, url)

# Recruit data frame
df = dataframe_from_filenames(hs_filenames)



In [85]:
# View Data so far
df.head()

Unnamed: 0,ranking,name,highSchool,position,height,weight,stars,rating,college,source
0,1,D.J. Williams,"De La Salle (Concord, CA)",ILB,6-2,235,5.0,0.9998,Miami,https://raw.githubusercontent.com/dathomas1/Nc...
1,2,Brock Berlin,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,5.0,0.9998,Florida,https://raw.githubusercontent.com/dathomas1/Nc...
2,3,Charles Rogers,"Saginaw (Saginaw, MI)",WR,6-4,195,5.0,0.9988,Michigan State,https://raw.githubusercontent.com/dathomas1/Nc...
3,4,Travis Johnson,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,5.0,0.9982,Florida State,https://raw.githubusercontent.com/dathomas1/Nc...
4,5,Marcus Houston,"Thomas Jefferson (Denver, CO)",RB,6-0,208,5.0,0.998,Colorado,https://raw.githubusercontent.com/dathomas1/Nc...


In [34]:
# Output dataframe so far
df.to_csv('hs_recruits.csv')

## Clean Data
While importing, I noticed that the 2018 CSV data shift all information one column over. This is because the CSV for 2018 doesn't capture values as strings within quotations

In [75]:
df.query("ranking == 'Trevor Lawrence'")

Unnamed: 0,ranking,name,highSchool,position,height,weight,stars,rating,college,source
43051,Trevor Lawrence,Cartersville (Cartersville,GA),PRO,6-6,208,5.0,0.9999,Clemson,https://raw.githubusercontent.com/dathomas1/Nc...
46540,Trevor Lawrence,Whitehouse (Whitehouse,TX),SDE,6-5,245,2.0,0.7497,uncommitted,https://raw.githubusercontent.com/dathomas1/Nc...


I will create a new dataframe without 2018 for now, clean 2018 then recombine.

In [80]:
url_2018 = "https://raw.githubusercontent.com/dathomas1/NcaaTool/master/CSV/HighSchool%20-%202018%20Player%20Rankings.csv"
recruits_without_2018 = df[df.source != url_2018]
print("Original Import Shape: ", df.shape)
print("Without 2018 Shape: ", recruits_without_2018.shape)

Original Import Shape:  (51113, 10)
Without 2018 Shape:  (47221, 10)


Import 2018 CSV and display

In [86]:
col_names_2018 = ["ranking", "name", "highSchool", "state", "position", "height","weight","stars","rating","college"]
df_2018 = pd.read_csv(url_2018, header=0, names=col_names_2018)
df_2018["source"] = url_2018
df_2018.head()

Unnamed: 0,ranking,name,highSchool,state,position,height,weight,stars,rating,college,source
0,1.0,Trevor Lawrence,Cartersville (Cartersville,GA),PRO,6-6,208,5.0,0.9999,Clemson,https://raw.githubusercontent.com/dathomas1/Nc...
1,2.0,Justin Fields,Harrison (Kennesaw,GA),DUAL,6-3,221,5.0,0.9998,Georgia,https://raw.githubusercontent.com/dathomas1/Nc...
2,3.0,Xavier Thomas,IMG Academy (Bradenton,FL),SDE,6-3,260,5.0,0.9988,Clemson,https://raw.githubusercontent.com/dathomas1/Nc...
3,4.0,Eyabi Anoma,St. Frances Academy (Baltimore,MD),WDE,6-5,235,5.0,0.9987,Alabama,https://raw.githubusercontent.com/dathomas1/Nc...
4,5.0,Micah Parsons,Harrisburg (Harrisburg,PA),WDE,6-3,235,5.0,0.9982,Penn State,https://raw.githubusercontent.com/dathomas1/Nc...
