# Technical Report
**Brianna Lytle**
**DSI-CC9_LA**


# Problem Statement
Basketball players are given one of 5 labels that describe their role on the court. No matter the case, each position is expected to shoot and do their best to play a good offense and defense at one point on the court. 

Using unsupervised learning, my goal is to identify the type of players that are in the NBA. I will also create a recommendation system so that any person can find a similar player to the one they have in mind. 

# Gathering Data
## Rough Draft Data
In the beginning, data was scraped from:
    - Basketball-reference.com
    - Nba-miner.com
    - fivethrityeight.com
When I started this project I felt basketball-reference was the most useful site to scrape due to the amount of information that was available (multiple seasons, salary, different stat measurements, etc.). I scraped the player pages for the 2019-2020 season and the 2018-2019 season. I scraped for multiple seasons due to the limited amount of games that have been played this season. I filtered my scrape based on 100 team possessions rather than per game so that stats wouldn’t be inflated for more popular players. I also scraped the salaries of each player that was on a 2019-2020 roster.

Nbaminer.com contains a wide range of player statistics that wasn’t available on basketball-reference. This includes shot distance measurements and shot types of players. This site was last updated in the 2018-2019. 

Fivethirtyeight.com rates their player by their own “Raptor Rating”. According to the site’s glossary, the Raptor rating turns individual players into team talent estimates. Raptor scores take into account the playing-time of a player rather than focusing on the player’s rankings in the NBA. Raptor scores are constantly updated and are only available for the current season. 

My rough draft data takes into account the current season and the previous season. The 2019-2020 season from basketball-reference and fivethirtyeight.com. The 2018-2019 season data is from basketball-reference and nbaminer.com

## Rough Draft Data Cleaning
I merged the scaped dataframes together based on the “Player” column. Here are the problems I faced while cleaning the data:

### Mismatching names
In a perfect world, the “Player” column among the dataframes would have a perfect merge. However there were many mismatches in the names. Here are a few examples:
    - Names that contained periods (ex/J.J. Reddick vs JJ Reddick)
    - Names that contain Sr. or Jr. (ex/ Kevin Porter Jr. vs Kevin Porter)
    - Names that contain numerical values (ex/ Kevin Knox II vs Kevin Knox)
    - Names that contain special characters on the letters (ex/ Dennis Schröder vs Dennis Schroder)

![mismatchnames](media/mismatchnames.png)
    
    
At the time, the best way for me to clean this data was manually. 

### Overaccounting for players
Scraping for player salaries was separate from player statistics. I scraped for player salaries but going into each team page and grabbing that dataframe. This is because basketball-reference does not update their salary list. There were many players (ex/ J.R. Smith) who were waived during this past season or the preseason, but their names were still on the team salary list. 

I resolved this issue by conducting on own research to be sure the player was on a team or not. 

### Players with no season statistics
**Players with no current season stats:** Some players did not have 2019-2020 season statistics due to injury (ex/ Victor Oladiplo) or because they have yet to be signed to a team (ex/ Andrew Iguodala). Many players did not have 2018-2020 seasons statistics because this is they were recently drafted from college. 
    - Example Case (Victor Oladipo - has not played for 2019-2020):
        - I sorted all the values for 2018-2019 player statistics obtaining the IQR of the NBA players. Based on Oladipo’s statistics, I identified which quartile he fell in based on the NBA players. I took the players in the specific quartile and looked at their 2019-2020 statistics. From there I imputing the mean of those 2019-2020 player statistics as Oladipo’s numbers for the year.
**Recently drafted players:** All recently drafted players have a profile on basketball-reference. However, their recent seasons’ data is not as diverse as the NBA league. Basic measures were available such as rebounds, steals, turnovers, points, etc. I attempted to take the same strategy as above. 
    - Example case (Zion Williamson - no NBA stats, has College stats for 2018-2019): I took the IQR of 2018-2019 season data and compared to Zion’s available college statistics. I took the quartile of where Zion’s statistics fell and averaged out the other measures that weren’t available from his college stats page. 
However, I realized this started allowing players like Zion Williamson to be placed among seasoned NBA players with Kemba Walker and DeMar DeRozan. It is possible for rookies to have great breakout seasons, however, it is unlikely that all of them do. In addition, many rookies start their NBA careers with 2-way statistics which means less playing time and opportunity to gain NBA court statistics. 

Due to a time crunch I decided to scrape the stats.nba.com site using Selenium. I scraped the following datasets:
- Traditional
- Advanced
- Miscellaneous 
- Scoring
- Usage
- Opponent
- Defense

## Official Model Data 
Consists of 460 players and 81 measurements from:
- Stats.nba player measures
- 538 Raptor ratings
- Basketball-reference - current Season Salaries


# Flask App Recommendation System
There are 4 different recommendation systems: Offense, Defense, Shooting, and Overall. 
Each of these recommendation systems were built based on features that fall in each of those categories. The offense recommendation system is worked around features such as field goal %, assists, and points. The Defense recommendation system is worked on several features relating to rebounds, blocks, and steals.  The Shooting recommendation system takes into account how a player scores points. These features include the percentage of shots that were assisted vs unassisted, where on the court the shot was taken, and what kinds of shots are attempted by the player. The overall recommendation system takes into account features in each of the other categories along with ratings such as VORP scores and the amoung of Fantasy points a player has earned so far. 

The Flask app will have an input field where a user can type in a player and the result will show the top 3 closes player depending on which “button” the user clicks. The defense button will return the top three players that have a similar defense style compared to the input field. The offense button will return the top three players with a similar defense style. The shooting button will return the top three players with similar shooting styles. The overall button will return the top three players that are overall similar to the input field. 

After the flask app recommendation system is in place and styled, I will attempt to bring in salaries of each player and have them displayed with the results. This way we can find a similar player and compare their salary range.

![flaskapp](media/flaskapp.png)

# Next Steps
- Create a funciton to automatically clean data.
- Gather more data (stats related + social media related)
- Practice Tableau to visualize different clusters
- Build a database for NBA Player information from different sources 
- Have more displays of player information on flask app