# NBA Hall of Fame or Ball of Lame (First Milestone)
Name: Cobi Toeun <br>
Email: u1230512@utah.edu -- cobitoeun@gmail.com <br>
UID: u1230512 <br>

# Project Description:

What makes an NBA player one of the best in a season? What makes an NBA player one of the greatest in league history? <br>

To answer this question, I am going to web scrape, using *python*, the stats of current (2021-2022 season) NBA players and predict their qualifications and probability as a future NBA Hall of Famer. 
<br>

To determine which players would be considered as an all time great, I am going to look at previous Hall of Famers:

- The *Hall of Fame* honors players who have shown exceptional skill at basketball, all-time great coaches, referees, and other major contributors to the sport. A player or staff member can be considered a hall of famer for many different reasons; however, I would like to determine an NBA players chances through stats, awards, and accomplishments.


# Project Objectives:
**Questions to consider:**

- What makes a player likely to be inducted into the hall of fame?
- Do championships alone determine a players greatness, or can a player be considered an all-time great without a championship?
- Can a players Hall of Fame chances increase or decrease after each season?
- What are the most common variables among current hall of famers? Which provide the strongest correlations?
- Can I create a model and algorithm to predict whether a player is likely to be chosen as a hall of famer based only on current stats?

**Benefits (What to learn and accomplish):**

- Use python libraries, including dataframes, beautifulsoup, mathplotlib, etc, to gather and express results scraped from [Basketball-Reference.com](https://www.basketball-reference.com/).
- Gain better knowledge on python web scraping and determining what methods are best used in certain scenarios.
- Find the bare minimum requirements to qualify as a Hall of Famer.
- Create a model and algorithm to predict a players chance of being a hall of famer, based on current hall of famers.

# Data Description:
All data has been web scraped from [Basketball-Reference.com](https://www.basketball-reference.com/).
To prevent web page overload, I have downloaded every page I need to complete analysis and predictions. Each page can be found in the *Webpages* folder.
<br>

**Before computing Hall of Fame predictions, this is the data to search and extract:**
- Career points, rebounds, assists, blocks, and steals (per game and percentage)
- Averages (FG%, eFG%, FG3%, FT%, PER, WS)
- All-stars, MVPs, Championships, DPOYs, All-NBA, etc. (awards and accomplishments)
<br>

**Here is an example of Michael Jordan's career stats and accomplishments:**
![mj_stats](images/mj_stats.png)
<br>

Using my browers built-in inspect function, I can extract MJ's awards using BeautifulSoup and a custom function: <br>
![image-4.png](images/inspect_stats.png)

After categorizing all the data extracted, I have converted each dataframe to a csv file so I wouldn't have to re-run the code each time I want to access it. CSV files can be accessed in *csv_files* folder. <br><br>

**All acquired and cleaned data will be present in *WebScraping-DataCleaning.ipynb***

<br>

**Webscraped data have been extracted from these pages:**
- [Basketball-Reference Homepage](https://www.basketball-reference.com/)
- [NBA Player Directory](https://www.basketball-reference.com/players/)
- [Michael Jordan Career Stats](https://www.basketball-reference.com/players/j/jordami01.html) (we'll use MJ's page as an example, but I've scraped from all NBA players)

# Ethical Consideration:
**Legal Issues?:** <br>

Web scraping is legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so we have to be careful scraping personal data, intellectual property, or confidential data. <br>

Though, some sites are likely to ban you if you scrape a site too much or too quickly. So far, I haven't had any problems with Basketball-Reference. To be safe, downloading the webpages are the best bet if you are trying to scrape large amounts of data. <br> <br>

**According to [clause 5](https://www.sports-reference.com/termsofuse.html) of Sports Reference:**

- Subject to the terms of this Agreement, you are granted a limited, personal, non-exclusive, non-sublicensable, non-assignable, non-transferable, and revocable license to access and use the Site and Content.
- This means that you should not create websites or tools based on data you scrape from Sports Reference or any of our sites.
- **Ultimately, I am granted to web scrap this website only for personal use and won't use my findings or tools to sell any information, or to create a personal website.** <br>

My project won’t harm the website in any way. More information on data usage can be found [here](https://www.sports-reference.com/data_use.html)

# Methods:
**Methods:**
- Web scraping (scrape using beautiful soup)
- Regex (regular expressions)
- String formatting (format string so I'll be able to extract mutiple different links)
- Loops and Conditional Statements (used in custom functions)
- Algorithms and Data Structures (use dictionaries, sets, lists to store data) (algorithms for extraction)
- Dataframes and CSV files (store extracted data as a table) (csv to use through analysis)
- Regression, SVM Models, Decision Trees (predict accuracy of model) (use model to give probability of hof chances)

**Data Visualizations:**
- Correlation Matrix (Check which variables best correlate with current Hall of Fame players)
- Heat Maps (Visualize correlation matrix)
- Scatter Plot Matrices
- Graphs (Line, Bar, Plot) (showcase stats, awards among hof and non-hof players)
<br> <br>

**More description of methods and visualizations are explained in *results* section.** <br> <br>

### Here are some examples of methods and visualizations I have used:

![code](images/code.png)

![dataframe](images/dataframe.png)

![image-4.png](images/corr_graph.png)

![image.png](images/heatmap.png)

# Preliminary Results and Upcoming Milestones:
**Steps Completed:**
- **Setup notebooks and directories:**
    - I have used a custom function and the os library to create directories so I can store downloaded webpages and csv files. Webpages are stored in *Webpages/* and csv file are store in *csv_files/*. <br>
- **Downloaded webpages:**
    - So I don't have to continuously overload Basketball-Reference, I have downloaded all the pages I need and stored them in the *Webpages/* folder. All individual player pages are in Webpages/Player Director/{}-players ({} being their last initial). <br>
- Written algorithms to extract data:
    - To prevent from manually gathering the data I need, I have written a few algorithms to extract elements from HTML. Each algorithm is a function and runs optimally. I will later optimise the code since it takes a while to scrape, but that will come after I have finished data analysis. <br>
- **Web scrape and cleaned data:**
    - Using the pages I have downloaded, I have used my algorithms (as stated above) to extract necessary data. Once the data has been extracted, I have stored them into a dataframe. From there, I have gone ahead and cleaned unecessary information using pandas. <br>
- Converted dataframes to csv files (stored csv files to folder):
    - So you and I wouldn't have to re-run my web scraping algorithms, I have converted each dataframe to a csv file. Each csv file is stored in the *csv_files/* folder. So far, I have a csv file for all_players, retired_players, hof_players, and active_players. I am likely to add more in the future. <br>
- Started with Data Analysis:
    - After create each csv file, I have started trying to determine the best stats and awards that highly correlate with hall of fame status. Along with that, I have used the describe function to determine the minimum qualifications for hall of famers. More things to be added are stated below. <br>
<br>

**Steps To Complete:**
- Find bare mimumum to qualify for hof:
    - Using the describe function, I am going to check what the minimum stats for current hofs. If current active players, or retired non-hofs meet the requirement, they are placed into eligible. The same if they don't meet it, they will be ineligible. <br>
- Create data visualizations to present findings:
    - I will use bar and scatter plots to showcase my findings for average seasons played for hof and retired players. The visual addition will make it easier to determine requirements. Along with plots, I want to add more heat maps and matrices to showcase correlations. <br>
- Create predictive models using regression, svm, or decision trees (k-means probably):
    - Like the homework, I am going to add regression, svm, and decision tree models to help predict probabilities/chances of being inducted as a hof. From these methods, I will determine the which method give the best accuracy and I will use it to predict. Findings will be showcased through data visualizations. <br>
- Write screen-cast script:
    - To keep things organized, once I am finished with writing the code and implementing data analysis and machine learning, I am going to write a short script for the video. I will closely follow this script when giving a voice over. <br>
- Record and edit video:
    - Once the script is finished, I will then record the video from my PC. If I have enough time, I will try to make to make the video interesting and add elements when editing. Rather than having a boring descriptive video, I would like to add game highlights or interviews so the viewer can have a good time. (video will be around 2-3 minutes) <br>
- Format and Submission:
    - Before submission, I will run all programs and reformat any files as necessary.

# Peer Feedback:
**Group providing feeback:**<br>
Hate Crime Data Analysis: *Marni Epstein, Aaron Tang, (third member not present)*
<br><br>
**Suggestions and feedback:** <Br>
- **New title:** NBA Hall of Fame or Ball of Lame <br> 
    - Be more descriptive on what I am gather data for. Is it for the MLB or NFL Hall of Fame? Adding *NBA* to (NBA) Hall of Fame or Ball of Lame makes it easier to understand that I working on data that relates to the NBA.

- **Methods to add:** Consider predictive models; Decision tree and classification model <br>
    - I am going to manually decide what factors play in qualifying players for the hof, and give points based on average stats and achievements. However, to add machine learning and making it easier for me to find probabilities, I was recommended to include predictive models so the computer can compute calculations for me.
    
The feedback that I was given was very brief but helpful. Both peers asked me great questions in regard to how I was going to scrap and present the data. Since I am doing this project alone, it was nice to have suggestions since, unlike other groups, I wouldn't go through of process of discussing solutions or debating.

# Project Summary:
The main idea is to web scrap the stats of current (2021-2022 season) NBA players and determine if they qualify as a future hall of famer. A player or staff member can be considered a hall of famer for many different reasons; however, I would like to determine a players chances through stats, awards, and accomplishments.

This project would challenge me in creating an algorithm and predictive model to find certain stats and awards that would guarantee a spot in the hall of fame. Just because a player has 10 championships doesn’t mean they are an all-time great. They could be a benchwarmer and only average 1 point per game with only 5 minutes. This is an idea that I'd love to research and explore.

[Basketball-Reference.com](https://www.basketball-reference.com/) is a great site to view current and old players stats, awards, and accomplishments. All data is clean and easy to navigate which is helpful for me.
<br>

**Project Summary (Milestones):**
- Web scrape and extract data I need to complete predictions
- I can get an overview of the data using the provided describe function.
- To visualize the data, I will create various graphs to determine total numbers of teams, players, and hall of famers
- From there I can compute the correlation matrix, visualize it with a heat map to determine which variables strongly correlate to a hall of famer.
- To further my analysis, I can visualize the correlations by making a scatterplot matrix.
- Overall, I will interpret what I see through various descriptions.
- After gathering all the data and analyzing, I can go ahead and answer all the questions I have asked and finalize with a conclusion.

# Assessment : On track? (**Y**/N)
Yes, this project is on track and I plan to complete data analysis and predictions in the next week or two. The script and video should be edited and completed right after and should take nearly two days.