The purpose of this project is to take data from different sources otherwise known as extract, then transform the data by dealing with missing values, choosing data to include in the final table, renaming columns, and dropping duplicate data, and finally loading the transformed data into a relational database. The project also identified NFL player arrest records and their demographic player records pulled from player profile records. As a bonus, we are also asking: "Which college did the most players with arrest records during the period 2000 -2017 come from?"
We analyzed all the available arrest records from Kaggle NFL Arrests 2000 - 2017. A record of reported NFL Arrests with details about Crime, Team and Player.
- Web Scraping: 'html' from https://www.kaggle.com/patrickmurphy/nfl-arrests
- pull: 'json' from https://www.kaggle.com/zynicide/nfl-football-player-stats#profiles_1512362725.022629.json
- Create dataframe from NFL Arrests csv file
- Choose columns wanted from original dataframe
- Rename columns
- Create dataframe from .json file nfl profiles
- Clean nfl arrests by saving only the 1st instance of a player name and removing duplicate entries.
- Clean nfl profiles. Duplicate players removed from the profile section
- All na files filled with not applicable
As part of the transformation of the data, the columns of each new created dataframe was evaluated and columns that were relevant to the project was selected from the dataframes. Next, some columns were renamed to give them more descriptive labeling. After manipulating columns, each dataframe was evaluated to find what type of data was included in the database, the shape of the data meaning the number of rows and columns in the database, missing data, and duplicate data.
After evaluating all the aspects mentioned above, the nfl_profiles_df dataframe was found to have missing data in the "college","draft_team", and "hometown" columns. The missing data in those columns would be filled with "not available", "not drafted", and "not available" respectively due to the desire to not eliminate any rows of data from the dataframe. Each dataframe was found to have duplicate data in the player column. The duplicates were dropped except for the first instance of the player name. The reason for this choice is due to the fact that we were addressing only one instance of an NFL player's arrests and not taking into account multiple arrests for the same individual.
nfl_arrests_df = nfl_arrests_df.rename(columns={"DATE": "offense_date", "NAME": "player", "CASE": "outcome", "CATEGORY": "charge"})
After the data was cleaned, the next step in the ETL process is loading. For this project a relational database was chosen; specifically PostgreSQL. The reason for choosing a relational database is due to the fact that each table is related to players in the NFL. One table is named "nfl_arrests" and the other "nfl_profiles". Each table can be queried seperately; however as will be explaind later, the tables can be joined where the player's names match in both tables. Joining the tables in this way gives the most robust data than either of the tables do seperately.
In this project SQL Alchemy was used to connect to PostgreSQL; so in essence SQL Alchemy is a ORM that will be connected to PostgreSQL through the use of python and Jupyter Notebooks.
- Connect the orm to the local database with the code below:
- Add the 2 panda dataframes to the local database with the code below:
After the tables were loaded into PostgreSQL a query was performed using SQL Alchemy to combine the data of each table in a format that would give much more interesting data regarding the NFL players in the dataframe. The code for running the query is below:
query = "select nfl_arrests.player, nfl_arrests.offense_date, nfl_arrests.outcome, nfl_arrests.charge, nfl_profiles.hometown,nfl_profiles.college, nfl_profiles.draft_team From nfl_arrests Inner Join nfl_profiles ON nfl_arrests.player = nfl_profiles.player"
Please note that you must use your personal password for SQL Alchemy as my password has been redacted for privacy