# Data Gathering

## Introduction

In this project, I used a variety of data sets regarding injuries and injury prevention in sports. I used both record and text data. The data sets came from sources such as academic articles, Kaggle from users who had already created sports related data sets from respective sports websites such as the NFL and NBA or from competitions hosted by professional sports leagues, statistical institutes, APIs, and data.world.

While I have a lot of data sets, I didn't use all of them in my main analysis. Because injuries in sports don't gain a ton of media attention if they aren't big, career ending types of injuries, I wanted to provide a little more context for the injuries that do occur before diving into what we can do to prevent them. My data sets are focused on three main sports: football (or soccer, but I will use football throughout the course of this project), basketball, and American football. I decided to use three sports for my exploratory data analysis to see how the nature of injuries differs between sports.  

The data set that is used for my main analysis is about injury prevention factors. More about this data set can be found below, and throughout the other data related tabs on my website, but I thought this data set would be a valuable one to focus on and would be the one to answer the questions in my introduction that are guiding the course of this project. 

# Record Data

### Injury Prevention Factors Data Set

The record data that I selected for this project came from an article entitled ["Perceptions of football players regarding injury risk factors and prevention strategies"](https://datadryad.org/stash/dataset/doi:10.5061/dryad.t5r56). This data set is one that focuses on injuries in football (soccer) players. 

Below is the raw data.

In [2]:
#| code-fold: true 
import pandas as pd

file_path = "../../../../data/00-raw-data/Data_Injury_Prevention.csv"
raw_injury_prevention = pd.read_csv(file_path)

raw_injury_prevention.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Table 1
ID,Age,Height,Mass,Team,Position,Years of Football Experience,"Previous Injuries (1=yes, 0=no)",Number of Injuries,"Ankle Injuries (1= yes, 0=no)",Number of Ankle Injuries,"Severe_Ankle_Injuries (1=yes, 0=no)","Noncontact_Ankle_Injuries ((1=yes, 0=no)","Knee Injuries (1=yes, 0=no)",Number of Knee Injuries,"Severe_Knee_Injuries(1=yes, 0=no)","Noncontact_Knee_Injuries(1=yes, 0=no)","Thigh_Injuries(1=yes, 0=no)",Number of Thigh Injuries,"Severe_Thigh_Injuries (1=yes, 0=no)","Noncontact_Thigh_Injuries(1=yes, 0=no)","Risk Factor Condition (1=yes, 0=no)","Risk Factor Coordination (1=yes, 0=no)","Risk Factor Muscle Impairments (1=yes, 0=no)","Risk Factor Fatigue (1=yes, 0=no)","Risk Factor Previous Injury(1=yes, 0=no)","Risk Factor Attentiveness (1=yes, 0=no)","Risk Factor Other Player (1=yes, 0=no)","Risk Factor Equipment(1=yes, 0=no)","Risk Factor Climatic Condition (1=yes, 0=no)","Risk Factor Diet (1=yes, 0=no)",Importance Injury Prevention,"Knowledgeability (1=yes, 2=no)","Prevention Measure Stretching (1=yes, 0=no)","Prevention Measure Warm Up (1=yes, 0=no)","Prevention Measure Specific Strength Exercises (1=yes, 0=no)","Prevention Measure Bracing (1=yes, 0=no)","Prevention Measure Taping (1=yes, 0=no)","Prevention Measure Shoe Insoles (1=yes, 0=no)","Prevention Measure Face Masks (1=yes, 0=no)","Prevention Measure Medical Corset (1=yes, 0=no)"
146.00,19.00,173.00,67.60,1.00,3.00,1.00,1.00,6.00,1.00,3.00,0.00,1.00,0.00,0.00,0.00,0.00,1.00,3.00,0.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,2.00,1.00,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
155.00,22.00,179.50,71.00,1.00,3.00,1.00,1.00,2.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,2.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00
160.00,22.00,175.50,71.80,1.00,3.00,1.00,1.00,7.00,1.00,4.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,3.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,1.00,0.00,0.00,0.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
164.00,23.00,190.00,80.50,1.00,4.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00


I got a little lucky with this data set as it is very comprehensive. As you can see in the table above, the data set covers topics such as what injury prevention methods that the athletes took, what risk factors they have, what type of injuries they sustained, as well as information about their age, height, weight, position and years of football experience. While this data set did not require a ton of cleaning, it still required some which is done in the [data cleaning section](../data_cleaning/data_cleaning.ipynb) of my website. 

### Injuries from Different Activities

This data set breaks down the how many injuries per age group come from different types of activities. While this data isn’t directly related to professional sports, it helps provide a basic understanding of what sports or physical activities produce the most injuries. This data sources comes from the [Insurance Information Institute](https://www.iii.org/fact-statistic/facts-statistics-sports-injuries).

The raw data is below. As with the previous data set, the cleaned data and the cleaning process can be found on the [data cleaning tab](../data_cleaning/data_cleaning.ipynb).

In [3]:
#| code-fold: true 

file_path = "../../../../data/00-raw-data/sports_injuries_data.csv"
injury_causes = pd.read_csv(file_path)

injury_causes.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X195,X196,X197,X198,X199,X200,X201,X202,X203,X204
0,"Number of injuries by age \n \nSport, activity...",,,,Number of injuries by age,Number of injuries by age,Number of injuries by age,,"Sport, activity or equipment",Injuries (1),...,2635.0,3261.0,572.0,"Nonpowder guns, BB'S, pellets",11603.0,519.0,3286.0,3443.0,3869.0,487.0
1,,,,Number of injuries by age,Number of injuries by age,Number of injuries by age,,,,,...,,,,,,,,,,
2,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older,,,,...,,,,,,,,,,
3,"Exercise, exercise equipment",445642,6662,36769,91013,229640,81558,,,,...,,,,,,,,,,
4,Bicycles and accessories,405411,13297,91089,50863,195030,55132,,,,...,,,,,,,,,,


This data required a good bit of cleaing, but provided information about injuries by activity per age group.

### NBA Injury Data 

The next data set that I decided to explore is a NBA Injury Data set. This data set records all of the roster moves related to injuries made by each NBA team in every season betwwen the 2010 and 2020 seasons. The raw data contains 5 features, including player name, team name, date of transaction, acquired players, relinquished players, and notes about the injury. I thought this data source would be useful to help gather information about what kinds of injuries are most common and how many injuries each team had per season. The raw data set comes from this [Kaggle data set](https://www.kaggle.com/datasets/ghopkins/nba-injuries-2010-2018/code) who scraped it from the [Pro Sports Transactions](http://www.prosportstransactions.com/) website. 

In [4]:
#| code-fold: true

file_path = "../../../../data/00-raw-data/injuries_2010-2020.csv"
nba_injuries = pd.read_csv(file_path)
nba_injuries.head()

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,2010-10-03,Bulls,,Carlos Boozer,fractured bone in right pinky finger (out inde...
1,2010-10-06,Pistons,,Jonas Jerebko,torn right Achilles tendon (out indefinitely)
2,2010-10-06,Pistons,,Terrico White,broken fifth metatarsal in right foot (out ind...
3,2010-10-08,Blazers,,Jeff Ayres,torn ACL in right knee (out indefinitely)
4,2010-10-08,Nets,,Troy Murphy,strained lower back (out indefinitely)


This data also required a good bit of cleaning, but provides valuable information about what types of injuries were most common and how severe each injury was.

### NFL Concussion Data 

This data source is a list of all the concussion or head injuries in the NFL during the 2012-2014 seasons. The data set has information about the time of the injury within the season, the severity of the injury, and the aftermath of the injury. I chose this data set to get a look at another sport's injury data. Concussions are the most common and one of the most serious injuries in the NFL which is why I chose to focus only on them. This information is helpful to provide injury context for the NFL seasons. The data set comes from [data.world](https://data.world/alice-c/nfl).

The raw data is below.

In [5]:
#| code-fold: true

file_path = "../../../../data/00-raw-data/Concussion Injuries 2012-2014.csv"
concussion_data = pd.read_csv(file_path)
concussion_data.head()

Unnamed: 0,ID,Player,Team,Game,Date,Opposing Team,Position,Pre-Season Injury?,Winning Team?,Week of Injury,Season,Weeks Injured,Games Missed,Unknown Injury?,Reported Injury Type,Total Snaps,Play Time After Injury,Average Playtime Before Injury
0,Aldrick Robinson - Washington Redskins vs. Tam...,Aldrick Robinson,Washington Redskins,Washington Redskins vs. Tampa Bay Buccaneers (...,30/09/2012,Tampa Bay Buccaneers,Wide Receiver,No,Yes,4,2012/2013,1,1.0,No,Head,0,14 downs,37.00 downs
1,D.J. Fluker - Tennessee Titans vs. San Diego C...,D.J. Fluker,San Diego Chargers,Tennessee Titans vs. San Diego Chargers (22/9/...,22/09/2013,Tennessee Titans,Offensive Tackle,No,No,3,2013/2014,1,1.0,No,Concussion,0,78 downs,73.50 downs
2,Marquise Goodwin - Houston Texans vs. Buffalo ...,Marquise Goodwin,Buffalo Bills,Houston Texans vs. Buffalo Bills (28/9/2014),28/09/2014,Houston Texans,Wide Receiver,No,No,4,2014/2015,1,1.0,No,Concussion,0,25 downs,17.50 downs
3,Bryan Stork - New England Patriots vs. Buffalo...,Bryan Stork,New England Patriots,New England Patriots vs. Buffalo Bills (12/10/...,12/10/2014,Buffalo Bills,Center,No,Yes,6,2014/2015,1,1.0,No,Head,0,82 downs,41.50 downs
4,Lorenzo Booker - Chicago Bears vs. Indianapoli...,Lorenzo Booker,Chicago Bears,Chicago Bears vs. Indianapolis Colts (9/9/2012),9/09/2012,Indianapolis Colts,Running Back,Yes,Yes,1,2012/2013,0,,No,Head,0,Did not return from injury,


This data set includes information about the team that was playing, the date of the game, the player's position, the type of injury sustained, the team's performance that game and season, as well as the duration of the recovery time. 

### NFL Game Injury & Condition Data Sets (2 Data Sets)

The final record data set I selected was a data set about NFL injuries and the situation in which the injury occured, such as the type of injury, type of stadium, the field type, and the temperatures. I found this data set to be useful for seeing how weather and game circumstances can influence injuries. The data set came from a [Kaggle competition](https://www.kaggle.com/competitions/nfl-playing-surface-analytics/data) hosted by the NFL in 2019.

In [6]:
#| code-fold: true

file_path = "../../../../data/00-raw-data/InjuryRecord.csv"
injury_record = pd.read_csv(file_path)
injury_record.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


I combined this data set with another regarding weather and other conditions of injuries in the NFL. This data set contains similar information to the previous set but also includes more information about the weather and the player. This data also comes from the NFL Analytics [Kaggle competition](https://www.kaggle.com/competitions/nfl-playing-surface-analytics/data).

In [3]:
#| code-fold: true 
file_path = "../../../../data/00-raw-data/PlayList.csv"
play_list = pd.read_csv(file_path)
play_list.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


# Text Data

### Injury News 

In terms of text data, I didn't select many sources as text data wasn't as conducive to the type of analysis I was trying to do. However, I did use it for some exploratory data analysis tasks as well as clustering. 

The first data set I used was from [NewsAPI](https://newsapi.org/docs/get-started). I wanted to explore what words were used to describe injuries sustained in football (soccer) within the news to see how injuries are talked about in popular culture. 

### Injury Prevention Text Data 

The other text data set that was used during the course of my project was a text data set that I created to correspond with the injury prevention factors data set (the first one discussed on this page). I used the definitions of each injury prevention method and created a data frame including a column with the definitions of each injury prevention methods that each person used. I did this to use in my Naive Bayes section to see if I would get different results using text data versus record data.  