## Introduction 

A lot of the time when we look for data on the internet, it is not going to be in the format that we need it to be in, in order to analyse it properly for our goals. Othertimes, the data set is incomplete including missing data points, has inconsistent formatting, or is in a format that does not work for smooth analysis. That is there data cleaning comes in. Cleaning our "dirty" data helps us ensure that the data is ready to be analyzed and that the data we do analyze is accurate and reliable. 

Common data cleaning tasks for tabular data include deciding how to deal with missing values, duplicate rows or entries, incorrect values, special characters, etc. Additionally, it involves accounting for inconsistent casting (having inconsistent instances of upper and lowercase letters), spelling errors, human errors, inconsistent units, and more. 

We will know that a tabular data set is clean when every column is a variable, every row is an observation,and every call is a value.   

When we clean text data, we often us Natural Language Processing, or NLP. Common cleaning tasks include removing punctuation, dealing with special characters, tokenization, vectorization, stop word removal, lemmatization or stemming, spell check, removing URLs, and more. 

To help us clean data, we can use R packages such as Tidyverse, dplyr, forcats, tidyr, etc. For Python, we can use packages such as Pandas, RegEx, and NLTK. I utilize Pandas quite a bit in the data cleaning process. I use both R and Python to clean my data.

## Record Data Cleaning

The process of cleaning my record data varied from data set to data set. Some of them were very simple to clean and some required some more time and effort. It all depends on the state of the data set when I found it and what I am desiring it to do. 

### Injury Prevention Factors

This data set was relatively clean to begin with, so I did not have to do a big amount of cleaning. 

Raw Data:

In [4]:
#| code-fold: true
import pandas as pd

file_path = "../../../../data/00-raw-data/Data_Injury_Prevention.csv"
raw_injury_prevention = pd.read_csv(file_path)

raw_injury_prevention.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Table 1
ID,Age,Height,Mass,Team,Position,Years of Football Experience,"Previous Injuries (1=yes, 0=no)",Number of Injuries,"Ankle Injuries (1= yes, 0=no)",Number of Ankle Injuries,"Severe_Ankle_Injuries (1=yes, 0=no)","Noncontact_Ankle_Injuries ((1=yes, 0=no)","Knee Injuries (1=yes, 0=no)",Number of Knee Injuries,"Severe_Knee_Injuries(1=yes, 0=no)","Noncontact_Knee_Injuries(1=yes, 0=no)","Thigh_Injuries(1=yes, 0=no)",Number of Thigh Injuries,"Severe_Thigh_Injuries (1=yes, 0=no)","Noncontact_Thigh_Injuries(1=yes, 0=no)","Risk Factor Condition (1=yes, 0=no)","Risk Factor Coordination (1=yes, 0=no)","Risk Factor Muscle Impairments (1=yes, 0=no)","Risk Factor Fatigue (1=yes, 0=no)","Risk Factor Previous Injury(1=yes, 0=no)","Risk Factor Attentiveness (1=yes, 0=no)","Risk Factor Other Player (1=yes, 0=no)","Risk Factor Equipment(1=yes, 0=no)","Risk Factor Climatic Condition (1=yes, 0=no)","Risk Factor Diet (1=yes, 0=no)",Importance Injury Prevention,"Knowledgeability (1=yes, 2=no)","Prevention Measure Stretching (1=yes, 0=no)","Prevention Measure Warm Up (1=yes, 0=no)","Prevention Measure Specific Strength Exercises (1=yes, 0=no)","Prevention Measure Bracing (1=yes, 0=no)","Prevention Measure Taping (1=yes, 0=no)","Prevention Measure Shoe Insoles (1=yes, 0=no)","Prevention Measure Face Masks (1=yes, 0=no)","Prevention Measure Medical Corset (1=yes, 0=no)"
146.00,19.00,173.00,67.60,1.00,3.00,1.00,1.00,6.00,1.00,3.00,0.00,1.00,0.00,0.00,0.00,0.00,1.00,3.00,0.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,2.00,1.00,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
155.00,22.00,179.50,71.00,1.00,3.00,1.00,1.00,2.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,2.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00
160.00,22.00,175.50,71.80,1.00,3.00,1.00,1.00,7.00,1.00,4.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,3.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,0.00,1.00,0.00,0.00,0.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
164.00,23.00,190.00,80.50,1.00,4.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,1.00,1.00,1.00,0.00,0.00,0.00,0.00,0.00


The cleaning process involved mainly cleaning the column names, and ensuring that the format of the data in each column was consistent. I didn't subset the data any further to allow for options when I moved onto my analysis. 

I cleaned this data set using R. The code can be found in [here](https://github.com/anly501/dsan-5000-project-rennyd123/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd) and the final cleaned data set can be found below. 

In [3]:
#| code-fold: true
import pandas as pd
file_path = "../../../../data/01-modified-data/injury_prevention_data_soccer.csv"
injury_prevention_factors = pd.read_csv(file_path)
injury_prevention_factors.head()

Unnamed: 0,ID,Age,Height,Mass,Team,Position,Years of Football Experience,Previous Injuries,Number of Injuries,Ankle Injuries,...,Importance Injury Prevention,Knowledgeability,Prevention Measure Stretching,Prevention Measure Warm Up,Prevention Measure Specific Strength Exercises,Prevention Measure Bracing,Prevention Measure Taping,Prevention Measure Shoe Insoles,Prevention Measure Face Masks,Prevention Measure Medical Corset
0,146,19,173.0,67.6,1,3,1,yes,6,yes,...,2,1,yes,no,yes,no,no,no,no,no
1,155,22,179.5,71.0,1,3,1,yes,2,no,...,1,1,yes,yes,no,no,no,no,no,no
2,160,22,175.5,71.8,1,3,1,yes,7,yes,...,1,1,yes,no,no,no,no,yes,no,no
3,164,23,190.0,80.5,1,4,1,yes,1,no,...,1,1,yes,yes,yes,no,no,no,no,no
4,145,19,173.5,68.7,1,3,1,yes,2,yes,...,1,2,yes,yes,no,no,yes,no,no,no


### Injuries from Different Activities Data Set

This data set required a little more cleaning. The formatting of the data set was very messy and not conducive to analysis at the beginning, so my main goals were to ensure that each age group was a column, and each activity was a different row.  

In [7]:
#| code-fold: true 
file_path = "../../../../data/00-raw-data/sports_injuries_data.csv"
injury_causes_dirty = pd.read_csv(file_path)
injury_causes_dirty.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X195,X196,X197,X198,X199,X200,X201,X202,X203,X204
0,"Number of injuries by age \n \nSport, activity...",,,,Number of injuries by age,Number of injuries by age,Number of injuries by age,,"Sport, activity or equipment",Injuries (1),...,2635.0,3261.0,572.0,"Nonpowder guns, BB'S, pellets",11603.0,519.0,3286.0,3443.0,3869.0,487.0
1,,,,Number of injuries by age,Number of injuries by age,Number of injuries by age,,,,,...,,,,,,,,,,
2,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older,,,,...,,,,,,,,,,
3,"Exercise, exercise equipment",445642,6662,36769,91013,229640,81558,,,,...,,,,,,,,,,
4,Bicycles and accessories,405411,13297,91089,50863,195030,55132,,,,...,,,,,,,,,,


In order to turn the above data set into the below one, I first had to read the data in using the read_html function in R, then went on to make a table from the table on the website and write it into a csv file. I removed the first row of the data frame which allowed me to then redefine the columns which fixed the format of the data. I then removed the first row and selected the first seven columns to end up with the data set below. The cleaning was done using R and the coding process can be seen [here](https://github.com/anly501/dsan-5000-project-rennyd123/blob/main/codes/01-data-gathering/data_gathering%26cleaning.Rmd).

In [6]:
#| code-fold: true 
file_path = "../../../../data/01-modified-data/sports_injury_data.csv"
injury_causes = pd.read_csv(file_path)
injury_causes.head()

Unnamed: 0,"Sport, activity or equipment",Injuries (1),Younger than 5,5 to 14,15 to 24,25 to 64,65 and older
0,"Exercise, exercise equipment",445642,6662,36769,91013,229640,81558
1,Bicycles and accessories,405411,13297,91089,50863,195030,55132
2,Basketball,313924,1216,109696,143773,57413,1825
3,Football,265747,581,145499,100760,18527,381
4,"ATV's, mopeds, minibikes, etc.",242347,3688,42069,61065,122941,12584


### NBA Injury Data 

In its raw form, the NBA data set is not extremely dirty, but there is a lot of work that needs to be done on it. As seen in the acquired column, there are many missing values, as well as very inconsistent formatting in the notes tab. I cleaned this data set using Python and my code can be found [here](../../../../codes/01-data-gathering/data_gathering&cleaning.ipynb).

Raw data: 

In [13]:
#| code-fold: true 

file_path = "../../../../data/00-raw-data/injuries_2010-2020.csv"
nba_injuries = pd.read_csv(file_path)
nba_injuries.head()

Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,2010-10-03,Bulls,,Carlos Boozer,fractured bone in right pinky finger (out inde...
1,2010-10-06,Pistons,,Jonas Jerebko,torn right Achilles tendon (out indefinitely)
2,2010-10-06,Pistons,,Terrico White,broken fifth metatarsal in right foot (out ind...
3,2010-10-08,Blazers,,Jeff Ayres,torn ACL in right knee (out indefinitely)
4,2010-10-08,Nets,,Troy Murphy,strained lower back (out indefinitely)


The acquired and relinquished columns are related. When one player ise relinquished, another player is acquired. Since I am not interested in acquired players, I dropped that column. I also extracted any information in the parenthesis in the notes to explore the nature of the injury and how long the player would be out of the game. I pulled that into a separate column. Finally, I made sure that the formatting of the rest of the columns are consistent. 

In [14]:
#| code-fold: true
file_path = "../../../../data/01-modified-data/basketball_injury_data.csv"
nba_injuries_cleaned = pd.read_csv(file_path)
nba_injuries_cleaned.head()

Unnamed: 0,Date,Team,Relinquished,Notes,InjuryStatus
0,2010-10-03,Bulls,Carlos Boozer,fractured bone in right pinky finger (out inde...,out indefinitely
1,2010-10-06,Pistons,Jonas Jerebko,torn right Achilles tendon (out indefinitely),out indefinitely
2,2010-10-06,Pistons,Terrico White,broken fifth metatarsal in right foot (out ind...,out indefinitely
3,2010-10-08,Blazers,Jeff Ayres,torn ACL in right knee (out indefinitely),out indefinitely
4,2010-10-08,Nets,Troy Murphy,strained lower back (out indefinitely),out indefinitely


### NFL Concussion Data

This data set was relatively clean to begin with and was already suitable for my desired analysis. I simply had to drop any unnecessary columns In order to clean this data set, I dropped all unnecessary columns and made the format of the columns consistent by turning all numbers into floats. I also decided to drop all of the rows with NaN values because it is clear to me that there is data missing in those rows due to the story of the injury not making sense. For example, there was one row where someone had a pre-season injury so they did not miss any weeks, but they had missing values for both play time before and anfer the injury. I did that using Python and my code can be found [here](../../../../codes/01-data-gathering/data_gathering&cleaning.ipynb).

Raw data:

In [18]:
#| code-fold: true 

file_path = "../../../../data/00-raw-data/Concussion Injuries 2012-2014.csv"
concussions_dirty = pd.read_csv(file_path)
concussions_dirty.head()

Unnamed: 0,ID,Player,Team,Game,Date,Opposing Team,Position,Pre-Season Injury?,Winning Team?,Week of Injury,Season,Weeks Injured,Games Missed,Unknown Injury?,Reported Injury Type,Total Snaps,Play Time After Injury,Average Playtime Before Injury
0,Aldrick Robinson - Washington Redskins vs. Tam...,Aldrick Robinson,Washington Redskins,Washington Redskins vs. Tampa Bay Buccaneers (...,30/09/2012,Tampa Bay Buccaneers,Wide Receiver,No,Yes,4,2012/2013,1,1.0,No,Head,0,14 downs,37.00 downs
1,D.J. Fluker - Tennessee Titans vs. San Diego C...,D.J. Fluker,San Diego Chargers,Tennessee Titans vs. San Diego Chargers (22/9/...,22/09/2013,Tennessee Titans,Offensive Tackle,No,No,3,2013/2014,1,1.0,No,Concussion,0,78 downs,73.50 downs
2,Marquise Goodwin - Houston Texans vs. Buffalo ...,Marquise Goodwin,Buffalo Bills,Houston Texans vs. Buffalo Bills (28/9/2014),28/09/2014,Houston Texans,Wide Receiver,No,No,4,2014/2015,1,1.0,No,Concussion,0,25 downs,17.50 downs
3,Bryan Stork - New England Patriots vs. Buffalo...,Bryan Stork,New England Patriots,New England Patriots vs. Buffalo Bills (12/10/...,12/10/2014,Buffalo Bills,Center,No,Yes,6,2014/2015,1,1.0,No,Head,0,82 downs,41.50 downs
4,Lorenzo Booker - Chicago Bears vs. Indianapoli...,Lorenzo Booker,Chicago Bears,Chicago Bears vs. Indianapolis Colts (9/9/2012),9/09/2012,Indianapolis Colts,Running Back,Yes,Yes,1,2012/2013,0,,No,Head,0,Did not return from injury,


The below is the data set I used for exploratory data analysis.

Cleaned data:

In [19]:
#| code-fold: true 

file_path = "../../../../data/01-modified-data/nfl_concussions.csv"
concussions_cleaned = pd.read_csv(file_path)
concussions_cleaned.head()

Unnamed: 0,Player,Team,Date,Opposing Team,Position,Pre-Season Injury?,Winning Team?,Week of Injury,Season,Weeks Injured,Games Missed,Unknown Injury?,Reported Injury Type,Total Snaps,Play Time After Injury,Average Playtime Before Injury
0,Aldrick Robinson,Washington Redskins,30/09/2012,Tampa Bay Buccaneers,Wide Receiver,No,Yes,4.0,2012/2013,1.0,1.0,No,Head,0.0,14.0 downs,37.00 downs
1,D.J. Fluker,San Diego Chargers,22/09/2013,Tennessee Titans,Offensive Tackle,No,No,3.0,2013/2014,1.0,1.0,No,Concussion,0.0,78.0 downs,73.50 downs
2,Marquise Goodwin,Buffalo Bills,28/09/2014,Houston Texans,Wide Receiver,No,No,4.0,2014/2015,1.0,1.0,No,Concussion,0.0,25.0 downs,17.50 downs
3,Bryan Stork,New England Patriots,12/10/2014,Buffalo Bills,Center,No,Yes,6.0,2014/2015,1.0,1.0,No,Head,0.0,82.0 downs,41.50 downs
4,Daniel Kilgore,San Francisco 49ers,29/10/2012,Arizona Cardinals,Guard,No,Yes,8.0,2012/2013,1.0,0.0,No,Concussion,1.0,8.0 downs,14.43 downs


### NFL Game Injury Data 

My NFL game injury data set was composed of two separate data sets that you can see below. The first data set mainly contained information about the injuries that occured as well as the situation that they occured in. The second data set contains more specific information about the field, weather conditions of the game, and the player. These two data sets combined will help to paint a coherent picture of injuries in the NFL. I cleaned both of these data sets using Python and the code can be found [here]("../../../../../../codes/01-data-gathering/data_gathering&cleaning.ipynb).

The raw data can be seen here: 

In [23]:
#| code-fold: true
file_path = "../../../../data/00-raw-data/InjuryRecord.csv"
injury_record = pd.read_csv(file_path)

print("INJURY RECORD DATA SET:")
injury_record.head()

INJURY RECORD DATA SET:


Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


In [24]:
#| code-fold: true
file_path = "../../../../data/00-raw-data/PlayList.csv"
play_list = pd.read_csv(file_path)
print("PLAY LIST DATA SET:")
play_list.head()

PLAY LIST DATA SET:


Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


In order to make my final data set, I first cleaned each set individually. For the first one, I simply had to drop the irrelevant columns, which I deamed to be every column starting with DM and the GameID and PlayKey columns. From the second data set, I also only had to drop excess columns. Finally, I merged both data sets together to end up with one bigger set. From there, I cleaned up the stadium type column by dealing with inconsistent spellings, dropped missing values, and dropped duplicate values. 

The final data set can be found here:

In [25]:
#| code-fold: true 
file_path = "../../../../data/01-modified-data/nfl_injuries.csv"
nfl_injuries = pd.read_csv(file_path)
nfl_injuries.head()

Unnamed: 0,PlayerKey,BodyPart,RosterPosition,StadiumType,FieldType,Temperature,Weather
0,39873,Knee,Linebacker,indoor,Synthetic,85,Mostly Cloudy
1,39873,Knee,Linebacker,outdoor,Natural,82,Sunny
2,39873,Knee,Linebacker,indoor,Synthetic,84,Cloudy
3,39873,Knee,Linebacker,retractable roof,Synthetic,78,Partly Cloudy
4,39873,Knee,Linebacker,outdoor,Natural,80,Cloudy


## Text Data Cleaning 

### News Data Cleaning

My news data came from the NewsAPI. In order to get this data clean, I first had to extract the data from the website, then I could clean up the punctuation, convert the data to a data frame, rename the columns, and create a description of each article. I also removed stop words to make the data better suited for analysis. It is important to note that this data set will be different every time my cleaning code is run, because the news will be different at any given day. I updated this data set as I am writing this today on December 4, 2023. My cleaning code can be found [here](../../../../codes/01-data-gathering/data_gathering&cleaning.ipynb).

The cleaned data set can be found here:

In [27]:
#| code-fold: true

file_path = "../../../../codes/01-data-gathering/cleaned_news_data.csv"
cleaned_news = pd.read_csv(file_path)
cleaned_news.head()

Unnamed: 0,source,author,publish_date,combined_t&d
0,cnet,kevin lynch,2023-11-07T14:45:04Z,watch champions league soccer: livestream boru...
1,espn,cesar hernandez,2023-11-12T01:57:26Z,rapinoe's career ends with injury in title gam...
2,cnet,kevin lynch,2023-11-25T09:30:05Z,man city vs liverpool livestream: how to watch...
3,cnet,adam oram,2023-11-25T12:00:04Z,newcastle vs chelsea livestream: how to watch ...
4,cnet,kevin lynch,2023-11-07T15:45:05Z,watch champions league soccer: livestream shak...


### Injury Prevention Text Data 

My last data set is one that I created to be a text data version of my injury prevention factors data set. I researched and wrote definitions for each injury prevention method and created a data frame for it. The original data set can be found below as well as the text data data set that I created. The code for the text data data set is also below.

In [28]:
#| code-fold: true
file_path = "../../../../data/01-modified-data/injury_prevention_data_soccer.csv"
soccer_injury = pd.read_csv(file_path)
print("ORIGINAL INJURY PREVENTION FACTORS DATA SET:")
soccer_injury.head()

ORIGINAL INJURY PREVENTION FACTORS DATA SET:


Unnamed: 0,ID,Age,Height,Mass,Team,Position,Years of Football Experience,Previous Injuries,Number of Injuries,Ankle Injuries,...,Importance Injury Prevention,Knowledgeability,Prevention Measure Stretching,Prevention Measure Warm Up,Prevention Measure Specific Strength Exercises,Prevention Measure Bracing,Prevention Measure Taping,Prevention Measure Shoe Insoles,Prevention Measure Face Masks,Prevention Measure Medical Corset
0,146,19,173.0,67.6,1,3,1,yes,6,yes,...,2,1,yes,no,yes,no,no,no,no,no
1,155,22,179.5,71.0,1,3,1,yes,2,no,...,1,1,yes,yes,no,no,no,no,no,no
2,160,22,175.5,71.8,1,3,1,yes,7,yes,...,1,1,yes,no,no,no,no,yes,no,no
3,164,23,190.0,80.5,1,4,1,yes,1,no,...,1,1,yes,yes,yes,no,no,no,no,no
4,145,19,173.5,68.7,1,3,1,yes,2,yes,...,1,2,yes,yes,no,no,yes,no,no,no


In [35]:
#| code-fold: true

from sklearn.model_selection import train_test_split
import pandas as pd 
import matplotlib as plt

file_path = "../../../../data/01-modified-data/injury_prevention_data_soccer.csv"
soccer_injury = pd.read_csv(file_path)
soccer_injury.columns = soccer_injury.columns.str.strip()
soccer_injury = soccer_injury.replace({"yes":1, "no":0})

columns_keep = ['Prevention Measure Stretching', 'Prevention Measure Warm Up',
       'Prevention Measure Specific Strength Exercises',
       'Prevention Measure Bracing', 'Prevention Measure Taping',
       'Prevention Measure Shoe Insoles', 'Prevention Measure Face Masks',
       'Prevention Measure Medical Corset', "Number of Injuries"]
df_features = soccer_injury[columns_keep]
df_features.columns

prevention_stretching = "Athletes can help protect themselves by preparing before and after a game or practice session by warming up muscles and then stretching. Exercises can include forward lunges, side lunges, standing quad stretch, seated straddle lotus, seated side straddle, seated toe touch, and the knees to chest stretch. Hold each stretch for 20 seconds. "
prevention_warm_up = "Warming up involves increasing the body's core temperature, heart rate, respiratory rate, and the body's muscle temperatures. By increasing muscle temperature, muscles become more loose and pliable, and by increasing heart rate and resipiatory rate, blood flood increases which helps to increase delivery of oxygen and nutrients to muscles. Warm up exercises include dynamic stretches, light bike riding, light jogging, jumproping, etc. "
prevention_specific_strength = "Specific strength exercises can refer to a wide range of exercises, from which a few are selected based each athlete. These exercises can range from muscle group specific weight-lifting exercises, to physical therapy-like exercises. These exercises depend on each athlete and are hard to define. "
prevention_bracing = "Basic braces provide general support and compression to specific areas of the body. More complex braces can do the same things, as well as promote healing, necessarily restrict movement, take weight off of an injury, etc. "
prevention_taping = "Taping can be used to reduce the range of motion at a joint and decrease swelling, which in turn can alleviate pain and prevent further injury. "
prevention_shoe_insoles = "Orthotics can alignment of an athlete's feet, ankles, knees, hips and back which can help prevent injuries. They can also absorb shock from impact of running to reduce stress on the athlete's joints and tissues. "
prevention_face_masks = "Athletic face masks can be used to protect maxillary, nasal, zygomatic and orbital injuries. These are worn in sports where a face injury could possible occur. "
prevention_medical_corset = "A medical corset is a corset that can be worn to help an athlete stablize their spine after a fracture or surgery. It will remind the athlete to not move in certain directors or to move more slowly to prevent causing further injury. "

prevention_definitions = {
    "Prevention Measure Stretching": prevention_stretching,
    "Prevention Measure Warm Up": prevention_warm_up,
    'Prevention Measure Specific Strength Exercises': prevention_specific_strength,
    'Prevention Measure Bracing': prevention_bracing, 
    'Prevention Measure Taping': prevention_taping,
    'Prevention Measure Shoe Insoles': prevention_shoe_insoles, 
    'Prevention Measure Face Masks': prevention_face_masks,
    'Prevention Measure Medical Corset': prevention_medical_corset
}

cols_keep = ['Number of Injuries', 'Prevention Measure Stretching', 'Prevention Measure Warm Up',
       'Prevention Measure Specific Strength Exercises',
       'Prevention Measure Bracing', 'Prevention Measure Taping',
       'Prevention Measure Shoe Insoles', 'Prevention Measure Face Masks',
       'Prevention Measure Medical Corset']
prevention_cols = ['Prevention Measure Stretching', 'Prevention Measure Warm Up',
       'Prevention Measure Specific Strength Exercises',
       'Prevention Measure Bracing', 'Prevention Measure Taping',
       'Prevention Measure Shoe Insoles', 'Prevention Measure Face Masks',
       'Prevention Measure Medical Corset']


si_subset = soccer_injury[cols_keep]
si_subset_prevention = si_subset[prevention_cols]
si_subset_prevention.head()

si_subset["Prevention Measures Definitions"] = ""

for col in si_subset.columns:
    if col.startswith("Prevention"):
        si_subset["Prevention Measures Definitions"] += si_subset[col].apply(
            lambda x: prevention_definitions[col] if x == 1 else ""
        )

print("INJURY PREVENTION FACTORS TEXT DATA DATA SET:")
si_subset.head()

INJURY PREVENTION FACTORS TEXT DATA DATA SET:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  si_subset["Prevention Measures Definitions"] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  si_subset["Prevention Measures Definitions"] += si_subset[col].apply(


Unnamed: 0,Number of Injuries,Prevention Measure Stretching,Prevention Measure Warm Up,Prevention Measure Specific Strength Exercises,Prevention Measure Bracing,Prevention Measure Taping,Prevention Measure Shoe Insoles,Prevention Measure Face Masks,Prevention Measure Medical Corset,Prevention Measures Definitions
0,6,1,0,1,0,0,0,0,0,Athletes can help protect themselves by prepar...
1,2,1,1,0,0,0,0,0,0,Athletes can help protect themselves by prepar...
2,7,1,0,0,0,0,1,0,0,Athletes can help protect themselves by prepar...
3,1,1,1,1,0,0,0,0,0,Athletes can help protect themselves by prepar...
4,2,1,1,0,0,1,0,0,0,Athletes can help protect themselves by prepar...
