<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Import-Packages" data-toc-modified-id="1.-Import-Packages-1">1. Import Packages</a></span></li><li><span><a href="#2.-Read-in-Data" data-toc-modified-id="2.-Read-in-Data-2">2. Read in Data</a></span></li><li><span><a href="#3.-Data-Cleaning" data-toc-modified-id="3.-Data-Cleaning-3">3. Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Filling-Nulls" data-toc-modified-id="Filling-Nulls-3.1">Filling Nulls</a></span></li><li><span><a href="#Data-Types" data-toc-modified-id="Data-Types-3.2">Data Types</a></span></li><li><span><a href="#Creating-New/Dropping-Columns" data-toc-modified-id="Creating-New/Dropping-Columns-3.3">Creating New/Dropping Columns</a></span></li><li><span><a href="#Editing-Player-Names" data-toc-modified-id="Editing-Player-Names-3.4">Editing Player Names</a></span></li><li><span><a href="#Removing-Unnecessary-Rows" data-toc-modified-id="Removing-Unnecessary-Rows-3.5">Removing Unnecessary Rows</a></span></li></ul></li><li><span><a href="#4.-Renaming-Columns" data-toc-modified-id="4.-Renaming-Columns-4">4. Renaming Columns</a></span></li><li><span><a href="#5.-Saving-Clean-File-to-CSV" data-toc-modified-id="5.-Saving-Clean-File-to-CSV-5">5. Saving Clean File to CSV</a></span></li></ul></div>

# 1. Import Packages

In [1]:
import pandas as pd
import numpy as np

# 2. Read in Data

Our data set on NBA injuries comes from Kaggle (https://www.kaggle.com/ghopkins/nba-injuries-2010-2018) and lists out announcements from injuries along with current team and date.

In [2]:
inj = pd.read_csv('../data/2010-2018_NBA_injuries.csv')

# 3. Data Cleaning

In [3]:
inj.head()

Unnamed: 0,Date,Team,Acquired,Relinquised,Notes
0,2010-10-03,Bulls,,Carlos Boozer,fractured bone in right pinky finger (out inde...
1,2010-10-06,Pistons,,Jonas Jerebko,torn right Acchilles tendon (out indefinitely)
2,2010-10-06,Pistons,,Terrico White,broken fifth metatarsal in right foot (out ind...
3,2010-10-08,Blazers,,Jeff Pendergraph / Jeff Ayres,torn ACL in right knee (out indefinitely)
4,2010-10-08,Nets,,Troy Murphy,strained lower back (out indefinitely)


In [4]:
inj.tail()

Unnamed: 0,Date,Team,Acquired,Relinquised,Notes
9778,2018-05-22,Warriors,,Andre Iguodala,bruised left leg (DTD)
9779,2018-05-25,Rockets,,Chris Paul,strained right hamstring (out for season)
9780,2018-05-26,Cavaliers,,Kevin Love,concussion (DTD)
9781,2018-05-31,Cavaliers,Kevin Love,,returned to lineup
9782,2018-06-06,Warriors,Andre Iguodala,,returned to lineup


## Filling Nulls

In [5]:
inj.isnull().sum()

Date              0
Team              3
Acquired       8194
Relinquised    1589
Notes             0
dtype: int64

We can fill all nulls with empty string as they are all in the two name and team columns.

In [6]:
inj.fillna('', inplace = True)

## Data Types

In [7]:
inj.dtypes

Date           object
Team           object
Acquired       object
Relinquised    object
Notes          object
dtype: object

We want to convert the dates to datetime objects and the other three columns should all be string.  

In [8]:
inj['Date'] = pd.to_datetime(inj['Date'])

Next we want to convert the remaining three columns to string.

In [9]:
inj['Team'] = inj['Team'].map(lambda i : str(i))
inj['Acquired'] = inj['Acquired'].map(lambda i: str(i))
inj['Relinquised'] = inj['Relinquised'].map(lambda i: str(i))
inj['Notes'] = inj['Notes'].map(lambda i : str(i))

## Creating New/Dropping Columns

We noticed that there is only one player in either the Acquired or Relinquished column - we want to combine these into one player column and then delete the two old columns.

In [10]:
inj['name'] = inj['Acquired'] + inj['Relinquised']

In [11]:
# we will also drop the team column as we don't require that
inj.drop(columns = ['Acquired', 'Relinquised', 'Team'], inplace = True)

## Editing Player Names

In [12]:
inj['name'] = inj['name'].map(lambda i : i.lower().replace(
            ' ', '').replace(
            '.', '').replace(
            '-', '').replace(
            "'", "").replace(
            '/', '').replace(
            '(william)', ''))   # necessary due to one player

## Removing Unnecessary Rows

This data comes from a set on announcments related to injuries.  In our initial viewing of the data, we see that some of the rows are not relevant for our purposes and should be removed from the set.

We can drop rows where the note is "returned to lineup". Additionally we can drop rows that mention flu or illness as the reason for not playing.  Finally, there are other pieces of string we see that we want to clear out that are not relevant.

Additionally, we want to clear out any rows where a player did not play due to rest.  Although this can be interpreted in some cases as the team being aware of a player's unique needs without specifying an injury, our knowledge is that teams often plan these rest days ahead of time regardless of current physical state.  Therefore, we will remove these rows.  However, we are potentially interested in viewing how teams handle rest days and as such, we will save these instances down into a new CSV file before removing them from the main injuries base.

In [13]:
# rest csv file
inj[(inj['Notes'] == 'rest (DTD)') | (inj['Notes'] == 'rest (DNP)')].to_csv('../data/rest.csv')

In [14]:
remove_string = ['returned to lineup', 
                 'illness (DNP)',
                 'illness (DTD)',
                 'flu (DNP)',
                 'flu (DTD)',
                 'returned as head coach', 
                 'returned to lineup (CBC)',
                 'activated from IL',
                 'head coach returned to team',
                 'returned to team as head coach',
                 'headache (DNP)',
                 'migraine headache (DNP)'
                 'ill (DNP)',
                 'ill (DTD)'
                 'rest (DTD)',
                 'rest (DNP)',
                 'personal reasons (DNP)',
                 'personal reasons (DTD)',
                 'DNP',
                 'blood clots (out for season)',
                 'thrombocytopenia (blood disorder) (DTD)',
                 'upper respiratory infection (DNP)',
                 'upper respiratory infection (DTD)',
                 'upper respiratory illness (DTD)',
                 'upper respiratory illness (DNP)',
                 'upper respiratory illness (DTD) (CBC Y)',
                 'illness / upper respiratory infection (DTD)'
                ]

In [15]:
for phrase in remove_string:
    inj = inj[inj['Notes'] != phrase]

# 4. Renaming Columns

In [16]:
inj.rename(columns = {'Date' : 'date', 'Notes' : 'notes'}, inplace = True)

# 5. Saving Clean File to CSV

In [17]:
inj.to_csv('../data/injuries_clean.csv')