# Hockey fights

David Singer, the gentleman who runs [hockeyfights dot com](http://www.hockeyfights.com/), was kind enough to provide us with a cut of the data powering his website for us to use in training sessions. Thanks, David!

This data lives here: `../data/hockey-fights.xlsx`. Every row in the data is one fight.

![fight fight fight](../img/hockey.gif "fight fight fight")

First, import pandas, then use the `read_excel()` method to load the data into a dataframe.

In [None]:
# import pandas


In [None]:
# read in the excel file, specifying the data sheet to load


In [None]:
# check the output with `head()`


### Check out the data

In [None]:
# check the output with `info()`


In [None]:
# check earliest date with `min()`


In [None]:
# check latest date with `max()`


In [None]:
# check home team values with `unique()`


In [None]:
# check away team values with `unique()`


In [None]:
# etc ...

### Come up with a list of questions

- Which player was involved in the most fights?
- Average number of fights per game?
- What was the longest fight?

... what else?

### Q: Which player was involved in the most fights?

This one will be a little tricky because of how the data is structured -- a player could be fighting either as the home or away player, so there's not an obvious column to group or pivot on. There are a couple of strategies we could use to answer this question. Here's what we're going to do:

- Use the [`concat()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) method to stack the column values in each player ID column into one Series (we're using player ID instead of name to avoid the "John Smith" problem (or I guess "Graham MacKenzie")
- Use `value_counts()` to get a count
- Grab the player ID with the most fights by getting the first ([0]) element in the `index` list for the Series returned by `value_counts()`
- Go back to the original data frame and filter for that ID, then use [`iloc`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to get a single fight record with the player's name, team, etc.

In [None]:
# use `concat()` to get a list of all player IDs


In [None]:
# use `value_counts()` to get counts of each ID, then `index` to grab the first one


In [None]:
# filter to get records where that player was involved in fights as away player
# and use `iloc` to get the first one


In [None]:
# print that player's name and team name


### Q: Average number of fights per game?

This one will be pretty easy. We need two numbers: The total number of fights -- which is the same as asking how many records are in our data frame -- and the total number of games, which will just involve counting the unique number of games in our data.

To get the number of records in our data frame, we shall use the `shape` attribute, which returns a [tuple](https://www.tutorialspoint.com/python/python_tuples.htm) with two things: the number of rows (the first thing) and the number of columns (the second thing). You can access items in a tuple just like you'd access items in a list: With square brackets `[]` and the index number of the thing you're trying to get.

To get the number of unique games, we're going to use the [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html) method to get the number of unique game IDs.

(How did I know about the `nunique()` method? I didn't until I Googled "pandas count unique values series.")

In [None]:
# use `shape` to grab the record count
# (note: could also use `len`)


In [None]:
# save record count as a variable


In [None]:
# use `nunique()` to get the number of unique game IDs


In [None]:
# average is: number of fights divided into number of games


### Q: What was the longest fight?

We have fight duration as a mixture of minutes and seconds, so we first need to convert to seconds ((minutes * 60) + seconds). We'll create a new column, `fight_duration`, for this. Then it's just a matter of sorting top to bottom by that new column and using [`.iloc[0]`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) again to grab the first record.

In [None]:
# add a new column that calculates the fight time in seconds

# sort by that new column descending and grab the first one
