# Prework

Let's create some fake data to practice what we've learned so far about pandas. We'll create a dataframe that simulates some NBA games.

To get started, let's import numpy and pandas into this notebook:

In [1]:
import numpy as np
import pandas as pd

Now, we need to create some matchups. We'll use the current list of 30 NBA teams below and create matchups so that every team plays each other once.

In [2]:
teams = ['Atlanta Hawks', 'Boston Celtics', 'Brooklyn Nets', 'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers', 'Dallas Mavericks', 'Denver Nuggets', 'Detroit Pistons', 'Golden State Warriors', 
         'Houston Rockets', 'Indiana Pacers', 'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat', 'Milwaukee Bucks', 'Minnesota Timberwolves', 'New Orleans Pelicans', 
         'New York Knicks', 'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers', 'Phoenix Suns', 'Portland Trail Blazers', 'Sacramento Kings', 'San Antonio Spurs', 'Toronto Raptors', 
         'Utah Jazz', 'Washington Wizards']

We'll use a nested for loop to create an array of matchups. Each matchup will be a dictionary containing the team and their opponent. Your output should look like this:

```python
matchups = [{'team': 'Atlanta Hawks', 'opponent': 'Boston Celtics'}, {'team': 'Atlanta Hawks', 'opponent': 'Brooklyn Nets'}, ...]
```

NOTE: Make sure no team plays themselves.

In [3]:
teams

['Atlanta Hawks',
 'Boston Celtics',
 'Brooklyn Nets',
 'Charlotte Hornets',
 'Chicago Bulls',
 'Cleveland Cavaliers',
 'Dallas Mavericks',
 'Denver Nuggets',
 'Detroit Pistons',
 'Golden State Warriors',
 'Houston Rockets',
 'Indiana Pacers',
 'Los Angeles Clippers',
 'Los Angeles Lakers',
 'Memphis Grizzlies',
 'Miami Heat',
 'Milwaukee Bucks',
 'Minnesota Timberwolves',
 'New Orleans Pelicans',
 'New York Knicks',
 'Oklahoma City Thunder',
 'Orlando Magic',
 'Philadelphia 76ers',
 'Phoenix Suns',
 'Portland Trail Blazers',
 'Sacramento Kings',
 'San Antonio Spurs',
 'Toronto Raptors',
 'Utah Jazz',
 'Washington Wizards']

In [12]:
matchups = []
for team in teams:
    for opponent in teams:
        if team == opponent:
            continue
        game = {}
        game['team'] = team
        game['opponent'] = opponent
        matchups.append(game)
                
matchups

[{'opponent': 'Boston Celtics', 'team': 'Atlanta Hawks'},
 {'opponent': 'Brooklyn Nets', 'team': 'Atlanta Hawks'},
 {'opponent': 'Charlotte Hornets', 'team': 'Atlanta Hawks'},
 {'opponent': 'Chicago Bulls', 'team': 'Atlanta Hawks'},
 {'opponent': 'Cleveland Cavaliers', 'team': 'Atlanta Hawks'},
 {'opponent': 'Dallas Mavericks', 'team': 'Atlanta Hawks'},
 {'opponent': 'Denver Nuggets', 'team': 'Atlanta Hawks'},
 {'opponent': 'Detroit Pistons', 'team': 'Atlanta Hawks'},
 {'opponent': 'Golden State Warriors', 'team': 'Atlanta Hawks'},
 {'opponent': 'Houston Rockets', 'team': 'Atlanta Hawks'},
 {'opponent': 'Indiana Pacers', 'team': 'Atlanta Hawks'},
 {'opponent': 'Los Angeles Clippers', 'team': 'Atlanta Hawks'},
 {'opponent': 'Los Angeles Lakers', 'team': 'Atlanta Hawks'},
 {'opponent': 'Memphis Grizzlies', 'team': 'Atlanta Hawks'},
 {'opponent': 'Miami Heat', 'team': 'Atlanta Hawks'},
 {'opponent': 'Milwaukee Bucks', 'team': 'Atlanta Hawks'},
 {'opponent': 'Minnesota Timberwolves', 'team

Now let's iterate through our matchups and create a few data points:

1. The team's score
2. The opponent's score
3. Whether or not the game was home or away.

We'll also use numpy to randomly generate these values.

Our matchups will look like this when we're done:

```python
matchups = [
    {
        'opponent': 'Boston Celtics',
        'opponent_score': 93,
        'team': 'Atlanta Hawks',
        'team_score': 104,
        'location': 'H'
    },
    ...
]
```

In [15]:
for game in matchups:
    teamscore, oppscore = np.random.choice(range(95, 115), size=2, replace=False)
    location = np.random.choice(['H', 'A'])
    game['team_score'] = teamscore
    game['opponent_score'] = oppscore
    game['location'] = location
    

matchups

[{'location': 'A',
  'opponent': 'Boston Celtics',
  'opponent_score': 107,
  'team': 'Atlanta Hawks',
  'team_score': 114},
 {'location': 'H',
  'opponent': 'Brooklyn Nets',
  'opponent_score': 107,
  'team': 'Atlanta Hawks',
  'team_score': 109},
 {'location': 'H',
  'opponent': 'Charlotte Hornets',
  'opponent_score': 109,
  'team': 'Atlanta Hawks',
  'team_score': 106},
 {'location': 'H',
  'opponent': 'Chicago Bulls',
  'opponent_score': 95,
  'team': 'Atlanta Hawks',
  'team_score': 107},
 {'location': 'A',
  'opponent': 'Cleveland Cavaliers',
  'opponent_score': 114,
  'team': 'Atlanta Hawks',
  'team_score': 101},
 {'location': 'A',
  'opponent': 'Dallas Mavericks',
  'opponent_score': 114,
  'team': 'Atlanta Hawks',
  'team_score': 95},
 {'location': 'H',
  'opponent': 'Denver Nuggets',
  'opponent_score': 98,
  'team': 'Atlanta Hawks',
  'team_score': 111},
 {'location': 'H',
  'opponent': 'Detroit Pistons',
  'opponent_score': 98,
  'team': 'Atlanta Hawks',
  'team_score': 1

Now we can use our list of dictionaries to create a pandas dataframe.

In [17]:
df = pd.DataFrame(matchups)

In [18]:
df.head()

Unnamed: 0,location,opponent,opponent_score,team,team_score
0,A,Boston Celtics,107,Atlanta Hawks,114
1,H,Brooklyn Nets,107,Atlanta Hawks,109
2,H,Charlotte Hornets,109,Atlanta Hawks,106
3,H,Chicago Bulls,95,Atlanta Hawks,107
4,A,Cleveland Cavaliers,114,Atlanta Hawks,101


In [19]:
df.tail()

Unnamed: 0,location,opponent,opponent_score,team,team_score
865,H,Portland Trail Blazers,96,Washington Wizards,100
866,A,Sacramento Kings,98,Washington Wizards,105
867,A,San Antonio Spurs,105,Washington Wizards,110
868,A,Toronto Raptors,97,Washington Wizards,107
869,H,Utah Jazz,102,Washington Wizards,97


# Feature extraction in pandas

In data science, the features you choose to create or omit can be just as important as what machine learning model you choose to use.

Today, we're going to cover feature engineering in pandas.

## Broadcasting

If you recall, broadcasting enables us to perform mathemeatical operations across a vector without having to create a for loop. 

For practice, create a numpy array of numbers 1 - 10:

In [None]:
df['team score'] + 100

Now, use broadcasting to double each number in the array:

Because pandas is built on numpy, we can create new features using broadcasting. With our nba dataframe, let's create a win column which will be `True` or `False`, depending on whether or not the team's score is higher than their opponent's.

In [23]:
df['win'] = df['team_score'] > df['opponent_score']
df.head()

Unnamed: 0,location,opponent,opponent_score,team,team_score,win
0,A,Boston Celtics,107,Atlanta Hawks,114,True
1,H,Brooklyn Nets,107,Atlanta Hawks,109,True
2,H,Charlotte Hornets,109,Atlanta Hawks,106,False
3,H,Chicago Bulls,95,Atlanta Hawks,107,True
4,A,Cleveland Cavaliers,114,Atlanta Hawks,101,False


In machine learning, we need 1's and 0's instead of booleans, so let's change the win column's datatype to be `int`

In [None]:
df['win'] = df['win'].astype(int)

In [None]:
df.head()

## Broadcasting practice

Use broadcasting to create a new column called point spread, which is the difference between the team's score and their opponent's.

For example, if the team's score is 90 and their opponent's is 99, then the point spread is -9.

In [None]:
df['spread'] = df['team score'] - df['opponent score']

# Mapping

[Basketball Reference](http://www.basketball-reference.com/) is a fantastic site for NBA statistics. We might want to scrape this site down the road, so it's a good idea to know how to navigate to a particular team's page. Each team has a unique slug that is used in their urls. 

For example, Atlanta Hawks' url is http://www.basketball-reference.com/teams/ATL/, which means their slug is ATL. 

Below is a dictionary that **maps** each team to their respective slug (hence the name of this section). We'll use this dictionary to add a couple of columns to our data frame.

In [None]:
slug_dict = {'Atlanta Hawks':'ATL', 'Brooklyn Nets':'BRK', 'Boston Celtics':'BOS', 'Charlotte Hornets':'CHO', 'Chicago Bulls':'CHI', 'Cleveland Cavaliers':'CLE', 'Dallas Mavericks':'DAL', 'Denver Nuggets':'DEN', 'Detroit Pistons':'DET', 'Golden State Warriors':'GSW', 'Houston Rockets':'HOU', 'Indiana Pacers':'IND', 'Los Angeles Clippers':'LAC', 'Los Angeles Lakers':'LAL', 'Memphis Grizzlies':'MEM', 'Miami Heat':'MIA', 'Milwaukee Bucks':'MIL', 'Minnesota Timberwolves':'MIN', 'New Orleans Pelicans':'NOP', 'New York Knicks':'NYK', 'Oklahoma City Thunder':'OKC', 'Orlando Magic':'ORL', 'Philadelphia 76ers':'PHI', 'Phoenix Suns':'PHO', 'Portland Trail Blazers':'POR', 'Sacramento Kings':'SAC', 'San Antonio Spurs':'SAS', 'Toronto Raptors':'TOR', 'Utah Jazz':'UTA', 'Washington Wizards':'WAS'}
slug_dict

We'll use pandas' `map` method along with our dictionary to create a `'team_slug'` column:

## Mapping practice

Using `slug_dict`, create a new column for the opponent's slug:

# Apply

Pandas allows us to use functions to transform our data. Typically this is done in two steps:

1. Create the function you will use to transform your data frame
2. Use the `apply` method to run this function across your data frame.

In our NBA example, let's change our slug columns to be the full url for each team/opponent. 

First, let's create a function that accepts a slug and returns the full Basketball Reference url for that slug:

Now let's use this function to change our `team_slug` column to be the full url:

Now do the same for `opponent_slug`:

### Cleanup: 

Because our columns are now the full url, and not just the slug, it makes sense for use to change the names of our columns:

Not every win in the NBA is the same. An inferior team might be favored simply because they're playing at home. Some [basketball statistics](https://en.wikipedia.org/wiki/Rating_Percentage_Index#Basketball_formula) account for this by reducing the value of a home win (and increasing the value of an away win). 

We'll create a new column called `'adjusted_win'`, which will be 0.6 wins if they won at home and 1.4 wins if they won on the road.

We'll use pandas `apply` method to create this new column.

First, create a function that accepts an individual row as a parameter. 
- If the game was at home, we'll multiply the win column by 0.6
- If the game was played on the raod, multiply the win column by 1.4
- NOTE: If the win column is zero, then our result will be zero

Now we'll use the `apply` method, along with our function to create the adjusted_win column. 

Note: We'll have to make a slight change to our `apply` method since we're dealing with multiple columns

# Dummies (AKA One Hot Encoding)

We might want to incorporate the game's location in our machine learning model. There's just one problem: we need numerical values, but we have strings.

Thankfully pandas has a method for converting categorical data into numerical data. We'll use `get_dummies` to create numerical columns from our `location` column:

## Dummies practice

Create dummy columns from the `team` and `opponent` columns: