# Feature engineering in Pandas

## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import re
%matplotlib inline

Read the `../data/iris.csv` dataset into an object named `iris`

In [None]:
iris = pd.read_csv('C:/Users/canin/Downloads/Files for Jupiter/hw14_w07d2-master/data/iris.csv')
iris.head()

How many different species are in this dataset?

In [None]:
iris['species'].nunique()

What are their names?

In [None]:
iris['species'].unique()

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [None]:
iris['species'].value_counts()

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [None]:
sr = pd.DataFrame(iris['sepal width (cm)'] / iris['sepal length (cm)'], columns = ['sepal_ratio'])
sr.head()


Create a similar column called `'petal_ratio'`: petal width / petal length

In [None]:
pr = pd.DataFrame(iris['petal width (cm)'] / iris['petal length (cm)'], columns= ['petal_ratio'])
pr.head()

Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [None]:
def toinches(value):
    value /= 2.54
    return value
iris['sepal length (in)'], iris['sepal width (in)'], iris['petal length (in)'], iris['petal width (in)'] = iris['sepal length (cm)'].apply(toinches), iris['sepal width (cm)'].apply(toinches), iris['petal length (cm)'].apply(toinches), iris['petal width (cm)'].apply(toinches)

iris.head()

## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


<details><summary>Hint 1</summary>
Create a dictionary using the species as keys and the numbers 0-2 for values
</details>

<details><summary>Hint 2</summary>
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column
</details>

In [None]:
def lettersfornumbers(value):
    if value == 'setosa':
        return 0
    if value == 'versicolor':
        return 1
    if value == 'virginica':
        return 2
iris['encoded_species'] = iris['species'].apply(lettersfornumbers)
iris.head()


## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [2]:
seeds = pd.read_csv('C:/Users/canin/Downloads/Files for Jupiter/hw14_w07d2-master/data/ncaa-seeds.csv')
seeds.head()

Unnamed: 0,team_seed,opponent_seed
0,01N,16N
1,02N,15N
2,03N,14N
3,04N,13N
4,05N,12N


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [3]:
def getdivision(value):
    for item in value:
        if item == 'N':
            return 'N'
        if item == 'S':
            return 'S'
        if item == 'W':
            return 'W'
        if item == 'E':
            return 'E'
seeds['team_division'], seeds['opponent_division'] = seeds['team_seed'].apply(getdivision), seeds['opponent_seed'].apply(getdivision)
seeds.head()


Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,01N,16N,N,N
1,02N,15N,N,N
2,03N,14N,N,N
3,04N,13N,N,N
4,05N,12N,N,N


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [4]:
def removeletters(value):
#     value.strip('N')
# #     try:
#         value = [value].remove('N')
#         value = [value].remove('S')
#         value = [value].remove('W')
#         value = [value].remove('E')
#     except ValueError:
# #         pass
    for char in value:
        if char == 'N':
            return int(value[0:2])
        elif char == 'S':
            return int(value[0:2])
        elif char == 'W':
            return int(value[0:2])
        elif char == 'E':
            return int(value[0:2])
# The above function will only work when the number of the team does not have more than two digits.
# The line immediately below this one works as well.
# seeds['team_seed'] = [x[1:2] for x in seeds['team_seed']]
seeds['team_seed'], seeds['opponent_seed'] = seeds['team_seed'].apply(removeletters), seeds['opponent_seed'].apply(removeletters)
seeds.head()


Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division
0,1,16,N,N
1,2,15,N,N
2,3,14,N,N
3,4,13,N,N
4,5,12,N,N


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [5]:
seeds['seed_delta'] = seeds['team_seed'] - seeds['opponent_seed']
seeds.head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division,seed_delta
0,1,16,N,N,-15
1,2,15,N,N,-13
2,3,14,N,N,-11
3,4,13,N,N,-9
4,5,12,N,N,-7


## Get Dummies

Using pandas get_dummies method, create a new dataframe with 4 columns from team_divison.

NOTE: Be sure to use 'team_division' as your prefix.

The first row of your result should look as follows:

| team_seed | opponent_seed | opponent_division | seed_delta | team_division_E | team_division_N | team_division_S | team_division_W |
|-----------|---------------|-------------------|------------|-----------------|-----------------|-----------------|-----------------|
| 1         | 16            | N                 | -15        | 0               | 1               | 0               | 0               |

In [6]:
nd = pd.get_dummies(seeds, prefix=['team_division'], columns = ['team_division'])
nd.head()

Unnamed: 0,team_seed,opponent_seed,opponent_division,seed_delta,team_division_E,team_division_N,team_division_S,team_division_W
0,1,16,N,-15,0,1,0,0
1,2,15,N,-13,0,1,0,0
2,3,14,N,-11,0,1,0,0
3,4,13,N,-9,0,1,0,0
4,5,12,N,-7,0,1,0,0


In machine learning, it's common to drop one the columns and have that be the baseline. Drop 'team_division_E', and append the remaining three columns to your original ncaa dataframe.

The first row of your result should look as follows:

| team_seed | opponent_seed | opponent_division | seed_delta | team_division_N | team_division_S | team_division_W |
|-----------|---------------|-------------------|------------|-----------------|-----------------|-----------------|
| 1         | 16            | N                 | -15        | 1               | 0               | 0               |

In [7]:
dropTDN = nd.drop(['team_division_E'], axis=1)
dropTDN
FinalDaFr = seeds.merge(dropTDN).head(5)

Repeat the previous two steps for opponent_division.

The first row of your result should look as follows:

| team_seed | opponent_seed | seed_delta | team_division_N | team_division_S | team_division_W |
|-----------|---------------|------------|-----------------|-----------------|-----------------|
| 1         | 16            | -15        | 1               | 0               | 0               |

In [8]:
nd = pd.get_dummies(seeds, prefix=['opponent_division'], columns = ['opponent_division'])
dropODE = nd.drop(['opponent_division_E'], axis=1)
dropODE
FinalDaFr.merge(dropODE).head()

Unnamed: 0,team_seed,opponent_seed,team_division,opponent_division,seed_delta,team_division_N,team_division_S,team_division_W,opponent_division_N,opponent_division_S,opponent_division_W
0,1,16,N,N,-15,1,0,0,1,0,0
1,2,15,N,N,-13,1,0,0,1,0,0
2,3,14,N,N,-11,1,0,0,1,0,0
3,4,13,N,N,-9,1,0,0,1,0,0
4,5,12,N,N,-7,1,0,0,1,0,0
