## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [1]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [2]:
df=pd.read_csv("iris 1.csv")
df


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


How many different species are in this dataset?

In [3]:
x=df['species'].nunique()
x

3

What are their names?

In [20]:

y=df.columns
y


Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'species'],
      dtype='object')

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [21]:
p=df.value_counts("species")
p


species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [22]:
df['speal_ratio']=df['sepal width (cm)']/df['sepal length (cm)']
df


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,speal_ratio
0,5.1,3.5,1.4,0.2,setosa,0.686275
1,4.9,3.0,1.4,0.2,setosa,0.612245
2,4.7,3.2,1.3,0.2,setosa,0.680851
3,4.6,3.1,1.5,0.2,setosa,0.673913
4,5.0,3.6,1.4,0.2,setosa,0.720000
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.447761
146,6.3,2.5,5.0,1.9,virginica,0.396825
147,6.5,3.0,5.2,2.0,virginica,0.461538
148,6.2,3.4,5.4,2.3,virginica,0.548387


Create a similar column called `'petal_ratio'`: petal width / petal length

In [23]:
df['petal_ratio']=df['petal width (cm)']/df['petal length (cm)']
df


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,speal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,setosa,0.720000,0.142857
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.447761,0.442308
146,6.3,2.5,5.0,1.9,virginica,0.396825,0.380000
147,6.5,3.0,5.2,2.0,virginica,0.461538,0.384615
148,6.2,3.4,5.4,2.3,virginica,0.548387,0.425926


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [8]:
df['sepal length (inches)'] = df['sepal length (cm)'] * 0.393701
df['sepal width (inches)'] = df['sepal width (cm)'] * 0.393701
df['petal length (inches)'] = df['petal length (cm)'] * 0.393701
df['petal width (inches)'] = df['petal width (cm)']* 0.393701
print(df[['sepal length (inches)','sepal width (inches)','petal length (inches)','petal width (inches)']])

     sepal length (inches)  sepal width (inches)  petal length (inches)  \
0                 2.007875              1.377954               0.551181   
1                 1.929135              1.181103               0.551181   
2                 1.850395              1.259843               0.511811   
3                 1.811025              1.220473               0.590552   
4                 1.968505              1.417324               0.551181   
..                     ...                   ...                    ...   
145               2.637797              1.181103               2.047245   
146               2.480316              0.984253               1.968505   
147               2.559057              1.181103               2.047245   
148               2.440946              1.338583               2.125985   
149               2.322836              1.181103               2.007875   

     petal width (inches)  
0                0.078740  
1                0.078740  
2              

## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [22]:
species_dict = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
df['encoded_species']=df['species'].apply(lambda x: species_dict[x])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,speal_ratio,petal_ratio,sepal length (inches),sepal width (inches),petal length (inches),petal width (inches),encoded_species
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,2.007875,1.377954,0.551181,0.078740,0
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.929135,1.181103,0.551181,0.078740,0
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.850395,1.259843,0.511811,0.078740,0
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.811025,1.220473,0.590552,0.078740,0
4,5.0,3.6,1.4,0.2,setosa,0.720000,0.142857,1.968505,1.417324,0.551181,0.078740,0
...,...,...,...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.447761,0.442308,2.637797,1.181103,2.047245,0.905512,2
146,6.3,2.5,5.0,1.9,virginica,0.396825,0.380000,2.480316,0.984253,1.968505,0.748032,2
147,6.5,3.0,5.2,2.0,virginica,0.461538,0.384615,2.559057,1.181103,2.047245,0.787402,2
148,6.2,3.4,5.4,2.3,virginica,0.548387,0.425926,2.440946,1.338583,2.125985,0.905512,2


## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [5]:
seeds=pd.read_csv("ncaa-seeds 1.csv")
print(seeds)

   team_seed opponent_seed
0        01N           16N
1        02N           15N
2        03N           14N
3        04N           13N
4        05N           12N
5        06N           11N
6        07N           10N
7        08N           09N
8        01S           16S
9        02S           15S
10       03S           14S
11       04S           13S
12       05S           12S
13       06S           11S
14       07S           10S
15       08S           09S
16       01E           16E
17       02E           15E
18       03E           14E
19       04E           13E
20       05E           12E
21       06E           11E
22       07E           10E
23       08E           09E
24       01W           16W
25       02W           15W
26       03W           14W
27       04W           13W
28       05W           12W
29       06W           11W
30       07W           10W
31       08W           09W


For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


In [6]:
seeds["team_divison"]=seeds['team_seed'].apply(lambda x: x[-1])
seeds["opponent_division"]=seeds['opponent_seed'].apply(lambda x: x[-1])
seeds.head(1)

Unnamed: 0,team_seed,opponent_seed,team_divison,opponent_division
0,01N,16N,N,N


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

In [8]:
seeds['team_seed'] = seeds['team_seed'].map(lambda x: int(x[:-1]))  
seeds['opponent_seed'] = seeds['opponent_seed'].map(lambda x: int(x[:-1])) 
print(seeds.head(1))

   team_seed  opponent_seed team_divison opponent_division
0          1             16            N                 N


Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [18]:

seeds['seed_delta'] = seeds['team_seed'] - seeds['opponent_seed']
print(seeds.head(1))







   team_seed  opponent_seed team_divison opponent_division  seed_delta
0          1             16            N                 N         -15
