# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to Pandas 2
Week 2 | Day 2

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Use the .loc function to slice
- Set and reset DataFrame Indices
- Use .isin()
- Perform boolean indexing on dataframes
- Perform math functions on pandas Series

## Recap of yesterday

### Importing pandas:
```python
import pandas as pd
```

### Reading in a csv: 

```python
df = pd.read_csv()
```

### Viewing the head and tail

```python
df.head()
df.tail()
```

### Viewing the columns and the index
```python
df.columns
df.index
```

### Data type info
```python
df.info()
```

### Summary statistics
```python
df.describe()
df['some_column_name'].max()
df['some_column_name'].min()
```

### Uniques
```python
df['some_column_name'].nunique()
df['some_column_name'].unique()
```

### Creating a histogram
```python
df['some_column_name'].hist()
```

## Slicing with .iloc

In [1]:
# import pandas
import pandas as pd

# create a dictionary
numbs = {'ones': [1, 2, 3, 4], 'tens': [10, 20, 30, 40], 'hundos': [100, 200, 300, 400]}

# pass dictionary into DataFrame and set the columns to keep the order
df = pd.DataFrame(numbs, columns=['ones', 'tens', 'hundos'])

### Slicing with .iloc


In [2]:
df

Unnamed: 0,ones,tens,hundos
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400


### How do we get column 1?

In [3]:
# solution

df.iloc[:, 0]

0    1
1    2
2    3
3    4
Name: ones, dtype: int64

### How do we turn it into a DataFrame?

## How do we get the first  3 rows and 2 columns?

### Notice this is exclusive - we get 0, 1, 2 for rows and 0,1 for columns

### One more. How do we get all rows and column 0 and 2?

In [6]:
df.iloc[:, [0,2]]

Unnamed: 0,ones,hundos
0,1,100
1,2,200
2,3,300
3,4,400


### Can we get only columns with an 'o' in them with iloc?

In [7]:
df.iloc[:, [x for x in df.columns if 'o' in x]]

TypeError: cannot perform reduce with flexible type

## No.

## Introducing .loc

In [8]:
df.loc[:, [x for x in df.columns if 'o' in x]]

Unnamed: 0,ones,hundos
0,1,100
1,2,200
2,3,300
3,4,400


### .loc allows us to use combined slicing: numeric and named

In [9]:
df.loc[:3, ['ones', 'tens']]

Unnamed: 0,ones,tens
0,1,10
1,2,20
2,3,30
3,4,40


### Notice that it is inclusive! It includes the last item listed

### Notice we can't use the .iloc syntax of all numerics

In [10]:
df.loc[:2, 1]

TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers [1] of <type 'int'>

## Let's change the index

In [171]:
from string import ascii_lowercase

# change the index to be lower case letters
df.index = [x for x in ascii_lowercase[:len(df.index)]]

In [172]:
df

Unnamed: 0,ones,tens,hundos
a,1,10,100
b,2,20,200
c,3,30,300
d,4,40,400


## Now we'll use .loc to slice with a named index

In [173]:
# list of indicies
df.loc[['a','c'], :]

Unnamed: 0,ones,tens,hundos
a,1,10,100
c,3,30,300


In [174]:
# slicing with named indices
df.loc['b':, :]

Unnamed: 0,ones,tens,hundos
b,2,20,200
c,3,30,300
d,4,40,400


In [175]:
states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

In [176]:
state_df = pd.DataFrame([states.keys(),\
              states.values(),\
              [len(x) for x in states.values()]],\
              index=['abbreviation', 'name', 'name_length']).T

## DataFrame of state names

In [177]:
# we have the index as what currently?
state_df

Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## Let's change the index

We have numbers as the index, let's make the index the state's abbreviation

In [178]:
state_df.set_index('abbreviation')

Unnamed: 0_level_0,name,name_length
abbreviation,Unnamed: 1_level_1,Unnamed: 2_level_1
WA,Washington,10
WI,Wisconsin,9
WV,West Virginia,13
FL,Florida,7
WY,Wyoming,7
NH,New Hampshire,13
NJ,New Jersey,10
NM,New Mexico,10
NC,North Carolina,14
ND,North Dakota,12


## Notice that was just a view, did not change the data

In [179]:
state_df

Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## Have to save as a new DataFrame or use inplace=True

In [180]:
state_df.set_index('abbreviation', inplace=True)

## Now we can see the changes 'stuck'

In [181]:
state_df

Unnamed: 0_level_0,name,name_length
abbreviation,Unnamed: 1_level_1,Unnamed: 2_level_1
WA,Washington,10
WI,Wisconsin,9
WV,West Virginia,13
FL,Florida,7
WY,Wyoming,7
NH,New Hampshire,13
NJ,New Jersey,10
NM,New Mexico,10
NC,North Carolina,14
ND,North Dakota,12


## What if we want to go back?

In [182]:
state_df

Unnamed: 0_level_0,name,name_length
abbreviation,Unnamed: 1_level_1,Unnamed: 2_level_1
WA,Washington,10
WI,Wisconsin,9
WV,West Virginia,13
FL,Florida,7
WY,Wyoming,7
NH,New Hampshire,13
NJ,New Jersey,10
NM,New Mexico,10
NC,North Carolina,14
ND,North Dakota,12


## Need to reset it!

In [183]:
state_df.reset_index(inplace=True)

In [184]:
state_df

Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## Exercise

Using the states_df:
- set the index to the state name and save it as new_state
- use the .loc method to select all the states that begin with the letter 'N'
- reset the index back to a zero-based index and do so inplace

## Using .isin()

In [188]:
state_df

Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## We get a boolean (True/False) Series back

In [189]:
states_with_direction = ['North Dakota', 'North Carolina',\
                         'South Carolina', 'South Dakota',\
                         'West Virginia']

state_df['name'].isin(states_with_direction)

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8      True
9      True
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21     True
22    False
23    False
24     True
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
Name: name, dtype: bool

## Wrapping that in state_df[ ] gives us only the True rows

In [190]:
state_df[state_df['name'].isin(states_with_direction)]

Unnamed: 0,abbreviation,name,name_length
2,WV,West Virginia,13
8,NC,North Carolina,14
9,ND,North Dakota,12
21,SC,South Carolina,14
24,SD,South Dakota,12


## Exercise

Using the state_df DataFrame:
- use .isin() to select only those rows that have a name_length of 10 or 12 characters
- use another .isin() with a list comprehension to select only the columns that are abbreviated that begin with an 'N' or a 'S'

## Futher into Boolean indexing

In [193]:
state_df['name_length'] > 10

0     False
1     False
2      True
3     False
4     False
5      True
6     False
7     False
8      True
9      True
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17     True
18    False
19    False
20    False
21     True
22    False
23    False
24     True
25    False
26    False
27    False
28    False
29    False
30     True
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43     True
44    False
45    False
46    False
47    False
48    False
49     True
Name: name_length, dtype: bool

## Again, we can wrap that

In [194]:
# gives us the only rows that have a length greater than 10
state_df[state_df['name_length']>10]

Unnamed: 0,abbreviation,name,name_length
2,WV,West Virginia,13
5,NH,New Hampshire,13
8,NC,North Carolina,14
9,ND,North Dakota,12
12,RI,Rhode Island,12
17,CT,Connecticut,11
21,SC,South Carolina,14
24,SD,South Dakota,12
30,PA,Pennsylvania,12
43,MA,Massachusetts,13


## Another example

In [195]:
state_df[state_df['name'].str.contains('South')]

Unnamed: 0,abbreviation,name,name_length
21,SC,South Carolina,14
24,SD,South Dakota,12


## Let's use an 'and' here to combine requirements

In [196]:
state_df[(state_df['name_length']>12)\
          &(state_df['name'].str.contains('South'))]

Unnamed: 0,abbreviation,name,name_length
21,SC,South Carolina,14


## Let's use an 'or' statement

In [197]:
state_df[(state_df['name'].str.contains('North'))\
          |(state_df['name'].str.contains('South'))]

Unnamed: 0,abbreviation,name,name_length
8,NC,North Carolina,14
9,ND,North Dakota,12
21,SC,South Carolina,14
24,SD,South Dakota,12


## Exercise

Using the state_df DataFrame:
- use Boolean indexing to select all states with a y in their name
- using the same code from the line above add another condition to only return states that have 10 or fewer characters in their name

## Math with pandas columns

In [200]:
state_df

Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## Let's add a new column that is 100x name_length

In [201]:
tmp_df = state_df.copy()

tmp_df['name_length_x100'] = tmp_df['name_length'] * 100

tmp_df

Unnamed: 0,abbreviation,name,name_length,name_length_x100
0,WA,Washington,10,1000
1,WI,Wisconsin,9,900
2,WV,West Virginia,13,1300
3,FL,Florida,7,700
4,WY,Wyoming,7,700
5,NH,New Hampshire,13,1300
6,NJ,New Jersey,10,1000
7,NM,New Mexico,10,1000
8,NC,North Carolina,14,1400
9,ND,North Dakota,12,1200


## Let's add two columns together

In [202]:
tmp_df['name_added_cols'] = tmp_df['name_length'] + tmp_df['name_length_x100']

tmp_df

Unnamed: 0,abbreviation,name,name_length,name_length_x100,name_added_cols
0,WA,Washington,10,1000,1010
1,WI,Wisconsin,9,900,909
2,WV,West Virginia,13,1300,1313
3,FL,Florida,7,700,707
4,WY,Wyoming,7,700,707
5,NH,New Hampshire,13,1300,1313
6,NJ,New Jersey,10,1000,1010
7,NM,New Mexico,10,1000,1010
8,NC,North Carolina,14,1400,1414
9,ND,North Dakota,12,1200,1212


## Exercise

Using the state_df DataFrame again:
- Save temp_df as a copy of state_df
- Double the name_length column by adding it to itself
- Double the doubled column you created by multiplying it by 2

## Conclusion

In this lecture we covered:
- How to use the .loc function to slice and how it differs from .iloc
- How to set and reset DataFrame Indices
- How to use .isin()
- How to perform boolean indexing on dataframes
- How to combine conditions using '|' and '&'
- How to perform math on pandas Series