# Filtering and Iterating DataFrames

One task that is commonly needed with DataFrames is to select a subset of the DataFrame. Another is to iterate either through the rows or columns of the DataFrame. 

There is a section of the DataFrame documentation that lists all of the methods that can be used for this. (There are many)
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#indexing-iteration


There are some examples below:


### Setup

In [3]:
import pandas as pd
import pprint
from pathlib import Path
# the numpy package for datatypes
import numpy as np

# capture the path to the data directory
dataDirectory = Path.home() /'LAS792/data' 

dogsDf = pd.read_csv(dataDirectory / "DOGGYDATA.csv",  
                        dtype={'Fur':np.float64, 'TEMPER':np.float64}, 
                        na_values=['M'])
dogsDf

Unnamed: 0,Name,Species,Gender,Weight,Fur,Temper,Bites
0,102,2,BOY,8,,2.0,9
1,103,1,GIRL,31,11.0,3.0,14
2,104,1,BOY,26,12.0,1.0,3
3,105,2,BOY,14,9.0,3.0,15
4,106,1,GIRL,64,7.0,3.0,16
5,107,2,GIRL,15,3.0,,10
6,108,1,GIRL,9,17.0,2.0,11
7,109,1,BOY,38,4.0,1.0,2
8,110,2,GIRL,12,14.0,2.0,12
9,111,1,BOY,41,2.0,3.0,17


In [18]:
dogsDf.head()

Unnamed: 0,Name,Species,Gender,Weight,Fur,Temper,Bites
0,102,2,BOY,8,,2.0,9
1,103,1,GIRL,31,11.0,3.0,14
2,104,1,BOY,26,12.0,1.0,3
3,105,2,BOY,14,9.0,3.0,15
4,106,1,GIRL,64,7.0,3.0,16


### selecting a column

In [4]:
# with a [] operator
SpeciesSeries = dogsDf["Species"]
print("SpeciesSeries is a ", type(SpeciesSeries))
SpeciesSeries

SpeciesSeries is a  <class 'pandas.core.series.Series'>


0     2
1     1
2     1
3     2
4     1
5     2
6     1
7     1
8     2
9     1
10    1
11    2
Name: Species, dtype: int64

### using a variable

In [5]:
# with a [] operator
wantedVar = "Species" 
SpeciesSeries = dogsDf[wantedVar]
print("SpeciesSeries is a ", type(SpeciesSeries))
SpeciesSeries

SpeciesSeries is a  <class 'pandas.core.series.Series'>


0     2
1     1
2     1
3     2
4     1
5     2
6     1
7     1
8     2
9     1
10    1
11    2
Name: Species, dtype: int64

In [7]:
def nonsense (var):
    SpeciesSeries = dogsDf[var]
    print("SpeciesSeries is a ", type(SpeciesSeries))
nonsense ("Species") 

SpeciesSeries is a  <class 'pandas.core.series.Series'>


In [6]:
# with dot notation
SpeciesSeries = dogsDf.Species
print("SpeciesSeries is a ", type(SpeciesSeries))
SpeciesSeries

SpeciesSeries is a  <class 'pandas.core.series.Series'>


0     2
1     1
2     1
3     2
4     1
5     2
6     1
7     1
8     2
9     1
10    1
11    2
Name: Species, dtype: int64

### Selecting a cell

In [8]:
dogsDf.at[3,'Name']

105

### this is actually a tuple

In [9]:
selectingTuple = (3,'Name')
dogsDf.at[selectingTuple]

105

### Selecting a single row

In [10]:
print("type of dogsDf.loc[3] is ", type(dogsDf.loc[3]))
dogsDf.loc[3]

type of dogsDf.loc[3] is  <class 'pandas.core.series.Series'>


Name       105
Species      2
Gender     BOY
Weight      14
Fur        9.0
Temper     3.0
Bites       15
Name: 3, dtype: object

In [11]:
print dogsDf.loc[1]

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(dogsDf.loc[1])? (<ipython-input-11-3e753617d943>, line 1)

### Selecting an enumerated set of rows
Note that 3,5, and 7 are labels not indices

In [13]:
print( "the type of  dogsDf.loc[[3,5,7]] is ", type(dogsDf.loc[[3,5,7]]))
dogsDf.loc[[3,5,7]]

the type of  dogsDf.loc[[3,5,7]] is  <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Species,Gender,Weight,Fur,Temper,Bites
3,105,2,BOY,14,9.0,3.0,15
5,107,2,GIRL,15,3.0,,10
7,109,1,BOY,38,4.0,1.0,2


### Selecting an enumerated set of columns
The example below selects all rows with the :

and three columns with a list 

In [12]:
dogsDf.loc[:,["Name", "Gender", "Fur"]]

Unnamed: 0,Name,Gender,Fur
0,102,BOY,
1,103,GIRL,11.0
2,104,BOY,12.0
3,105,BOY,9.0
4,106,GIRL,7.0
5,107,GIRL,3.0
6,108,GIRL,17.0
7,109,BOY,4.0
8,110,GIRL,14.0
9,111,BOY,2.0


### Selecting a range of columns
Note that the last *label* in the range **is** included

In [None]:
dogsDf.loc[:,"Name":"Fur"]

### to the end

In [None]:
dogsDf.loc[:,"Species":]

### from the beginning

In [None]:
dogsDf.loc[:,:"Fur"]

### Selecting a subset of the rows by position
Note that the last *position* **is not** included

In [None]:
dogsDf.iloc[2:5,2:6]

## Iterators
The **next** function can operate on iterators. The Enumerate method returns an iterator that returns tuples having the index and value from the list. In this example next is called on the enumerate iterator until it throws a StopIteration

In [None]:
listEnum = enumerate(['First','Second'])
print(next(listEnum))
print(next(listEnum))
print(next(listEnum))

In [None]:
for s in enumerate(['First','Second']):
    print(s)

### Iterating across rows.  iterrows



In [19]:
for ixRow, row in dogsDf.iterrows():
  print("   row is a ", type(row), " gender is ", row.Gender)

   row is a  <class 'pandas.core.series.Series'>  gender is  BOY
   row is a  <class 'pandas.core.series.Series'>  gender is  GIRL
   row is a  <class 'pandas.core.series.Series'>  gender is  BOY
   row is a  <class 'pandas.core.series.Series'>  gender is  BOY
   row is a  <class 'pandas.core.series.Series'>  gender is  GIRL


### Iterating across rows.  itertuples

In [None]:
for row in dogsDf.itertuples():
  pprint.pprint(row)  
  print("   row is a ", type(row), " gender is ", row.Gender)

### without the Index

In [None]:
# exclude the Index element of the tuple. What if there is a column named Index.
for row in dogsDf.itertuples(index=False):
  pprint.pprint(row)  
  print("   row is a ", type(row), " gender is ", row.Gender)

### Using next explicitly

In [None]:
tuples = dogsDf.itertuples()
t1 = next(tuples)
t2 = next(tuples)
print(t1, '\n\n', t2)