**Some Pandas**

We use the pandas package to deal with flat/rectangular datasets.
Pandas calls these data frames (a term that comes from R).
We'll only introduce a few things one can do with these for now.

- read a .csv file as a pandas data frame
- apply a function to every row to produce a pandas *Series*
- create a new data frame by selecting rows satisfying some criterion


In [8]:
import pandas as pd
df=pd.read_csv("mortgage_data.csv")
print(df.shape)
df.tail(7)

(9864, 5)


Unnamed: 0,location,princ,irate,cscore,result
9857,suburban,507,7.25,641,non-default
9858,suburban,809,7.0,764,non-default
9859,suburban,769,7.75,586,non-default
9860,suburban,451,7.25,684,non-default
9861,suburban,410,7.0,702,non-default
9862,suburban,851,7.0,774,non-default
9863,suburban,260,7.5,657,default


We can get a row of a pandas data frame using the iloc function. Here iloc's argument refers to position in a list of rows. 

In [10]:
row=df.iloc[5]
print(type(row))
print(row)

<class 'pandas.core.series.Series'>
location    suburban
princ            574
irate           7.25
cscore           715
result       default
Name: 5, dtype: object


The resulting pandas series can be thought of as name value pairs.
The names of the rows (the first column) is the *index* of the series.
We can access the values using a dot.

In [11]:
row.location

'suburban'

Square brackets work as well, and this is important because sometimes our indices are strings with spaces in them.

In [15]:
row["location"]

'suburban'

We can have a function that takes a row as an argument and returns a value. 

In [16]:
def f(row):
    return(row.irate>7.2)
f(df.iloc[131])

True

We can apply a function to every element of a pandas series to produce another pandas series.

In [18]:
s=df.apply(f,axis=1)
print(s)
print(type(s))

0       False
1        True
2        True
3        True
4        True
        ...  
9859     True
9860     True
9861    False
9862    False
9863     True
Length: 9864, dtype: bool
<class 'pandas.core.series.Series'>


In [11]:
def f(row):
    return(row.location=="urban")
df.apply(f,axis=1)

0       False
1       False
2       False
3       False
4       False
        ...  
9859    False
9860    False
9861    False
9862    False
9863    False
Length: 9864, dtype: bool

In [19]:
def f(row):
    return(row.location in ["urban","suburban"])
df.apply(f,axis=1)

0       True
1       True
2       True
3       True
4       True
        ... 
9859    True
9860    True
9861    True
9862    True
9863    True
Length: 9864, dtype: bool

Finally, we can use a Boolean series of the same length as the number of rows in our dataframe to
produce a new data frame of rows for which the Boolean evaluates to True.

In [21]:
def f(row):
    return(row.cscore>750)
res=df.apply(f,axis=1) # True/False Series
print(res)
dfnew1=df.loc[res] # rows of df that give res==True
dfnew2=df.loc[~res] # rows of df that give res=False
print(dfnew1.shape)
print(dfnew2.shape)

print(dfnew1)

0       False
1       False
2       False
3       False
4       False
        ...  
9859    False
9860    False
9861    False
9862     True
9863    False
Length: 9864, dtype: bool
(617, 5)
(9247, 5)
      location  princ  irate  cscore       result
10    suburban    494   7.25     753  non-default
26    suburban    293   7.00     799  non-default
40       urban    580   6.75     816  non-default
43    suburban    588   7.00     758      default
48    suburban    329   6.50     786  non-default
...        ...    ...    ...     ...          ...
9825     urban    268   6.50     823  non-default
9828  suburban    822   7.00     766      default
9855  suburban    377   7.00     792      default
9858  suburban    809   7.00     764  non-default
9862  suburban    851   7.00     774  non-default

[617 rows x 5 columns]
