This notebook will focus on using DataFrames in Pandas
# Data Frames

In [112]:
import pandas as pd
import numpy as np

First, let us read in some data. There are many ways to do this, but we will work with csv data:

In [113]:
data = pd.read_csv('us-ca-la_city-building_fires_2014_2015.csv')

Using some high level commands we can find out more about our data.

In [114]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7082 entries, 0 to 7081
Data columns (total 13 columns):
index                     7082 non-null int64
group                     7082 non-null object
inc_date                  7082 non-null int64
incnum                    7082 non-null int64
exp                       7082 non-null int64
aidtype                   7082 non-null object
inc_type                  7082 non-null int64
struc_type                1982 non-null float64
property_use              7082 non-null object
fire_sprd                 1982 non-null float64
civilian_casualties       7082 non-null int64
firefighter_casualties    7082 non-null int64
incident_number           7082 non-null int64
dtypes: float64(2), int64(8), object(3)
memory usage: 719.3+ KB


This tells us that we are operating with a DataFrame (that is how Pandas reads the file into memory). It tells us there are 7082
entries and that Pandas numbers rows from 0 to N-1 rows. we can also see that both struc_type and fire_sprd have empty rows. While they have the same number of entries we cannot assume that they populate the same rows.

In [115]:
data.dtypes

index                       int64
group                      object
inc_date                    int64
incnum                      int64
exp                         int64
aidtype                    object
inc_type                    int64
struc_type                float64
property_use               object
fire_sprd                 float64
civilian_casualties         int64
firefighter_casualties      int64
incident_number             int64
dtype: object

We can also quickly get some statisitcs on the data columns. Here we will look at just firefighter casualties. Recall how to call
based on index value from the series demos. We will user the column header.

In [116]:
data['firefighter_casualties'].describe()

count    7082.000000
mean        0.006072
std         0.129494
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         8.000000
Name: firefighter_casualties, dtype: float64

We can also view parts of the dataset if we choose, similar to linux commands.

In [117]:
data.head()

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
0,69,inc_sf2014q1.csv,20140101,519,0,N,118,,962,,0,0,201401010519
1,72,inc_sf2014q1.csv,20140101,562,0,N,118,,962,,0,0,201401010562
2,73,inc_sf2014q1.csv,20140101,583,0,N,118,,963,,0,0,201401010583
3,77,inc_sf2014q1.csv,20140101,620,0,N,113,,419,,0,0,201401010620
4,56,inc_sf2014q1.csv,20140101,422,0,N,111,1.0,419,4.0,0,0,201401010422


In [118]:
data.tail()

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
7077,18029,inc_sf2014q2.csv,20140421,1375,0,N,118,,963,,0,0,201404211375
7078,18031,inc_sf2014q2.csv,20140421,1415,0,N,118,,963,,0,0,201404211415
7079,18032,inc_sf2014q2.csv,20140421,1415,1,N,118,,963,,0,0,201404211415
7080,18044,inc_sf2014q2.csv,20140422,110,0,N,113,,419,,0,0,201404220110
7081,115511,inc_sf20151105.csv,20151105,1178,0,N,118,,960,,0,0,201511051178


In [119]:
data[5051:5054]

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
5051,79823,inc_sf2015q2.csv,20150412,999,0,N,113,,429,,0,0,201504120999
5052,79834,inc_sf2015q2.csv,20150412,1083,0,N,118,,215,,0,0,201504121083
5053,79836,inc_sf2015q2.csv,20150412,1135,0,N,118,,963,,0,0,201504121135


Think of dataframes as a set of series -- that share an index - column headers. Say we want to just view the first four rows of the
incident date:

In [120]:
data['inc_date'].head()

0    20140101
1    20140101
2    20140101
3    20140101
4    20140101
Name: inc_date, dtype: int64

There are multiple ways to index/reference data from a dataframe.

In [121]:
data[data.firefighter_casualties>0].head(4)

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
314,4106,inc_sf2014q1.csv,20140123,844,0,N,111,1.0,419,5.0,0,2,201401230844
370,5080,inc_sf2014q1.csv,20140129,1423,0,N,111,1.0,419,4.0,0,1,201401291423
1264,19600,inc_sf2014q2.csv,20140501,678,0,N,111,1.0,429,2.0,0,1,201405010678
1506,23684,inc_sf2014q2.csv,20140523,1041,0,N,111,1.0,419,4.0,0,2,201405231041


Say we want the cases where we have civilian and firefighter casualties.

In [122]:
data[(data.firefighter_casualties>0) & (data['civilian_casualties']>0)]

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
5518,93708,inc_sf2015q3.csv,20150704,378,0,N,111,1.0,419,5.0,4,1,201507040378


We will now use this data to determine the number of total causualities and number of fires where fire spread was reported. We are going to convert some of these columns to strings so that we can slice and dice this DataFrame to suit our needs.

In [129]:
data['aidtype'] = data.aidtype.apply(str)
data['struc_type'] = data.struc_type.apply(str)
data['inc_type'] = data.inc_type.apply(str)

In [130]:
data.dtypes

index                       int64
group                      object
inc_date                    int64
incnum                      int64
exp                         int64
aidtype                    object
inc_type                   object
struc_type                 object
property_use               object
fire_sprd                 float64
civilian_casualties         int64
firefighter_casualties      int64
incident_number             int64
dtype: object

Another way to do this would have been when we read in the data:
data = pd.read_csv('us-ca-la_city-building_fires_2014_2015.csv',dtype={ 'aidtype':str,'struc_type':str,'inc_type': str,})

In [172]:
data2 = (pd.read_csv('us-ca-la_city-building_fires_2014_2015.csv',dtype={ 'aidtype':str,'struc_type':str,'inc_type': str,})
                   .fillna(0)
                   )

In [173]:
data2.tail()

Unnamed: 0,index,group,inc_date,incnum,exp,aidtype,inc_type,struc_type,property_use,fire_sprd,civilian_casualties,firefighter_casualties,incident_number
7077,18029,inc_sf2014q2.csv,20140421,1375,0,N,118,0,963,0.0,0,0,201404211375
7078,18031,inc_sf2014q2.csv,20140421,1415,0,N,118,0,963,0.0,0,0,201404211415
7079,18032,inc_sf2014q2.csv,20140421,1415,1,N,118,0,963,0.0,0,0,201404211415
7080,18044,inc_sf2014q2.csv,20140422,110,0,N,113,0,419,0.0,0,0,201404220110
7081,115511,inc_sf20151105.csv,20151105,1178,0,N,118,0,960,0.0,0,0,201511051178
