# Grouping

Chapter 9 in Python for Data Analysis demonstrates a variety of methods to analyze data via data aggregation and grouping operations. Those are the focus of this session.  Our overall goal for this session is to do Exploratory Data Analysis, which is essentially looking at and probing our data to learn about the patterns we can discover in them.  Often this can generate a better understanding of problems in the data, as well as revealing relationships that might be worth exploring in more depth.


For this exercise, we will use 2010 US Census data, for all census tracts in Kentucky.  A good starting point for any kind of Census data is the American Fact Finder: 

https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

The Census Bureau produces several products, the most famous of which is the decennial census, which as its name implies, is a true Census.  There are a limited number of data fields for the whole population (or at least those that the Census workers are able to reach, which is pretty good).  The Census has other products that are surveys, such as the American Community Survey (ACS) which asks more detailed questions on an annual basis from a 1% sample of households.  

In this case, we are working with file DP-1: Profile of General Population and Housing Characteristics: 2010, which is also a part of something called Summary File 1 (SF-1).  It is for the 100% sample.  A data dictionary for selected fields is below.  

In [1]:
import pandas as pd

# skip the second row, which contains descriptions
sf1 = pd.read_csv('data/DEC_10_SF1_combined.csv', skiprows=[1])

# make the tract, county and state separate
sf1['tract'] = sf1['GEO.display-label'].apply(lambda x: x.split(sep=',')[0])
sf1['county'] = sf1['GEO.display-label'].apply(lambda x: x.split(sep=',')[1])
sf1['state'] = sf1['GEO.display-label'].apply(lambda x: x.split(sep=',')[2])

# compute some derived fields
sf1['pct_rent'] = sf1['H4-D004'] / sf1['H4-D001'] * 100
sf1['pct_black'] = sf1['P3-D003'] / sf1['P3-D001'] * 100
sf1['pct_asian'] = sf1['P3-D005'] / sf1['P3-D001'] * 100
sf1['pct_white'] = sf1['P3-D002'] / sf1['P3-D001'] * 100
sf1['pct_hisp'] = sf1['P4-D003'] / sf1['P4-D001'] * 100
sf1['pct_vacant'] = sf1['H5-D001'] / sf1['H1-D001'] * 100
sf1[:5]

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,...,H5-D008,tract,county,state,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727,1727,1683,14,1,0,1,...,60,Census Tract 9701,Adair County,Kentucky,17.411402,0.810654,0.0,97.452229,2.142444,16.794872
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722,1722,1635,41,5,0,0,...,35,Census Tract 9702,Adair County,Kentucky,19.252874,2.380952,0.0,94.947735,2.61324,25.16129
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016,3016,2944,6,11,8,0,...,106,Census Tract 9703,Adair County,Kentucky,20.521173,0.198939,0.265252,97.612732,1.856764,15.19337
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070,4070,3716,237,1,16,1,...,109,Census Tract 9704.01,Adair County,Kentucky,37.215909,5.823096,0.39312,91.302211,1.547912,10.795743
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261,4261,3950,180,16,16,3,...,70,Census Tract 9704.02,Adair County,Kentucky,30.911681,4.22436,0.375499,92.701244,1.900962,11.642542


## Groupby and Aggregation Operations

Groupby is a powerful method in pandas that follows the split-apply-combine approach to data.  As shown in Figure 9-1 in the context of a sum operation, the data is first split into groups that share the same key values.  Then an operation, in this case a sum, is applied to each group.  Then the results are combined.

The built-in aggregation methods available for groupby operations include:
* count
* sum
* mean
* median
* std, var
* min, max
* first, last

You can also apply your own functions as aggregation methods.

![Groupby Operations](groupby.png "Groupby")

Let's apply this approach to computing total population in each county in our dataset.  We can do this in two steps to help explain what is happening.  First we create a groupby object, using county codes to group all the census blocks in sf1 into groups that share the same county code.

In [2]:
grouped = sf1['P1-D001'].groupby(sf1['county'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001EEF1FE69E8>

Now that we have this grouping object that represents the **split** part of the workflow in the figure above, we can **apply** operations and **combine** the results using methods like sum:

In [3]:
grouped.sum()

county
 Adair County       18656
 Allen County       19956
 Anderson County    21421
 Ballard County      8249
 Barren County      42173
                    ...  
 Wayne County       20813
 Webster County     13621
 Whitley County     35637
 Wolfe County        7355
 Woodford County    24939
Name: P1-D001, Length: 120, dtype: int64

In [4]:
grouped.count()

county
 Adair County        7
 Allen County        6
 Anderson County     5
 Ballard County      3
 Barren County      10
                    ..
 Wayne County        5
 Webster County      4
 Whitley County      8
 Wolfe County        2
 Woodford County     8
Name: P1-D001, Length: 120, dtype: int64

We might want to capture the result in a DataFrame if we want to use it in other processing, like merging the results to the original DataFrame.

In [5]:
county_pop = sf1['P1-D001'].groupby(sf1['county']).sum().to_frame(name='county_population')
county_pop

Unnamed: 0_level_0,county_population
county,Unnamed: 1_level_1
Adair County,18656
Allen County,19956
Anderson County,21421
Ballard County,8249
Barren County,42173
...,...
Wayne County,20813
Webster County,13621
Whitley County,35637
Wolfe County,7355


Here we merge the county total population with sf1 and create a new DataFrame.

In [6]:
sf2 = pd.merge(sf1,county_pop, left_on='county', right_index=True)
sf2[:5]

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,...,tract,county,state,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant,county_population
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727,1727,1683,14,1,0,1,...,Census Tract 9701,Adair County,Kentucky,17.411402,0.810654,0.0,97.452229,2.142444,16.794872,18656
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722,1722,1635,41,5,0,0,...,Census Tract 9702,Adair County,Kentucky,19.252874,2.380952,0.0,94.947735,2.61324,25.16129,18656
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016,3016,2944,6,11,8,0,...,Census Tract 9703,Adair County,Kentucky,20.521173,0.198939,0.265252,97.612732,1.856764,15.19337,18656
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070,4070,3716,237,1,16,1,...,Census Tract 9704.01,Adair County,Kentucky,37.215909,5.823096,0.39312,91.302211,1.547912,10.795743,18656
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261,4261,3950,180,16,16,3,...,Census Tract 9704.02,Adair County,Kentucky,30.911681,4.22436,0.375499,92.701244,1.900962,11.642542,18656


## Transforming Data with Groupby

In some cases you may want to apply a function to your data, by group.  An example would be to normalize a column by a mean of each group.  Say we wanted to subtract the mean population density of each county from the population density of each census block. We could write a function to subtract the mean from each value, and then use the transform operation to apply this to each group:

In [7]:
def demean(arr):
    return arr - arr.mean()

Now we can apply this tranformation to columns in our dataframe.  As examples, let's 'demean' the pct_black and pct_rent columns, subtracting the county-wide mean of these values from the tract-specific values, so that the result is transformed to have a mean of zero within each county.

To check the results, we print the means per county, then the original values for the first 5 rows, then the transformed results.  The transformed results we should be able to calculate by subtracting the appropriate county mean from the tract value.

In [8]:
normalized = sf1[['pct_black', 'pct_rent']].groupby(sf1['county']).transform(demean)
print(sf1[['pct_black', 'pct_rent']].groupby(sf1['county']).mean())
print(sf1[['county','pct_black', 'pct_rent']][:5])
print(normalized[:5])

                  pct_black   pct_rent
county                                
 Adair County      2.104384  22.886621
 Allen County      0.716792  22.312658
 Anderson County   1.419188  20.908477
 Ballard County    2.690326  21.350937
 Barren County     3.385357  29.982273
...                     ...        ...
 Wayne County      1.288175  25.683905
 Webster County    3.825768  22.732870
 Whitley County    0.519283  31.299399
 Wolfe County      0.131572  27.972188
 Woodford County   4.802279  29.615644

[120 rows x 2 columns]
          county  pct_black   pct_rent
0   Adair County   0.810654  17.411402
1   Adair County   2.380952  19.252874
2   Adair County   0.198939  20.521173
3   Adair County   5.823096  37.215909
4   Adair County   4.224360  30.911681
   pct_black   pct_rent
0  -1.293729  -5.475218
1   0.276569  -3.633747
2  -1.905445  -2.365448
3   3.718712  14.329289
4   2.119977   8.025060


We can merge these transformed results on to the original DataFrame, and check the means of the original variables and the tranformed ones.  The transformed ones should be arbitrarily close to zero.

In [9]:
sf2 = pd.merge(sf1,normalized, left_index=True, right_index=True)

sf2.groupby('county')[['pct_black_x', 'pct_black_y', 'pct_rent_x', 'pct_rent_y']].mean()

Unnamed: 0_level_0,pct_black_x,pct_black_y,pct_rent_x,pct_rent_y
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adair County,2.104384,-2.854859e-16,22.886621,-2.030122e-15
Allen County,0.716792,-5.551115e-17,22.312658,-3.256654e-15
Anderson County,1.419188,8.881784e-17,20.908477,7.105427e-16
Ballard County,2.690326,1.480297e-16,21.350937,0.000000e+00
Barren County,3.385357,0.000000e+00,29.982273,4.618528e-15
...,...,...,...,...
Wayne County,1.288175,-4.440892e-17,25.683905,7.105427e-16
Webster County,3.825768,0.000000e+00,22.732870,0.000000e+00
Whitley County,0.519283,-2.775558e-17,31.299399,-8.881784e-16
Wolfe County,0.131572,6.938894e-18,27.972188,0.000000e+00


Apply is a method we have learned previously, which allows us to apply a function to each row in a DataFrame.  We can also combine apply with groupby to apply functions based on group membership.  For example, the function 'top' sorts an array and selects the top n rows from it.  We provide some defaults for the arguments of how many rows, and the column to use for the selection:

In [10]:
def top(df, n=2, column='pct_rent'):
    return df.sort_values(by=column, ascending=False).head(n)

Using this on the full dataset and setting the number of rows and the column to get the top values for, in this case using pct_rent to override the default argument, we get the top 10 tracts in the region in terms of percentage rental.

In [11]:
top(sf1, n=3, column='pct_rent')

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,...,H5-D008,tract,county,state,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant
255,1400000US21061980100,21061980100,"Census Tract 9801, Edmonson County, Kentucky",8,8,8,0,0,0,0,...,0,Census Tract 9801,Edmonson County,Kentucky,100.0,0.0,0.0,100.0,0.0,0.0
508,1400000US21111003000,21111003000,"Census Tract 30, Jefferson County, Kentucky",3565,3565,258,3187,15,1,16,...,7,Census Tract 30,Jefferson County,Kentucky,99.768697,89.396914,0.02805,7.237027,1.290323,7.883523
200,1400000US21047201501,21047201501,"Census Tract 2015.01, Christian County, Kentucky",5315,5315,3580,982,64,47,81,...,8,Census Tract 2015.01,Christian County,Kentucky,99.698568,18.476011,0.88429,67.356538,14.995296,7.072829


Below we apply this with groupby and use the defaults for n and column, and it applies the function within each county and concatenates the results, producing the top 5 blocks on pop_sqmi for each county in the region.

In [12]:
sf1.groupby('county').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,GEO.id,GEO.id2,GEO.display-label,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,...,H5-D008,tract,county,state,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Adair County,3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070,4070,3716,237,1,16,1,...,109,Census Tract 9704.01,Adair County,Kentucky,37.215909,5.823096,0.393120,91.302211,1.547912,10.795743
Adair County,4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261,4261,3950,180,16,16,3,...,70,Census Tract 9704.02,Adair County,Kentucky,30.911681,4.224360,0.375499,92.701244,1.900962,11.642542
Allen County,9,1400000US21003920300,21003920300,"Census Tract 9203, Allen County, Kentucky",4685,4685,4473,83,19,12,0,...,86,Census Tract 9203,Allen County,Kentucky,34.676186,1.771612,0.256137,95.474920,1.173959,9.254975
Allen County,10,1400000US21003920400,21003920400,"Census Tract 9204, Allen County, Kentucky",4492,4492,4363,42,12,2,0,...,103,Census Tract 9204,Allen County,Kentucky,27.267668,0.934996,0.044524,97.128228,1.892253,14.346997
Anderson County,14,1400000US21005950201,21005950201,"Census Tract 9502.01, Anderson County, Kentucky",5372,5372,5120,81,12,50,2,...,57,Census Tract 9502.01,Anderson County,Kentucky,35.815441,1.507818,0.930752,95.309010,1.470588,7.598143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Whitley County,1098,1400000US21235920200,21235920200,"Census Tract 9202, Whitley County, Kentucky",2808,2808,2728,6,6,29,0,...,70,Census Tract 9202,Whitley County,Kentucky,42.869342,0.213675,1.032764,97.150997,1.068376,11.085801
Wolfe County,1106,1400000US21237930200,21237930200,"Census Tract 9302, Wolfe County, Kentucky",4082,4082,4032,7,7,2,0,...,147,Census Tract 9302,Wolfe County,Kentucky,33.196961,0.171485,0.048996,98.775110,0.636943,16.577279
Wolfe County,1105,1400000US21237930100,21237930100,"Census Tract 9301, Wolfe County, Kentucky",3273,3273,3233,3,14,1,0,...,136,Census Tract 9301,Wolfe County,Kentucky,22.747415,0.091659,0.030553,98.777880,0.458295,15.848353
Woodford County,1107,1400000US21239050103,21239050103,"Census Tract 501.03, Woodford County, Kentucky",3144,3144,2412,406,10,8,0,...,32,Census Tract 501.03,Woodford County,Kentucky,57.563369,12.913486,0.254453,76.717557,13.104326,8.320840


Here we pass arguments to the function to set n and the column to select the top value from.

In [13]:
sf1.groupby('county').apply(top, n=1, column='P1-D001')

Unnamed: 0_level_0,Unnamed: 1_level_0,GEO.id,GEO.id2,GEO.display-label,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,...,H5-D008,tract,county,state,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Adair County,4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261,4261,3950,180,16,16,3,...,70,Census Tract 9704.02,Adair County,Kentucky,30.911681,4.224360,0.375499,92.701244,1.900962,11.642542
Allen County,9,1400000US21003920300,21003920300,"Census Tract 9203, Allen County, Kentucky",4685,4685,4473,83,19,12,0,...,86,Census Tract 9203,Allen County,Kentucky,34.676186,1.771612,0.256137,95.474920,1.173959,9.254975
Anderson County,13,1400000US21005950100,21005950100,"Census Tract 9501, Anderson County, Kentucky",8164,8164,7647,292,5,38,3,...,103,Census Tract 9501,Anderson County,Kentucky,26.186684,3.576678,0.465458,93.667320,1.641352,8.081991
Ballard County,18,1400000US21007950100,21007950100,"Census Tract 9501, Ballard County, Kentucky",4259,4259,4024,145,8,11,0,...,132,Census Tract 9501,Ballard County,Kentucky,22.140011,3.404555,0.258277,94.482273,0.986147,11.664153
Barren County,26,1400000US21009950600,21009950600,"Census Tract 9506, Barren County, Kentucky",5937,5937,5331,299,11,40,10,...,95,Census Tract 9506,Barren County,Kentucky,44.142343,5.036214,0.673741,89.792825,2.779181,11.500354
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wayne County,1089,1400000US21231920200,21231920200,"Census Tract 9202, Wayne County, Kentucky",6121,6121,5709,150,24,20,2,...,85,Census Tract 9202,Wayne County,Kentucky,37.323391,2.450580,0.326744,93.269074,4.296684,9.452736
Webster County,1093,1400000US21233960100,21233960100,"Census Tract 9601, Webster County, Kentucky",4584,4584,4203,22,5,18,33,...,101,Census Tract 9601,Webster County,Kentucky,22.176227,0.479930,0.392670,91.688482,9.642234,9.475375
Whitley County,1099,1400000US21235920300,21235920300,"Census Tract 9203, Whitley County, Kentucky",6189,6189,6078,8,28,17,2,...,108,Census Tract 9203,Whitley County,Kentucky,23.053892,0.129262,0.274681,98.206495,0.727096,9.590101
Wolfe County,1106,1400000US21237930200,21237930200,"Census Tract 9302, Wolfe County, Kentucky",4082,4082,4032,7,7,2,0,...,147,Census Tract 9302,Wolfe County,Kentucky,33.196961,0.171485,0.048996,98.775110,0.636943,16.577279


## Exploring Correlations in the Data

Pandas provides simple ways of computing correlation coefficients among the columns in your DataFrame.  If you use corr() on a full DF, it will produce a large correlation table.  A bit hard to navigate and you mostly would not be interested in some of these results.

In [14]:
sf1.corr()

Unnamed: 0,GEO.id2,P1-D001,P3-D001,P3-D002,P3-D003,P3-D004,P3-D005,P3-D006,P3-D007,P3-D008,...,H5-D005,H5-D006,H5-D007,H5-D008,pct_rent,pct_black,pct_asian,pct_white,pct_hisp,pct_vacant
GEO.id2,1.0,0.049408,0.049408,0.063873,-0.024147,0.00196,-0.048639,-0.031018,-0.020234,-0.04501,...,0.07767,0.076772,0.010594,0.119563,-0.06182,-0.03333,-0.05869,0.045403,-0.022668,0.043711
P1-D001,0.049408,1.0,1.0,0.929708,0.150379,0.479878,0.324468,0.192667,0.24064,0.581058,...,0.308361,0.012925,-0.023479,0.202435,-0.06252,-0.070055,0.136213,0.043546,0.058258,-0.300937
P3-D001,0.049408,1.0,1.0,0.929708,0.150379,0.479878,0.324468,0.192667,0.24064,0.581058,...,0.308361,0.012925,-0.023479,0.202435,-0.06252,-0.070055,0.136213,0.043546,0.058258,-0.300937
P3-D002,0.063873,0.929708,0.929708,1.0,-0.213918,0.363714,0.209791,0.095243,0.057796,0.351332,...,0.376236,0.064984,0.014118,0.236156,-0.257852,-0.391277,0.031737,0.382284,-0.115163,-0.289514
P3-D003,-0.024147,0.150379,0.150379,-0.213918,1.0,0.252784,0.158195,0.171271,0.325919,0.497941,...,-0.145718,-0.120118,-0.089171,-0.003757,0.464467,0.899663,0.132016,-0.899986,0.302933,0.011119
P3-D004,0.00196,0.479878,0.479878,0.363714,0.252784,1.0,0.131467,0.463373,0.299747,0.565977,...,0.057503,0.000613,-0.035805,0.090925,0.184507,0.114453,0.060946,-0.170683,0.299351,-0.015777
P3-D005,-0.048639,0.324468,0.324468,0.209791,0.158195,0.131467,1.0,0.176404,0.160199,0.399051,...,-0.082495,-0.075145,-0.105767,-0.302912,0.163496,0.054982,0.923845,-0.172315,0.17843,-0.222993
P3-D006,-0.031018,0.192667,0.192667,0.095243,0.171271,0.463373,0.176404,1.0,0.281407,0.523912,...,-0.06952,-0.049845,-0.029479,-0.106864,0.255751,0.099153,0.155861,-0.174929,0.339602,-0.039314
P3-D007,-0.020234,0.24064,0.24064,0.057796,0.325919,0.299747,0.160199,0.281407,1.0,0.504998,...,-0.108742,-0.086525,-0.01721,-0.124005,0.305578,0.192686,0.134676,-0.334419,0.843366,-0.086896
P3-D008,-0.04501,0.581058,0.581058,0.351332,0.497941,0.565977,0.399051,0.523912,0.504998,1.0,...,-0.029011,-0.118407,-0.08846,-0.130512,0.425299,0.307009,0.306124,-0.413328,0.466302,-0.216524


It is easy to compute correlation coeffients for a subset of columns.

In [15]:
sf1[['pct_rent', 'pct_vacant']].corr()

Unnamed: 0,pct_rent,pct_vacant
pct_rent,1.0,0.085586
pct_vacant,0.085586,1.0


And this method can be combined with groupby to compute correlation tables by group.

In [16]:
sf1.groupby('county')[['pct_rent', 'pct_vacant']].corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,pct_rent,pct_vacant
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adair County,pct_rent,1.000000,-0.613912
Adair County,pct_vacant,-0.613912,1.000000
Allen County,pct_rent,1.000000,-0.620421
Allen County,pct_vacant,-0.620421,1.000000
Anderson County,pct_rent,1.000000,-0.534183
...,...,...,...
Whitley County,pct_vacant,0.483214,1.000000
Wolfe County,pct_rent,1.000000,1.000000
Wolfe County,pct_vacant,1.000000,1.000000
Woodford County,pct_rent,1.000000,0.154547


## Your turn to practice:

Count the number of census blocks per county.

Calculate total households per county.

Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county)

Calculate percent vacant by county.

Calculate mean, min and max vacancy rate (at the block level) by county.

Calculate the 90th percentile of vacancy rate (at the block level) by county.

### Count the number of census blocks per county.

In [17]:
sf1['tract'].groupby(sf1['county']).count()

county
 Adair County        7
 Allen County        6
 Anderson County     5
 Ballard County      3
 Barren County      10
                    ..
 Wayne County        5
 Webster County      4
 Whitley County      8
 Wolfe County        2
 Woodford County     8
Name: tract, Length: 120, dtype: int64

### Calculate total households per county.

In [18]:
sf1['H1-D001'].groupby(sf1['county']).sum()

county
 Adair County        8568
 Allen County        9307
 Anderson County     9127
 Ballard County      3885
 Barren County      19188
                    ...  
 Wayne County       10942
 Webster County      5936
 Whitley County     15166
 Wolfe County        3660
 Woodford County    10711
Name: H1-D001, Length: 120, dtype: int64

### Calculate percent renters by county. (Careful not to calculate the mean percent rental across blocks in a county

In [19]:
prent = sf1['H4-D004'].groupby(sf1['county']).sum().to_frame('renters')
hhoccupied = sf1['H4-D001'].groupby(sf1['county']).sum().to_frame('occupiedHH')
hhlds = pd.merge(prent,hhoccupied, how = 'inner', left_on = 'county', right_on = 'county')

def mpr(renter, occupiedhh):
    return ((renter/occupiedhh)*100)

hhlds['mean_percent_rental'] = hhlds.apply(lambda x: mpr(x['renters'], x['occupiedHH']),axis = 1)
hhlds

Unnamed: 0_level_0,renters,occupiedHH,mean_percent_rental
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adair County,1860,7285,25.531915
Allen County,1898,7848,24.184506
Anderson County,2067,8369,24.698291
Ballard County,727,3397,21.401236
Barren County,5449,16999,32.054827
...,...,...,...
Wayne County,2386,8646,27.596576
Webster County,1245,5272,23.615326
Whitley County,4121,13575,30.357274
Wolfe County,876,3065,28.580750


In [20]:
hhoccupied

Unnamed: 0_level_0,occupiedHH
county,Unnamed: 1_level_1
Adair County,7285
Allen County,7848
Anderson County,8369
Ballard County,3397
Barren County,16999
...,...
Wayne County,8646
Webster County,5272
Whitley County,13575
Wolfe County,3065


### Calculate percent vacant by county.

In [21]:
vacant = sf1['H3-D003'].groupby(sf1['county']).sum().to_frame('vacanthh')
total = sf1['H3-D001'].groupby(sf1['county']).sum().to_frame('totalhh')
vdf = pd.merge(vacant, total, how = 'inner', left_on = 'county', right_on = 'county')

def pvpc(v,t):
    return ((v/t)*100)

vdf['percent_vacant'] = vdf.apply(lambda x: pvpc(x['vacanthh'],x['totalhh']), axis=1)
vdf

Unnamed: 0_level_0,vacanthh,totalhh,percent_vacant
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adair County,1283,8568,14.974323
Allen County,1459,9307,15.676373
Anderson County,758,9127,8.305029
Ballard County,488,3885,12.561133
Barren County,2189,19188,11.408172
...,...,...,...
Wayne County,2296,10942,20.983367
Webster County,664,5936,11.185984
Whitley County,1591,15166,10.490571
Wolfe County,595,3660,16.256831


In [22]:
vdf['county_name'] = vdf.index
vdf

Unnamed: 0_level_0,vacanthh,totalhh,percent_vacant,county_name
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adair County,1283,8568,14.974323,Adair County
Allen County,1459,9307,15.676373,Allen County
Anderson County,758,9127,8.305029,Anderson County
Ballard County,488,3885,12.561133,Ballard County
Barren County,2189,19188,11.408172,Barren County
...,...,...,...,...
Wayne County,2296,10942,20.983367,Wayne County
Webster County,664,5936,11.185984,Webster County
Whitley County,1591,15166,10.490571,Whitley County
Wolfe County,595,3660,16.256831,Wolfe County


### Calculate mean, min and max vacancy rate (at the block level) by county

In [23]:
import statistics
def minimum(df, column = 'vacancy_rate'):
    return min(df[column])

def maximum(df, column = 'vacancy_rate'):
    return max(df[column])

def meanval(df, column = 'vacancy_rate'):
    return statistics.mean(df[column])

In [24]:
vr = sf1.groupby('county').apply(lambda x: x['H3-D003']/x['H3-D001']).to_frame('vacancy_rate').reset_index()
vr

Unnamed: 0,county,level_1,vacancy_rate
0,Adair County,0,0.167949
1,Adair County,1,0.251613
2,Adair County,2,0.151934
3,Adair County,3,0.107957
4,Adair County,4,0.116425
...,...,...,...
1110,Woodford County,1110,0.077959
1111,Woodford County,1111,0.059516
1112,Woodford County,1112,0.098894
1113,Woodford County,1113,0.149028


In [25]:
vrtmin = vr.groupby('county').apply(minimum).to_frame('min_vacancy_rate')
vrtmin

Unnamed: 0_level_0,min_vacancy_rate
county,Unnamed: 1_level_1
Adair County,0.107957
Allen County,0.092550
Anderson County,0.054923
Ballard County,0.116084
Barren County,0.074578
...,...
Wayne County,0.094527
Webster County,0.094754
Whitley County,0.083780
Wolfe County,0.158484


In [26]:
vrtmax = vr.groupby('county').apply(maximum).to_frame('max_vacancy_rate')
vrtmax

Unnamed: 0_level_0,max_vacancy_rate
county,Unnamed: 1_level_1
Adair County,0.251613
Allen County,0.275591
Anderson County,0.134520
Ballard County,0.146486
Barren County,0.229752
...,...
Wayne County,0.412289
Webster County,0.135246
Whitley County,0.139728
Wolfe County,0.165773


In [27]:
vrtmean = vr.groupby('county').apply(meanval).to_frame('mean_vacancy_rate')
vrtmean

Unnamed: 0_level_0,mean_vacancy_rate
county,Unnamed: 1_level_1
Adair County,0.158683
Allen County,0.158085
Anderson County,0.089273
Ballard County,0.126404
Barren County,0.114448
...,...
Wayne County,0.184769
Webster County,0.110198
Whitley County,0.109061
Wolfe County,0.162128


### Calculate the 90th percentile of vacancy rate (at the block level) by county.

In [28]:
import numpy as np

def percentile90th(df, column = 'vacancy_rate'):
    return np.percentile(df[column], 90)

In [29]:
vrt90p = vr.groupby('county').apply(percentile90th).to_frame('90thpercentile_vacancy_rate')
vrt90p

Unnamed: 0_level_0,90thpercentile_vacancy_rate
county,Unnamed: 1_level_1
Adair County,0.206497
Allen County,0.223243
Anderson County,0.120760
Ballard County,0.140517
Barren County,0.159855
...,...
Wayne County,0.314346
Webster County,0.126642
Whitley County,0.127064
Wolfe County,0.165044


## Some review:

I've included in the data folder a shapefile with the Census geographies.  Can you use it to: 

1. Calculate the population density of each Census tract and county in Kentucky. 
2. Make a choropleth showing the population density.  

In [36]:
import geopandas as gpd
shp = gpd.read_file('data//gz_2010_21_140_00_500k.shp')
shp

Unnamed: 0,GEO_ID,STATE,COUNTY,TRACT,NAME,LSAD,CENSUSAREA,geometry
0,1400000US21035010700,21,035,010700,107,Tract,74.333,"POLYGON ((-88.43011 36.50090, -88.45016 36.501..."
1,1400000US21037050100,21,037,050100,501,Tract,0.205,"POLYGON ((-84.50608 39.09508, -84.50206 39.096..."
2,1400000US21037050400,21,037,050400,504,Tract,0.186,"POLYGON ((-84.48395 39.09684, -84.48151 39.095..."
3,1400000US21037052100,21,037,052100,521,Tract,0.433,"POLYGON ((-84.49355 39.10272, -84.48774 39.110..."
4,1400000US21037052200,21,037,052200,522,Tract,0.452,"POLYGON ((-84.46286 39.09985, -84.46695 39.097..."
...,...,...,...,...,...,...,...,...
1110,1400000US21239050105,21,239,050105,501.05,Tract,1.928,"POLYGON ((-84.74664 38.03602, -84.74513 38.037..."
1111,1400000US21239050106,21,239,050106,501.06,Tract,5.583,"POLYGON ((-84.72462 38.03148, -84.72424 38.035..."
1112,1400000US21239050107,21,239,050107,501.07,Tract,4.958,"POLYGON ((-84.71406 38.05093, -84.70282 38.049..."
1113,1400000US21239050200,21,239,050200,502,Tract,74.749,"POLYGON ((-84.72344 38.02521, -84.72288 38.022..."


In [47]:
census = pd.merge(sf1, shp, how ='inner', left_on = 'GEO.id', right_on = 'GEO_ID')

In [55]:
census = census[['GEO.id','tract','county','P1-D001','CENSUSAREA']]
census

Unnamed: 0,GEO.id,tract,county,P1-D001,CENSUSAREA
0,1400000US21001970100,Census Tract 9701,Adair County,1727,67.730
1,1400000US21001970200,Census Tract 9702,Adair County,1722,40.270
2,1400000US21001970300,Census Tract 9703,Adair County,3016,73.943
3,1400000US21001970401,Census Tract 9704.01,Adair County,4070,40.684
4,1400000US21001970402,Census Tract 9704.02,Adair County,4261,25.877
...,...,...,...,...,...
1110,1400000US21239050106,Census Tract 501.06,Woodford County,3261,5.583
1111,1400000US21239050107,Census Tract 501.07,Woodford County,3757,4.958
1112,1400000US21239050200,Census Tract 502,Woodford County,3533,74.749
1113,1400000US21239050300,Census Tract 503,Woodford County,1899,58.799


In [56]:
# each county
def tractpd(pop,area):
    return (pop/area)

census['tractdensity'] = census.apply(lambda x: tractpd(x['P1-D001'], x['CENSUSAREA']), axis =1)
census

In [72]:
population = census['P1-D001'].groupby(census['county']).sum().to_frame('pop')
area = census['CENSUSAREA'].groupby(census['county']).sum().to_frame('area')
dfk = pd.merge(population, area, left_on = 'county', right_on = 'county')

def popden(v,t):
    return ((v/t))

dfk['popdensity'] = dfk.apply(lambda x: popden(x['pop'],x['area']), axis=1)
dfk            

Unnamed: 0_level_0,pop,area,popdensity
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adair County,18656,405.283,46.032032
Allen County,19956,344.337,57.954852
Anderson County,21421,201.832,106.132823
Ballard County,8249,246.659,33.442931
Barren County,42173,487.540,86.501620
...,...,...,...
Wayne County,20813,458.172,45.426172
Webster County,13621,331.943,41.034153
Whitley County,35637,437.830,81.394605
Wolfe County,7355,222.170,33.105280
