To start sampling America by the county, we need to generate some data points (latitude and longitude pairs) to use. In this notebook, we generate a list of latitutes and longitudes using the Latin Square technique. We am given a latitude and longitude in the center of each county and the total square area of each county from the Gazetter Dataset which you can download <a href=
'http://people.bu.edu/balawson/csv/congress.csv'>here</a>.I use the Latin Square technique to randomally sample the county. I use PyDOE to implement this and you can read more about Python Design Of Experiments [here](http://pythonhosted.org/pyDOE/randomized.html). I'll be using Pandas to manage data. 

In [13]:
import pandas as pd
import random
from pyDOE import *
import math

Here's the link to Wikipedia's page on [Latin Hypercube sampling](https://en.wikipedia.org/wiki/Latin_hypercube_sampling). In an oversimplified way, I like to think of it as "Sudoku"- every row and column has exactly one sample (at least in orthogonal latin sampling). I would like to use orthogonal latin square sampling, but this isn't implemented (yet) in the pyDOE library. This would be a great open source project if you are looking for one. We will be using the general latin cube sampling to randomally sample points in a county. The output of this algorithm will be values between [0-1] that we can use a ratios away from the center of the county.

In [14]:
#orginal data source: https://www.census.gov/geo/maps-data/data/gazetteer2014.html
df = pd.DataFrame.from_csv("csv/congress.csv")
states = df.index
df.index = df.GEOID #because current index is just state abbreviation 
df['states'] = states
#grab the first row and look at it - what info do we want to grab? ALAND, INTPTLAT, and INTPTLONG
df.iloc[0:1]

Unnamed: 0_level_0,GEOID,ANSICODE,NAME,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG,states
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1001,1001,161526,Autauga County,1539584444,25773561,594.437,9.951,32.53216,-86.646469,AL


In [15]:
a = lhs(2, samples=30, criterion='center')
a

array([[ 0.51666667,  0.38333333],
       [ 0.55      ,  0.81666667],
       [ 0.91666667,  0.25      ],
       [ 0.75      ,  0.15      ],
       [ 0.38333333,  0.21666667],
       [ 0.21666667,  0.68333333],
       [ 0.08333333,  0.11666667],
       [ 0.28333333,  0.01666667],
       [ 0.65      ,  0.91666667],
       [ 0.48333333,  0.88333333],
       [ 0.98333333,  0.51666667],
       [ 0.95      ,  0.05      ],
       [ 0.31666667,  0.35      ],
       [ 0.71666667,  0.71666667],
       [ 0.18333333,  0.78333333],
       [ 0.81666667,  0.61666667],
       [ 0.85      ,  0.48333333],
       [ 0.45      ,  0.75      ],
       [ 0.41666667,  0.65      ],
       [ 0.25      ,  0.55      ],
       [ 0.01666667,  0.45      ],
       [ 0.15      ,  0.85      ],
       [ 0.68333333,  0.95      ],
       [ 0.35      ,  0.18333333],
       [ 0.78333333,  0.28333333],
       [ 0.61666667,  0.58333333],
       [ 0.05      ,  0.08333333],
       [ 0.11666667,  0.31666667],
       [ 0.88333333,

In [16]:
#renormalize values between -1 and 1 in order to plot on all four quadrants
b = (a-0.5)*2
b

array([[ 0.03333333, -0.23333333],
       [ 0.1       ,  0.63333333],
       [ 0.83333333, -0.5       ],
       [ 0.5       , -0.7       ],
       [-0.23333333, -0.56666667],
       [-0.56666667,  0.36666667],
       [-0.83333333, -0.76666667],
       [-0.43333333, -0.96666667],
       [ 0.3       ,  0.83333333],
       [-0.03333333,  0.76666667],
       [ 0.96666667,  0.03333333],
       [ 0.9       , -0.9       ],
       [-0.36666667, -0.3       ],
       [ 0.43333333,  0.43333333],
       [-0.63333333,  0.56666667],
       [ 0.63333333,  0.23333333],
       [ 0.7       , -0.03333333],
       [-0.1       ,  0.5       ],
       [-0.16666667,  0.3       ],
       [-0.5       ,  0.1       ],
       [-0.96666667, -0.1       ],
       [-0.7       ,  0.7       ],
       [ 0.36666667,  0.9       ],
       [-0.3       , -0.63333333],
       [ 0.56666667, -0.43333333],
       [ 0.23333333,  0.16666667],
       [-0.9       , -0.83333333],
       [-0.76666667, -0.36666667],
       [ 0.76666667,

In [17]:
#http://gis.stackexchange.com/questions/2951/algorithm-for-offsetting-a-latitude-longitude-by-some-amount-of-meters
def meters_to_degs(x, y):
    #takes meters in the x- and y-directions
    #returns a tuple changes in degree
    #this method is refered to as 'quick and dirty' and not suggested for life-dependent applications or long distances
    return ((y/111111.0), x/(111111 * math.cos(y)))

In [18]:
def get_max_distances(land_area):
    #assuming counties are square (smaller area than circle - less points near or outside boundary)
    side = math.sqrt(land_area)
    r = side/2
    return r

In [19]:
def get_max_distances_circle(land_area):
    #assuming counties are circles (which they are not, but shapes are hard)
    r_2 = land_area/math.pi
    r = math.sqrt(r_2)
    return r

In [20]:
def get_degree_ranges(land_area):
    d = get_max_distances(land_area)
    return (meters_to_degs(d, d))

In [21]:
#let's test the functions I wrote using the first entry in the csv
x = df.iloc[0:1].ALAND
al = get_degree_ranges(x)
al

(0.17656910076302565, -0.1987309136349029)

The general idea here is to find the maxium distance from the center in the x and y direction and then find random samples within that boundary. Since counties are irregular in shape we will have to do some cleaning.

In [22]:
#let's try to look at the info that's interesting at the moment (1001 corresponds to the first GEOID/index value)
df["INTPTLONG"][1001]

KeyError: 'INTPTLONG'

In [None]:
#why the error message? try this one:
df["INTPTLONG                                                                                                               "][1001]

In [None]:
#quick fix
df.columns = [x.strip() for x in df.columns]
df["INTPTLONG"][1001]

In [None]:
def sampler(row, val):
    #row corresponds to one of the dataframe rows
    #val is the row of the Latin Square that I will use for this sample
    latin_square_coefficient = b[val]
    multiplier = get_degree_ranges(row.ALAND)
    center = [row.INTPTLAT, row.INTPTLONG]
    return latin_square_coefficient*multiplier + center

In [None]:
#constants
num_of_samples = 30 #samples per county 

latin_square_coefficient = lhs(2, samples=num_of_samples, criterion='center')
latin_square_coefficient = (latin_square_coefficient-0.5)*2

Let's test out the sampling function using the first row of the dataframe.

In [None]:
print "latitude\tlongitude"
for x in xrange(30):
    sample = sampler(df.loc[1001], x)
    print "{0}\t{1}".format(sample[0], sample[1])

Sweet, so now we have 20 points generated inside our target county. We should stop and check at this point to make sure our methodology is correct, that is check to make sure these points are actually inside the correct county. We can visualize this on map. 

In [None]:
import folium, json

row = df.loc[1001]
center = [row.INTPTLAT, row.INTPTLONG]
name = row.NAME

#you can download this file here: (http://catalog.civicdashboards.com/dataset/1c992edf-5ec7-456b-8191-c73a33bb79e1/resource/af46d2c0-5f84-42ae-85a1-d2ab7d46d9a7/download/ee2d088c0afb441cb8eaf57a8d279de6temp.geojson)
#it contains the boundaries for all the counties in AL
with open('al.geojson') as f:
    countylines = json.load(f)

multi = []
for x in countylines['features']:
    if name in x['properties']['name']:
        for y in x['geometry']['coordinates'][0][0]:
            multi.append([y[1], y[0]])

#sample code from the Folium tutorial at (https://github.com/python-visualization/folium) 
map_osm = folium.Map(location=[center[0], center[1]])

# Create the map and add the line
map_osm.line(multi, line_color='#FF0000', line_weight=5)

#loop over point
for x in xrange(num_of_samples):
    sample = sampler(df.loc[1001], x)
    map_osm.simple_marker([sample[0], sample[1]], popup='Sample Number: {0}'.format(x))

map_osm.create_map(path='osm.html')


Okay let's see how we did for this [one example county.](osm.html) Note: our sampling is based on a random algorithm so everyone *should* have different plots. TODO: do this programmatically for every county. To be able to defend a project, you should be able to show with what degree this sampling is accurate.

In [None]:
#blank dataframe
samples = pd.DataFrame(columns=[x for x in range(num_of_samples)])

In [None]:
for idx, row in df.iterrows():
    for x in range(num_of_samples):
        sample =  (sampler(row, x))
        samples.loc[idx,x] = (sample[0], sample[1])

In [None]:
samples.info()

In [None]:
print "latitude\tlongitude"
for x in samples.loc[1001]:
    print "{0}\t{1}".format(x[0], x[1])

In [None]:
#get the state abbreviation for later use
samples['states'] = df.states

In [None]:
samples.to_csv('csv/samples.csv')

Now we have the samples, we can query those locations and begin to ask questions. Below, I will show you how to work with a few coroporate APIs to pursue some questions.

In [12]:
# Code for setting the style of the notebook

from IPython.core.display import HTML
def css_styling():
    styles = open("./theme/custom.css", "r").read()
    return HTML(styles)
css_styling()

IOError: [Errno 2] No such file or directory: './theme/custom.css'