# How to geocode Lat/Longs in a pandas DataFrame

In this example, we'll be using the `censusgeocode` package [link]() to enrich a dataset that already contains a `long`itude and a `lat`itude column.


In [1]:
import pandas as pd
import censusgeocode as cg
from rich import print  # Not absolutely necessary, just makes things easier to read.


We used mockaroo.com to create a fake dataset (`censusgeocode_fakedata.csv`) with random company names and a longitude and latitude for each. You can see the schema (and get more data) here: https://mockaroo.com/37155290


In [2]:
fake_data = pd.read_csv("censusgeocode_fakedata.csv")
fake_data.head()


Unnamed: 0,location_name,lat,long
0,Photojam,36.262315,-116.6851
1,Wikibox,37.867693,-117.947864
2,Babbleblab,37.307165,-119.260149
3,Lajo,37.232432,-120.565067
4,Yacero,38.342407,-121.587746


## using `censusgeocode.coordinates()`

We will be using the `.coordinates()` function which takes on an `x` (longitude) and `y` (latitude) value and returns a `CensusResult` object which, for our purposes, is just a dictionary.


In [3]:
result = cg.coordinates(x=-122, y=37)
print(result)


As you can see, `result` is a python dictionary with various geographies' (state, county, census tract, etc) information. Each of this geographies contains a `list` of dictionaries with one result (another dictionary)


In [4]:
print(result.keys())


This below returns a list of one


In [5]:
print(result["Counties"])


you can access the dictionary by grabbing the list's first element (0th)


In [6]:
print(result["Counties"][0])


This itself is a dictionary


In [7]:
print(result["Counties"][0].keys())


You can access each of these values like you do any value in a dictionary (with square brackets)


In [8]:
print(result["Counties"][0]["NAME"])


That's how you would manually get the county's name of a lat/long pair. In this case, using `cg.coordinates()` we found out that the point `(37, -122)` in a map is somewhere in Santa Cruz county in California and it is in congressional district...


In [9]:
result["116th Congressional Districts"][0]["NAME"]


'Congressional District 20'

---

What we want to do now is to recreate this `for` each row in our dataset.

In other words we `iter`ate over our dataset's `rows`.


In [10]:
for (index, row) in fake_data.head(2).iterrows():
    print(f"index {index} contains row:\n{row}")


We won't be needing the `index` part but now we can access each `row`s `lat` and `long` values like this:


In [11]:
for (index, row) in fake_data.head(2).iterrows():
    print(row["lat"], row["long"])


With that, we can geocode our rows fairly easy now


In [12]:
list_of_results = []
for (index, row) in fake_data.head(2).iterrows():
    result = cg.coordinates(x=row["long"], y=row["lat"])
    list_of_results.append(result)


In [13]:
print(f"There's {len(list_of_results)} results.")
print(list_of_results)


But the data we are receiving for each geocoded row is a nested dictionary and that's difficult to work with. It'd be best if we had it in tabular form. The best way to do that is to create second dictionary for each row with only the information we want to extract for the larger `result` dictionary.

A dictionary is the best object to hold this data because it works very nicely with pandas DataFrames. Each key in a dictionary is assumed to be a column name and each value is the corresponding cell value. If you have a list of dictionaries, pandas assumes each is a row in the DataFrame you're trying to construct.

It will even handle missing columns nicely!


In [14]:
# a quick example
example_row_1 = {
    "name": "Clifford",
    "last_name": "Smith",
    "stage_name": "Method Man",
}

example_row_2 = {
    "name": "Gary",
    "middle_name": "Earl",
    "last_name": "Grice",
    "stage_name": "GZA",
}

list_of_example_rows = [example_row_1, example_row_2]

pd.DataFrame(list_of_example_rows)


Unnamed: 0,name,last_name,stage_name,middle_name
0,Clifford,Smith,Method Man,
1,Gary,Grice,GZA,Earl


Notice how row 0 has a `NaN` or null value for `middle_name`


---

Now, just like we did earlier with county and congressional district, we can create a new dataframe with the results we get from geocoding each row in our dataset.


In [15]:
fake_data.head()


Unnamed: 0,location_name,lat,long
0,Photojam,36.262315,-116.6851
1,Wikibox,37.867693,-117.947864
2,Babbleblab,37.307165,-119.260149
3,Lajo,37.232432,-120.565067
4,Yacero,38.342407,-121.587746


This will take some time (around 3 minutes) because each row takes a second or two to geocode and we have 100 rows.


In [16]:
list_of_results = []

for (index, row) in fake_data.iterrows():
    # we first create an empty dictionary to store our data
    result_data = {}

    # we can add data from our original dataset to this dictionary too
    # if you have a larger dataset you might want to include a few columns
    # that you can use to join the results dataframe. For example,
    # in the resource gaps analysis we included a `location_id`.
    result_data["location_name"] = row["location_name"]
    result_data["long"] = row["long"]
    result_data["lat"] = row["lat"]

    result = cg.coordinates(x=row["long"], y=row["lat"])

    # this is the same code we used earlier in the notebook
    result_data["county"] = result["Counties"][0]["NAME"]
    result_data["congressional_district"] = result["116th Congressional Districts"][0][
        "NAME"
    ]

    list_of_results.append(result_data)


In [17]:
geocoded_data = pd.DataFrame(list_of_results)
geocoded_data


Unnamed: 0,location_name,long,lat,county,congressional_district
0,Photojam,-116.685100,36.262315,Inyo County,Congressional District 8
1,Wikibox,-117.947864,37.867693,Esmeralda County,Congressional District 4
2,Babbleblab,-119.260149,37.307165,Fresno County,Congressional District 4
3,Lajo,-120.565067,37.232432,Merced County,Congressional District 16
4,Yacero,-121.587746,38.342407,Yolo County,Congressional District 3
...,...,...,...,...,...
95,Teklist,-117.159930,38.554249,Nye County,Congressional District 4
96,Meetz,-119.468842,37.244560,Madera County,Congressional District 4
97,Gigabox,-117.768981,37.304711,Inyo County,Congressional District 8
98,Meemm,-117.385647,38.344826,Nye County,Congressional District 4
