# Geospatial Analysis

The goal of this notebook is to find the counties that are close to each other and see if there is a correlation between their microdensities.

In [1]:
import os
import pandas as pd

os.environ['PROJECT_ROOT'] = '.'

# load train dataset
traindf = pd.read_csv('io/dataset/train.csv')
# get sorted unique cfips from train dataset
traincfips = traindf[['cfips']].drop_duplicates().sort_values(by=['cfips']).reset_index(drop=True)


## Check the integrity of coordinates dataset

Dataset location: `io/customdata/cfips_coordinates.csv`

In [2]:
# load coordinates dataset
coorddf = pd.read_csv('io/customdata/cfips_coordinates.csv')
# sort by cfips (they're already unique)
coordcfips = coorddf[['cfips']].sort_values(by=['cfips']).reset_index(drop=True)

Check if

* coordcfips has only unique cfips

In [3]:
print("Length of train cfips: ", len(traincfips))
print("Length of coordinates cfips: ", len(coordcfips))
assert len(coordcfips) == len(coordcfips['cfips'].unique()), "There is a duplicate cfips in coordcfips"
print("[ok] All cfips in coordinates dataset are unique!")

Length of train cfips:  3135
Length of coordinates cfips:  3221
[ok] All cfips in coordinates dataset are unique!


Check if

* coordcfips has all the cfips from train data
* the cfips are sorted in the same order in both datasets (ignoring the missing cfips in train data)

In [4]:
foundlen = 0
offset = 0

# for each row in traincfips (unique train data cfips)
for i in range(len(traincfips)):
    # if the cfips in train data equals the cfips in coordinates data
    if traincfips.iloc[i].item() == coordcfips.iloc[i+offset].item():
        foundlen += 1
    else:
        for j in range(i+offset, len(coordcfips)):
            if traincfips.iloc[i].item() == coordcfips.iloc[j].item():
                offset = j - i
                foundlen += 1
                break

print("Total cfips in train data: ", len(traincfips))
print("Found length: ", foundlen)
assert foundlen == len(traincfips), "There are missing cfips in the coordinates dataset!! (io/customdata/cfips_coordinates.csv)"
print("[ok] All training cfips are in coordinates dataset")
print("[ok] Both datasets are sorted correctly")

Total cfips in train data:  3135
Found length:  3135
[ok] All training cfips are in coordinates dataset
[ok] Both datasets are sorted correctly


## 

## Visualize the counties

In order to check whether the counties are close to each other, we can visualize them on a map.

Let's divide the map into 10x10 grid and count the number of counties in each grid.

TODO: Group latitude and longitude by 10x10 grid