# Gourmet Meals Business -- SQL + Neo4j Project (Part 3.1)

Authors: **Brodie Deb, Carolyn Dunlap, and Ethan Moody**

Date: **December 2022**

### Business Case

You have been working at AGM as a data engineer for a couple of years. As a data engineer, you have had a great working relationship with the data science team, and they have been very impressed with your work and your suggestions. The data science team has several positions approved, but has been unsuccessful at filling them.  

In light you your successful work as a data engineer, your knowledge of the company and the business, your great working relationship with the data science team, and your studying to become a data scientist, the data science team has asked you to join their team as a data scientist. Likewise, there are a couple of other data engineers in a similar situation that they have also asked to join the data science team. 

The data science team is looking to upgrade its skill set in terms of awareness of new technologies.   

Specifically, the data science team has no experience with NoSQL databases, and they want your team of new data scientists to give a quick overview of Neo4j, MongoDB, and Redis for the entire data science team, focusing on:

- Business examples which would involve **Neo4j**, **MongdoDB**, and **Redis**
- How **Neo4j**, **MongdoDB**, and **Redis** can be used to solve those business examples
- Why a relational database would not be a good fit for those business examples

A couple of days after you joined the data science team, the AGM executives met with the data science team in a special off site meeting and discussed their vision for the future of the company. They let the data science team know that for their vision to be successful, they will be heavily relying on the data science team. The data science team is very anxious to demonstrate to the executives that they have the knowledge and skills to have a key role in implementing their vision for the future. The data science team has asked that your business case examples be directly related to the vision of the future of the company.  

The executives relayed that all proofs of concepts will take place in the CA Bay Area, as the Berkeley store is the original store, has the oldest and largest customer base, etc. They have also been in talks with BART regarding public transportation.

AGM executives' vision of the future of the company includes the following:

- Adding more pickup locations
- Using public transportation to transport deliveries
    - BART could avoid gridlock traffic
    - Special cars or even special trains could be used
- Using delivery drones
- Using delivery robots
- Hybrid combinations:
    - Adding pickup locations at BART stations would have the following advantages:
        - Lots of potential customers passing through the BART stations each day
        - Allow the use of BART to transport deliveries
    - Using BART for transportation would have the following advantages:
        - Traditional delivery trucks could pick up deliveries at the BART station and:
            - Run a local delivery route
            - Deliver a truck load to another local pickup location
        - Delivery drones could pick up deliveries at the BART station and deliver them locally
        - Delivery robots could pick up deliveries at the BART station and deliver them locally
        
You and the data science team have been tasked with exploring these options using SQL, Neo4j, and external research as needed.

## Overview of Neo4j, MongoDB, and Redis Business Examples

**Neo4j (Primary Focus)**
- Business example #1: One of the use cases could be to assess which are some viable ways to make the delivery process more efficient with faster speed leading to better customer satisfaction and lower cost. Few ways to do this have been discussed already - to create more delivery stops around stations, using robots or drones. 
    - How this would be achieved: create graphs for the various stops as nodes and the time taken to travel between them as weights. This can then be used to calculate shortest distances using the Djikstra algorithm. The nodes could also be graded from highly important to least important based on the population and popularity (based on sales in the zipcodes around the station). 
- Business example #2: This could be used in the search algorithm for the products in the online search platform of the store and graph based recommender systems.
- Business example #3: Team performance and see which employees do better in which teams.
- Business example #4: Tracking sales around areas and assess which stores are the major influencers and pool more resources around those geographical areas. 
- Business example #5: Fraud detection in ecommerce with graph based anomaly detection algorithms.
- Business example #6: Allow customers facing social media platforms with graph databases.

**MongoDB (Secondary Focus)**
- Business Example #1: As AGM continues to grow its customer base, they might find it helpful to store their sales, products, customers, and stores data in multiple different POVs, so that different functional groups within the business (e.g., product teams, marketing teams, sales teams, strategy teams, etc.) could use these POVs for quick and efficient business analytics.
    - How Mongo could solve this scenario: Using a document database (like Mongo) could help AGM organize all of their data in a way that’s convenient for analytics and business intelligence. The data could be loaded/organized within JSON or JSON-like files, converted to specific POVs, and then employees of AGM could simply pick which POV they need/want to use for analytical purposes (e.g., using a products POV to determine product performance across their customer base, or using a customer POV to quickly see which customer characteristics correlate most strongly with sales, etc.).
    - Why relational database wouldn’t work: A relational database would likely duplicate all the data in these POVs across multiple tables and would require AGM’s employees to query many tables to tie it all together. While querying could technically work to address this situation, it might not be the most efficient or streamlined way to navigate the data and mine it for insights or business recommendations, especially if AGM continues to grow/expand to more store locations, increase its product offerings, or drive more customer sales (all of which could eventually make the size of these data tables “humongous” and vastly increase the run-times for queries).
- Business Example #2: To evaluate customer satisfaction/experience (e.g., cNPS) and continue building its brand, AGM may want to start collecting customer feedback through reviews on sites like Google, YouTube, Yelp, or its own company website. This feedback would likely take the form of unstructured text or audio/visual data.
    - How Mongo could solve this scenario: AGM could use a document database (like Mongo) to capture customer feedback in full without truncating/abbreviating it or forcing it into a specific format/data type, which a relational database might require. Mongo could provide AGM with a lot of flexibility around how this data is stored and accessed for business insights (e.g., sentiment analytics, customer outreach/promotional programs, etc.).
    - Why relational database wouldn’t work: A relational database would commit this data to a particular format up front, which could be especially problematic if AGM wanted to store video transcripts or paragraph-length customer reviews in a data field with character length constraints. A relational database would also require some type of primary key to be defined within the data, which may not make sense for unstructured data types (e.g., text reviews, audio recordings, etc.).

**Redis (Secondary Focus)**
- Business Case #1: Should AGM's customer base expand to require the development of an app or website login, Redis will be useful for storing and managing customer profiles.
    - Redis could be used to store customer IDs (or their chosen username) with their password and other profile information.
    - A relational database would not work as well because it would take longer to retrieve the data from storage (vs from memory in Redis), which would lead to a lag on the user end.
- Business Case #2: Along with a website, we can also enable delivery tracking within the website/app.
    - Redis could also be used to help customers track their deliveries, by using AGM delivery drivers/transporter IDs and frequently updating the GPS coordinates. AGM Customers could then view their delivery in real time each time they call that ID.
    - A relational database would also not work well in this case because the time it would take to retrieve this information and relay it to the customer would cause the location of the delivery to lag behind where the delivery actually is.

## Included Modules and Packages

In [1]:
import csv
import math
import numpy as np
import pandas as pd
import psycopg2
import json
import gmaps
import gmaps.geojson_geometries
from geographiclib.geodesic import Geodesic

## Additional Setup Code

In [2]:
# Function to run a select query and return rows in a pandas dataframe
# Note: pandas formats all numeric values from postgres as float

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "Function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # Fix any float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [3]:
# Function to read a csv file and print a set number of rows

def my_read_csv_file(file_name, limit):
    "Read the csv file and print only the first 'limit' rows"
    
    csv_file = open(file_name, "r")
    
    csv_data = csv.reader(csv_file)
    
    i = 0
    
    for row in csv_data:
        i += 1
        if i <= limit:
            print(row)
            
    print("\nPrinted ", min(limit, i), "lines of ", i, "total lines.")

In [4]:
# Function to calculate a box on a map, given a point and miles

def my_calculate_box(point, miles):
    "Given a point and miles, calculate the box in form left, right, top, bottom"
    
    geod = Geodesic.WGS84

    kilometers = miles * 1.60934
    meters = kilometers * 1000

    g = geod.Direct(point[0], point[1], 270, meters)
    left = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 90, meters)
    right = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 0, meters)
    top = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 180, meters)
    bottom = (g["lat2"], g["lon2"])
    
    return(left, right, top, bottom)

In [5]:
# Set up connection to postgres
# Note: All connection inputs below have been removed for protection
connection = psycopg2.connect(
    user = "",
    password = "",
    host = "",
    port = "",
    database = ""
)

In [6]:
cursor = connection.cursor()

## List CA BART Station Locations - Step 1

In [7]:
# Query drops the table housing BART station data, if it already exists

connection.rollback()

query = """

drop table if exists stations;

"""

cursor.execute(query)

connection.commit()

## List CA BART Station Locations - Step 2

In [8]:
# Query creates a table to house BART station data from the stations.csv file

connection.rollback()

query = """

create table stations (
  station varchar(32)
, latitude numeric(9,6)
, longitude numeric(9,6)
, transfer_time numeric(3)
, primary key (station)
)

;

"""

cursor.execute(query)

connection.commit()

## List CA BART Station Locations - Step 3

In [9]:
# Read and display a portion of the BART station data from the stations.csv file

my_read_csv_file("stations.csv", limit = 10)

['station', 'latitude', 'longitude', 'transfer_time']
['12th Street', '37.803608', '-122.272006', '282']
['16th Street Mission', '37.764847', '-122.420042', '287']
['19th Street', '37.807869', '-122.26898', '67']
['24th Street Mission', '37.752', '-122.4187', '277']
['Antioch', '37.996281', '-121.783404', '0']
['Ashby', '37.853068', '-122.269957', '299']
['Balboa Park', '37.721667', '-122.4475', '48']
['Bay Fair', '37.697', '-122.1265', '63']
['Berryessa', '37.368361', '-121.874655', '288']

Printed  10 lines of  51 total lines.


## List CA BART Station Locations - Step 4

In [10]:
# Load BART station data from the stations.csv file into a database

connection.rollback()

query = """

copy stations
from '/user/projects/code/stations.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

## List CA BART Station Locations - Step 5

In [11]:
# Query returns the latitude and longitude coordinates for all BART stations in CA

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_stations.station
, t1_stations.latitude
, t1_stations.longitude

from stations as t1_stations

order by
  t1_stations.station

"""

df_bartlocs = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_bartlocs

Unnamed: 0,station,latitude,longitude
0,12th Street,37.803608,-122.272006
1,16th Street Mission,37.764847,-122.420042
2,19th Street,37.807869,-122.26898
3,24th Street Mission,37.752,-122.4187
4,Antioch,37.996281,-121.783404
5,Ashby,37.853068,-122.269957
6,Balboa Park,37.721667,-122.4475
7,Bay Fair,37.697,-122.1265
8,Berryessa,37.368361,-121.874655
9,Castro Valley,37.690748,-122.075679


## Show Total Number of AGM Customers

In [12]:
# Query returns total number of customers

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  count(t1_customers.customer_id) as total_number_of_customers

from customers as t1_customers

;

"""

df_totalc = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_totalc

Unnamed: 0,total_number_of_customers
0,31082


## Show Total Number of AGM Customers by Store

In [13]:
# Query returns total number of customers by store

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_stores.city as store_name 
, count(t2_customers.customer_id) as total_number_of_customers

from stores as t1_stores

join customers as t2_customers
on t1_stores.store_id = t2_customers.closest_store_id

group by
  store_name
  
order by
  store_name

;

"""

df_storec = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_storec

Unnamed: 0,store_name,total_number_of_customers
0,Berkeley,8138
1,Dallas,6359
2,Miami,5725
3,Nashville,3646
4,Seattle,7214


## Show Ratio of AGM Customers to Population by Zip Code

In [14]:
# Query returns the percentage of customers per population by zip code

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_customers.zip as zip
, round((count(t1_customers.customer_id)/t2_zipcodes.population)*100,3) as percentage_customers_per_population

from customers as t1_customers

join zip_codes as t2_zipcodes
on t1_customers.zip = t2_zipcodes.zip

group by
  t1_customers.zip
, t2_zipcodes.zip
  
order by
  (count(t1_customers.customer_id)/t2_zipcodes.population) desc

;

"""

df_zipcpratio = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_zipcpratio

Unnamed: 0,zip,percentage_customers_per_population
0,98164,1.290
1,98050,1.087
2,33109,1.053
3,94613,1.045
4,37240,1.028
...,...,...
545,33033,0.002
546,75067,0.001
547,75035,0.001
548,94565,0.001


## Show Total AGM Customers and Population by Zip Code

In [15]:
# Query returns the total number of customers and population by zip code, along with additional geographic details

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_customers.zip as zip
, t2_zipcodes.latitude as lat
, t2_zipcodes.longitude as long
, t2_zipcodes.city as z_city
, t1_customers.city as c_city
, t2_zipcodes.state as z_state
, t1_customers.state as c_state
, t2_zipcodes.population as population
, t2_zipcodes.area as area
, t2_zipcodes.density as density
, t2_zipcodes.time_zone as timezone
, count(t1_customers.customer_id) as total_customers

from customers as t1_customers

join zip_codes as t2_zipcodes
on t1_customers.zip = t2_zipcodes.zip

where
  t1_customers.state in ('CA')

group by
  t1_customers.zip
, t2_zipcodes.latitude
, t2_zipcodes.longitude
, t2_zipcodes.city
, t1_customers.city
, t2_zipcodes.state
, t1_customers.state
, t2_zipcodes.population
, t2_zipcodes.area
, t2_zipcodes.density
, t2_zipcodes.time_zone
  
order by
  t1_customers.zip

;

"""

df_zippc = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_zippc

Unnamed: 0,zip,lat,long,z_city,c_city,z_state,c_state,population,area,density,timezone,total_customers
0,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles,4
1,94005,37.6887,-122.4080,Brisbane,Brisbane,CA,CA,4692,4.8168,974.09,America/Los_Angeles,41
2,94010,37.5693,-122.3653,Burlingame,Burlingame,CA,CA,42730,12.4205,3440.28,America/Los_Angeles,7
3,94014,37.6909,-122.4475,Daly City,Daly City,CA,CA,49515,6.6361,7461.50,America/Los_Angeles,41
4,94015,37.6812,-122.4805,Daly City,Daly City,CA,CA,64887,6.1247,10594.35,America/Los_Angeles,4
...,...,...,...,...,...,...,...,...,...,...,...,...
139,94963,38.0138,-122.6703,San Geronimo,San Geronimo,CA,CA,498,2.3307,213.67,America/Los_Angeles,5
140,94964,37.9431,-122.4918,San Quentin,San Quentin,CA,CA,3418,0.2608,13104.83,America/Los_Angeles,34
141,94965,37.8499,-122.5236,Sausalito,Sausalito,CA,CA,11408,14.2131,802.64,America/Los_Angeles,39
142,94970,37.9145,-122.6469,Stinson Beach,Stinson Beach,CA,CA,689,7.0006,98.42,America/Los_Angeles,7


## Show AGM Customers and Population by Zip Code - Flat Table

In [16]:
# Query returns a flat table of customers by zip code, along with additional geographic details

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_customers.customer_id as customer_id
, t1_customers.zip as zip
, t2_zipcodes.latitude as lat
, t2_zipcodes.longitude as long
, t2_zipcodes.city as z_city
, t1_customers.city as c_city
, t2_zipcodes.state as z_state
, t1_customers.state as c_state
, t2_zipcodes.population as population
, t2_zipcodes.area as area
, t2_zipcodes.density as density
, t2_zipcodes.time_zone as timezone

from customers as t1_customers

join zip_codes as t2_zipcodes
on t1_customers.zip = t2_zipcodes.zip

where
  t1_customers.state in ('CA')
  
order by
  t1_customers.zip

;

"""

df_zippc_flat = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_zippc_flat

Unnamed: 0,customer_id,zip,lat,long,z_city,c_city,z_state,c_state,population,area,density,timezone
0,8103,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles
1,8100,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles
2,8101,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles
3,8102,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles
4,7158,94005,37.6887,-122.4080,Brisbane,Brisbane,CA,CA,4692,4.8168,974.09,America/Los_Angeles
...,...,...,...,...,...,...,...,...,...,...,...,...
8133,8098,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles
8134,8099,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles
8135,8094,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles
8136,8093,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles


## Apply Random Variance to Customer Latitude and Longitude Coordinates

The next step applies random variance ("noise") to each latitude and longitude coordinate in the dataframe generated above to create random points within each zip code. This step is performed because the customer addresses in the *customers* database are fake (due to privacy issues and ethical concerns), while the customer city, states, and zips in this same database are all legitimate and the latitude and longitude associated with each customer's zip code represents the coordinates for the *center* of their zip code.

In [17]:
# Generate random variance to apply to latitude points and set seed to maintain consistence

np.random.seed(0)

lat_var_array = np.random.uniform(size = (len(df_zippc_flat), 1), low = -1, high = 1) / 100


# Generate random variance to apply to longitude points and set seed to maintain consistence

np.random.seed(1)

long_var_array = np.random.uniform(size = (len(df_zippc_flat), 1), low = -1, high = 1) / 100


# Update dataframe to show latitude/longitude random variance and adjusted coordinates ("random points")

df_zippc_flat.insert(len(df_zippc_flat.columns), "lat_var", lat_var_array)

df_zippc_flat.insert(len(df_zippc_flat.columns), "long_var", long_var_array)

df_zippc_flat.insert(len(df_zippc_flat.columns), "lat_rp", round(df_zippc_flat["lat"] + df_zippc_flat["lat_var"], 4))

df_zippc_flat.insert(len(df_zippc_flat.columns), "long_rp", round(df_zippc_flat["long"] + df_zippc_flat["long_var"], 4))

df_zippc_flat

Unnamed: 0,customer_id,zip,lat,long,z_city,c_city,z_state,c_state,population,area,density,timezone,lat_var,long_var,lat_rp,long_rp
0,8103,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles,0.000976,-0.001660,37.5145,-122.3008
1,8100,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles,0.004304,0.004406,37.5178,-122.2947
2,8101,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles,0.002055,-0.009998,37.5156,-122.3091
3,8102,94002,37.5135,-122.2991,Belmont,Belmont,CA,CA,27202,5.9244,4591.53,America/Los_Angeles,0.000898,-0.003953,37.5144,-122.3031
4,7158,94005,37.6887,-122.4080,Brisbane,Brisbane,CA,CA,4692,4.8168,974.09,America/Los_Angeles,-0.001527,-0.007065,37.6872,-122.4151
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8133,8098,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles,0.007847,0.002289,38.0205,-122.6374
8134,8099,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles,-0.000206,-0.000152,38.0125,-122.6399
8135,8094,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles,-0.001842,-0.001762,38.0109,-122.6415
8136,8093,94973,38.0127,-122.6397,Woodacre,Woodacre,CA,CA,1342,3.7411,358.71,America/Los_Angeles,0.000955,0.000314,38.0137,-122.6394


## Connect to Google Maps Using API Key

In [18]:
# Open gmap_api_key file containing api key and reads it into gmaps.configure statement

f = open("path-to-api-key", "r") # Note: Actual path edited out for privacy

my_api_key = f.read()

api_key = f.read()

f.close()

gmaps.configure(api_key = my_api_key)

## Create Basic Map Centered on UC Berkeley (UCB) Sather Gate

This map is for reference only - it is not used for any analysis and is simply to prove that the code below and API connection above function correctly.

In [19]:
# Use exact latitude and longitude coordinates for Sather Gate

sather_gate_berkeley = (37.870260430419115, -122.25950168579497)

gmaps.figure(center = sather_gate_berkeley, zoom_level = 9)

Figure(layout=FigureLayout(height='420px'))

## Limit Customers and Population Flat Dataframe to Randomized lat/long Coordinates

This step extracts only the randomized latitude and longitude coordinates for each AGM customer from the customer and population dataframe generated above so it can be applied as a "layer" within a mapping visualization.

In [20]:
# Drop all columns except lat_rp and long_rp to show customers by randomized latitude and longitude coordinates

df_zippc_flat_cmap = df_zippc_flat[["lat_rp", "long_rp"]]

df_zippc_flat_cmap

Unnamed: 0,lat_rp,long_rp
0,37.5145,-122.3008
1,37.5178,-122.2947
2,37.5156,-122.3091
3,37.5144,-122.3031
4,37.6872,-122.4151
...,...,...
8133,38.0205,-122.6374
8134,38.0125,-122.6399
8135,38.0109,-122.6415
8136,38.0137,-122.6394


## Create Box Centered on UCB Sather Gate - Step 1

This step calculates the latitude and longtitude coordinates that form an X-mile box around UC Berkeley's Sather Gate. Displaying this box is helpful for giving a sense of scale to the below maps and for determining if certain population/customer centers fall within or outside of the box. The code below uses X = 5, since we assume that most people would consider a <=5mi distance from one location to another to be a "quick and easy" trip and a >5mi distance to be worth more consideration.

In [21]:
# Calculate latitude and longitude coordinates positioned within X miles of a defined point (Sather Gate at UCB)

left, right, top, bottom = my_calculate_box(sather_gate_berkeley, 5)

print(left, right, top, bottom)

(37.8702249126905, -122.35095496602143) (37.8702249126905, -122.16804840556851) (37.9427566793502, -122.25950168579497) (37.79776328649186, -122.25950168579497)


## Create Box Centered on UCB Sather Gate - Step 2

This step is purely for reference - it shows all zip codes in the zip_codes database that fall within the box calculated above.

In [22]:
# Find all zip codes that fall within the box defined above

connection.rollback()

query = "select zip from zip_codes "
query += " where latitude >= " + str(bottom[0])
query += " and latitude <= " + str(top [0])
query += " and longitude >= " + str(left[1])
query += " and longitude <= " + str(right[1])
query += " order by 1 "

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    print(row[0])

94530
94563
94602
94607
94608
94609
94610
94611
94612
94618
94702
94703
94704
94705
94706
94707
94708
94709
94710
94720
94804


## Create Box Centered on UCB Sather Gate - Step 3

This step displays the corners of the box calculated above in terms of latitude and longitude coordinates.

In [23]:
# Find the corners of the box

corners = [(top[0], left[1]), (top[0], right[1]), (bottom[0], right[1]), (bottom[0], left[1])]

corners

[(37.9427566793502, -122.35095496602143),
 (37.9427566793502, -122.16804840556851),
 (37.79776328649186, -122.16804840556851),
 (37.79776328649186, -122.35095496602143)]

## Create Box Centered on UCB Sather Gate - Step 4

This step draws the box on a map so it can be displayed.

In [24]:
# Draw the outline of the box on a map

figure_layout = {
    "width": "500px",
    "height": "500px",
    "border": "1px solid black",
    "padding": "1px"
}

box = gmaps.figure(
    center = sather_gate_berkeley,
    map_type = "HYBRID",
    zoom_level = 8,
    layout = figure_layout
)

lines = []

lines.append(
    gmaps.Line(
        start = corners[0],
        end = corners[1],
        stroke_color = "blue",
        stroke_weight = 7
    )
)
lines.append(
    gmaps.Line(
        start = corners[1],
        end = corners[2],
        stroke_color = "blue",
        stroke_weight = 7
    )
)
lines.append(
    gmaps.Line(
        start = corners[2],
        end = corners[3],
        stroke_color = "blue",
        stroke_weight = 7
    )
)
lines.append(
    gmaps.Line(
        start = corners[3],
        end = corners[0],
        stroke_color = "blue",
        stroke_weight = 7
    )
)

drawing_layer = gmaps.drawing_layer(features = lines)

box.add_layer(drawing_layer)

box

Figure(layout=FigureLayout(border='1px solid black', height='500px', padding='1px', width='500px'))

## Create Heatmap Showing the Number of AGM Customers in each Zip Code

In [25]:
# Show hybrid map showing lowest/highest customer population points in green/red

hm_c = gmaps.figure(
    center = sather_gate_berkeley,
    map_type = "HYBRID",
    zoom_level = 8.75,
    layout = figure_layout
)

heatmap_layer = gmaps.heatmap_layer(df_zippc_flat_cmap)

hm_c.add_layer(heatmap_layer)

hm_c

Figure(layout=FigureLayout(border='1px solid black', height='500px', padding='1px', width='500px'))

## Create Heatmap Showing the Number of AGM Customers in each Zip Code + BART Locations

In [26]:
# Show hybrid map showing lowest/highest customer population points in green/red, plus BART station markers/scale box

figure_layout = {
    "width": "500px",
    "height": "500px",
    "border": "1px solid black",
    "padding": "1px"
}

hm_c_bart = gmaps.figure(
    center = sather_gate_berkeley,
    map_type = "HYBRID",
    zoom_level = 8.75,
    layout = figure_layout
)

heatmap_layer = gmaps.heatmap_layer(df_zippc_flat_cmap)

df_bart_stations = df_bartlocs[["latitude", "longitude"]]

symbol_layer1 = gmaps.symbol_layer(
    df_bart_stations,
    hover_text = "BART Station",
    fill_color = "white",
    stroke_color = "black",
    scale = 3
)

# symbol_layer2 = gmaps.symbol_layer(
#     df_bart_stations,
#     #fill_color = "white",
#     fill_color = None,
#     fill_opacity = 0.1,
#     stroke_color = "white",
#     stroke_opacity = 0.1,
#     scale = 20
# )

drawing_layer = gmaps.drawing_layer(features = lines)

hm_c_bart.add_layer(heatmap_layer)

hm_c_bart.add_layer(symbol_layer1)

# hm_c_bart.add_layer(symbol_layer2)

hm_c_bart.add_layer(drawing_layer)

hm_c_bart

Figure(layout=FigureLayout(border='1px solid black', height='500px', padding='1px', width='500px'))

## Show CA Population Density - Step 1

In [27]:
# Query drops the table housing CA population sample data, if it already exists

connection.rollback()

query = """

drop table if exists popsample;

"""

cursor.execute(query)

connection.commit()

## Show CA Population Density - Step 2

In [28]:
# Query creates a table to house CA population sample data from the CA_Bay_Area_Population_Sample_Table.csv file

connection.rollback()

query = """

create table popsample (
  row varchar(32)
, latitude numeric(9,6)
, longitude numeric(9,6)
, primary key (row)
)

;

"""

cursor.execute(query)

connection.commit()

## Show CA Population Density - Step 3

Note: This CSV file was created in the following way:
* Export CA population data for Bay Area zip codes from the *zipcodes* table into an Excel file
* Calculate the percentage share of the population located within each of these zip codes
* Create a "stratified sample" of 1,000 lat/long pairs where the strata are the zip code population proportions
    * Example: If zip code X has 2% of the population, then its lat/long coordinates make up 2% of the 1,000 records
* Save this table of lat/long pairs as a CSV file and import into notebook
* Relevant files are saved here:
    * /user/projects/code/CA_Bay_Area_Population_Sample_Generator.xlsm (uses a macro)
    * /user/projects/code/CA_Bay_Area_Population_Sample_Table.csv

In [29]:
# Read and display a portion of the CA population sample data from the CA_Bay_Area_Population_Sample_Table.csv file

my_read_csv_file("CA_Bay_Area_Population_Sample_Table.csv", limit = 10)

['\ufeffrow', 'lat', 'long']
['1', '37.5135', '-122.2991']
['2', '37.5135', '-122.2991']
['3', '37.5135', '-122.2991']
['4', '37.5135', '-122.2991']
['5', '37.5135', '-122.2991']
['6', '37.5135', '-122.2991']
['7', '37.5135', '-122.2991']
['8', '37.6887', '-122.408']
['9', '37.5693', '-122.3653']

Printed  10 lines of  1001 total lines.


## Show CA Population Density - Step 4

In [30]:
# Load CA population sample data from the CA_Bay_Area_Population_Sample_Table.csv file into a database

connection.rollback()

query = """

copy popsample
from '/user/projects/code/CA_Bay_Area_Population_Sample_Table.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

## Show CA Population Density - Step 5

In [31]:
# Query returns the latitude and longitude coordinates for a stratified sample (N = 1,000) of the CA Bay Area population

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_popsample.row
, t1_popsample.latitude
, t1_popsample.longitude

from popsample as t1_popsample

"""

df_poplocs = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_poplocs

Unnamed: 0,row,latitude,longitude
0,1,37.5135,-122.2991
1,2,37.5135,-122.2991
2,3,37.5135,-122.2991
3,4,37.5135,-122.2991
4,5,37.5135,-122.2991
...,...,...,...
995,996,37.9958,-122.5778
996,997,37.9431,-122.4918
997,998,37.8499,-122.5236
998,999,37.8499,-122.5236


## Apply Random Variance to Population Latitude and Longitude Coordinates

The next step applies random variance ("noise") to each latitude and longitude coordinate in the dataframe generated above to create random points within each zip code. This step is performed because the zip code coordinates represent the *center* of each zip code.

In [32]:
# Generate random variance to apply to latitude points and set seed to maintain consistence

np.random.seed(2)

lat_var_array = np.random.uniform(size = (len(df_poplocs), 1), low = -1, high = 1) / 100


# Generate random variance to apply to longitude points and set seed to maintain consistence

np.random.seed(3)

long_var_array = np.random.uniform(size = (len(df_poplocs), 1), low = -1, high = 1) / 100


# Update dataframe to show latitude/longitude random variance and adjusted coordinates ("random points")

df_poplocs.insert(len(df_poplocs.columns), "lat_var", lat_var_array)

df_poplocs.insert(len(df_poplocs.columns), "long_var", long_var_array)

df_poplocs.insert(len(df_poplocs.columns), "lat_rp", round(df_poplocs["latitude"] + df_poplocs["lat_var"], 4))

df_poplocs.insert(len(df_poplocs.columns), "long_rp", round(df_poplocs["longitude"] + df_poplocs["long_var"], 4))

df_poplocs

Unnamed: 0,row,latitude,longitude,lat_var,long_var,lat_rp,long_rp
0,1,37.5135,-122.2991,-0.001280,0.001016,37.5122,-122.2981
1,2,37.5135,-122.2991,-0.009481,0.004163,37.5040,-122.2949
2,3,37.5135,-122.2991,0.000993,-0.004182,37.5145,-122.3033
3,4,37.5135,-122.2991,-0.001294,0.000217,37.5122,-122.2989
4,5,37.5135,-122.2991,-0.001593,0.007859,37.5119,-122.2912
...,...,...,...,...,...,...,...
995,996,37.9958,-122.5778,0.001970,-0.005190,37.9978,-122.5830
996,997,37.9431,-122.4918,-0.002822,0.001602,37.9403,-122.4902
997,998,37.8499,-122.5236,0.003608,-0.002586,37.8535,-122.5262
998,999,37.8499,-122.5236,0.007064,-0.009297,37.8570,-122.5329


## Create Heatmap Showing Population Density in each Zip Code

In [33]:
# Show hybrid map showing lowest/highest CA population points in green/red

hm_p = gmaps.figure(
    center = sather_gate_berkeley,
    map_type = "HYBRID",
    zoom_level = 8.75,
    layout = figure_layout
)

heatmap_layer = gmaps.heatmap_layer(df_poplocs[["lat_rp", "long_rp"]])

hm_p.add_layer(heatmap_layer)

hm_p

Figure(layout=FigureLayout(border='1px solid black', height='500px', padding='1px', width='500px'))

## Create Heatmap Showing Population Density in each Zip Code + BART Locations

In [34]:
# Show hybrid map showing lowest/highest CA population points in green/red, plus BART station markers/scale box

hm_p_bart = gmaps.figure(
    center = sather_gate_berkeley,
    map_type = "HYBRID",
    zoom_level = 8.75,
    layout = figure_layout
)

heatmap_layer = gmaps.heatmap_layer(df_poplocs[["lat_rp", "long_rp"]])

df_bart_stations = df_bartlocs[["latitude", "longitude"]]

symbol_layer1 = gmaps.symbol_layer(
    df_bart_stations,
    hover_text = "BART Station",
    fill_color = "white",
    stroke_color = "black",
    scale = 3
)

# symbol_layer2 = gmaps.symbol_layer(
#     df_bart_stations,
#     #fill_color = "white",
#     fill_color = None,
#     fill_opacity = 0.1,
#     stroke_color = "white",
#     stroke_opacity = 0.1,
#     scale = 20
# )

drawing_layer = gmaps.drawing_layer(features = lines)

hm_p_bart.add_layer(heatmap_layer)

hm_p_bart.add_layer(symbol_layer1)

# hm_p_bart.add_layer(symbol_layer2)

hm_p_bart.add_layer(drawing_layer)

hm_p_bart

Figure(layout=FigureLayout(border='1px solid black', height='500px', padding='1px', width='500px'))