# Quick exploration for lessons learned / data prep

**Reduce memory at every step**
1. start with geoparquet (works with R too) instead of shapefile
2. explore columns and see which can be downgraded
   - `pointid` is unsigned, only positive numbers. also, the number of digits (at most, goes up to 8M if we combine all regions): https://towardsdatascience.com/reducing-memory-usage-in-pandas-with-smaller-datatypes-b527635830af
   - `Point_ID` is the string version of that...strings take more memory to store. Let's ignore this column while we're wrangling and then bring it back in at the end.
   - `grid_code` appears as float, but can be integer too
3. for geospatial operations (buffer, spatial join, distance), keep as few columns as possible

**chunks to stay within local memory limits**
<br>4. use existing regional batches - chunk it up and do spatial stuff

**use arrays to vectorize multiplying by a scalar and adding**
<br>5. read in partitioned parquets with distance and use arrays to apply decay and aggregation
<br>6. bring back original dataset with all the columns and merge results in

In [1]:
import geopandas as gpd
import pandas as pd

from prep import GCS_FILE_PATH

files = ["Mojave_POIs", "SoCal_POIs"]
CRS = "EPSG:3857"


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
mojave = gpd.read_parquet(f"{GCS_FILE_PATH}{files[0]}.parquet")
socal = gpd.read_parquet(f"{GCS_FILE_PATH}{files[1]}.parquet")

In [3]:
mojave.shape, socal.shape

((791670, 4), (2644392, 4))

In [4]:
mojave.dtypes, socal.dtypes

(pointid         int64
 grid_code     float64
 Point_ID       object
 geometry     geometry
 dtype: object,
 pointid         int64
 grid_code     float64
 Point_ID       object
 geometry     geometry
 dtype: object)

In [5]:
mojave.crs.to_epsg(), socal.crs.to_epsg()

(3857, 3857)

In [6]:
socal.memory_usage()

Index             128
pointid      21155136
grid_code    21155136
Point_ID     21155136
geometry     21155136
dtype: int64

In [7]:
socal.astype({
    "pointid": "uint32",
    "grid_code": "int16", 
    #using uint vs int doesn't appear to make a difference
}).memory_usage()

Index             128
pointid      10577568
grid_code     5288784
Point_ID     21155136
geometry     21155136
dtype: int64

In [8]:
print(f"point id max: {socal.pointid.max()}")
print(f"grid code max: {socal.grid_code.max()}")

point id max: 3936983
grid code max: 259.0


Is `pointid` unique?

-- No. Create a unique identifier...here, just use index bc we're going to concatenate them. Also store the region, in case we want to use it.

In [9]:
mojave.pointid.describe()

count    7.916700e+05
mean     8.843604e+05
std      3.618008e+05
min      1.000000e+00
25%      6.290412e+05
50%      9.215615e+05
75%      1.195664e+06
max      1.412400e+06
Name: pointid, dtype: float64

In [10]:
socal.pointid.describe()

count    2.644392e+06
mean     1.750431e+06
std      1.060644e+06
min      1.000000e+00
25%      8.759248e+05
50%      1.631116e+06
75%      2.592089e+06
max      3.936983e+06
Name: pointid, dtype: float64

In [11]:
mojave[mojave.pointid==1]

Unnamed: 0,pointid,grid_code,Point_ID,geometry
0,1,0.0,id_1,POINT (-13177590.802 4510549.039)


In [12]:
socal[socal.pointid==1]

Unnamed: 0,pointid,grid_code,Point_ID,geometry
0,1,0.0,id_1,POINT (-13519417.193 4285824.176)
