# Quick exploration for lessons learned / data prep
**Reduce memory at every step**
1. start with geoparquet (works with R too) instead of shapefile
2. explore columns and see which can be downgraded
   - `pointid` is unsigned, only positive numbers. also, the number of digits (at most, goes up to 8M if we combine all regions): https://towardsdatascience.com/reducing-memory-usage-in-pandas-with-smaller-datatypes-b527635830af
   - `Point_ID` is the string version of that...strings take more memory to store. Let's ignore this column while we're wrangling and then bring it back in at the end.
   - `grid_code` appears as float, but can be integer too
3. for geospatial operations (buffer, spatial join, distance), keep as few columns as possible

In [None]:
import geopandas as gpd
import pandas as pd

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/py_crow_flies/"

files = ["Mojave_POIs", "SoCal_POIs"]
CRS = "EPSG:3857"

In [None]:
mojave = gpd.read_parquet(f"{GCS_FILE_PATH}{files[0]}.parquet")
socal = gpd.read_parquet(f"{GCS_FILE_PATH}{files[1]}.parquet")

In [None]:
mojave.shape, socal.shape

In [None]:
mojave.dtypes, socal.dtypes

In [None]:
mojave.crs.to_epsg(), socal.crs.to_epsg()

In [None]:
socal.memory_usage()

In [None]:
socal.astype({
    "pointid": "uint32",
    "grid_code": "int16", 
    #using uint vs int doesn't appear to make a difference
}).memory_usage()

In [None]:
print(f"point id max: {socal.pointid.max()}")
print(f"grid code max: {socal.grid_code.max()}")

Is `pointid` unique?

-- No. Create a unique identifier...here, just use index bc we're going to concatenate them. Also store the region, in case we want to use it.

In [None]:
mojave.pointid.describe()

In [None]:
socal.pointid.describe()

In [None]:
mojave[mojave.pointid==1]

In [None]:
socal[socal.pointid==1]