# Overstory Technical Delivery Lead Take-Home Exercise

As a Technical Delivery Lead you will be working with customer data a lot. Customer data frequently has problems and discrepancies in it. To make sure that we ingest only correct data, we have written a set of functions to validate the data before importing it into our system.

In this exercise we have provided you with two such functions: `validate_client_lines()` and `validate_client_poles()`. We also provide you with a lines file and a poles file, such as you might have received from a customer. These represent electrical infrastructure pylons/poles ("poles") and the "lines" or spans/cables between them. Currently both validation functions are failing.

We would like you to:

1) Write code to fix the data so that the validation functions pass

2) Compute some statistics on the validated data, namely:

*   Total length (in feet) of all single-phase and total length (in feet) of all three-phase spans
*   Total number of spans grouped by `phasingType`

3) Visualise the data so that you could discuss the customer's infrastructure network with a colleague outside the tech team.

Please explain your working as you would in a notebook that a colleague might need to use in the future.

Please save a copy of this notebook or download it as ipynb before starting.  Add all your code under **The exercise** section. When you're done, zip your final notebook and any other relevant files and upload it using the link provided to you in an email.


# Setup dependencies

In [2]:
# Install dependencies
!pip install -q pandera geopandas



In [3]:
import logging
from typing import Union, Optional

import geopandas as gpd
import pandas as pd
import pandera as pa
from pandera import Field, SchemaModel, check, dataframe_check
from pandera.typing import Series


In [4]:
logger = logging.getLogger(__name__)

# Dataset schemas

In [5]:
class ClientLines(SchemaModel):
    """
    Client Lines dataset
    """

    geometry: Series[gpd.array.GeometryDtype]
    level3: Series[str] = Field(unique=True, description="Unique line ID")
    level1: Optional[Series[str]] = Field(nullable=True)
    level2: Optional[Series[str]] = Field(nullable=True)
    phasingType: Optional[Series[str]] = Field(
        isin=["single-phase", "two-phase", "three-phase"], nullable=True
    )

    class Config:
        name = "Client lines"
        description = "Cleaned client lines dataset"
        unique_column_names = True

    @check("geometry", name="geometry_is_valid")
    def geometry_is_valid(cls, geom: Series[gpd.array.GeometryDtype]) -> Series[bool]:
        return geom.is_valid

    @check("geometry", name="geometry_is_linestring")
    def geometry_is_linestring(
        cls, geom: Series[gpd.array.GeometryDtype]
    ) -> Series[bool]:
        return geom.geom_type == "LineString"

    @dataframe_check
    def dataframe_in_utm(cls, gdf: gpd.GeoDataFrame) -> Series[bool]:
        """Ensure dataframe CRS is in UTM"""
        return gdf.estimate_utm_crs() == gdf.crs


In [6]:
class ClientPoles(SchemaModel):
    """
    Client Poles dataset
    """

    geometry: Series[gpd.array.GeometryDtype]
    poleID: Series[str] = Field(unique=True, nullable=True)
    level1: Optional[Series[str]]
    level2: Optional[Series[str]] = Field(nullable=True)
    heightInFt: Optional[Series[float]] = Field(
        ge=0,
        le=500,
        description="Height of pole in feet",
        nullable=True,
    )

    class Config:
        name = "Client poles"
        description = "Cleaned client poles dataset"
        unique_column_names = True

    @check("geometry", name="geometry_is_point")
    def geometry_is_point(cls, geom: Series[gpd.array.GeometryDtype]) -> Series[bool]:
        return geom.geom_type == "Point"

    @dataframe_check
    def dataframe_in_utm(cls, df: pd.DataFrame):
        """Ensure dataframe CRS is in UTM"""
        if isinstance(df, gpd.GeoDataFrame):
            return df.estimate_utm_crs() == df.crs
        return True


# Validation functions

In [7]:
MIN_LINE_LENGTH_IN_M = 2  # Default minimum span length

def validate(gdf: gpd.GeoDataFrame, schema):
    try:
        return schema.validate(gdf, lazy=True)
    except pa.errors.SchemaErrors as err:
        logger.error(err.failure_cases)

def validate_client_poles(gdf: gpd.GeoDataFrame):
    try:
        gdf['heightInFt'] = gdf['heightInFt'].astype('float64')
        return ClientPoles.validate(gdf, lazy=True)
    except pa.errors.SchemaErrors as err:
        logger.error(err.failure_cases)
        assert False, "Validation failed."


def validate_client_lines(
    gdf: gpd.GeoDataFrame, min_line_length_in_m: float = MIN_LINE_LENGTH_IN_M
):

    for col in gdf:
        gdf = gdf[gdf[col]!= 'N/A']

    try:
      geometry_column = pa.Column(
          gpd.array.GeometryDtype,
          name="geometry",
          checks=pa.Check(
              lambda x: x.length > min_line_length_in_m,
              error="Line should meet minimum line length.",
              name="geometry_min_length",
          ),
      )
      geometry_column.validate(gdf, lazy=True)
      return ClientLines.validate(gdf, lazy=True)
    except pa.errors.SchemaErrors as err:
      logger.error(err.failure_cases)
      assert False, "Validation failed."

# The exercise
(Note the file URLs were updated March 7th 2023, but the file content and exercise remains the same)

In [8]:
lines = gpd.read_file('https://storage.googleapis.com/overstory-customer-test/take_home_exercise/demo_lines.geojson',

                      )
print(lines.head())

validate_client_lines(lines)

  level1    level2 level3  phasingType  pointA  pointB  lineHeightInFt  \
0   AOI2  East12th  EA100          N/A     206   402.0              35   
1   AOI2  East12th  EA101  three-phase     206   207.0              35   
2   AOI2  East12th  EA102  three-phase     207   209.0              35   
3   AOI2  East12th  EA103  three-phase     209   211.0              35   
4   AOI2  East12th  EA104  three-phase     211   213.0              35   

                                            geometry  
0  LINESTRING (505131.847 4398320.697, 505188.789...  
1  LINESTRING (505188.789 4398319.902, 505229.010...  
2  LINESTRING (505229.010 4398319.412, 505229.638...  
3  LINESTRING (505229.638 4398265.080, 505229.506...  
4  LINESTRING (505229.506 4398235.136, 505229.746...  


Unnamed: 0,level1,level2,level3,phasingType,pointA,pointB,lineHeightInFt,geometry
1,AOI2,East12th,EA101,three-phase,206,207.0,35,"LINESTRING (505188.789 4398319.902, 505229.010..."
2,AOI2,East12th,EA102,three-phase,207,209.0,35,"LINESTRING (505229.010 4398319.412, 505229.638..."
3,AOI2,East12th,EA103,three-phase,209,211.0,35,"LINESTRING (505229.638 4398265.080, 505229.506..."
4,AOI2,East12th,EA104,three-phase,211,213.0,35,"LINESTRING (505229.506 4398235.136, 505229.746..."
5,AOI2,East12th,EA105,three-phase,213,406.0,35,"LINESTRING (505229.746 4398199.608, 505229.932..."
...,...,...,...,...,...,...,...,...
266,AOI2,Elm,E63,single-phase,64,65.0,35,"LINESTRING (506204.459 4400170.652, 506206.135..."
267,AOI2,Elm,E64,single-phase,65,66.0,35,"LINESTRING (506206.135 4400139.905, 506205.529..."
268,AOI2,Elm,E65,single-phase,66,67.0,35,"LINESTRING (506205.529 4400103.968, 506205.298..."
269,AOI2,Elm,E66,single-phase,67,68.0,35,"LINESTRING (506205.298 4400075.244, 506204.820..."


In [10]:
poles = gpd.read_file('https://storage.googleapis.com/overstory-customer-test/take_home_exercise/demo_poles.geojson')

# Your code goes here

validate_client_poles(poles)

Unnamed: 0,level1,heightInFt,poleID,geometry
0,AOI2,35.0,00001,POINT (506059.544 4400335.804)
1,AOI2,35.0,00002,POINT (506106.465 4400335.842)
2,AOI2,35.0,00003,POINT (506160.698 4400335.642)
3,AOI2,35.0,00004,POINT (506205.816 4400337.059)
4,AOI2,35.0,00005,POINT (506261.851 4400336.553)
...,...,...,...,...
271,AOI2,35.0,00409,POINT (505718.950 4398905.436)
272,AOI2,35.0,00410,POINT (506113.903 4398890.557)
273,AOI2,35.0,00411,POINT (506212.631 4398889.235)
274,AOI2,35.0,00412,POINT (506611.235 4398691.823)


In [11]:
# Statistics and visualisation go here

In [57]:
def get_phase(gdf: gpd.GeoDataFrame):
    phase_1_3 = gdf[gdf['phasingType'].isin(['single-phase','three-phase'])]
    grouped = phase_1_3.groupby('phasingType')['lineHeightInFt'].agg([sum])
    grouped.reset_index(inplace=True)
    return grouped

In [58]:
def get_span(gdf: gpd.GeoDataFrame):
    phase_1_3 = gdf[gdf['phasingType'].isin(['single-phase','three-phase'])]
    grouped = phase_1_3.groupby('phasingType')[['lineHeightInFt']].count()
    grouped.reset_index(inplace=True)
    grouped.rename(columns={'lineHeightInFt':'count'},inplace=True)
    return grouped

In [59]:
get_phase(lines)

Unnamed: 0,phasingType,sum
0,single-phase,2555
1,three-phase,6580


In [60]:
get_span(lines)

Unnamed: 0,phasingType,count
0,single-phase,73
1,three-phase,188


In [82]:
import pandas as pd

In [83]:
df = pd.DataFrame({'A':[1,2,3,7,8,9],'B':[2,4,7,8,9,11]})

In [84]:
df2 = df.copy()

In [85]:
x = df['A']
y = df2['B']

In [86]:
x

0    1
1    2
2    3
3    7
4    8
5    9
Name: A, dtype: int64

In [87]:
y

0     2
1     4
2     7
3     8
4     9
5    11
Name: B, dtype: int64

In [88]:
x = set(x)

In [89]:
y = set(y)

In [90]:
x

{1, 2, 3, 7, 8, 9}

In [91]:
y

{2, 4, 7, 8, 9, 11}

In [93]:
p = x - y

In [94]:
q = y - x

In [95]:
list(p) + list(q)

[1, 3, 11, 4]