## Heuristic optimizations
There are multiple tweaks needed to get the final product working as desired. 

[Validation of pipeline](#section00)

[Simple penalty function](#section0)

[Exclusion of areas in Noord](#section1)

In [2]:
import sys
import geopandas as gpd
import pandas as pd
import numpy as np
import shapely

sys.path.append("../")

from Code.helper_functions import initial_loading, analyze_candidate_solution,\
add_shortest_distances_to_all_households, calculate_weighted_distance, calculate_penalties
from Code.loading_data import distance_matrix_with_counts, load_api_data,\
create_all_households, create_aansluitingen

<a id='section00'></a>
## Validation of pipeline

There were a lot of small problems with the initial pipeline, which resulted in a need to validate all parts of this:
- amount of households per bag --> fixed
- total households --> fixed
- average walking distance update --> fixed
- Check amount of households in total --> fixed
- Exclude households that have more than 1000 meters walking distance --> fixed (optional parameter)


- locations
- Exclusion of some rural areas (Ransdorp, Durgerdam etc.)
- Penalties
- Search for rural areas and exclude those possibly


In [3]:
rel_poi_df = pd.read_csv('../Data/postgres_db/info_pois.csv', index_col='Unnamed: 0')
rel_poi_df.head()
# rel_poi_df = rel_poi_df[['bk_afv_rel_nodes_poi','s1_afv_nodes', 's1_afv_poi', 'cluster_x', 'cluster_y', 'type', 'bag']]

Unnamed: 0,s1_afv_rel_nodes_poi,bk_afv_rel_nodes_poi,s1_afv_nodes,s1_afv_poi,mf_insert_datetime,mf_update_datetime,mf_row_hash,mf_deleted_ind,mf_run_id,cluster_x,cluster_y,type,bag
0,283,119757.000~488187.000~votpand_cluster~36310001...,483489,131232,2020-04-01 13:57:12,2020-04-01 13:57:12,6d65409c-1e4a-6959-f449-5a69231228ea,False,357,119757,488187,votpand_cluster,363100012072349
1,285,128330.000~485171.000~votpand_cluster~36310001...,483491,67104,2020-04-01 13:57:12,2020-04-01 13:57:12,67abef92-9627-5588-5a01-9f45e979fe43,False,357,128330,485171,votpand_cluster,363100012159111
2,286,116781.500~485130.000~votpand_cluster~36310001...,483492,14571,2020-04-01 13:57:12,2020-04-01 13:57:12,c0614f09-3b81-f752-9d43-e51e95a7deb7,False,357,116782,485130,votpand_cluster,363100012142426
3,288,121814.966~490814.811~afval_cluster~121813.172...,483494,91163,2020-04-01 13:57:12,2020-04-01 13:57:12,ea822bb4-51b1-e17c-4db1-6e6a4d6c0cdc,False,357,121815,490815,afval_cluster,121813.172|490814.873
4,291,126721.457~481159.344~votpand_cluster~36310001...,483497,53386,2020-04-01 13:57:12,2020-04-01 13:57:12,1d8a8fea-8c46-4d04-c57d-49e59879c7f0,False,357,126721,481159,votpand_cluster,363100012127611


In [4]:
# rel_poi_df['bag'].value_counts()
# rel_poi_df[rel_poi_df['bag'] == '363100012241555']
# df = pd.read_csv('../Data/households_per_cluster.csv')
# df[df['ligtin_bag_pnd_identificatie'] == '0363100012241555']
info_poi = pd.read_csv('../Data/postgres_db/afv_poi.csv')
temp = rel_poi_df.groupby('bag').first().reset_index().rename(columns={0:'bag'})

In [5]:
inpt_dfob = pd.read_csv('../Data/postgres_db/addresses_per_cluster.csv')
inpt_dis = pd.read_csv('../Data/postgres_db/distance_matrix.csv')

In [6]:
df_afstandn2 = distance_matrix_with_counts(inpt_dfob=inpt_dfob,
                                                       inpt_poi=rel_poi_df,
                                                       inpt_dis=inpt_dis,
                                                       get_data=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['ligtin_bag_pnd_identificatie'] = \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verblijfsobjecten['bag'] = verblijfsobjecten.loc[:, 'bag']\


In [7]:
api_df = load_api_data(subsectie=None)

In [None]:
# automatic
all_households, rel_poi_df, joined, df_afstandn2 = initial_loading()

Do you want to use addresses instead of clusters?True
What stadsdeel do you want to make as a subsection(optional parameter)?
What is the maximum amount of containers in a cluster that is considered to be useful?8
Where to get db files(local/online)?local
DB relation POIs loaded
distance matrix loaded
API data loaded


In [None]:
joined_cluster_distance = joined.set_index('s1_afv_nodes')\
    .join(df_afstandn2.set_index('van_s1_afv_nodes')).reset_index()\
    .rename(columns={'index':'van_s1_afv_nodes'})

In [None]:
good_result = add_shortest_distances_to_all_households(all_households,
                                                      joined_cluster_distance,
                                                      use_count=True)
good_result['rest_afstand'].isna().sum()

In [None]:
good_result['count'] = good_result['count'].fillna(0)
good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] = good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']].fillna(999)
good_result[['rest_afstand', 'plastic_afstand', 'papier_afstand', 'glas_afstand', 'textiel_afstand']] = good_result[['rest_afstand', 'plastic_afstand', 'papier_afstand', 'glas_afstand', 'textiel_afstand']].fillna(2000)


good_result.loc[~good_result['uses_container'],
                        'rest_afstand'] = np.nan
good_result.loc[~good_result['uses_container'], 'poi_rest'] = np.nan
# good_result

In [None]:
aansluitingen = create_aansluitingen(good_result, joined_cluster_distance, use_count=True)

In [None]:
(good_result['rest_afstand'] * good_result['count']).sum()/good_result['count'].sum()

In [None]:
def calculate_weighted_distance(good_result, use_count=False, w_rest=0.61,
                                w_plas=0.089, w_papi=0.16, w_glas=0.11,
                                w_text=0.025, return_all=False):
    """
    Calculate weighted distance of an input dataframe.

    Function to calculated the weighted average walking distance as part of the
    score function. It calculates the mean difference per fraction and employs
    the weights assigned to them to combine it into a single score.

    Input:
    df_containing distance from all households to its nearest container
    per fraction (known as good_result)

    Output:
    float representing weighted average distance
    """
    if not use_count:
        rest_mean = good_result['rest_afstand'].mean()
        papier_mean = good_result['papier_afstand'].mean()
        glas_mean = good_result['glas_afstand'].mean()
        plastic_mean = good_result['plastic_afstand'].mean()
        textiel_mean = good_result['textiel_afstand'].mean()
    else:
        rest_mean = (good_result['rest_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        papier_mean = (good_result['papier_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        plastic_mean = (good_result['plastic_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        glas_mean = (good_result['glas_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        textiel_mean = (good_result['textiel_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()

    if return_all:  # Return all individual mean distances (for analysis)
        return rest_mean, plastic_mean, papier_mean, glas_mean, textiel_mean

    # Multiply mean distance per fraction with its relative importance
    score = w_rest * rest_mean + w_plas * plastic_mean + w_papi * papier_mean + \
        w_glas * glas_mean + w_text * textiel_mean
    return score

In [None]:
avg_distance = calculate_weighted_distance(good_result,
                                           use_count=True,
                                           return_all=True)
avg_distance


# Adress POIs die meer dan 1 fractie niet binnen 1 km hebben maar wel woonfunctie huizen hebben
bad_ones = good_result[(((good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] == 999.0).sum(axis=1) > 1) & (good_result['count'] > 0))]
good_result_clean = good_result[(((good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] == 999.0).sum(axis=1) < 2) & (good_result['count'] > 0))]

avg_distance2 = calculate_weighted_distance(good_result_clean,
                                            use_count=True,
                                            return_all=True)
avg_distance2, avg_distance

In [None]:
def calculate_simple_penalties(good_result, aansluitingen, use_count=True,
                               w_rest=0.61, w_plas=0.089, w_papi=0.16, w_glas=0.11,
                               w_text=0.025, use_weight=True, return_all=True):
    """
    Return simplified version of penalties.

    Simplified version of penalty function that gives total amount of
    penalties.
    """
    if use_count:
        penalty1 = good_result[good_result['rest_afstand'] > 100]['count'].sum()
        penalty2 = good_result[good_result['plastic_afstand'] > 150]['count'].sum()
        penalty3 = good_result[good_result['papier_afstand'] > 150]['count'].sum()
        penalty4 = good_result[good_result['glas_afstand'] > 150]['count'].sum()
        penalty5 = good_result[good_result['textiel_afstand'] > 150]['count'].sum()
    
    else:
        penalty1 = good_result[good_result['rest_afstand'] > 100].shape[0]
        penalty2 = good_result[good_result['plastic_afstand'] > 150].shape[0]
        penalty3 = good_result[good_result['papier_afstand'] > 150].shape[0]
        penalty4 = good_result[good_result['glas_afstand'] > 150].shape[0]
        penalty5 = good_result[good_result['textiel_afstand'] > 300].shape[0]

    temp = (aansluitingen['poi_rest'] - aansluitingen['rest'] * 100)
    penalty6 = temp[temp > 0].sum()
    temp = (aansluitingen['poi_plastic'] - aansluitingen['plastic'] * 200)
    penalty7 = temp[temp > 0].sum()
    temp = (aansluitingen['poi_papier'] - aansluitingen['papier'] * 200)
    penalty8 = temp[temp > 0].sum()
    temp = (aansluitingen['poi_glas'] - aansluitingen['glas'] * 200)
    penalty9 = temp[temp > 0].sum()
    temp = (aansluitingen['poi_textiel'] - aansluitingen['textiel'] * 750)
    penalty10 = temp[temp > 0].sum()

    if use_weight:
        total = w_rest * (penalty1 + penalty6) + w_plas * (penalty2 + penalty7) +\
            w_papi * (penalty3 + penalty8) + w_glas * (penalty4 + penalty9) +\
            w_text * (penalty5 + penalty10)
        if return_all:
            return total, penalty1, penalty2, penalty3, penalty4, penalty5,\
                penalty6, penalty7, penalty8, penalty9, penalty10
        else:
            return total

    return penalty1+penalty2+penalty3+penalty4+penalty5+penalty6+penalty7 + \
        penalty8+penalty9+penalty10

def calculate_penalties(good_result, aansluitingen, use_count=False,
                        w_rest=0.61, w_plas=0.089, w_papi=0.16, w_glas=0.11,
                        w_text=0.025, return_all=False):
    """
    Calculate the amount of penalties based on described policies.

    This function calculates all the penalties associated with the candidate
    solution. It does this by calculating the number of times all constraints
    are violated and applies the weighing that is associated with all these
    violations. D_... contain the maximum allowed walking distance to a
    container of the specified fraction. P_... show the maximum amount of
    households connected to a container of the fraction specified.

    Input:
    dataframe good_result containing per adress or adress poi the distance
    to the nearest container for all fractions.
    dataframe connections containing for all clusters the amount of containers
    per fraction, the amount of people using these containers and percentage
    of occupancy compared to the norm

    Output:
    The sum of all different penalties as a single float
    """
    # Create dict containing max_dist, max_connection and weight per fraction
    fractions = {'rest': [100, 100, w_rest], 'plastic': [150, 200, w_plas],
                 'papier': [150, 200, w_papi], 'glas': [150, 200, w_glas],
                 'textiel': [200, 750, w_text]}
    MAX_PERC = 100  # To prevent magic numbers
    NORMAL = 1500  # Normalization factor to balance both types of penalties
    penalties = []  # This list will store all penalties
    for k, v in fractions.items():  # Per fraction
        # Filter data for all entries that violate maximal walking distance
        dist_pen = good_result[good_result[f'{k}_afstand'] > v[0]]
        # Filter data for all containers having to many households attached
        conn_pen = aansluitingen[aansluitingen[f'{k}_perc'] > MAX_PERC]
        if not use_count:
            # Average amount of meters over maximum limit
            penalties.append((dist_pen[f'{k}_afstand'] - v[0]).sum() /
                             good_result.shape[0] * v[2])
            # Ratio of amount of households over threshold (with normalization)
            penalties.append((conn_pen[f'poi_{k}'] - (conn_pen[k] * v[1]))
                             .sum() / good_result['count'].sum() * v[2] *
                             NORMAL)
        else:  # Use count column to weigh according to households
            penalties.append(((dist_pen[f'{k}_afstand'] - v[0]) *
                              dist_pen['count']).sum()/good_result['count']
                             .sum() * v[2])
            penalties.append((conn_pen[f'poi_{k}'] - (conn_pen[k] * v[1]))
                             .sum() / good_result['count'].sum() * v[2] *
                             NORMAL)
    if return_all:  # Return all contributing factors
        return penalties
    else:  # Return only the sum of the penalties
        return sum(penalties)

In [None]:
calculate_penalties(good_result, aansluitingen, return_all=True)
# calculate_simple_penalties(good_result, aansluitingen, use_count=True, use_weight=True, return_all=True)

# good_result[good_result['plastic_afstand'] > 150]['count'].sum()

### Find areas to exclude

In [None]:
from bokeh.plotting import figure, show
from bokeh.models import GeoJSONDataSource, ColumnDataSource, HoverTool

long = good_result[((good_result['rest_afstand'] > 750) | 
                    (good_result['plastic_afstand'] > 750) | 
                    (good_result['papier_afstand'] > 750) | 
                    (good_result['glas_afstand'] > 750)) & 
                    (good_result['count'] > 0)]\
    .append(bad_ones)

load = gpd.read_file('../data/Inzameling_huisvuil_100220.shp')
street_map_clean = load[load['aanbiedwij'] ==
                              "Breng uw restafval  naar een container voor restafval."]
bewoonde_wijken = gpd.read_file('../data/woonbrt10_region.shp')
wijken = gpd.read_file('../data/brtk2010_ind2005_region.shp')

geosource = GeoJSONDataSource(geojson=street_map_clean.to_json())
geosource2 = GeoJSONDataSource(geojson=buurten.to_json())
geosource3 = GeoJSONDataSource(geojson=bewoonde_wijken.to_json())
geosource4 = GeoJSONDataSource(geojson=wijken.to_json())

source = ColumnDataSource(data=long)
source2 = ColumnDataSource(data=wijken)

TOOLTIPS2 = [
            ("index", "$index"),
            ("(x,y)", "($x, $y)"),
            ("Naam", "@count")]

TOOLTIPS = [
            ("index", "$index"),
            ("(x,y)", "($x, $y)"),
            ("Count", "@BCNAAM")]

p = figure(match_aspect=True)
p.patches('xs', 'ys', source=geosource, fill_color='red', alpha=0.1, line_color=None)
p.patches('xs', 'ys', source=geosource3, fill_color='green', alpha=0.1, line_color=None)
r0 = p.patches('xs', 'ys', source=geosource4, fill_color=None, alpha=0.1)
p.add_tools(HoverTool(renderers=[r0], tooltips=TOOLTIPS))
r1 = p.circle(x='cluster_x', y='cluster_y', source=source, radius=10)
p.add_tools(HoverTool(renderers=[r1], tooltips=TOOLTIPS2))

show(p)

# long.plot(x='cluster_x', y='cluster_y', kind='scatter', s=0.1)

In [None]:
bad_ones

<a id='section1'></a>
### Exclusion of areas in North

In [None]:
# Create dataframe noord holding only households in noord
noord = all_households[all_households['in_neigborhood']]

# Check which neigborhoods have no clusters
joined[joined['stadsdeel'] == 'N']['wijk'].value_counts()
joined_clean = joined[joined['stadsdeel'] == 'N']
joined_clean = joined_clean[joined_clean['wijk'] != 'N73']

# Load in shapefile of buurten to exclude N73, since N64 is in inhabited area
shapefile = gpd.read_file('../data/bc2010zw_region.shp')
Waterland = shapefile.iloc[69]['geometry']
Waterland

# Create column to check if addresses are in N73
noord['in_n73'] = noord.apply(lambda row: shapely.geometry.Point(row['cluster_x'], row['cluster_y']).within(Waterland), axis=1)
noord_clean = noord[~noord['in_n73']]

In [None]:
joined_cluster_distance, good_result_rich, aansluitingen, avg_distance, penalties = analyze_candidate_solution(joined_clean, noord_clean, rel_poi_df, df_afstandn2, clean=False, use_count=True)

In [None]:
joined_cluster_distance, good_result_rich, aansluitingen, avg_distance, penalties = analyze_candidate_solution(joined, noord, rel_poi_df, df_afstandn2, clean=False, use_count=True)

It looks like the exclusion of landelijk noord has a minor effect on the score of this area. This probably is a result of the fact that households that don't have a container within reach get assigned a NaN value, that is excluded from calculation. As a new implementation this NaN is replaced with 2000 meters.

In [None]:
good_result_rich[good_result_rich['rest_afstand'].isna()]