## Heuristic optimizations
There are multiple tweaks needed to get the final product working as desired. 

[Validation of pipeline](#section00)

[Simple penalty function](#section0)

[Exclusion of areas in Noord](#section1)

In [100]:
import sys
import geopandas as gpd
import pandas as pd
import numpy as np
import shapely

sys.path.append("../")

from Code.helper_functions import initial_loading, analyze_candidate_solution,\
add_shortest_distances_to_all_households, calculate_weighted_distance
from Code.loading_data import distance_matrix_with_counts, load_api_data,\
create_all_households, create_aansluitingen

<a id='section00'></a>
## Validation of pipeline

There were a lot of small problems with the initial pipeline, which resulted in a need to validate all parts of this:
- amount of households per bag
- total households
- locations
- Exclusion of some rural areas (Ransdorp, Durgerdam etc.)
- Penalties
- Check amount of households in total
- Search for rural areas and exclude those possibly
- Exclude households that have more than 1000 meters walking distance


In [None]:
rel_poi_df = pd.read_csv('../Data/postgres_db/info_pois.csv', index_col='Unnamed: 0')
rel_poi_df.head()
# rel_poi_df = rel_poi_df[['bk_afv_rel_nodes_poi','s1_afv_nodes', 's1_afv_poi', 'cluster_x', 'cluster_y', 'type', 'bag']]

In [None]:
# rel_poi_df['bag'].value_counts()
# rel_poi_df[rel_poi_df['bag'] == '363100012241555']
# df = pd.read_csv('../Data/households_per_cluster.csv')
# df[df['ligtin_bag_pnd_identificatie'] == '0363100012241555']
info_poi = pd.read_csv('../Data/postgres_db/afv_poi.csv')
temp = rel_poi_df.groupby('bag').first().reset_index().rename(columns={0:'bag'})

In [None]:
inpt_dfob = pd.read_csv('../Data/postgres_db/addresses_per_cluster.csv')
inpt_dis = pd.read_csv('../Data/postgres_db/distance_matrix.csv')

In [None]:
df_afstandn2 = distance_matrix_with_counts(inpt_dfob=inpt_dfob,
                                                       inpt_poi=rel_poi_df,
                                                       inpt_dis=inpt_dis,
                                                       get_data=False)

In [None]:
api_df = load_api_data(subsectie=None)

In [2]:
# automatic
all_households, rel_poi_df, joined, df_afstandn2 = initial_loading()

Do you want to use addresses instead of clusters?True
What stadsdeel do you want to make as a subsection(optional parameter)?
What is the maximum amount of containers in a cluster that is considered to be useful?8
Where to get db files(local/online)?local
DB relation POIs loaded
446641
446641


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['ligtin_bag_pnd_identificatie'] = \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  verblijfsobjecten['bag'] = verblijfsobjecten.loc[:, 'bag']\


446257.0
Index(['van_s1_afv_nodes', 'naar_s1_afv_nodes', 'afstand', 'count'], dtype='object')
distance matrix loaded
API data loaded
Table all households created
API and DB joined
containers per cluster determined
move_rest determined


### Find areas to exclude

In [6]:
joined_cluster_distance = joined.set_index('s1_afv_nodes')\
    .join(df_afstandn2.set_index('van_s1_afv_nodes')).reset_index()\
    .rename(columns={'index':'van_s1_afv_nodes'})

In [87]:
good_result = add_shortest_distances_to_all_households(all_households,
                                                      joined_cluster_distance,
                                                      use_count=True)
good_result['rest_afstand'].isna().sum()

47368

In [108]:
good_result['count'] = good_result['count'].fillna(0)
good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] = good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']].fillna(999)
good_result[['rest_afstand', 'plastic_afstand', 'papier_afstand', 'glas_afstand', 'textiel_afstand']] = good_result[['rest_afstand', 'plastic_afstand', 'papier_afstand', 'glas_afstand', 'textiel_afstand']].fillna(2000)


good_result.loc[~good_result['uses_container'],
                        'rest_afstand'] = np.nan
good_result.loc[~good_result['uses_container'], 'poi_rest'] = np.nan
# good_result

In [109]:
aansluitingen = create_aansluitingen(good_result, joined_cluster_distance, use_count=True)

In [124]:
(good_result['rest_afstand'] * good_result['count']).sum()/good_result['count'].sum()

62.22752463667192

In [125]:
def calculate_weighted_distance(good_result, use_count=False, w_rest=0.61,
                                w_plas=0.089, w_papi=0.16, w_glas=0.11,
                                w_text=0.025, return_all=False):
    """
    Calculate weighted distance of an input dataframe.

    Function to calculated the weighted average walking distance as part of the
    score function. It calculates the mean difference per fraction and employs
    the weights assigned to them to combine it into a single score.

    Input:
    df_containing distance from all households to its nearest container
    per fraction (known as good_result)

    Output:
    float representing weighted average distance
    """
    if not use_count:
        rest_mean = good_result['rest_afstand'].mean()
        papier_mean = good_result['papier_afstand'].mean()
        glas_mean = good_result['glas_afstand'].mean()
        plastic_mean = good_result['plastic_afstand'].mean()
        textiel_mean = good_result['textiel_afstand'].mean()
    else:
        rest_mean = (good_result['rest_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        papier_mean = (good_result['papier_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        plastic_mean = (good_result['plastic_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        glas_mean = (good_result['glas_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()
        textiel_mean = (good_result['textiel_afstand'] * good_result['count'])\
            .sum() / good_result['count'].sum()

    if return_all:  # Return all individual mean distances (for analysis)
        return rest_mean, papier_mean, glas_mean, plastic_mean, textiel_mean

    # Multiply mean distance per fraction with its relative importance
    score = w_rest * rest_mean + w_plas * plastic_mean + w_papi * papier_mean + \
        w_glas * glas_mean + w_text * textiel_mean
    return score

In [155]:
avg_distance = calculate_weighted_distance(good_result,
                                           use_count=True,
                                           return_all=True)
avg_distance


# Adress POIs die meer dan 1 fractie niet binnen 1 km hebben maar wel woonfunctie huizen hebben
bad_ones = good_result[(((good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] == 999.0).sum(axis=1) > 1) & (good_result['count'] > 0))]
good_result_clean = good_result[(((good_result[['poi_rest', 'poi_plastic', 'poi_papier', 'poi_glas', 'poi_textiel']] == 999.0).sum(axis=1) < 2) & (good_result['count'] > 0))]

avg_distance2 = calculate_weighted_distance(good_result_clean,
                                            use_count=True,
                                            return_all=True)
avg_distance2, avg_distance

((62.162276628423925,
  159.7853990460581,
  184.09729861226975,
  235.24120648621596,
  471.00252453554214),
 (62.22752463667192,
  160.15394380460344,
  184.51844078494068,
  236.27765400525217,
  471.93563609079337))

<a id='section1'></a>
### Exclusion of areas in North

In [None]:
# Create dataframe noord holding only households in noord
noord = all_households[all_households['in_neigborhood']]

# Check which neigborhoods have no clusters
joined[joined['stadsdeel'] == 'N']['wijk'].value_counts()
joined_clean = joined[joined['stadsdeel'] == 'N']
joined_clean = joined_clean[joined_clean['wijk'] != 'N73']

# Load in shapefile of buurten to exclude N73, since N64 is in inhabited area
shapefile = gpd.read_file('../data/bc2010zw_region.shp')
Waterland = shapefile.iloc[69]['geometry']
Waterland

# Create column to check if addresses are in N73
noord['in_n73'] = noord.apply(lambda row: shapely.geometry.Point(row['cluster_x'], row['cluster_y']).within(Waterland), axis=1)
noord_clean = noord[~noord['in_n73']]

In [None]:
joined_cluster_distance, good_result_rich, aansluitingen, avg_distance, penalties = analyze_candidate_solution(joined_clean, noord_clean, rel_poi_df, df_afstandn2, clean=False, use_count=True)

In [None]:
joined_cluster_distance, good_result_rich, aansluitingen, avg_distance, penalties = analyze_candidate_solution(joined, noord, rel_poi_df, df_afstandn2, clean=False, use_count=True)

It looks like the exclusion of landelijk noord has a minor effect on the score of this area. This probably is a result of the fact that households that don't have a container within reach get assigned a NaN value, that is excluded from calculation. As a new implementation this NaN is replaced with 2000 meters.

In [None]:
good_result_rich[good_result_rich['rest_afstand'].isna()]