<img src="img/header.png" style="width:100%">

<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Problem Definition</h1>

When I worked on the final for the first of the machine learning courses (Introduction to Supervised Learning), I obtained some surprising results. My goal was to produce a supervised learning model that could predict parents' satisfaction level with their kindergartens, in the Oslo area of Norway. I decided to focus on three key quality indicators for kindergartens – the quality of toys, food and outdoor areas. However, after I had trained three different models, I found that none of them were able to predict with any accuracy the quality of outdoor areas (see figure below). In fact, the Lasso model actually did worse than simply predicting the average score for all the kindergartens (that is, it received a negative R-squared score). This stands in stark contrast to results for food quality, which most of the models could predict with a reasonable level of accuracy (R-squared of about .4 for the best model). The aim of this notebook is to use visualizations to explore why the supervised learning models were so much better at predicting the quality of food rather than the quality of the outdoors area.

In [1]:
# Importing libraries

import pandas as pd
import numpy as np
import requests
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import warnings
import altair as alt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Importing data</h1>

* The data is taken from the Norwegian Parent Survey for kindergartens. The Parent Survey for kindergartens is a survey that gives parents an opportunity to state how they feel about the kindergarten services. The survey allows the kindergartens and their owners to know how satisfied parents are with the services that the kindergartens provide.

* The survey is carried out every year and include parents all across the country. However, in the dataset for this project, we will only make use of data for 2022, and we will only include kindergartens in counties that belong to the Oslo Urban Area. That is, we will only include kindergartens that are in the capital of Norway or the surrounding area.

* The survey consists of 30 questions that allow the parents to express their opinion about different aspects of the kindergarten services. However, in our dataset, we only include three questions: how the parents perceive (1) the outdoors playing area, (2) the toys available to the kids and (3) the quality of the food served.

* The survey data is published as aggregate numbers for each kindergarten (and municipality, county), and their score on each question is therefore an average of what the parents thought about the kindergarten's services.

* The survey also includes some background information about the kindergartens such as whether they are public or private, their size (measured in number of kids) and location.

* The data can be found following this <a href="https://www.udir.no/tall-og-forskning/statistikk/statistikk-barnehage/foreldreundersokelsen-i-barnehager--resultater-etter-fylke/?rapportsideKode=BHG_Fuba_Fylk&filtre=AldergruppeID(-10)_BarnehageenhetID(-12)_BarnehagestoerrelsegruppeID(-10)_KjoennID(-10)_KommunalitetID(-10)_SpoersmaalID(-31_-30_-29_-28_-27_-26_-25_-24_-23)_TidID(202212)_VisAntallBesvart(0)&radsti=F!(1)_(*)_(1.*)">link</a>

In [2]:
df = pd.read_csv("data/kindergarten.csv", encoding="utf16", sep="\t")

In [3]:
df

Unnamed: 0,BarnehageenhetNivaa,Nasjonaltkode,Fylkekode,Kommunekode,Organisasjonsnummer,Nasjonalt,Fylke,Kommune,Barnehageenhetnavn,2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar,2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar,2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar,2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar
0,1,I,,,I,Hele landet,Alle fylker,Alle kommuner,Alle barnehager,41,42,39,43,44,44,40,41,37,42,43,42,40,41,36,42,43,41,40,41,36,42,43,42
1,2,I,3.0,,03,Hele landet,Oslo,Alle kommuner,Alle barnehager,40,41,36,41,44,43,38,41,36,42,43,41,39,41,34,40,43,41,40,41,34,40,43,42
2,3,I,3.0,30112.0,030112,Hele landet,Oslo,Alna,Alle barnehager,43,44,44,43,43,36,37,39,33,44,43,43,37,39,37,38,41,40,41,41,36,38,41,39
3,4,I,3.0,30112.0,996797864,Hele landet,Oslo,Alna,Barneslottet barnehage,,,,,,,,,,,,,,,,,,,43,41,37,,,
4,4,I,3.0,30112.0,973111965,Hele landet,Oslo,Alna,Fresesarmeens barnehager Teisentopp,,,,,,,,,,,,,,,,,,,,,,38,40,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
865,4,I,30.0,3027.0,987067276,Hele landet,Viken,Rælingen,Smestadtoppen barnehage,,,,,,,,,,,,,,,,,,,41,41,36,,,
866,4,I,30.0,3027.0,872215492,Hele landet,Viken,Rælingen,Tangen barnehage SA,,,,,,,,,,43,45,34,,,,,,,,,,,,
867,4,I,30.0,3027.0,991298207,Hele landet,Viken,Rælingen,Tomter Fus barnehage AS,,,,,,,,,,,,,,,,42,42,38,,,,,,
868,4,I,30.0,3027.0,988860298,Hele landet,Viken,Rælingen,Torva barnehage,,,,,,,,,,,,,33,38,30,,,,,,,,,


<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Data cleaning</h1>

The dataset comes in a form that requires a considerable amount of data wrangling to make it ready for any type of analysis. In this section, we will carry out the following operations:

* Remove unnecessary aggregates
* Translate Norwegian terms to English
* Compress sparse data into a smaller number of columns
* Change European decimal notation with ',' to American notation with '.'

In [4]:
# Remove all rows that are aggregates for city districts, the municipalitity, the county or the country. 
# This is done by removing all BarnehageenhetNivaa < 4

mask = df["BarnehageenhetNivaa"] == 4
df = df[mask]

In [5]:
transformation_dict = {
       'Organisasjonsnummer': 'Company registration number',
       'Kommune': 'Borough of Oslo',
       'Barnehageenhetnavn': 'Kindergarden Name',
       '2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '1-25_Public_Outdoors',
       '2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '1-25_Public_Toys',
       '2022.1 - 25 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '1-25_Public_Food',
       '2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '1-25_Private_Outdoors',
       '2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '1-25_Private_Toys',
       '2022.1 - 25 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '1-25_Private_Food',
       '2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '26-50_Public_Outdoors',
       '2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '26-50_Public_Toys',
       '2022.26 - 50 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '26-50_Public_Food',
       '2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '26-50_Privat_Outdoors',
       '2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '26-50_Privat_Toys',
       '2022.26 - 50 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '26-50_Privat_Food',
       '2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '51-75_Public_Outdoors',
       '2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '51-75_Public_Toys',
       '2022.51 - 75 barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '51-75_Public_Food',
       '2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '51-75_Private_Outdoors',
       '2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '51-75_Private_Toys',
       '2022.51 - 75 barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '51-75_Private_Food',
       '2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '76+_Public_Outdoors',
       '2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '76+_Public_Toys',
       '2022.76 + barn.Kommunal.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '76+_Public_Food',
       '2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Utearealer.Snittsvar': '76+_Private_Outdoors',
       '2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Leker og utstyr.Snittsvar': '76+_Private_Toys',
       '2022.76 + barn.Privat.Alle aldersgrupper.Ute- og innemiljø.Mattilbudet.Snittsvar': '76+_Private_Food'
}

In [6]:
df = df.rename(columns=transformation_dict)

In [7]:
df.columns

Index(['BarnehageenhetNivaa', 'Nasjonaltkode', 'Fylkekode', 'Kommunekode',
       'Company registration number', 'Nasjonalt', 'Fylke', 'Borough of Oslo',
       'Kindergarden Name', '1-25_Public_Outdoors', '1-25_Public_Toys',
       '1-25_Public_Food', '1-25_Private_Outdoors', '1-25_Private_Toys',
       '1-25_Private_Food', '26-50_Public_Outdoors', '26-50_Public_Toys',
       '26-50_Public_Food', '26-50_Privat_Outdoors', '26-50_Privat_Toys',
       '26-50_Privat_Food', '51-75_Public_Outdoors', '51-75_Public_Toys',
       '51-75_Public_Food', '51-75_Private_Outdoors', '51-75_Private_Toys',
       '51-75_Private_Food', '76+_Public_Outdoors', '76+_Public_Toys',
       '76+_Public_Food', '76+_Private_Outdoors', '76+_Private_Toys',
       '76+_Private_Food'],
      dtype='object')

In [8]:
# Remove unneccesary columns

df = df[['Company registration number', 'Borough of Oslo','Kindergarden Name', 
    '1-25_Public_Outdoors', '1-25_Public_Toys','1-25_Public_Food', '1-25_Private_Outdoors', '1-25_Private_Toys','1-25_Private_Food', 
    '26-50_Public_Outdoors', '26-50_Public_Toys','26-50_Public_Food', '26-50_Privat_Outdoors', '26-50_Privat_Toys', '26-50_Privat_Food', 
    '51-75_Public_Outdoors', '51-75_Public_Toys','51-75_Public_Food', '51-75_Private_Outdoors', '51-75_Private_Toys','51-75_Private_Food', 
    '76+_Public_Outdoors', '76+_Public_Toys','76+_Public_Food', '76+_Private_Outdoors', '76+_Private_Toys','76+_Private_Food']]

In [9]:
df

Unnamed: 0,Company registration number,Borough of Oslo,Kindergarden Name,1-25_Public_Outdoors,1-25_Public_Toys,1-25_Public_Food,1-25_Private_Outdoors,1-25_Private_Toys,1-25_Private_Food,26-50_Public_Outdoors,26-50_Public_Toys,26-50_Public_Food,26-50_Privat_Outdoors,26-50_Privat_Toys,26-50_Privat_Food,51-75_Public_Outdoors,51-75_Public_Toys,51-75_Public_Food,51-75_Private_Outdoors,51-75_Private_Toys,51-75_Private_Food,76+_Public_Outdoors,76+_Public_Toys,76+_Public_Food,76+_Private_Outdoors,76+_Private_Toys,76+_Private_Food
3,996797864,Alna,Barneslottet barnehage,,,,,,,,,,,,,,,,,,,43,41,37,,,
4,973111965,Alna,Fresesarmeens barnehager Teisentopp,,,,,,,,,,,,,,,,,,,,,,38,40,39
5,975317161,Alna,Frydenlund barnehage,,,,,,,,,,,,,39,39,33,,,,,,,,,
6,975317145,Alna,Furustien barnehage,,,,,,,,,,,,,37,39,39,,,,,,,,,
7,893765832,Alna,Gransbakken barnehage,,,,,,,,,,,,,,,,,,,39,40,32,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
865,987067276,Rælingen,Smestadtoppen barnehage,,,,,,,,,,,,,,,,,,,41,41,36,,,
866,872215492,Rælingen,Tangen barnehage SA,,,,,,,,,,43,45,34,,,,,,,,,,,,
867,991298207,Rælingen,Tomter Fus barnehage AS,,,,,,,,,,,,,,,,42,42,38,,,,,,
868,988860298,Rælingen,Torva barnehage,,,,,,,,,,,,,33,38,30,,,,,,,,,


In [10]:
# Functions to compress data into fewer rows

def public_vs_private(rows):
    public_vs_private = 'public'
    for element in rows:
        if pd.notnull(element):
            public_vs_private = 'private'
    return public_vs_private


def size_of_kindergarden(rows):
    size_of_kindergarden = None
    for number, element in enumerate(rows):
        if pd.notnull(element):
            
            if number == 0 or number == 1:
                if size_of_kindergarden != None and size_of_kindergarden != '1-25':
                    raise Exception("Kindergarden can only have one size. Check input columns!")
                else:
                    size_of_kindergarden = '1-25'
            
            if number == 2 or number == 3:
                if size_of_kindergarden != None and size_of_kindergarden != '26-50':
                    raise Exception("Kindergarden can only have one size. Check input columns!")
                else:
                    size_of_kindergarden = '26-50'
            
            if number == 4 or number == 5:
                if size_of_kindergarden != None and size_of_kindergarden != '51-75':
                    raise Exception("Kindergarden can only have one size. Check input columns!")
                else:
                    size_of_kindergarden = '51-75'
            
            if number == 6 or number == 7:
                if size_of_kindergarden != None and size_of_kindergarden != '76+':
                    raise Exception("Kindergarden can only have one size. Check input columns!")
                else:
                    size_of_kindergarden = '76+'
    
    return size_of_kindergarden


def score_on_category(rows):
    
    score_on_category = None 
    
    list_of_elements = [element for element in rows if pd.notnull(element)]
    if len(list_of_elements) > 1:
        raise Exception("Kindergardens can only have one score for each category. Check input columns!")
    else:
        score_on_category = list_of_elements[0]
    
    return score_on_category

In [11]:
df['Ownership'] = df[[
    '1-25_Private_Outdoors', '1-25_Private_Toys','1-25_Private_Food', 
    '26-50_Privat_Outdoors', '26-50_Privat_Toys', '26-50_Privat_Food', 
    '51-75_Private_Outdoors', '51-75_Private_Toys','51-75_Private_Food', 
    '76+_Private_Outdoors', '76+_Private_Toys','76+_Private_Food'
]].apply(public_vs_private, axis=1)

In [12]:
df['Size'] = df[[
    '1-25_Public_Outdoors', '1-25_Private_Outdoors',
    '26-50_Public_Outdoors', '26-50_Privat_Outdoors',
    '51-75_Public_Outdoors', '51-75_Private_Outdoors',
    '76+_Public_Outdoors', '76+_Private_Outdoors'
]].apply(size_of_kindergarden, axis=1)

In [13]:
df['Outdoors_score'] = df[[
    '1-25_Public_Outdoors','1-25_Private_Outdoors', '26-50_Public_Outdoors', '26-50_Privat_Outdoors', 
    '51-75_Public_Outdoors','51-75_Private_Outdoors', '76+_Public_Outdoors','76+_Private_Outdoors'
]].apply(score_on_category, axis=1)

In [14]:
df['Toys_score'] =  df[[
    '1-25_Public_Toys', '1-25_Private_Toys', '26-50_Public_Toys', '26-50_Privat_Toys', 
    '51-75_Public_Toys', '51-75_Private_Toys', '76+_Public_Toys', '76+_Private_Toys'
]].apply(score_on_category, axis=1)

In [15]:
df['Food_score'] = df[[
    '1-25_Public_Food','1-25_Private_Food','26-50_Public_Food','26-50_Privat_Food',
    '51-75_Public_Food','51-75_Private_Food','76+_Public_Food','76+_Private_Food'  
]].apply(score_on_category, axis=1)

In [16]:
df.columns

Index(['Company registration number', 'Borough of Oslo', 'Kindergarden Name',
       '1-25_Public_Outdoors', '1-25_Public_Toys', '1-25_Public_Food',
       '1-25_Private_Outdoors', '1-25_Private_Toys', '1-25_Private_Food',
       '26-50_Public_Outdoors', '26-50_Public_Toys', '26-50_Public_Food',
       '26-50_Privat_Outdoors', '26-50_Privat_Toys', '26-50_Privat_Food',
       '51-75_Public_Outdoors', '51-75_Public_Toys', '51-75_Public_Food',
       '51-75_Private_Outdoors', '51-75_Private_Toys', '51-75_Private_Food',
       '76+_Public_Outdoors', '76+_Public_Toys', '76+_Public_Food',
       '76+_Private_Outdoors', '76+_Private_Toys', '76+_Private_Food',
       'Ownership', 'Size', 'Outdoors_score', 'Toys_score', 'Food_score'],
      dtype='object')

In [17]:
# Removing unneccesary columns
df = df[['Company registration number', 'Kindergarden Name','Outdoors_score', 'Toys_score', 'Food_score', 
         'Borough of Oslo', 'Ownership', 'Size', 
    ]]

In [18]:
# Change European decimal notation with ',' to American notation with '.'

df['Outdoors_score'] = df['Outdoors_score'].replace('*', np.nan)
df['Outdoors_score'] = df['Outdoors_score'].replace(',', '.', regex=True)
df['Outdoors_score'] = df['Outdoors_score'].astype(float)

df['Toys_score'] = df['Toys_score'].replace('*', np.nan)
df['Toys_score'] = df['Toys_score'].replace(',', '.', regex=True)
df['Toys_score'] = df['Toys_score'].astype(float)

df['Food_score'] = df['Food_score'].replace('*', np.nan)
df['Food_score'] = df['Food_score'].replace(',', '.', regex=True)
df['Food_score'] = df['Food_score'].astype(float)

In [19]:
df

Unnamed: 0,Company registration number,Kindergarden Name,Outdoors_score,Toys_score,Food_score,Borough of Oslo,Ownership,Size
3,996797864,Barneslottet barnehage,4.3,4.1,3.7,Alna,public,76+
4,973111965,Fresesarmeens barnehager Teisentopp,3.8,4.0,3.9,Alna,private,76+
5,975317161,Frydenlund barnehage,3.9,3.9,3.3,Alna,public,51-75
6,975317145,Furustien barnehage,3.7,3.9,3.9,Alna,public,51-75
7,893765832,Gransbakken barnehage,3.9,4.0,3.2,Alna,public,76+
...,...,...,...,...,...,...,...,...
865,987067276,Smestadtoppen barnehage,4.1,4.1,3.6,Rælingen,public,76+
866,872215492,Tangen barnehage SA,4.3,4.5,3.4,Rælingen,private,26-50
867,991298207,Tomter Fus barnehage AS,4.2,4.2,3.8,Rælingen,private,51-75
868,988860298,Torva barnehage,3.3,3.8,3.0,Rælingen,public,51-75


<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Low-fidelity Prototypes</h1>

# Simple Bivariate plots - Outdoors vs. Food Score

In [20]:
outdoors_geo = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
)

food_geo = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = alt.Y("Food_score", title='Food Quality Score')
)

geo = outdoors_geo | food_geo

#chart.save('charts/chart_1.png')

geo

In [21]:
outdoors_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
)

food_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Food_score", title='Food Quality Score')
)

size = outdoors_size | food_size
size

In [22]:
outdoors_own = alt.Chart(df).mark_circle().encode(
    x = "Ownership",
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
)

food_own = alt.Chart(df).mark_circle().encode(
    x = "Ownership",
    y = alt.Y("Food_score", title='Food Quality Score')
)

own = outdoors_own | food_own 
own

# Bivariate Plots with Mean - Outdoors vs. Food Score

In [23]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each borough
mean_outdoors = df.groupby('Borough of Oslo')['Outdoors_score'].mean().reset_index()
mean_food = df.groupby('Borough of Oslo')['Food_score'].mean().reset_index()

# Merge the mean data back to get the order of boroughs
mean_outdoors['Order'] = mean_outdoors['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])
mean_food['Order'] = mean_food['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])

# Sort by the order to ensure the line goes through the boroughs in the correct order
mean_outdoors = mean_outdoors.sort_values('Order')
mean_food = mean_food.sort_values('Order')

# Outdoors chart with mean line across boroughs
outdoors_geo = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by District in Oslo Region'
) + alt.Chart(mean_outdoors).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Outdoors_score'
)

# Food chart with mean line across boroughs
food_geo = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by District in Oslo Region'
) + alt.Chart(mean_food).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Food_score'
)

# Combine the charts
geo = outdoors_geo | food_geo

# Save the chart if needed
# chart.save('charts/chart_1.png')

geo


In [24]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each size
mean_outdoors_size = df.groupby('Size')['Outdoors_score'].mean().reset_index()
mean_food_size = df.groupby('Size')['Food_score'].mean().reset_index()

# Sort by size to ensure the line goes through the sizes in the correct order
mean_outdoors_size = mean_outdoors_size.sort_values('Size')
mean_food_size = mean_food_size.sort_values('Size')

# Outdoors chart with mean line across sizes
outdoors_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by Size of Kindergarten'
) + alt.Chart(mean_outdoors_size).mark_line(color='red').encode(
    x='Size',
    y='Outdoors_score'
)

# Food chart with mean line across sizes
food_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by Size of Kindergarten'
) + alt.Chart(mean_food_size).mark_line(color='red').encode(
    x='Size',
    y='Food_score'
)

# Combine the charts
size = outdoors_size | food_size

# Save the chart if needed
# chart.save('charts/chart_1.png')

size


In [25]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each ownership category
mean_outdoors_own = df.groupby('Ownership')['Outdoors_score'].mean().reset_index()
mean_food_own = df.groupby('Ownership')['Food_score'].mean().reset_index()

# Sort by ownership to ensure the line goes through the ownership categories in the correct order
mean_outdoors_own = mean_outdoors_own.sort_values('Ownership')
mean_food_own = mean_food_own.sort_values('Ownership')

# Outdoors chart with mean line across ownership categories
outdoors_own = alt.Chart(df).mark_circle().encode(
    x = alt.X("Ownership", title="Ownership"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by Ownership'
) + alt.Chart(mean_outdoors_own).mark_line(color='red').encode(
    x='Ownership',
    y='Outdoors_score'
)

# Food chart with mean line across ownership categories
food_own = alt.Chart(df).mark_circle().encode(
    x = alt.X("Ownership", title="Ownership"),
    y = alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by Ownership'
) + alt.Chart(mean_food_own).mark_line(color='red').encode(
    x='Ownership',
    y='Food_score'
)

# Combine the charts
own = outdoors_own | food_own

# Save the chart if needed
# chart.save('charts/chart_1.png')

own

# Advanced Bivariate Plots - Outdoors vs. Food Score

In [26]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each borough
mean_outdoors = df.groupby('Borough of Oslo')['Outdoors_score'].mean().reset_index()
mean_food = df.groupby('Borough of Oslo')['Food_score'].mean().reset_index()

# Merge the mean data back to get the order of boroughs
mean_outdoors['Order'] = mean_outdoors['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])
mean_food['Order'] = mean_food['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])

# Sort by the order to ensure the line goes through the boroughs in the correct order
mean_outdoors = mean_outdoors.sort_values('Order')
mean_food = mean_food.sort_values('Order')

# Outdoors chart with mean line across boroughs
outdoors_geo = alt.Chart(df).mark_boxplot().encode(
    x=alt.X("Borough of Oslo", title="District in Oslo Region"),
    y=alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by District in Oslo Region'
) + alt.Chart(mean_outdoors).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Outdoors_score'
)

# Food chart with mean line across boroughs
food_geo = alt.Chart(df).mark_boxplot().encode(
    x=alt.X("Borough of Oslo", title="District in Oslo Region"),
    y=alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by District in Oslo Region'
) + alt.Chart(mean_food).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Food_score'
)

# Combine the charts
geo = outdoors_geo | food_geo

# Save the chart if needed
# chart.save('charts/chart_1.png')

geo


In [27]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each size
mean_outdoors_size = df.groupby('Size')['Outdoors_score'].mean().reset_index()
mean_food_size = df.groupby('Size')['Food_score'].mean().reset_index()

# Sort by size to ensure the line goes through the sizes in the correct order
mean_outdoors_size = mean_outdoors_size.sort_values('Size')
mean_food_size = mean_food_size.sort_values('Size')

# Outdoors chart with mean line across sizes
outdoors_size = alt.Chart(df).mark_boxplot().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by Size of Kindergarten'
) + alt.Chart(mean_outdoors_size).mark_line(color='red').encode(
    x='Size',
    y='Outdoors_score'
)

# Food chart with mean line across sizes
food_size = alt.Chart(df).mark_boxplot().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by Size of Kindergarten'
) + alt.Chart(mean_food_size).mark_line(color='red').encode(
    x='Size',
    y='Food_score'
)

# Combine the charts
size = outdoors_size | food_size

# Save the chart if needed
# chart.save('charts/chart_1.png')

size


In [28]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each ownership category
mean_outdoors_own = df.groupby('Ownership')['Outdoors_score'].mean().reset_index()
mean_food_own = df.groupby('Ownership')['Food_score'].mean().reset_index()

# Sort by ownership to ensure the line goes through the ownership categories in the correct order
mean_outdoors_own = mean_outdoors_own.sort_values('Ownership')
mean_food_own = mean_food_own.sort_values('Ownership')

# Outdoors chart with mean line across ownership categories
outdoors_own = alt.Chart(df).mark_boxplot().encode(
    x = alt.X("Ownership", title="Ownership"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score')
).properties(
    title='Outdoors Area Score by Ownership'
) + alt.Chart(mean_outdoors_own).mark_line(color='red').encode(
    x='Ownership',
    y='Outdoors_score'
)

# Food chart with mean line across ownership categories
food_own = alt.Chart(df).mark_boxplot().encode(
    x = alt.X("Ownership", title="Ownership"),
    y = alt.Y("Food_score", title='Food Quality Score')
).properties(
    title='Food Quality Score by Ownership'
) + alt.Chart(mean_food_own).mark_line(color='red').encode(
    x='Ownership',
    y='Food_score'
)

# Combine the charts
own = outdoors_own | food_own

# Save the chart if needed
# chart.save('charts/chart_1.png')

own


# Multivariate plots - Outdoors Score vs. Food Score

In [29]:
a = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = "Outdoors_score",
    color='Ownership'
)

b = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = "Food_score",
    color='Ownership'
)

c = a | b
c

In [30]:
a = alt.Chart(df).mark_circle().encode(
    x = "Size",
    y = "Outdoors_score",
    color='Ownership'
)

b = alt.Chart(df).mark_circle().encode(
    x = "Size",
    y = "Food_score",
    color='Ownership'
)

c = a | b
c

# Multivariate Plots with Mean - Outdoors vs. Food Score

In [31]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean for each borough
mean_outdoors = df.groupby('Borough of Oslo')['Outdoors_score'].mean().reset_index()
mean_food = df.groupby('Borough of Oslo')['Food_score'].mean().reset_index()

# Sort by the order of boroughs to ensure the line goes through the boroughs in the correct order
mean_outdoors['Order'] = mean_outdoors['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])
mean_food['Order'] = mean_food['Borough of Oslo'].apply(lambda x: df[df['Borough of Oslo'] == x].index[0])
mean_outdoors = mean_outdoors.sort_values('Order')
mean_food = mean_food.sort_values('Order')

# Outdoors chart with mean line across boroughs
a = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = "Outdoors_score",
    color='Ownership'
) + alt.Chart(mean_outdoors).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Outdoors_score'
)

# Food chart with mean line across boroughs
b = alt.Chart(df).mark_circle().encode(
    x = alt.X("Borough of Oslo", title="District in Oslo Region"),
    y = "Food_score",
    color='Ownership'
) + alt.Chart(mean_food).mark_line(color='red').encode(
    x='Borough of Oslo',
    y='Food_score'
)

# Combine the charts
c = a | b

# Save the chart if needed
# chart.save('charts/chart_1.png')

c


In [32]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Outdoors_score for each Size category
mean_outdoors_size = df.groupby('Size')['Outdoors_score'].mean().reset_index()

# Base chart with circles
outdoors_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score'),
    color = 'Ownership'
).properties(
    title='Outdoors Area Score by Size of Kindergarten'
)

# Mean line
mean_line = alt.Chart(mean_outdoors_size).mark_line(color='red').encode(
    x = 'Size',
    y = 'Outdoors_score'
)

# Combine the charts
combined_chart1 = outdoors_size + mean_line

import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Food_score for each Size category
mean_food_size = df.groupby('Size')['Food_score'].mean().reset_index()

# Base chart with circles
food_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Food_score", title='Food Quality Score'),
    color = 'Ownership'
).properties(
    title='Food Quality Score by Size of Kindergarten'
)

# Mean line
mean_line = alt.Chart(mean_food_size).mark_line(color='red').encode(
    x = 'Size',
    y = 'Food_score'
)

# Combine the charts
combined_chart2 = food_size + mean_line

# Display the chart
combined_chart1 | combined_chart2


<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Too complex</h1>

In [33]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Outdoors_score for each Size category
mean_outdoors_size = df.groupby('Size')['Outdoors_score'].mean().reset_index()

# Base chart with circles
outdoors_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score'),
    color = alt.Color('Ownership', title='Ownership')
).properties(
    title='Outdoors Area Score by Size of Kindergarten'
)

# Mean line
mean_line = alt.Chart(mean_outdoors_size).mark_line(color='#FF542E').encode(
    x = 'Size',
    y = 'Outdoors_score'
)

# Combine the charts
combined_chart = outdoors_size + mean_line

# Display the chart
combined_chart

In [34]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Outdoors_score for each Size category
mean_outdoors_size = df.groupby('Size')['Outdoors_score'].mean().reset_index()

# Base chart with circles
outdoors_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Outdoors_score", title='Outdoors Area Score'),
    color = alt.Color('Ownership', title='Ownership')
).properties(
    title='Outdoors Area Score by Size of Kindergarten'
)

# Mean line
mean_line = alt.Chart(mean_outdoors_size).mark_line(color='#EE3233', strokeWidth=3).encode(  ##FF542E
    x = 'Size',
    y = 'Outdoors_score'
)

# Combine the charts
combined_chart = outdoors_size + mean_line

# Display the chart
combined_chart


In [35]:
alt.Chart(df).mark_circle().encode(
    x = "Size",
    y = "Food_score",
    color='Ownership'
)

In [36]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Food_score for each Size category
mean_food_size = df.groupby('Size')['Food_score'].mean().reset_index()

# Base chart with circles
food_size = alt.Chart(df).mark_circle().encode(
    x = alt.X("Size", title="Size of Kindergarten"),
    y = alt.Y("Food_score", title='Food Quality Score'),
    color = 'Ownership'
).properties(
    title='Food Quality Score by Size of Kindergarten'
)

# Mean line
mean_line = alt.Chart(mean_food_size).mark_line(color='#EE3233', strokeWidth=3).encode(
    x = 'Size',
    y = 'Food_score'
)

# Combine the charts
combined_chart = food_size + mean_line

# Display the chart
combined_chart




In [37]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Food_score for each Size category and Ownership
mean_food_size_ownership = df.groupby(['Size', 'Ownership'])['Food_score'].mean().reset_index()

# Calculate overall mean Food_score for each Size category
mean_food_size = df.groupby('Size')['Food_score'].mean().reset_index()

# Base chart with circles
food_size = alt.Chart(df).mark_circle().encode(
    x=alt.X("Size", title="Size of Kindergarten"),
    y=alt.Y("Food_score", title='Food Quality Score'),
    color='Ownership'
).properties(
    title='Food Quality Score by Size of Kindergarten',
    width=120  # Set the width of the chart
)

# Mean lines for each Ownership category
#mean_lines_ownership = alt.Chart(mean_food_size_ownership).mark_line(strokeWidth=3, strokeDash=[10, 1]).encode(
#    x='Size',
#    y='Food_score',
#    color='Ownership'
#)

# Overall mean line in red
mean_line_overall = alt.Chart(mean_food_size).mark_line(color='red', strokeWidth=3).encode(
    x='Size',
    y='Food_score'
)

# Combine the charts
combined_chart = food_size + mean_line_overall #+ mean_lines_ownership

# Display the chart
combined_chart



In [38]:
alt.Chart(df).mark_point().encode(
    x = "Borough of Oslo",
    y = "Food_score",
    shape='Ownership',
    color='Size'
)

In [39]:
alt.Chart(df).mark_point().encode(
    x = "Borough of Oslo",
    y = "Food_score",
    color='Ownership',
    shape='Size'
)

In [40]:
import altair as alt
import pandas as pd

# Assuming df is your DataFrame

# Calculate mean Food_score for each Size category and Ownership
mean_food_size_ownership = df.groupby(['Size', 'Ownership'])['Food_score'].mean().reset_index()

# Base chart with circles
food_size = alt.Chart(df).mark_circle().encode(
    x=alt.X("Size", title="Size of Kindergarten"),
    y=alt.Y("Food_score", title='Food Quality Score'),
    color='Ownership'
).properties(
    title='Food Quality Score by Size of Kindergarten'
)

# Mean lines for each Ownership category
mean_lines = alt.Chart(mean_food_size_ownership).mark_line(strokeWidth=3).encode(
    x='Size',
    y='Food_score',
    color='Ownership'
)

# Combine the charts
combined_chart = food_size + mean_lines

# Display the chart
combined_chart


In [41]:
alt.Chart(df).mark_circle().encode(
    x = "Borough of Oslo",
    y = "Food_score",
    color='Size'
)

<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 1em">Evaluation</h1>