# Preparation of Population by brick

### Sweden Population by brick 2022.xlsx
This notebook prepares the population data from `Sweden Population by brick 2022.xlsx`.

In [1]:
# Load required packages
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os

## Define helper function

In [1]:
def get_county_to_region_map_dict(column):
    """
    Get BC or melanoma regions for the Swedish counties.
    """
    dict_name = dict()
    
    for region in sorted(mapping['county_council'].unique()):
        dict_name.update({region: []})

    for i in range(mapping.shape[0]):
        if mapping[column].iloc[i] not in dict_name[mapping['county_council'].iloc[i]]:
            dict_name[mapping['county_council'].iloc[i]].append(mapping[column].iloc[i])
        
    return dict_name

## Load data

In [3]:
# Read in data frame
population_by_brick = pd.read_excel("../../0_raw_data/novartis_data/Sweden Population by brick 2022.xlsx")

# Look at entire data frame
population_by_brick

Unnamed: 0,Country,County Council,Brick,Population
0,SE,,,10219002
1,SE,Unknown,,0
2,SE,Unknown,99 Unknown,0
3,SE,Stockholm,,2339287
4,SE,Stockholm,73 Stockholm-V,457418
...,...,...,...,...
97,SE,Blekinge,23 Karlshamn,63256
98,SE,Jämtland,,130191
99,SE,Jämtland,63 Östersund,130191
100,SE,Gotland,,59191


## Preparatory steps

In [4]:
# Rename all columns
population_by_brick = population_by_brick.rename(columns = {"Country": "country", "County Council": "county_council", 
                                                            "Brick": "brick", "Population": "population"})
population_by_brick

Unnamed: 0,country,county_council,brick,population
0,SE,,,10219002
1,SE,Unknown,,0
2,SE,Unknown,99 Unknown,0
3,SE,Stockholm,,2339287
4,SE,Stockholm,73 Stockholm-V,457418
...,...,...,...,...
97,SE,Blekinge,23 Karlshamn,63256
98,SE,Jämtland,,130191
99,SE,Jämtland,63 Östersund,130191
100,SE,Gotland,,59191


### Begin of  insertion: Is it possible to do an analysis on the county level?

In [5]:
# Read in data frame 
route0 = "../processed_data"
mapping = pd.read_pickle(f"{route0}/mapping.pkl")

# Look at data frame
mapping

Unnamed: 0,brick,sweden_bc,sweden_me
0,02 Norrtälje,Stockholm,Stockholm ONCO
1,04 Uppsala,Uppsala,Uppsala ONCO
2,03 Enköping,Uppsala,Uppsala ONCO
3,05 Nyköping,Sörmland-Eskilstuna,Sörmland-Eskilstuna ONCO
4,06 Katrineholm,Sörmland-Eskilstuna,Sörmland-Eskilstuna ONCO
...,...,...,...
73,85 Kungälv,Västra Götaland-Göteborg,Västra Götaland-Göteborg ONCO
74,86 Lerum/Alingsås,Västra Götaland-Alingsås,Västra Götaland-SÄS ONCO
75,91 Malmö,Skåne-Lund,Skåne ONCO
76,92 Lund,Skåne-Lund,Skåne ONCO


In [6]:
mapping = pd.merge(mapping, population_by_brick, on='brick', how='left')
mapping

Unnamed: 0,brick,sweden_bc,sweden_me,country,county_council,population
0,02 Norrtälje,Stockholm,Stockholm ONCO,SE,Stockholm,61689
1,04 Uppsala,Uppsala,Uppsala ONCO,SE,Uppsala,287170
2,03 Enköping,Uppsala,Uppsala ONCO,SE,Uppsala,58281
3,05 Nyköping,Sörmland-Eskilstuna,Sörmland-Eskilstuna ONCO,SE,Södermanland,92517
4,06 Katrineholm,Sörmland-Eskilstuna,Sörmland-Eskilstuna ONCO,SE,Södermanland,60313
...,...,...,...,...,...,...
73,85 Kungälv,Västra Götaland-Göteborg,Västra Götaland-Göteborg ONCO,SE,Västra Götaland,88901
74,86 Lerum/Alingsås,Västra Götaland-Alingsås,Västra Götaland-SÄS ONCO,SE,Västra Götaland,125696
75,91 Malmö,Skåne-Lund,Skåne ONCO,SE,Skåne,338230
76,92 Lund,Skåne-Lund,Skåne ONCO,SE,Skåne,222013


We see that an analysis on county level is not possible because one BC or melanoma region does not always only refer to one county. 

Example: 
* BC region Uppsala is mapped to both counties Uppsala and Gävle
* BC region Västra Götaland-Göteborg is mapped to both counties Västra Götaland and Halland
* Melanoma region Västra Götaland-Göteborg ONCO is mapped to both counties Halland and Västra Götaland

This means the patients and sales data - given in terms of the BC and melanoma regions - cannot be aggregated by counties because the mapping between BC and melanoma regions and counties is not injective.

We therefore conduct our further analysis on the basis of the Swedish BC and melanoma regions from `sweden_bc` and `sweden_me`.

This decision is also supported by the fact that Novartis does its sales forecasting and financial planning based on the BC and melanoma regions. It is therefore desirable to later come up with sales forecasts for the BC and Melanoma regions rather than for the counties.

In [9]:
# BC regions for each county council
bc_map_dict = get_county_to_region_map_dict('sweden_bc')
bc_map_dict

{'Blekinge': ['Blekinge'],
 'Dalarna': ['Dalarna'],
 'Gotland': ['Stockholm-Gotland'],
 'Gävleborg': ['Uppsala', 'Gävleborg-Gävle'],
 'Halland': ['Halland-Halmstad',
  'Halland-Varberg-Falkenberg',
  'Västra Götaland-Göteborg'],
 'Jämtland': ['Jämtland'],
 'Jönköping': ['Jönköping-Jönköping',
  'Jönköping-Nässjö-Eksjö',
  'Jönköping-Värnamo'],
 'Kalmar': ['Kalmar'],
 'Kronoberg': ['Kronoberg-Ljungby', 'Kronoberg-Växjö'],
 'Norrbotten': ['Norrbotten-Sunderbyn'],
 'Skåne': ['Skåne-Kristianstad', 'Skåne-Helsingborg-Landskrona', 'Skåne-Lund'],
 'Stockholm': ['Stockholm'],
 'Södermanland': ['Sörmland-Eskilstuna'],
 'Uppsala': ['Uppsala'],
 'Värmland': ['Värmland-Karlstad'],
 'Västerbotten': ['Västerbotten-Umeå', 'Västerbotten-Skellefteå'],
 'Västernorrland': ['Västernorrland-Sundsvall', 'Västernorrland-Örnsköldsvik'],
 'Västmanland': ['Västmanland-Västerås'],
 'Västra Götaland': ['Västra Götaland-Uddevalla',
  'Västra Götaland-Borås',
  'Västra Götaland-Lidköping',
  'Västra Götaland-Skövde

In [10]:
# Melanoma regions for each county council
me_map_dict = get_county_to_region_map_dict('sweden_me')
me_map_dict

{'Blekinge': ['Blekinge ONCO'],
 'Dalarna': ['Dalarna ONCO'],
 'Gotland': ['Stockholm-Gotland ONCO'],
 'Gävleborg': ['Gävleborg-Gävle ONCO'],
 'Halland': ['Halland-Halmstad ONCO', 'Västra Götaland-Göteborg ONCO'],
 'Jämtland': ['Jämtland ONCO'],
 'Jönköping': ['Jönköping ONCO'],
 'Kalmar': ['Kalmar ONCO'],
 'Kronoberg': ['Kronoberg-Växjö ONCO'],
 'Norrbotten': ['Norrbotten-Sunderbyn ONCO'],
 'Skåne': ['Skåne ONCO'],
 'Stockholm': ['Stockholm ONCO'],
 'Södermanland': ['Sörmland-Eskilstuna ONCO'],
 'Uppsala': ['Uppsala ONCO'],
 'Värmland': ['Värmland-Karlstad ONCO'],
 'Västerbotten': ['Västerbotten-Umeå ONCO'],
 'Västernorrland': ['Västernorrland-Sundsvall ONCO'],
 'Västmanland': ['Västmanland-Västerås ONCO'],
 'Västra Götaland': ['Västra Götaland-Göteborg ONCO',
  'Västra Götaland-SÄS ONCO'],
 'Örebro': ['Örebro-Örebro ONCO'],
 'Östergötland': ['Östergötland-Linköping ONCO']}

### End of insertion

In [11]:
# Drop irrelevant columns
population_by_brick.drop(["country", "county_council"], axis = 1, inplace = True)

In [12]:
population_by_brick

Unnamed: 0,brick,population
0,,10219002
1,,0
2,99 Unknown,0
3,,2339287
4,73 Stockholm-V,457418
...,...,...
97,23 Karlshamn,63256
98,,130191
99,63 Östersund,130191
100,,59191


In [13]:
# Cast to appropriate data type
population_by_brick["brick"] = population_by_brick["brick"].astype('category')
population_by_brick["population"] = population_by_brick["population"].astype('float')

In [14]:
# Remove rows where brick is NaN
population_by_brick = population_by_brick.dropna(subset = ['brick']).reset_index(drop=True)

In [15]:
population_by_brick

Unnamed: 0,brick,population
0,99 Unknown,0.0
1,73 Stockholm-V,457418.0
2,72 Stockholm-NV,404184.0
3,75 Stockholm-S,641403.0
4,76 Stockholm-SV,414460.0
...,...,...
74,15 Ljungby,38790.0
75,22 Karlskrona,96295.0
76,23 Karlshamn,63256.0
77,63 Östersund,130191.0


In [16]:
# Remove row with brick == '99 Unknown'
population_by_brick = population_by_brick[population_by_brick.brick != '99 Unknown'].reset_index(drop=True)

In [17]:
population_by_brick

Unnamed: 0,brick,population
0,73 Stockholm-V,457418.0
1,72 Stockholm-NV,404184.0
2,75 Stockholm-S,641403.0
3,76 Stockholm-SV,414460.0
4,71 Stockholm-NO,281467.0
...,...,...
73,15 Ljungby,38790.0
74,22 Karlskrona,96295.0
75,23 Karlshamn,63256.0
76,63 Östersund,130191.0


We see that the population data is only given for 2022. 

In [18]:
# Save the prepared data frame
route0 = "../processed_data"

if not os.path.exists(route0):
    os.mkdir(route0)
    
print("saving file corresponding to population_by_brick.pkl")
population_by_brick.to_pickle(f"{route0}/population_by_brick.pkl")
pd.read_pickle(f"{route0}/population_by_brick.pkl")

saving file corresponding to population_by_brick.pkl


Unnamed: 0,brick,population
0,73 Stockholm-V,457418.0
1,72 Stockholm-NV,404184.0
2,75 Stockholm-S,641403.0
3,76 Stockholm-SV,414460.0
4,71 Stockholm-NO,281467.0
...,...,...
73,15 Ljungby,38790.0
74,22 Karlskrona,96295.0
75,23 Karlshamn,63256.0
76,63 Östersund,130191.0
