In [32]:
import pandas as pd

# SIMD

Scottish Index Multiple Deprivation v2 2020

Data: http//simd.scot

Technical Notes: https://www.gov.scot/binaries/content/documents/govscot/publications/statistics/2020/09/simd-2020-technical-notes/documents/simd-2020-technical-notes/simd-2020-technical-notes/govscot%3Adocument/SIMD%2B2020%2Btechnical%2Bnotes.pdf

In [33]:
df_simd_init = pd.read_csv("data/simd_2020.csv", index_col=0)

In [34]:
df_simd_init.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6976 entries, 0 to 6975
Data columns (total 49 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Data_Zone                        6976 non-null   object 
 1   Intermediate_Zone                6976 non-null   object 
 2   Council_area                     6976 non-null   object 
 3   Total_population                 6976 non-null   int64  
 4   Working_Age_population           6976 non-null   int64  
 5   SIMD2020v2_Rank                  6976 non-null   int64  
 6   SIMD_2020v2_Percentile           6976 non-null   int64  
 7   SIMD2020v2_Vigintile             6976 non-null   int64  
 8   SIMD2020v2_Decile                6976 non-null   int64  
 9   SIMD2020v2_Quintile              6976 non-null   int64  
 10  SIMD2020v2_Income_Domain_Rank    6976 non-null   float64
 11  SIMD2020_Employment_Domain_Rank  6976 non-null   float64
 12  SIMD2020_Health_Doma

We can immediately get rid of some columns which will not be useful to us.

- Intermediate_Zone
- Council_area
- any rate columns where we also have count (as we will be including population)
- any ranking columns as these should not be predictive
- percentile, quintile etc

In [35]:
cols_to_drop = ["Intermediate_Zone", "Council_area", "SIMD2020v2_Rank", "SIMD_2020v2_Percentile", "SIMD2020v2_Vigintile", "SIMD2020v2_Decile", "SIMD2020v2_Quintile"]
cols_to_drop += df_simd_init.columns[df_simd_init.columns.str.contains("_Rank")].to_list()
cols_to_drop += ["income_rate", "employment_rate", "crime_rate", "overcrowded_rate"]

df_simd = df_simd_init.drop(cols_to_drop, axis=1)

## 1 Missing Data

The SIMD Technical notes indicate that wherever "*" is present, this indicates that there was no population relevant to the statistic and a division by 0 occoured in the calculation. Later we will be aggregating this individual data zones into electoral wards and can exclude these data zones from the weighted average calculation. If there are any left then we will deal with them at that stage.

In [44]:
(df_simd=="*").sum()

Data_Zone                   0
Total_population            0
Working_Age_population      0
income_count                0
employment_count            0
CIF                         0
ALCOHOL                     0
DRUG                        0
SMR                         0
DEPRESS                     0
LBWT                        0
EMERG                       0
Attendance                565
Attainment                178
no_qualifications           0
not_participating           0
University                  0
crime_count               500
overcrowded_count           0
nocentralheating_count      0
nocentralheating_rate       0
drive_petrol                0
drive_GP                    0
drive_post                  0
drive_primary               0
drive_retail                0
drive_secondary             0
PT_GP                       0
PT_post                     0
PT_retail                   0
broadband                   0
dtype: int64

Seperately we have some missing data and it is not clear why.

In [36]:
df_simd.isnull().sum()

Data_Zone                  0
Total_population           0
Working_Age_population     0
income_count               0
employment_count           0
CIF                        3
ALCOHOL                    2
DRUG                       2
SMR                        2
DEPRESS                    1
LBWT                       1
EMERG                      2
Attendance                 2
Attainment                11
no_qualifications          0
not_participating          3
University                 2
crime_count                0
overcrowded_count          0
nocentralheating_count     0
nocentralheating_rate      0
drive_petrol               0
drive_GP                   0
drive_post                 0
drive_primary              0
drive_retail               0
drive_secondary            0
PT_GP                      0
PT_post                    0
PT_retail                  0
broadband                  2
dtype: int64