# Religion

In this notebook, we're focusing on the relationship between non-response (for SO question) and the % of Non-English speakers in our LA's.

* We will explore at the Shannon index, which is a measure of religious diversity within a community.
* We will calculate the SI for our different LAs. 
* A higher Shannon Index indicates greater religious diversity.

This will allow us to make some inferences about the religious diversity among our 331 LAs.


## Import libraries

In [1]:
# used to manipulate dataframes
import pandas as pd

# used to create visualisations
import matplotlib.pylab as plt

# used to calculate shannon index
import numpy as np

# used to create interactive visualisations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColorBar, BasicTicker, PrintfTickFormatter
from bokeh.models import LinearColorMapper

## Read-in data

We will read in the original religion dataset.

In [2]:
rel = pd.read_excel('../Data/religion_so.xlsx')

In [3]:
# Let's take a brief look..

rel.head()

Unnamed: 0,Lower tier local authorities Code,Lower tier local authorities,Sexual orientation (6 categories) Code,Sexual orientation (6 categories),Religion (10 categories) Code,Religion (10 categories),Observation
0,E06000001,Hartlepool,-8,Does not apply,-8,Does not apply,0
1,E06000001,Hartlepool,-8,Does not apply,1,No religion,0
2,E06000001,Hartlepool,-8,Does not apply,2,Christian,0
3,E06000001,Hartlepool,-8,Does not apply,3,Buddhist,0
4,E06000001,Hartlepool,-8,Does not apply,4,Hindu,0


## Cleaning data

In [6]:
# Let's rename our columns and give them less clunky names

rel.rename(columns={'Lower tier local authorities Code': 'LA_code', 'Lower tier local authorities': 'LA_name', 'Sexual orientation (6 categories) Code': 'SO_code', 'Sexual orientation (6 categories)': 'SO_categories', 'Religion (10 categories) Code': 'Religion_code', 'Religion (10 categories)': 'Religion_categories'}, inplace=True)

In [8]:
rel.head(20)

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Religion_code,Religion_categories,Observation
0,E06000001,Hartlepool,-8,Does not apply,-8,Does not apply,0
1,E06000001,Hartlepool,-8,Does not apply,1,No religion,0
2,E06000001,Hartlepool,-8,Does not apply,2,Christian,0
3,E06000001,Hartlepool,-8,Does not apply,3,Buddhist,0
4,E06000001,Hartlepool,-8,Does not apply,4,Hindu,0
5,E06000001,Hartlepool,-8,Does not apply,5,Jewish,0
6,E06000001,Hartlepool,-8,Does not apply,6,Muslim,0
7,E06000001,Hartlepool,-8,Does not apply,7,Sikh,0
8,E06000001,Hartlepool,-8,Does not apply,8,Other religion,0
9,E06000001,Hartlepool,-8,Does not apply,9,Not answered,0


In [9]:
# Subset data - we want to focus on non-response and exclude 'Does not apply' + 'Not answered'

non_resp = rel[(rel.SO_code == 5) & (rel.SO_code != -8) & (rel.Religion_code != -8) & (rel.Religion_code != 9)]

In [10]:
non_resp.head()

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Religion_code,Religion_categories,Observation
51,E06000001,Hartlepool,5,Not answered,1,No religion,1139
52,E06000001,Hartlepool,5,Not answered,2,Christian,1651
53,E06000001,Hartlepool,5,Not answered,3,Buddhist,14
54,E06000001,Hartlepool,5,Not answered,4,Hindu,23
55,E06000001,Hartlepool,5,Not answered,5,Jewish,4


# Analysis

Now we're going to create the Shannon index. These are the steps:

1) Calculate the religion proportions for each LA
2) Execute SI formula - take proportion of each religion and its natural logarithm
3) Sum those values for each LA

Then we will reduce the dataset so that it just lists the shannon index for each unique LA.
Finally, we'll create a colormap object which will allow us to set the colour mapping of datapoints.

## Calculations

In [11]:
# Created an empty column to hold proportions of religions within each LA

non_resp['Proportions'] = ''

for i in non_resp.LA_name.unique():
    
    b = non_resp[non_resp.LA_name == i]
    
    prop = b.Observation / b.Observation.sum()
    
    non_resp.loc[b.index, 'Proportions'] = round(prop, 4)

    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_resp['Proportions'] = ''


In [12]:
non_resp.head()

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Religion_code,Religion_categories,Observation,Proportions
51,E06000001,Hartlepool,5,Not answered,1,No religion,1139,0.3854
52,E06000001,Hartlepool,5,Not answered,2,Christian,1651,0.5587
53,E06000001,Hartlepool,5,Not answered,3,Buddhist,14,0.0047
54,E06000001,Hartlepool,5,Not answered,4,Hindu,23,0.0078
55,E06000001,Hartlepool,5,Not answered,5,Jewish,4,0.0014


In [13]:
# Converted proportions column to numeric type

non_resp['Proportions'] = pd.to_numeric(non_resp['Proportions'])


# Calculated the intermediate value used in the Shannon Index
# Formula for SI involves the proportion of each religion and the natural logarithm of that proportion

non_resp['Calc'] = np.where(non_resp['Proportions'] > 0, - non_resp['Proportions'] * np.log(non_resp['Proportions']), 0)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_resp['Proportions'] = pd.to_numeric(non_resp['Proportions'])
  result = getattr(ufunc, method)(*inputs, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_resp['Calc'] = np.where(non_resp['Proportions'] > 0, - non_resp['Proportions'] * np.log(non_resp['Proportions']), 0)


In [14]:
# Created an empty column titled Shannon_idx

non_resp['Shannon_idx'] = ''

for i in non_resp.LA_code.unique():
    
    b = non_resp[non_resp.LA_code == i]
    
    # used to calculate shannon index - we sum each value in the calc column 
    summed = sum(b.Calc)
    
    non_resp.loc[b.index, 'Shannon_idx'] = summed

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_resp['Shannon_idx'] = ''


In [15]:
# Success.

non_resp

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Religion_code,Religion_categories,Observation,Proportions,Calc,Shannon_idx
51,E06000001,Hartlepool,5,Not answered,1,No religion,1139,0.3854,0.367469,0.931876
52,E06000001,Hartlepool,5,Not answered,2,Christian,1651,0.5587,0.325243,0.931876
53,E06000001,Hartlepool,5,Not answered,3,Buddhist,14,0.0047,0.025193,0.931876
54,E06000001,Hartlepool,5,Not answered,4,Hindu,23,0.0078,0.037858,0.931876
55,E06000001,Hartlepool,5,Not answered,5,Jewish,4,0.0014,0.009200,0.931876
...,...,...,...,...,...,...,...,...,...,...
19854,W06000024,Merthyr Tydfil,5,Not answered,4,Hindu,7,0.0036,0.020257,0.842286
19855,W06000024,Merthyr Tydfil,5,Not answered,5,Jewish,1,0.0005,0.003800,0.842286
19856,W06000024,Merthyr Tydfil,5,Not answered,6,Muslim,23,0.0119,0.052731,0.842286
19857,W06000024,Merthyr Tydfil,5,Not answered,7,Sikh,4,0.0021,0.012948,0.842286


In [16]:
# Reduce dataset so that it just contains each unique LA and its SI

unique_shannon_df = non_resp[['LA_name', 'Shannon_idx']].drop_duplicates(subset=['LA_name'])

# Let's take a look at our top 5 most diverse LAs

unique_shannon_df.sort_values(by = 'Shannon_idx', ascending = False).head()

Unnamed: 0,LA_name,Shannon_idx
17631,Hounslow,1.625498
17571,Hillingdon,1.6235
16731,Barnet,1.613314
17091,Ealing,1.613012
951,Leicester,1.586275


In [17]:
# Sort LA_name by alphabetical order and reset the index

unique_shannon_df = unique_shannon_df.sort_values(by = 'LA_name').reset_index(drop = True)

In [18]:
unique_shannon_df.head()

Unnamed: 0,LA_name,Shannon_idx
0,Adur,0.891429
1,Allerdale,0.778939
2,Amber Valley,0.827071
3,Arun,0.864957
4,Ashfield,0.89143


In [19]:
# Before we map the shannon index, we should normalise the values
# This will make it easier to identify trends or patterns in the data

unique_shannon_df['Normalised'] =  (unique_shannon_df['Shannon_idx'] - unique_shannon_df['Shannon_idx'].min()) / (unique_shannon_df['Shannon_idx'].max() - unique_shannon_df['Shannon_idx'].min())


## Color mapping in Bokeh

Bokeh is the interactive visualisation library that we will be using, so we'll have to create our colour mapping object using its functions.

In [20]:
# Created colour map object in Bokeh
# Viridis256 chosen because its good at representing continuous variables

color_map = LinearColorMapper(palette="Viridis256", low=unique_shannon_df.Normalised.min(), high=unique_shannon_df.Normalised.max())

# Success. 

color_map

## Read-in merged dataframe

This merged_df was created in the 'Main_Lang_NR_SO.ipynb' notebook, and contains the modified main language dataset, along with additional columns on 'region' and 'urban-rural'.

In [21]:
merged_df = pd.read_csv('../Data/lang_rural_region_so.csv')

In [22]:
# Let's check it out

merged_df.head()

Unnamed: 0,LA_name,Observation,Non_Eng_Percentages,NR_rate,region,Urb_Rur
0,Adur,1971,3.14,6.47,South East,Predominantly Urban
1,Allerdale,1073,1.15,6.18,North West,Predominantly Rural
2,Amber Valley,1850,1.51,6.77,East Midlands,Predominantly Urban
3,Arun,9469,5.89,7.09,South East,Predominantly Urban
4,Ashfield,3944,3.22,6.77,East Midlands,Predominantly Urban


In [23]:
# Cool. All we need to do is append our Normalised column to this dataset!

merged_df['Shannon_idx'] = unique_shannon_df['Normalised']

In [24]:
merged_df.head()

Unnamed: 0,LA_name,Observation,Non_Eng_Percentages,NR_rate,region,Urb_Rur,Shannon_idx
0,Adur,1971,3.14,6.47,South East,Predominantly Urban,0.176281
1,Allerdale,1073,1.15,6.18,North West,Predominantly Rural,0.050053
2,Amber Valley,1850,1.51,6.77,East Midlands,Predominantly Urban,0.104063
3,Arun,9469,5.89,7.09,South East,Predominantly Urban,0.146576
4,Ashfield,3944,3.22,6.77,East Midlands,Predominantly Urban,0.176282


# Visualisation - Bokeh

Let's now create a standalone Bokeh plot and colour our data points by their Shannon index.

In [25]:

# Bokeh has a hover tool, allowing you to scroll over dps to reveal info
# To configure the tool, we must set our tooltips arguments...

# We simply define a list of tuples which refer to column values in our final_df 

tool = [
    ("index", "$index"),
    ("(x,y)", "(@Non_Eng_Percentages, @NR_rate)"),
    ("name","@LA_name"),
    ("Shannon_idx", "@Shannon_idx")
]


# Create graph figure, set title and x and y labels

p2 = figure(title="Relationship between Non-response Rate and Non-English Speakers", x_axis_label="Non-response Rate", y_axis_label="Percentage of Non-English Speakers", tooltips = tool)


# Create scatterplot for x and y column

p2.scatter("Non_Eng_Percentages", "NR_rate", source = merged_df, fill_alpha = 0.5, size = 10,  color={'field': 'Shannon_idx', 'transform': color_map})


# Create colour bar and set the color_mapper parameter to 'color_map' which we produced earlier

color_bar = ColorBar(color_mapper=color_map,
                     title='Shannon Index',
                     ticker=BasicTicker(desired_num_ticks=5),
                     formatter=PrintfTickFormatter(format='%.2f'))

# Add the color bar to the plot
p2.add_layout(color_bar, 'right')

# Display output 

output_notebook()
show(p2)

# Outputs

Now that we've added on the shannon index column to our language dataset, let's save this to a csv. Our aim will now be to create an interactive Bokeh plot for SO which has drop downs allowing users to switch between colouring dps by:

* region
* urban-rural classification
* shannon index

In [26]:
merged_df.to_csv('../Data/final_lang_so.csv')