<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Part-1" data-toc-modified-id="Part-1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 1</a></span><ul class="toc-item"><li><span><a href="#Aim" data-toc-modified-id="Aim-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Aim</a></span></li><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Sexual-orientation" data-toc-modified-id="Sexual-orientation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Sexual orientation</a></span></li><li><span><a href="#Main-language" data-toc-modified-id="Main-language-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Main language</a></span></li></ul></li><li><span><a href="#Part-2" data-toc-modified-id="Part-2-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Part 2</a></span><ul class="toc-item"><li><span><a href="#Region-and-urban-rural-classification" data-toc-modified-id="Region-and-urban-rural-classification-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Region and urban-rural classification</a></span></li><li><span><a href="#Bokeh-trial-visualisation" data-toc-modified-id="Bokeh-trial-visualisation-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Bokeh trial visualisation</a></span></li></ul></li><li><span><a href="#Outputs" data-toc-modified-id="Outputs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Outputs</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Part 1

## Aim

To eventually create some interactive scatterplots which explore the relationship between:

* relationship between non-response (for sexual orientation Q) and the % of Non-English speakers in our LA's
* relationship between religious group % and contribution to non-response rates in LAs

In this notebook, we'll just be focusing on the first bullet point. 

Let's get started.

## Import libraries

In [1]:
# used to manipulate dataframes
import pandas as pd

# used to create visualisations
import seaborn as sns
import matplotlib.pylab as plt

# used to create interactive visualisations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

# used to apply scatterplot labels at selective distances 
from sklearn.neighbors import NearestNeighbors

# used to calculate correlation
from scipy.stats import pearsonr



## Sexual orientation

### Read-in sexual orientation data

First we will import the sexual orientation dataset which details SO responses by local authority.

We are importing cleaned data - column names have been shorted and underscored where necessary. 


In [2]:
# Let's read in SO data

df = pd.read_csv('/Users/loucap/Documents/GitWork/InteractiveGender/Data/so_renamed.csv')

### Overview

In [3]:
# Let's check it out

df.head()

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Observation
0,E06000001,Hartlepool,-8,Does not apply,0
1,E06000001,Hartlepool,1,Straight or Heterosexual,68070
2,E06000001,Hartlepool,2,Gay or Lesbian,1121
3,E06000001,Hartlepool,3,Bisexual,784
4,E06000001,Hartlepool,4,All other sexual orientations,157


In [4]:
# Some more info...

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1986 entries, 0 to 1985
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   LA_code        1986 non-null   object
 1   LA_name        1986 non-null   object
 2   SO_code        1986 non-null   int64 
 3   SO_categories  1986 non-null   object
 4   Observation    1986 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 77.7+ KB


In [5]:
# Let's take a look at our SO codes

df.SO_code.unique()

array([-8,  1,  2,  3,  4,  5])

In [6]:
# Let's see what they refer to...

df.SO_categories.unique()

array(['Does not apply', 'Straight or Heterosexual', 'Gay or Lesbian',
       'Bisexual', 'All other sexual orientations', 'Not answered'],
      dtype=object)

In [7]:
# We are working with 331 local authorities

df.LA_name.nunique()

331

In [8]:
# You can see we have our counts for each category for a specific LA

df.head(10)

Unnamed: 0,LA_code,LA_name,SO_code,SO_categories,Observation
0,E06000001,Hartlepool,-8,Does not apply,0
1,E06000001,Hartlepool,1,Straight or Heterosexual,68070
2,E06000001,Hartlepool,2,Gay or Lesbian,1121
3,E06000001,Hartlepool,3,Bisexual,784
4,E06000001,Hartlepool,4,All other sexual orientations,157
5,E06000001,Hartlepool,5,Not answered,4554
6,E06000002,Middlesbrough,-8,Does not apply,0
7,E06000002,Middlesbrough,1,Straight or Heterosexual,102027
8,E06000002,Middlesbrough,2,Gay or Lesbian,1787
9,E06000002,Middlesbrough,3,Bisexual,1385


### Data pre-processing

Calculating SO response percentages.

In [9]:

# Let's calculate the df_category percentages for each local authority, then subset our df
# I've named this new column NR_rate

df['NR_rate'] = ''


for i in df.LA_code.unique():
    b = df[df.LA_code == i]
    percent = b['Observation'] / b['Observation'].sum() * 100
    df.loc[b.index, 'NR_rate'] = (percent).round(2)
    
# Now let's subset our df so that we're just left with our Non-response rates
df = df[df['SO_code'] == 5].drop(columns=['LA_code', 'SO_code', 'Observation', 'SO_categories'])

    
# Sort df alphabetically by LA_name and reset index 

df = df.sort_values(by = 'LA_name').reset_index(drop = True)

In [10]:
# Success!

df

Unnamed: 0,LA_name,NR_rate
0,Adur,6.47
1,Allerdale,6.18
2,Amber Valley,6.77
3,Arun,7.09
4,Ashfield,6.77
...,...,...
326,Wrexham,7.88
327,Wychavon,6.15
328,Wyre,6.1
329,Wyre Forest,6.91


## Main language

Okay, so we have our non-response rate for each LA, we also need the % of Non-English speakers.

Let's start by importing the main language dataset, which classifies residents by their main language.

### Read-in data

### Cleaning data and pre-processing

This step requires us to read-in the main language dataset and then clean the columns and calculate our Non-English percentages.

Luckily for us however, we have already completed this step in the previous [Main_Lang_NR_GI.ipynb](./Main_Lang_NR_GI.ipynb) notebook. So we'll just read-in that data!

In [None]:
non_eng_sum = pd.read_csv('../Data/non_eng_sum.csv')

In [None]:
# Let's take a look..
# Nice, we have our Non-English percentages for each LA

non_eng_sum

### Merge datasets

All that's left for us to do now is merge our non_eng_sum dataset with our non-english percentages, with our df dataframe which has our non-response rates for each LA.

In [None]:
# Merge non_eng_sum and df together

merged_df = non_eng_sum.merge(df, on = ['LA_name'])

In [None]:
merged_df.head()

### Data processing

Awesome. Now we can move on to our data processing, where we can extract some useful information and insights from our 2 variables. We're going to make a simple scatterplot showing the relationship between non-response rate and % of non-english speakers in our LAs.

#### Scatterplot visualisation

In [None]:
import matplotlib.pyplot as plt

# Set the size of the seaborn plot

sns.set(rc={'figure.figsize':(11.7,8.27)})

# Now we can visualise the relationship between GI NR and % of Non-Eng speakers

plt.scatter(merged_df['Non_Eng_Percentages'], merged_df['NR_rate'])

# We can use a bit of Machine Learning to selectively apply LA labels

# Find nearest neighbors
X = merged_df[['Non_Eng_Percentages', 'NR_rate']].values
nbrs = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = nbrs.kneighbors(X)
min_distance = 0.3

for i, row in merged_df.iterrows():
    if distances[i][1] >= min_distance:
        plt.annotate(row['LA_name'], (row['Non_Eng_Percentages'], row['NR_rate']), fontsize=8, alpha=0.6)

plt.xlabel('Percentage of Non-English Speakers')
plt.ylabel('Non-response Rate')
plt.title('Relationship between Non-response Rate and Non-English Speakers')
plt.show()

### Correlation

In [None]:
# Calculate the Pearson correlation coefficient and the p-value

correlation, p_value = pearsonr(merged_df['Non_Eng_Percentages'], df['NR_rate'])

print("Correlation:", correlation)
print("P-value:", p_value)

# Part 2

Okay, so that's all well and good, but it'd be nice to explore this relationship further.

Therefore, I will:

* color-code each data point by region
* color-code each data point by urban-rural classification

## Region and urban-rural classification

We will classify each Local Authority by region and urban-rural classification, and add this info as an additional column to our merged_df. Again, we are lucky that we've already done this in the previous notebook mentioned above. So, we'll just read in the dataset!

### Read-in data

In [None]:
merged_2 = pd.read_csv('../Data/lang_rural_region_gi.csv')

In [None]:
# Great. What we'll do now is pinch the region and Urb_Rur columns
# We'll simply add them to our merged_df!

merged_2.head()

In [None]:
merged_df['region'] = merged_2['region'] 
merged_df['Urb_Rur'] = merged_2['Urb_Rur']

In [None]:
# Success.

merged_df.head()

### Data processing

Now that we have our finished dataset, we'll first make a scatterplot in seaborn, then will make a standalone interactive scatterplot using Bokeh.

#### Scatterplot - seaborn

In [None]:
import seaborn as sns

# Set the size of the seaborn plot

sns.set(rc={'figure.figsize':(11.7,8.27)})

# Now we can visualise the relationship between GI NR and % of Non-Eng speakers
# Setting our hue to region, colours each dp by its corresponding region

ax = sns.scatterplot(data=merged_df, x='Non_Eng_Percentages', y='NR_rate', hue='region')

# We can use a bit of Machine Learning to selectively apply LA labels

from sklearn.neighbors import NearestNeighbors

# Find nearest neighbors

X = merged_df[['Non_Eng_Percentages', 'NR_rate']].values
nbrs = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = nbrs.kneighbors(X)
min_distance = 0.3

for i, row in merged_df.iterrows():
    if distances[i][1] >= min_distance:
        ax.annotate(row['LA_name'], (row['Non_Eng_Percentages'], row['NR_rate']), fontsize=8, alpha=0.6)

# Set x,y, and title labels

plt.xlabel('Percentage of Non-English Speakers')
plt.ylabel('Non-response Rate')
plt.title('Relationship between Non-response Rate and Non-English Speakers')

# Display output

plt.show()


## Bokeh trial visualisation

In [None]:
# Let's see if we can make a standalone Bokeh plot for some of this data

from bokeh.models.annotations import LabelSet
from bokeh.models import ColumnDataSource, Label, LabelSet
from bokeh.palettes import Category10
from bokeh.io import show
from bokeh.models import CheckboxGroup, CustomJS
from bokeh.layouts import column
from bokeh.models import Button
from bokeh.plotting import figure, curdoc

# Bokeh has a hover tool, allowing you to scroll over dps to reveal info
# To configure the tool, we must set our tooltips arguments...

# We simply define a list of tuples which refer to column values in our merged_df 

tool = [
    ("index", "$index"),
    ("(x,y)", "(@Non_Eng_Percentages, @NR_rate)"),
    ("name","@LA_name"),
]

# Create graph figure, set title and x and y labels

p1 = figure(title="Relationship between NR rate and Non-English Speakers", x_axis_label="Percentage of Non-English Speakers", y_axis_label= "Non-response rate", tooltips = tool)

# To colour each data point by region we first loop over each unique region and its colour
for region, color in zip(merged_df.region.unique(), Category10[10]):
#     Subset dataframe by region for each unique region
    b = merged_df[merged_df.region == region]
#     Each dp within that region is then plotted with its data and specific colour
    p1.circle(x = 'Non_Eng_Percentages', y = 'NR_rate', size = 10, alpha = 0.5, color = color, legend_label = region, muted_color = color, muted_alpha = 0.1, source = b)


# Set location of legend

p1.legend.location = "bottom_right"

# Make it so that when a specific legend is clicked, its dps are removed from plot
p1.legend.click_policy="hide"

# Set legend title 
p1.legend.title = "Regions"

# Display output
output_notebook()
show(p1)


# Outputs

In [None]:
merged_df.to_csv('../Data/lang_rural_region_so.csv', index = False)

# Conclusion

We have now coloured our data points by region and by urban-rural classification. If you're interested to see how we coloured these same data points by their Shannon index (a measure of religious diversity within each LA) then please proceed to our [Religion_1_SO.ipynb](./Religion_1_SO.ipynb) notebook.