<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-up-libraries-and-packages" data-toc-modified-id="Load-up-libraries-and-packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load up libraries and packages</a></span></li><li><span><a href="#Read-in-Main-langauge-dataset" data-toc-modified-id="Read-in-Main-langauge-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read-in Main langauge dataset</a></span></li><li><span><a href="#Cleaning-dataset" data-toc-modified-id="Cleaning-dataset-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cleaning dataset</a></span></li><li><span><a href="#Exploring-%s-of-languages-in-individual-LAs" data-toc-modified-id="Exploring-%s-of-languages-in-individual-LAs-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exploring %s of languages in individual LAs</a></span></li><li><span><a href="#Exploring-main-non-English-languages" data-toc-modified-id="Exploring-main-non-English-languages-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Exploring main non-English languages</a></span></li><li><span><a href="#Top-10-main-Non-English-languages" data-toc-modified-id="Top-10-main-Non-English-languages-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Top 10 main Non-English languages</a></span></li><li><span><a href="#Top-10-main-Non-English-languages-(excluding-Wales)" data-toc-modified-id="Top-10-main-Non-English-languages-(excluding-Wales)-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Top 10 main Non-English languages (excluding Wales)</a></span><ul class="toc-item"><li><span><a href="#Read-in-region-dataset" data-toc-modified-id="Read-in-region-dataset-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Read-in region dataset</a></span></li><li><span><a href="#Use-zip()-and-dictionary-comprehension-to-fill-in-region-column" data-toc-modified-id="Use-zip()-and-dictionary-comprehension-to-fill-in-region-column-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Use zip() and dictionary comprehension to fill in region column</a></span></li></ul></li><li><span><a href="#Top-10-main-Non-English-languages---exc-Wales" data-toc-modified-id="Top-10-main-Non-English-languages---exc-Wales-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Top 10 main Non-English languages - exc Wales</a></span></li><li><span><a href="#Top-10-Main-Languages-(by-region)" data-toc-modified-id="Top-10-Main-Languages-(by-region)-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Top 10 Main Languages (by region)</a></span></li><li><span><a href="#Top-10-Main-Languages-(by-region---except-Wales)" data-toc-modified-id="Top-10-Main-Languages-(by-region---except-Wales)-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Top 10 Main Languages (by region - except Wales)</a></span></li></ul></div>

# Load up libraries and packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import numpy as np



# Read-in Main langauge dataset

In [2]:
lang = pd.read_csv('../Data/Language_2021.csv')

# Cleaning dataset

In [3]:
lang.rename(columns={'Lower Tier Local Authorities Code':'LA_code', 'Lower Tier Local Authorities':'LA_name', 'Main language (detailed) (95 categories) Code': 'ML_code', 'Main language (detailed) (95 categories)': 'ML_categories'}, inplace=True)

In [4]:
# First, let's remove our 'Does not apply' response category 

lang = lang[lang['ML_code'] != -8]

In [5]:
lang2 = lang.copy()

# Exploring %s of languages in individual LAs

Now we can make queries such as...

Which local authorities have the highest % of Arabic speakers?

In [6]:


# Let's calculate the language %'s for individual LAs

lang2['Percentages'] = lang2.groupby('LA_code')['Observation'].transform(lambda x: x / x.sum() * 100)


In [7]:
# Let's subset by chosen lang2uage

a = lang2[lang2.ML_categories == 'Arabic']

In [8]:
# Now let's sort the dataset by highest Percentage

a.sort_values(by = 'Percentages', ascending = False).head()

Unnamed: 0,LA_code,LA_name,ML_code,ML_categories,Observation,Percentages
29303,E09000033,Westminster,43,Arabic,7446,3.742197
26643,E09000005,Brent,43,Arabic,10618,3.239417
27023,E09000009,Ealing,43,Arabic,8775,2.476323
28068,E09000020,Kensington and Chelsea,43,Arabic,3196,2.288677
23033,E08000003,Manchester,43,Arabic,10425,1.961493


# Exploring main non-English languages

Now we can consider the top languages other than English  

In [9]:
# Let's create a dataset of just non-english lang2uages

non_eng = lang2[lang2['ML_code'] != 1]

In [10]:
# We need to group our Main language categories by their observation total
# e.g. How many Arabic speakers altogether?

ML_totals = non_eng.groupby('ML_categories')['Observation'].sum().reset_index()

In [11]:
# Looking good

ML_totals.head()

Unnamed: 0,ML_categories,Observation
0,African language: Afrikaans,7501
1,African language: Akan,19753
2,African language: Amharic,10827
3,African language: Any other African language,14763
4,African language: Any other Nigerian language,4439


In [12]:
# Now we'll create a new column 'Percentages' which takes the observation total for each language and divides it by the overall total for main languages (inc English)

ML_totals['Percentages'] = ML_totals['Observation']/ lang2.Observation.sum() * 100

# Top 10 main Non-English languages

This covers England and Wales.

In [13]:
# Now we'll use the same method that we did in the previous example

ML_totals.sort_values(by = 'Percentages', ascending = False).head(10)

Unnamed: 0,ML_categories,Observation,Percentages
46,Other European language (EU): Polish,611831,1.060287
47,Other European language (EU): Romanian,471952,0.81788
80,South Asian language: Panjabi,290745,0.503853
84,South Asian language: Urdu,269863,0.467665
67,Portuguese,224720,0.389434
85,Spanish,215058,0.37269
15,Arabic,204001,0.353528
73,South Asian language: Bengali (with Sylheti an...,199491,0.345713
74,South Asian language: Gujarati,188963,0.327468
42,Other European language (EU): Italian,159986,0.277252


# Top 10 main Non-English languages (excluding Wales)

What if we just wanted to focus on all of England's local authorities? We could do that by reading in our region dataset (which classifies local authorities by region), and then filtering Welsh local authorities out of our lang dataset.

## Read-in region dataset

In [14]:
region = pd.read_csv('../Data/Local_Authority__to_Region.csv')

## Use zip() and dictionary comprehension to fill in region column

It's a bit of a faff considering we only want the Welsh LA's, but hey, maybe this regional info will come in handy for future analysis! Maybe we could consider looking at the top 10 non-English languages by region?

In [15]:
# Create a key-value dictionary using zip() and a dictionary comprehension
key_value_dict = {key: value for key, value in zip(region['LAD22NM'], region['RGN22NM'])}


# Create a new column titled 'region' and set it to empty
non_eng['region'] = ''


for key, value in key_value_dict.items():
#     Creates a boolean series where key = True
    matching_rows = non_eng['LA_name'] == key
# Use .loc to access the row in which key = True, i.e. where the 'LA_name' column matches the current key
# We then access the region column and set it's value to match the corresponding value for our key.
    non_eng.loc[matching_rows, 'region'] = value
    
    
# Manual matching for those that couldn't be filled in


non_eng.loc[non_eng['LA_name'] == 'Herefordshire', 'region'] = 'West Midlands'
non_eng.loc[non_eng['LA_name'] == 'Kingston upon Hull', 'region'] = 'Yorkshire and The Humber'
non_eng.loc[non_eng['LA_name'] == 'Bristol', 'region'] = 'South West'

# The rest of the LA_names that weren't filled in all belong to the Wales region
# So we subset the dataframe so we only have those rows where the region column is empty
b = non_eng[non_eng.region == '']

# Then we create a list from those unique values
la_names = b.LA_name.unique().tolist()

# We iterate through each value in the list
for i in la_names:
#     Again, we use the same method...
# Creates a boolean series where the rows in LA_name are set to True if they match i
    matching_rows = non_eng['LA_name'] == i
#     Use .loc to access the rows where LA_name matches i
# We then access the region column and set it's value to 'Wales'
    non_eng.loc[matching_rows, 'region'] = 'Wales'
    


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_eng['region'] = ''


In [16]:
# We have successfully added region values to our non_eng df

non_eng.head()

Unnamed: 0,LA_code,LA_name,ML_code,ML_categories,Observation,Percentages,region
2,E06000001,Hartlepool,2,Welsh or Cymraeg (in England only),4,0.004473,North East
3,E06000001,Hartlepool,3,Other UK language: Gaelic (Irish),0,0.0,North East
4,E06000001,Hartlepool,4,Other UK language: Gaelic (Scottish),0,0.0,North East
5,E06000001,Hartlepool,5,Other UK language: Manx Gaelic,0,0.0,North East
6,E06000001,Hartlepool,6,Other UK language: Gaelic (Not otherwise speci...,0,0.0,North East


In [17]:
# Let's also add this region column to our lang2 dataset (which inc. english lang)

# Step 1: Create a unique mapping
unique_mapping = non_eng[['LA_name', 'region']].drop_duplicates()

# Step 2: Map this unique mapping to lang2
lang2['region'] = lang2['LA_name'].map(unique_mapping.set_index('LA_name')['region'])


In [18]:
lang2.head()

Unnamed: 0,LA_code,LA_name,ML_code,ML_categories,Observation,Percentages,region
1,E06000001,Hartlepool,1,English (English or Welsh in Wales),87544,97.90313,North East
2,E06000001,Hartlepool,2,Welsh or Cymraeg (in England only),4,0.004473,North East
3,E06000001,Hartlepool,3,Other UK language: Gaelic (Irish),0,0.0,North East
4,E06000001,Hartlepool,4,Other UK language: Gaelic (Scottish),0,0.0,North East
5,E06000001,Hartlepool,5,Other UK language: Manx Gaelic,0,0.0,North East


In [19]:
# Now let's exclude Welsh LA's from our non_eng and lang2 datasets 
# Then we can start calculating percentages for top non_eng languages across England

exc_wales_non = non_eng[non_eng['region'] != 'Wales']

exc_wales_lang = lang2[lang2['region'] != 'Wales']

In [20]:
# Now we can group our main language categories by observation and sum each of them up
# e.g. this will give us total amount of Arabic speakers across all LAs

ML_totals = exc_wales_non.groupby('ML_categories')['Observation'].sum().reset_index()

In [21]:
# Create a new column titled percentages
# Each observation for each lang is divided by the total amount of all eng and non_eng speakers


ML_totals['Percentages'] = ML_totals['Observation']/ exc_wales_lang.Observation.sum() * 100

# Top 10 main Non-English languages - exc Wales

In [22]:
## Then we sort by percentages and ascending.. as usual

ML_totals.sort_values(by = 'Percentages', ascending = False)[:10]

Unnamed: 0,ML_categories,Observation,Percentages
46,Other European language (EU): Polish,590968,1.080655
47,Other European language (EU): Romanian,465933,0.852013
80,South Asian language: Panjabi,288658,0.527845
84,South Asian language: Urdu,267624,0.489382
67,Portuguese,221076,0.404264
85,Spanish,211994,0.387656
15,Arabic,195483,0.357464
73,South Asian language: Bengali (with Sylheti an...,194820,0.356251
74,South Asian language: Gujarati,188212,0.344168
42,Other European language (EU): Italian,157623,0.288232


# Top 10 Main Languages (by region)

In [23]:
grp = lang2.groupby(['region', 'ML_categories'])['Observation'].sum().reset_index()


In [24]:
grp.head()

Unnamed: 0,region,ML_categories,Observation
0,East Midlands,African language: Afrikaans,382
1,East Midlands,African language: Akan,1264
2,East Midlands,African language: Amharic,313
3,East Midlands,African language: Any other African language,1046
4,East Midlands,African language: Any other Nigerian language,326


In [25]:
grp['Percentages'] = grp.groupby('region')['Observation'].transform(lambda x: x / x.sum() * 100)

In [26]:
top_10_per_region = grp.groupby('region').apply(lambda x: x.nlargest(10, 'Percentages')).reset_index(drop=True)


In [27]:
top_10_per_region.head()

Unnamed: 0,region,ML_categories,Observation,Percentages
0,East Midlands,English (English or Welsh in Wales),4340516,91.707946
1,East Midlands,Other European language (EU): Polish,70965,1.499373
2,East Midlands,South Asian language: Gujarati,53596,1.132395
3,East Midlands,Other European language (EU): Romanian,45380,0.958805
4,East Midlands,South Asian language: Panjabi,24684,0.521532


In [28]:
grp2 = non_eng.groupby(['region', 'ML_categories'])['Observation'].sum().reset_index()


In [29]:
grp2

Unnamed: 0,region,ML_categories,Observation
0,East Midlands,African language: Afrikaans,382
1,East Midlands,African language: Akan,1264
2,East Midlands,African language: Amharic,313
3,East Midlands,African language: Any other African language,1046
4,East Midlands,African language: Any other Nigerian language,326
...,...,...,...
925,Yorkshire and The Humber,West or Central Asian language: Any other West...,448
926,Yorkshire and The Humber,West or Central Asian language: Hebrew,117
927,Yorkshire and The Humber,West or Central Asian language: Kurdish,9986
928,Yorkshire and The Humber,West or Central Asian language: Pashto,5489


In [30]:
grp2['Percentages'] = grp2.groupby('region')['Observation'].transform(lambda x: x / lang2.Observation.sum() * 100)

In [31]:
top_10_per_region = grp2.groupby('region').apply(lambda x: x.nlargest(10, 'Percentages')).reset_index(drop=True)

In [32]:
top_10_per_region

Unnamed: 0,region,ML_categories,Observation,Percentages
0,East Midlands,Other European language (EU): Polish,70965,0.122980
1,East Midlands,South Asian language: Gujarati,53596,0.092880
2,East Midlands,Other European language (EU): Romanian,45380,0.078642
3,East Midlands,South Asian language: Panjabi,24684,0.042777
4,East Midlands,Other European language (EU): Lithuanian,15319,0.026547
...,...,...,...,...
95,Yorkshire and The Humber,East Asian language: All other Chinese,11088,0.019215
96,Yorkshire and The Humber,Other European language (EU): Slovak,10639,0.018437
97,Yorkshire and The Humber,West or Central Asian language: Kurdish,9986,0.017305
98,Yorkshire and The Humber,Other European language (EU): Lithuanian,8834,0.015309


# Top 10 Main Languages (by region - except Wales)

In [33]:
grp = exc_wales_lang.groupby(['region', 'ML_categories'])['Observation'].sum().reset_index()


In [34]:
grp2 = exc_wales_non.groupby(['region', 'ML_categories'])['Observation'].sum().reset_index()


In [35]:
grp2['Percentages'] = grp2.groupby('region')['Observation'].transform(lambda x: x / grp.Observation.sum() * 100)

In [36]:
top_10_per_region = grp2.groupby('region').apply(lambda x: x.nlargest(10, 'Percentages')).reset_index(drop=True)

In [37]:
top_10_per_region

Unnamed: 0,region,ML_categories,Observation,Percentages
0,East Midlands,Other European language (EU): Polish,70965,0.129768
1,East Midlands,South Asian language: Gujarati,53596,0.098007
2,East Midlands,Other European language (EU): Romanian,45380,0.082983
3,East Midlands,South Asian language: Panjabi,24684,0.045138
4,East Midlands,Other European language (EU): Lithuanian,15319,0.028013
...,...,...,...,...
85,Yorkshire and The Humber,East Asian language: All other Chinese,11088,0.020276
86,Yorkshire and The Humber,Other European language (EU): Slovak,10639,0.019455
87,Yorkshire and The Humber,West or Central Asian language: Kurdish,9986,0.018261
88,Yorkshire and The Humber,Other European language (EU): Lithuanian,8834,0.016154


In [42]:
# Assuming your dataframe is called 'lang'

# Step 1: Filter out Wales as a region from original lang dataset
lang_without_wales = lang2[lang2['region'] != 'Wales']

# Step 2: Filter out English as a language from this new dataset
filtered_data = lang_without_wales[lang_without_wales['ML_code'] != 1]

# Total observations in England excluding Wales
total_obs_without_wales = lang_without_wales['Observation'].sum()

# Step 3: Group by region and language, then calculate percentages
grouped_data = filtered_data.groupby(['region', 'ML_categories'])['Observation'].sum().reset_index()
grouped_data['Percentages'] = (grouped_data['Observation'] / total_obs_without_wales) * 100

# Step 4: Sort each region's languages by the percentages in descending order and get the top 10 for each region
top_10_languages_by_region = grouped_data.groupby('region').apply(lambda x: x.nlargest(10, 'Percentages')).reset_index(drop=True)

top_10_languages_by_region


Unnamed: 0,region,ML_categories,Observation,Percentages
0,East Midlands,Other European language (EU): Polish,70965,0.129768
1,East Midlands,South Asian language: Gujarati,53596,0.098007
2,East Midlands,Other European language (EU): Romanian,45380,0.082983
3,East Midlands,South Asian language: Panjabi,24684,0.045138
4,East Midlands,Other European language (EU): Lithuanian,15319,0.028013
...,...,...,...,...
85,Yorkshire and The Humber,East Asian language: All other Chinese,11088,0.020276
86,Yorkshire and The Humber,Other European language (EU): Slovak,10639,0.019455
87,Yorkshire and The Humber,West or Central Asian language: Kurdish,9986,0.018261
88,Yorkshire and The Humber,Other European language (EU): Lithuanian,8834,0.016154
