## Analysis for the Urban and Rural areas

As one of our research questions was to see the differences between what was considered urban vs rural areas, one of the objectives for this question was to find a way to define for each entry how to consider it a developed urban area or a low population density rural area. According to the case study performed by Cybera (Towards a clear understanding or rural internet) For comparing rural and municipal internet speeds, we could check if the entry or tile in that entry was assigned to a PCCLASS (Population Centry Type).

Considering the previous statement, we could guess that any entry/tile with a "Null" value in the PCCLASS column could refer to a rural area. The other values for the Population Centre Type are:
- 2 = Small Population Centre (population between 1,000 and 29,999)
- 3 = Medium Population Centre (population between 30,000 and 99,999)
- 4 = Large urban Population Centre (population of 100,000 or more)

Before trying to analyze those tiles/entries without a PCCLASS, we could see how the Population Centres are performing in regards to the goal desired of 50 Mbps of download internet speed.

In [None]:
# Getting libraries ready
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import os

In [None]:
# Change the directory to access the Final.csv
# os.chdir('/home/jovyan/data')
os.chdir('../data/hackathon')

In [None]:
# Loading the dataset
finalData = pd.read_csv('Final.csv')
finalData.head()

In [None]:
# Let's consider only the entries in which the PCCLASS is defined.
finalDataPC = finalData[finalData['PCCLASS'].notnull()]

In [None]:
finalDataPC.tail()

In [None]:
# Boxplot behavior for the Population Centre Type
import seaborn as sns

sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.set(rc={'figure.figsize':(11.7,8.27)})

g = sns.boxplot(data=finalDataPC, x="avg_d_kbps", 
                                    y="PCCLASS",  
                                    orient = 'h')
                                    #showfliers = False)

plt.xlabel('Avg. Download in kbps')
plt.ylabel('Population Centre Class')
plt.yticks([0, 1, 2], ['Small P. Centre', 'Medium P. Centre', 'Large P. Centre'])
plt.title('Behavior for the Average Download Speed in 2019 for the defined Population Centres')
plt.axvline(50000, linestyle='--')
plt.show(g)

As expected, we can see that the greater the Population Centre is, the more tiles/entries associated with that centre actually surpass the goal of 50 Mbps for internet downloads. We can clearly notice that for the Small Population Centre, in average their lower quartile value is even below the goal of 50 Mbps.

Another visualization we could perform to see the performance of these centres is to see the relationship between the population density and the internet download speed based on the PCCLASS, with the use of a scatterplot:

In [None]:
#Scatter plot for the AVG download speeds among the population density of each Population Centre

sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.set(rc={'figure.figsize':(11.7,8.27)})

g = sns.scatterplot(data=finalDataPC, x="POP_DENSITY", 
                                    y="avg_d_kbps", 
                                    hue="PCCLASS", 
                                    palette="deep")

plt.xlabel('Population Density in 2019')
plt.ylabel('Avg. Download in kbps')
plt.legend(title='Population Centre Class', loc='upper right')
plt.title('Average Download Speed in contrast to the population density of 2019 for the defined Population Centres')
plt.axhline(50000, linestyle='--')
plt.show(g)

Even though we can still see that most of the tiles associated with a small population centre performed beyond the established expectations, it is still the most common PCCLASS value found within the area refering to those tiles or entries in which the goal of 50 Mbps was not reached.

Now, considering some of the small population centres are not performing within the expected goal, we can analyze how those tile areas with no defined PCCLASS, considering them to be rural areas, are performing in the same internet speed tests. In order to analyze such entries, the "Null" value in the PCCLASS column was changed to "Outside".

In [None]:
#Let's change the NaN values for the PCCLASS cells in such entries
# Let's convert the NULL values into "Outside"
finalData['PCCLASS'].fillna("Outside", inplace = True)

In [None]:
# Boxplot behavior for the Population Centre Type
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.set(rc={'figure.figsize':(11.7,8.27)})

g = sns.boxplot(data=finalData, x="avg_d_kbps", 
                                    y="PCCLASS",  
                                    orient = 'h')
                                    #showfliers = False)

plt.xlabel('Avg. Download in kbps')
plt.ylabel('Population Centre Class')
plt.title('Behavior for the Average Download Speed in 2019 for all areas - Population Centres vs. Rural/Outside areas')
plt.axvline(50000, linestyle='--')
plt.show(g)

As expected considering these entries as rural areas, we can clearly spot the difference in performance for the internet speed tests for all the tiles/entries without a PCCLASS. Most of these cases were not capable of reaching the goal of 50 Mbps in 2019, and more context would be needed in order to understand why some cases could have such a high internet download speeds, which could refer to research centres in zone far away from an urban area, as well as the area depending on which province is it located.

If the CRTC wants to target low internet speed areas, with the previous visualizations, it's safe to say that checking the Population Centre Type of such areas could be a good indicator of how good that area is performing.