To avoid issues, please install the Python packages in the next code cell. Additionally, please install Python, pip installer, and Jupyter Notebook when accessing this notebook.

There might be need to manualy open your Jupyter Notebook. You can use this local network: http://127.0.0.1:8050/

In [None]:
# Install the packages
# Uncomment the line below
# pip install pandas dash plotly scipy jupyter-dash

# Import the packages
import pandas as pd
import numpy as np
import webbrowser

import dash
from dash import dcc, html, callback_context
from dash.dependencies import Input, Output

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode

from scipy.stats import linregress

init_notebook_mode(connected=True)

# Load the dataset
data_file = "CMSC 205 data.csv"
data = pd.read_csv(data_file, encoding='utf-8', encoding_errors='replace')

# Filter valid rows
# Negros Region is removed as the provinces of Negros Occidental and Negros Oriental 
# were separated and became Negros Island Region in May 2015. In 2017 this was cancelled, and they reverted to Western Visayas and Central Visayas, 
# respectively. Hence, it is not included in the interpretation of the dataset.

data_filtered = data[
    (data['admin1_name'] != "Negros Island Region (NIR)") &
    (data['admin1_name'] != "#adm1+name") &
    (data['admin1_name'].notna()) &
    (data['admin1_name'] != "")
]

print(data_filtered.head())

In [2]:
# Gender Gap in Literacy

# Ensure literacy_male and literacy_female are numeric
data_filtered['literacy_male'] = pd.to_numeric(data_filtered['literacy_male'], errors='coerce')
data_filtered['literacy_female'] = pd.to_numeric(data_filtered['literacy_female'], errors='coerce')

# Fill missing values with 0
data_filtered['literacy_male'].fillna(0, inplace=True)
data_filtered['literacy_female'].fillna(0, inplace=True)

# Check again after conversion
print("Unique values in literacy_male after conversion:", data['literacy_male'].unique())
print("Unique values in literacy_female after conversion:", data['literacy_female'].unique())

# Filter valid rows for gender gap analysis
data_filtered_gender = data_filtered.copy()

# Calculate gender gap
data_filtered_gender['gender_gap'] = data_filtered_gender['literacy_male'] - data_filtered_gender['literacy_female']

# Add a color column for visualization
data_filtered_gender['color'] = data_filtered_gender['gender_gap'].apply(lambda x: 'blue' if x > 0 else 'pink')

# Summarize gender gap
gender_gap_summary = data_filtered_gender.groupby('admin1_name', as_index=False).agg({
    'gender_gap': 'sum',
    'color': 'first'
})

# Plot the chart
fig = px.bar(
    gender_gap_summary,
    x='admin1_name',
    y='gender_gap',
    title='Sum of Gender Gap in Literacy by Region',
    labels={'admin1_name': 'Region', 'gender_gap': 'Gender Gap (Male - Female)'},
    color='color',
    color_discrete_map={'blue': 'blue', 'pink': 'pink'}
)

# Update the chart
fig.update_layout(xaxis_tickangle=45, template='plotly_white')

# Open chart in another tab
fig.write_html("gender_gap_chart.html")
webbrowser.open("gender_gap_chart.html")

Unique values in literacy_male after conversion: ['#population+m+pct+literate+age10up' '0.99' '0.962' '0.989' '0.973'
 '0.974' '0.957' '0.946' '0.982' '0.958' '0.966' '0.941' '0.991' '0.964'
 '0.803' '0.96' nan]
Unique values in literacy_female after conversion: ['#population+f+pct+literate+age10up' '0.989' '0.968' '0.99' '0.977'
 '0.972' '0.945' '0.984' '0.967' '0.969' '0.938' '0.991' '0.958' '0.802'
 '0.98' '0.96' nan]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inpl

True

In [3]:
# Dynamics of Population Density and Literacy

# Ensure numeric conversion in the relevant columns
data_filtered['pop_total'] = pd.to_numeric(data_filtered['pop_total'], errors='coerce')
data_filtered['literacy_all'] = pd.to_numeric(data_filtered['literacy_all'], errors='coerce')

# Handle missing values
data_filtered['pop_total'].fillna(0, inplace=True)
data_filtered['literacy_all'].fillna(0, inplace=True)

# Drop rows with missing or invalid values for regression analysis
data_filtered_pop = data_filtered.dropna(subset=['pop_total', 'literacy_all']).copy()

# Perform regression analysis
slope, intercept, r_value, p_value, std_err = linregress(
    data_filtered_pop['pop_total'], data_filtered_pop['literacy_all']
)

# Add predicted values
data_filtered_pop['predicted_literacy'] = slope * data_filtered_pop['pop_total'] + intercept
r_squared = r_value**2

# Determine significance note based on p-value
significance_note = "(Significant)" if p_value < 0.05 else "(Not Significant)"

# Plot the scatter with regression line
fig = px.scatter(
    data_filtered_pop,
    x='pop_total',
    y='literacy_all',
    title=f'Dynamics of Population Density and Literacy {significance_note}',
    labels={'pop_total': 'Population Total', 'literacy_all': 'Literacy Rate (%)'},
    opacity=0.7,
    hover_name='admin1_name'
)

fig.add_trace(
    go.Scatter(
        x=data_filtered_pop['pop_total'],
        y=data_filtered_pop['predicted_literacy'],
        mode='lines',
        name='Regression Line',
        line=dict(color='red'),
        hovertemplate=(
            'Predicted Literacy: %{y:.2f}<br>'
            f'R²: {r_squared:.2f}<br>'
            f'P-value: {p_value:.4f} ({p_value * 100:.2f}%)'
        )
    )
)

# Update the chart
fig.update_layout(template='plotly_white')

# O
fig.write_html("population_literacy_regression.html")
webbrowser.open("population_literacy_regression.html")




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inpl

True

In [4]:
# Effects of Language Diversity on Literacy

# Ensure numeric conversion for the relevant columns
data_filtered['number_of_named_languages'] = pd.to_numeric(data_filtered['number_of_named_languages'], errors='coerce')
data_filtered['literacy_all'] = pd.to_numeric(data_filtered['literacy_all'], errors='coerce')

# Handle missing values by dropping rows with NaN in required columns
data_filtered_lang = data_filtered.dropna(subset=['number_of_named_languages', 'literacy_all']).copy()

# Perform regression analysis
slope, intercept, r_value, p_value, std_err = linregress(
    data_filtered_lang['number_of_named_languages'], data_filtered_lang['literacy_all']
)

# Add predicted values for the regression line
data_filtered_lang['predicted_literacy'] = slope * data_filtered_lang['number_of_named_languages'] + intercept

# Calculate R-squared value and determine significance
r_squared = r_value**2
significance_note = "(Significant)" if p_value < 0.05 else "(Not Significant)"

# Create scatter plot with regression line
fig = px.scatter(
    data_filtered_lang,
    x='number_of_named_languages',
    y='literacy_all',
    title=f'Effects of Language Diversity on Literacy {significance_note}',
    labels={'number_of_named_languages': 'Number of Named Languages', 'literacy_all': 'Literacy Rate (%)'},
    opacity=0.7,
    hover_name='admin1_name'  # Display region names on hover
)

# Add regression line to the plot
fig.add_trace(
    go.Scatter(
        x=data_filtered_lang['number_of_named_languages'],
        y=data_filtered_lang['predicted_literacy'],
        mode='lines',
        name='Regression Line',
        line=dict(color='red'),
        hovertemplate=(
            'Predicted Literacy: %{y:.2f}<br>'
            f'R²: {r_squared:.2f}<br>'
            f'P-value: {p_value:.4f} ({p_value * 100:.2f}%)'
        )
    )
)

# Update the chart
fig.update_layout(template='plotly_white')

# Open chart in another tab
fig.write_html("language_diversity_literacy_regression.html")
webbrowser.open("language_diversity_literacy_regression.html")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



True

In [5]:
# Ensure numeric conversion for the relevant columns
data_filtered['main_language_share'] = pd.to_numeric(data_filtered['main_language_share'], errors='coerce')
data_filtered['literacy_all'] = pd.to_numeric(data_filtered['literacy_all'], errors='coerce')

# Handle missing values by dropping rows with NaN in required columns
data_filtered_language = data_filtered.dropna(subset=['main_language_share', 'literacy_all']).copy()

# Perform regression analysis
slope, intercept, r_value, p_value, std_err = linregress(
    data_filtered_language['main_language_share'], data_filtered_language['literacy_all']
)

# Add predicted values for the regression line
data_filtered_language['predicted_literacy'] = slope * data_filtered_language['main_language_share'] + intercept

# Calculate R-squared value and determine significance
r_squared = r_value**2
significance_note = "(Significant)" if p_value < 0.05 else "(Not Significant)"

# Create scatter plot with regression line
fig = px.scatter(
    data_filtered_language,
    x='main_language_share',
    y='literacy_all',
    title=f'Dominant Language Share and Literacy Rate {significance_note}',
    labels={'main_language_share': 'Dominant Language Share (%)', 'literacy_all': 'Literacy Rate (%)'},
    opacity=0.7,
    hover_name='admin1_name',  # Display region names on hover
    custom_data=['main_language']  # Include main language in hover data
)

# Add regression line to the plot
fig.add_trace(
    go.Scatter(
        x=data_filtered_language['main_language_share'],
        y=data_filtered_language['predicted_literacy'],
        mode='lines',
        name='Regression Line',
        line=dict(color='red'),
        hovertemplate=(
            'Predicted Literacy: %{y:.2f}<br>'
            f'R²: {r_squared:.2f}<br>'
            f'P-value: {p_value:.4f} ({p_value * 100:.2f}%)'
        )
    )
)

# Customize hover text for scatter points
fig.update_traces(
    hovertemplate=(
        "Region: %{hovertext}<br>"
        "Main Language: %{customdata[0]}<br>"
        "Literacy Rate: %{y:.2f}<br>"
        "Dominant Language Share: %{x:.2f}%"
    ),
    selector=dict(mode='markers')
)

# Update the chart
fig.update_layout(template='plotly_white')

# Open the chart in another layout
fig.write_html("dominant_language_share_literacy_regression.html")
webbrowser.open("dominant_language_share_literacy_regression.html")




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



True