# RQ1: Education and Unemployment in Ireland

This notebook investigates the relationship between third-level education
and unemployment rates, with a focus on gender differences.

## Imports
This section imports the Python libraries required for the analysis.
- **pandas** is used for loading and manipulating the dataset
- **numpy** is used for numerical operations
- **matplotlib** is used for visualisation
- **scipy.stats** is used to compute correlation coefficients


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr


import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm


In [None]:
%pip install seaborn scikit-learn statsmodels

In [None]:
%pip install numpy

In [None]:
%pip install matplotlib

In [None]:
%pip install scipy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr

## Loading the dataset

The dataset is loaded from a CSV file and checked to understand its structure like verify column names and data types and it checks for missing values.

In [None]:
df = pd.read_csv("C:/Users/ryano/OneDrive/Documents/MastersS3/Python/DeprivationIndex.csv")

df.shape
df.head()
df.info()

## Selecting Variables

This section keeps only the variables relevant to RQ1 and RQ1.1.
Column names are renamed to shorter labels.

In [None]:

cols = [
    "ED Name",
    "County",
    "Proportion at Third Level Education 2016 %",
    "Unemployment Rate - Male",
    "Unemployment Rate - Female"
]

d = df[cols].copy()

d = d.rename(columns={
    "Proportion at Third Level Education 2016 %": "third_level_pct",
    "Unemployment Rate - Male": "unemp_male",
    "Unemployment Rate - Female": "unemp_female"
})


## Data Cleaning

The selected variables are converted to numeric.
Rows with missing values in key variables are removed which prevents plots from breaking and giving misleading results.

In [None]:

for c in ["third_level_pct", "unemp_male", "unemp_female"]:
    d[c] = pd.to_numeric(d[c], errors="coerce")

d.isna().sum()
d = d.dropna(subset=["third_level_pct", "unemp_male", "unemp_female"])


## Overall Unemployment Measure

Because the dataset provides unemployment rates separately for men and women,
an overall unemployment rate is made by taking the average of the two, which allows the main relationship in RQ1 to be analysed.

In [None]:
d["unemp_overall"] = (d["unemp_male"] + d["unemp_female"]) / 2

## Descriptive Statistics

- are used to examine the distribution and range of education
and unemployment variables before further analysis.

In [None]:
d[["third_level_pct", "unemp_male", "unemp_female", "unemp_overall"]].describe()

## Distribution of Key Variables

Histograms are used to visualise how third-level education and unemployment rates
are distributed across townlands.

In [None]:
d[["third_level_pct", "unemp_overall"]].hist(bins=30)
plt.show()

## RQ1: Education and Overall Unemployment

This section explores the relationship between the proportion of individuals
with third-level education and the overall unemployment rate using both
visualisation and correlation analysis.

In [None]:
plt.figure()
plt.scatter(d["third_level_pct"], d["unemp_overall"], alpha=0.5)
plt.xlabel("Third-level education (%)")
plt.ylabel("Unemployment rate (overall, %)") 
plt.title("RQ1: Education vs Unemployment (Overall)")
plt.show()

In [None]:
x = d[["third_level_pct"]]
y = d[["unemp_overall"]]

pear_r, pear_p = pearsonr(x, y)
spear_r, spear_p = spearmanr(x, y)

pear_r, pear_p, spear_r, spear_p

## RQ1.1: Gender Differences

To examine whether the relationship between education and unemployment differs
by gender, we repreat separately for male and female unemployment rates.

In [None]:
# Male
plt.figure()
plt.scatter(d["third_level_pct"], d["unemp_male"], alpha=0.5)
plt.xlabel("Third-level education (%)")
plt.ylabel("Male unemployment (%)")
plt.title("RQ1.1: Education vs Male Unemployment")
plt.show()

# Female
plt.figure()
plt.scatter(d["third_level_pct"], d["unemp_female"], alpha=0.5)
plt.xlabel("Third-level education (%)")
plt.ylabel("Female unemployment (%)")
plt.title("RQ1.1: Education vs Female Unemployment")
plt.show()

In [None]:
r_m, p_m = pearsonr(d["third_level_pct"], d["unemp_male"])
r_f, p_f = pearsonr(d["third_level_pct"], d["unemp_female"])

(r_m, p_m), (r_f, p_f)

## Visualising Trends

Additional function is created to add fitted linear trend line into the scatterplots, which makes easier to compare the strength and direction of relationships.

In [None]:
def scatter_with_line(x, y, xlabel, ylabel, title):
    plt.figure()
    plt.scatter(x, y, alpha=0.5)

    # line of best fit
    m, b = np.polyfit(x, y, 1)
    xs = np.linspace(x.min(), x.max(), 100)
    plt.plot(xs, m*xs + b)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

scatter_with_line(d["third_level_pct"], d["unemp_overall"],
                  "Third-level education (%)", "Unemployment (overall, %)",
                  "Education vs Unemployment (Overall)")

scatter_with_line(d["third_level_pct"], d["unemp_male"],
                  "Third-level education (%)", "Male unemployment (%)",
                  "Education vs Male Unemployment")

scatter_with_line(d["third_level_pct"], d["unemp_female"],
                  "Third-level education (%)", "Female unemployment (%)",
                  "Education vs Female Unemployment")

## Regression

Adding a Regression to better analyse the relationship between third level education and unemployment

In [None]:
#Creating and Training Model
model = LinearRegression()
model.fit(x, y)

#Results
intercept = model.intercept_
slope = model.coef_[0]
r_squared = model.score(x, y)

print("Intercept:", intercept)
print("Slope:", slope)
print("R-squared:", r_squared)

#Predict
y_pred = model.predict(x)

# Plot the data and regression line
plt.scatter(x, y)
plt.plot(x, y_pred, color = "red")
plt.xlabel("Third-Level Education (%)")
plt.ylabel("Unemployment Total (overall, %)")
plt.title("Regression of Unemployment on Third-Level Education by Small Area")
plt.show()

model = sm.OLS(y, x).fit()

print(model.summary())



In [None]:
x_sm = sm.add_constant(x)
model = sm.OLS(y, x_sm).fit()

summary_table = pd.DataFrame({
    "Estimate": [
        model.params["const"],
        model.params["third_level_pct"],
    ],
    "Std. Error": [
        model.bse["const"],
        model.bse["third_level_pct"]
    ]
}, index=[
    "Intercept",
    "third_level_pct"
])

# Add R-squared
summary_table.loc["R-squared", "Estimate"] = model.rsquared
summary_table.loc["R-squared", "Std. Error"] = np.nan

print(summary_table)