## Religion
### Pew Research Center - Global Religious Diversity

In [17]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

This is a time series dataset. It contains no missing values.

In [18]:
rdi_df = pd.read_csv('../datasets/processed/religion/pew-research-center-religion-diversity/religious-diversity-index.csv', header=0)
print(f"Records: {len(rdi_df)}")

print(rdi_df.info())

rdi_df.describe()


Records: 232
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country       232 non-null    object 
 1   rdi           232 non-null    float64
 2   christian     232 non-null    float64
 3   muslim        232 non-null    float64
 4   unaffiliated  232 non-null    float64
 5   hindu         232 non-null    float64
 6   buddhist      232 non-null    float64
 7   folk          232 non-null    float64
 8   other         232 non-null    float64
 9   jewish        232 non-null    float64
 10  population    232 non-null    int64  
dtypes: float64(9), int64(1), object(1)
memory usage: 20.1+ KB
None


Unnamed: 0,rdi,christian,muslim,unaffiliated,hindu,buddhist,folk,other,jewish,population
count,232.0,232.0,232.0,232.0,232.0,232.0,232.0,232.0,232.0,232.0
mean,2.905172,0.605638,0.224228,0.078996,0.019582,0.03444,0.026414,0.006112,0.004004,29723490.0
std,2.16898,0.374656,0.359616,0.126438,0.090157,0.139082,0.074266,0.017553,0.04965,123545000.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10000.0
25%,1.075,0.143,0.001,0.007,0.0,0.0,0.0,0.0,0.0,450000.0
50%,2.4,0.807,0.018,0.0295,0.0,0.0,0.0035,0.002,0.0,5065000.0
75%,4.625,0.922,0.24175,0.1,0.002,0.00225,0.01625,0.006,0.0,19210000.0
max,9.0,1.0,0.999,0.764,0.807,0.969,0.589,0.162,0.756,1341340000.0


For each country I'll take the major religion and create a boolean predictor for it.

In [19]:
religions_columns = ['christian', 'muslim', 'unaffiliated', 'hindu', 'buddhist', 'folk', 'other', 'jewish']
rdi_df['dominant_religion'] = rdi_df[religions_columns].idxmax(axis=1)

for religion in religions_columns:
    maj_column_name = f'maj_{religion}'
    rdi_df[maj_column_name] = (rdi_df['dominant_religion'] == religion).astype(int)


rdi_df.head()





Unnamed: 0,country,rdi,christian,muslim,unaffiliated,hindu,buddhist,folk,other,jewish,population,dominant_religion,maj_christian,maj_muslim,maj_unaffiliated,maj_hindu,maj_buddhist,maj_folk,maj_other,maj_jewish
0,Afghanistan,0.1,0.001,0.997,0.0,0.0,0.0,0.0,0.0,0.0,31410000,muslim,0,1,0,0,0,0,0,0
1,Albania,3.7,0.18,0.803,0.014,0.0,0.0,0.0,0.002,0.0,3200000,muslim,0,1,0,0,0,0,0,0
2,Algeria,0.5,0.002,0.979,0.018,0.0,0.0,0.0,0.0,0.0,35470000,muslim,0,1,0,0,0,0,0,0
3,American Samoa,0.4,0.983,0.0,0.007,0.0,0.003,0.004,0.003,0.0,70000,christian,1,0,0,0,0,0,0,0
4,Andorra,2.2,0.895,0.008,0.088,0.005,0.0,0.0,0.001,0.003,80000,christian,1,0,0,0,0,0,0,0


In [20]:
maj_column_names = [f'maj_{religion}' for religion in religions_columns]

# Calculate the percentage just using the boolean column and calculating the mean
religion_percentages = rdi_df[maj_column_names].mean()

religion_percentages

maj_christian       0.689655
maj_muslim          0.215517
maj_unaffiliated    0.030172
maj_hindu           0.012931
maj_buddhist        0.034483
maj_folk            0.012931
maj_other           0.000000
maj_jewish          0.004310
dtype: float64

I'll make a linear regression to see if the religious diversity is correlated with countries were a given religion is the dominant one.



In [21]:
# Create a linear regression model
X = rdi_df[maj_column_names]
y = rdi_df['rdi']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())



                            OLS Regression Results                            
Dep. Variable:                    rdi   R-squared:                       0.194
Model:                            OLS   Adj. R-squared:                  0.173
Method:                 Least Squares   F-statistic:                     9.028
Date:                Wed, 16 Apr 2025   Prob (F-statistic):           7.60e-09
Time:                        08:20:54   Log-Likelihood:                -483.30
No. Observations:                 232   AIC:                             980.6
Df Residuals:                     225   BIC:                             1005.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                4.0736      0.345  

  return np.sqrt(eigvals[0]/eigvals[-1])


This linear regression isn't a good model to predict the religious diversity index, which makes sense given that we're only using the boolean columns.

Still it shows that:
- Countries with Christian or Muslim majority have lower religious diversity and this is statistically significant.
- Countries with an unaffiliated majority or folks religion majority have higher religious diversity.

I'll create a new dataset with the boolean columns.

In [22]:
rdi_df.to_csv('../datasets/processed/religion/pew-research-center-religion-diversity/religious-diversity-index-extended.csv', index=False)