## **Dataset 2** 

In the following dataset we have a list of attributes related to religious denominations and their membership and church counts in 1952, organized by county and state. Each attribute represents the number of members and churches for a specific denomination

#### Import libraries

In [1]:
import plotly.express as px
from plotly.subplots import make_subplots  
import plotly.graph_objects as go
import pandas as pd
import geopandas as gpd
import numpy as np
import folium
from folium.features import GeoJsonTooltip
from IPython.display import display, Image
import gc

#### Insert dataset

In [2]:
csv_file = 'datasets/1952.csv'
religion = pd.read_csv(csv_file)

### **Task 4** : Find the 3 most extreme counties with respect to the distribution of their churches across religions.

Firstly, let's create the dataset which will use in order to give answers to this task. We will need a data set which has the names of every county and the total numbers of churches of every religion in each county. 

In [3]:
selected_columns = ['CNAME'] + [col for col in religion.columns if col.endswith('_C')]
churches = religion[selected_columns]
churches.head() 

Unnamed: 0,CNAME,SDA_C,AOG_C,ABC_C,SBC_C,COB_C,COGT_C,COGI_C,CGC_C,NAZRN_C,...,RECH_C,SOCBR_C,NSAC_C,UNCHU_C,UBC_C,UCHRC_C,UCA_C,VEDS_C,VLNTR_C,CGP_C
0,"Hale, AL",0,0,0,14,0,0,4,0,2,...,0,0,0,0,0,0,0,0,0,0
1,"Henry, AL",0,3,0,21,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,"Houston, AL",2,10,0,42,0,1,0,3,1,...,0,0,0,0,0,0,0,0,0,0
3,"Jackson, AL",1,0,0,58,0,5,0,0,2,...,0,0,0,0,0,0,0,0,0,0
4,"Jefferson, AL",4,15,0,232,0,29,18,19,8,...,0,0,0,1,0,0,0,0,0,11


In [4]:
difChurches = churches.copy()  #create a copy of the dataframe churches  
difChurches['DIFCHURCHES'] = (difChurches != 0).sum(axis=1)  #creates a column which as a value is the count of different churches in this county
sorted_difChurches = difChurches.sort_values(by='DIFCHURCHES', ascending=False)  #sort the counties by the numbers of different churches that has 
print('The counties are presented below, arranged in ascending order based on the count of diverse churches they possess.')
sorted_difChurches[['CNAME', 'DIFCHURCHES']].head(3)

The counties are presented below, arranged in ascending order based on the count of diverse churches they possess.


Unnamed: 0,CNAME,DIFCHURCHES
172,"Los Angeles, CA",70
574,"Cook, IL",66
1276,"Wayne, MI",61


In [5]:
bar1 = sorted_difChurches[['CNAME', 'DIFCHURCHES']].head(3)
fig = px.bar(bar1, x='CNAME', y='DIFCHURCHES', labels={'DIFCHURCHES': 'Number of different Churches'})
fig.update_layout(title='Number of Churches in Each County', xaxis_title='Counties', yaxis_title='Number of different Churches')
fig.update_xaxes(tickangle=45)  # Rotate x-axis labels for better readability
fig.update_traces(marker_color='green') 
fig.show()

Looking at the provided dataframe, it is apparent that Los Angeles (CA) stands out as the county with the most diverse distribution of churches across various religions, boasting a total of 70 distinct churches. Following closely, Cook (IL) ranks second with 66 different churches, and Wayne (ML) holds the third position with 61 diverse churches representing different religions.

#### Second Approach : Find Standard deviation for number of churches of each religion within each county


In the second method, we'll compute the standard deviation for the number of churches of each religion within each county. This approach differs as it takes into account the variation in the number of churches for specific religions.

In [6]:
stDevChurches = churches.copy()
stDevChurches['STD_DEV'] = stDevChurches[churches.columns[1:]].std(axis=1)
sorted_stDevChurches = stDevChurches.sort_values(by='STD_DEV', ascending=False)
sorted_stDevChurches[['CNAME', 'STD_DEV']].head(3)

Unnamed: 0,CNAME,STD_DEV
574,"Cook, IL",44.53237
172,"Los Angeles, CA",38.327505
2209,"Allegheny, PA",28.252443


In [7]:
bar2 = sorted_stDevChurches[['CNAME', 'STD_DEV']].head(3)
fig = px.bar(bar2, x='CNAME', y='STD_DEV', labels={'STD_DEV': 'St. dev. of the different Churches'})
fig.update_layout(title='Number of Churches in Each County', xaxis_title='Counties', yaxis_title='St. dev. of different Churches')
fig.update_xaxes(tickangle=45)  
fig.update_traces(marker_color='green') 
fig.show()

Upon reviewing both the DataFrame and the bar plots depicting the standard deviation in the number of churches for each religion within every county, it becomes evident that Cook County (IL) holds the highest deviation at 44.53. Following closely is Los Angeles County (CA) with a standard deviation of 38.33, and Allegheny County (PA) secures the third position with a standard deviation of 28.25.

From the approaches above we can make some observations : 
- Los Angeles (CA) stands out as the leading county in America, boasting the highest count of unique churches across various religions, specifically with 70 distinct churches. However, in terms of standard deviation, it takes the second position, considering both the variety and number of churches for each religion within the county.

- Cook (IL) County secures the second position in the count of unique churches with 66 distinct establishments. Interestingly, it claims the first position in standard deviation, indicating a notable diversity in the number of churches for each unique religion.

- Wayne (MI) claims the third position in terms of the count of unique churches. In the standard deviation of the number of churches, Allegheny (PA) rank third, respectively, showcasing a mix of unique churches and variability in religious institutions.