# Setup


### Install required libraries

In case of a requirement of installing certain python libraries for use in your task, you may do so as shown below.


## Install Required Libraries

In [1]:
!pip install pycountry

Collecting pycountry
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycountry
Successfully installed pycountry-24.6.1


## Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pycountry as pc
from plotly import offline
import plotly.offline as pyo
import plotly.express as px
import folium
import statsmodels.api as sm

---


# Select the Dataset


In [3]:
df = pd.read_csv('/kaggle/input/happiness-world/2016.csv')
# Get the statistical description of all the features
df.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


## Data Cleaning

In [4]:
# Get the summary statistics for numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
print("Summary statistics for numeric columns:")
for col in numeric_cols:
    print(f"{col}:")
    print(df[col].describe())


Summary statistics for numeric columns:
Happiness Rank:
count    157.000000
mean      78.980892
std       45.466030
min        1.000000
25%       40.000000
50%       79.000000
75%      118.000000
max      157.000000
Name: Happiness Rank, dtype: float64
Happiness Score:
count    157.000000
mean       5.382185
std        1.141674
min        2.905000
25%        4.404000
50%        5.314000
75%        6.269000
max        7.526000
Name: Happiness Score, dtype: float64
Lower Confidence Interval:
count    153.000000
mean       5.268641
std        1.151503
min        2.732000
25%        4.322000
50%        5.226000
75%        6.128000
max        7.460000
Name: Lower Confidence Interval, dtype: float64
Family:
count    157.000000
mean       0.793621
std        0.266706
min        0.000000
25%        0.641840
50%        0.841420
75%        1.021520
max        1.183260
Name: Family, dtype: float64
Trust (Government Corruption):
count    157.000000
mean       0.137624
std        0.111038
min      

In [5]:
# Get the summary statistics for object columns
object_cols = df.select_dtypes(include=['object']).columns
print("\nSummary statistics for object columns:")
for col in object_cols:
    print(f"{col}:")
    print(df[col].describe())


Summary statistics for object columns:
Country:
count         157
unique        157
top       Denmark
freq            1
Name: Country, dtype: object
Region:
count                    157
unique                    10
top       Sub-Saharan Africa
freq                      38
Name: Region, dtype: object
Upper Confidence Interval:
count       155
unique      152
top       5.923
freq          2
Name: Upper Confidence Interval, dtype: object
Economy (GDP per Capita):
count         156
unique        156
top       1.44178
freq            1
Name: Economy (GDP per Capita), dtype: object
Health (Life Expectancy):
count         155
unique        154
top       0.62994
freq            2
Name: Health (Life Expectancy), dtype: object
Freedom:
count         157
unique        157
top       0.57941
freq            1
Name: Freedom, dtype: object


In [6]:
# Check if the data types are correct
for col in df.columns:
    if df[col].dtype not in ['int64', 'float64']:
        print(f"Column '{col}' has incorrect data type. Expected int64 or float64, got {df[col].dtype}")

Column 'Country' has incorrect data type. Expected int64 or float64, got object
Column 'Region' has incorrect data type. Expected int64 or float64, got object
Column 'Upper Confidence Interval' has incorrect data type. Expected int64 or float64, got object
Column 'Economy (GDP per Capita)' has incorrect data type. Expected int64 or float64, got object
Column 'Health (Life Expectancy)' has incorrect data type. Expected int64 or float64, got object
Column 'Freedom' has incorrect data type. Expected int64 or float64, got object


In [7]:
# Remove leading and trailing whitespaces from the values in a column
for col in df.columns:
    if df[col].dtype not in ['int64','float64']:
        print('columns', col)
        df[col] = df[col].str.strip()
        has_trailing_spaces = df[col] != df[col].str.strip()
        print(has_trailing_spaces)

columns Country
0      False
1      False
2      False
3      False
4      False
       ...  
152    False
153    False
154    False
155    False
156    False
Name: Country, Length: 157, dtype: bool
columns Region
0      False
1      False
2      False
3      False
4      False
       ...  
152    False
153    False
154    False
155    False
156    False
Name: Region, Length: 157, dtype: bool
columns Upper Confidence Interval
0      False
1      False
2      False
3      False
4      False
       ...  
152    False
153    False
154    False
155    False
156    False
Name: Upper Confidence Interval, Length: 157, dtype: bool
columns Economy (GDP per Capita)
0      False
1      False
2      False
3      False
4      False
       ...  
152    False
153    False
154    False
155    False
156    False
Name: Economy (GDP per Capita), Length: 157, dtype: bool
columns Health (Life Expectancy)
0      False
1      False
2      False
3      False
4      False
       ...  
152    False
153    False

In [8]:
# Identify the columns of a data frame with missing values.
# Print the columns with missing values
print(df.isnull().sum())

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        4
Upper Confidence Interval        2
Economy (GDP per Capita)         1
Family                           0
Health (Life Expectancy)         2
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64


In [9]:
# Select rows with any null values
rows_with_nulls = df[df.isnull().any(axis=1)]
#print(rows_with_nulls)
print(rows_with_nulls[['Country','Lower Confidence Interval','Upper Confidence Interval','Economy (GDP per Capita)','Health (Life Expectancy)']])

           Country  Lower Confidence Interval Upper Confidence Interval  \
9           Sweden                      7.227                       NaN   
12   United States                      7.020                     7.188   
14     Puerto Rico                        NaN                     7.284   
36           Spain                        NaN                     6.434   
41         Bahrain                      6.128                     6.308   
44        Slovakia                      5.996                      6.16   
63            Peru                        NaN                     5.839   
69        Paraguay                      5.453                       NaN   
110   Sierra Leone                        NaN                     4.765   

    Economy (GDP per Capita) Health (Life Expectancy)  
9                    1.45181                  0.83121  
12                   1.50796                      NaN  
14                   1.35943                  0.77758  
36                   1.34

In [10]:
# Change the data type of the columns to appropriate type as per the latest version of pandas
# Upper Confidence Interval, 'Economy (GDP per Capita)','Health (Life Expectancy)','Freedom' - has incorrect datatype
cols = ['Upper Confidence Interval', 'Economy (GDP per Capita)','Health (Life Expectancy)','Freedom']
for i in cols:
    print(i,df[i].dtype)
    #df[i] = df[i].astype('float')
    df[i] = df[i].apply(pd.to_numeric,errors='coerce')
    print('after change',i,df[i].dtype)

Upper Confidence Interval object
after change Upper Confidence Interval float64
Economy (GDP per Capita) object
after change Economy (GDP per Capita) float64
Health (Life Expectancy) object
after change Health (Life Expectancy) float64
Freedom object
after change Freedom float64


In [11]:
#Replace the missing values thus identified with mean values of the column.
# Replace missing values with mean values
df['Lower Confidence Interval'] = df['Lower Confidence Interval'].fillna(df['Lower Confidence Interval'].mean())
df['Upper Confidence Interval'] = df['Upper Confidence Interval'].fillna(df['Upper Confidence Interval'].mean())
df['Economy (GDP per Capita)'] = df['Economy (GDP per Capita)'].fillna(df['Economy (GDP per Capita)'].mean())
df['Health (Life Expectancy)'] = df['Health (Life Expectancy)'].fillna(df['Health (Life Expectancy)'].mean())
df['Freedom'] = df['Freedom'].fillna(df['Freedom'].mean())

In [12]:
# Select rows with any null values again to check
rows_with_nulls = df[df.isnull().any(axis=1)]
print(rows_with_nulls)
print(rows_with_nulls[['Country','Lower Confidence Interval','Upper Confidence Interval','Economy (GDP per Capita)','Health (Life Expectancy)']])

Empty DataFrame
Columns: [Country, Region, Happiness Rank, Happiness Score, Lower Confidence Interval, Upper Confidence Interval, Economy (GDP per Capita), Family, Health (Life Expectancy), Freedom, Trust (Government Corruption), Generosity, Dystopia Residual]
Index: []
Empty DataFrame
Columns: [Country, Lower Confidence Interval, Upper Confidence Interval, Economy (GDP per Capita), Health (Life Expectancy)]
Index: []


In [13]:
# Print the columns with missing values - confirmation
print(df.isnull().sum())

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64


## Exploratory Data Analysis

In [14]:
# identify the GDP per capita and Healthy Life Expectancy of the top 10 countries and create a bar chart named fig1 to show the GDP per capita and Healthy Life Expectancy of these top 10 countries using plotly.
# Select the top 10 countries

top_10_countries = df.nlargest(10, 'Economy (GDP per Capita)')

# Melt the DataFrame to have a long format for the three parameters
df_melted = top_10_countries.melt(id_vars='Country', 
                                    value_vars=['Economy (GDP per Capita)', 'Health (Life Expectancy)'], 
                                    var_name='Parameter', 
                                    value_name='Value')

print(df_melted)

# Combined bar chart to show GDP per capita and Healthy Life Expectancy of the top 10 countries
fig1 = px.bar(df_melted, 
                      x='Country', 
                      y='Value', 
                      color='Parameter', 
                      title='Top 10 Countries by GDP, and Health Life Expectancy',
                      width=800, 
                      height=600)
fig1.show()


                 Country                 Parameter     Value
0                  Qatar  Economy (GDP per Capita)  1.824270
1             Luxembourg  Economy (GDP per Capita)  1.697520
2              Singapore  Economy (GDP per Capita)  1.645550
3                 Kuwait  Economy (GDP per Capita)  1.617140
4                 Norway  Economy (GDP per Capita)  1.577440
5   United Arab Emirates  Economy (GDP per Capita)  1.573520
6            Switzerland  Economy (GDP per Capita)  1.527330
7              Hong Kong  Economy (GDP per Capita)  1.510700
8          United States  Economy (GDP per Capita)  1.507960
9           Saudi Arabia  Economy (GDP per Capita)  1.489530
10                 Qatar  Health (Life Expectancy)  0.717230
11            Luxembourg  Health (Life Expectancy)  0.845420
12             Singapore  Health (Life Expectancy)  0.947190
13                Kuwait  Health (Life Expectancy)  0.635690
14                Norway  Health (Life Expectancy)  0.795790
15  United Arab Emirates

In [15]:
# Correlation heatmap
# Select the specified variables
df_heat = df[['Economy (GDP per Capita)', 'Health (Life Expectancy)', 'Freedom', 
         'Trust (Government Corruption)', 'Generosity', 'Happiness Score']]

# Calculate the correlation matrix
corr_matrix = df_heat.corr()

# Create a heatmap
fig2 = px.imshow(corr_matrix, 
                text_auto=True, 
                color_continuous_scale='Viridis', 
                title='Correlation Heatmap',
                width=800,  # Adjust width
                height=600)  # Adjust height

fig2.show()

In [16]:
# scatter plot to identify the effect of GDP per Capita on Happiness Score in various Regions
# Check the correlation between gdp and happiness perform correlation analysis

#print(df)
df_corr = df[['Economy (GDP per Capita)', 'Happiness Score']].corr()
df_corr
# Perform regression analysis
X = df['Economy (GDP per Capita)']
y = df['Happiness Score']
z = df['Region']
print(z)

# Add a constant to the independent variable
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

#Visualization

fig3 = px.scatter(df, x='Economy (GDP per Capita)', y='Happiness Score', color=z, 
                 title='Effect of GDP on Happiness Across Regions',
                 labels={'Economy (GDP per Capita)': 'GDP per Capita', 'Happiness Score': 'Happiness Score'})
fig3.show()




0                       Western Europe
1                       Western Europe
2                       Western Europe
3                       Western Europe
4                       Western Europe
                    ...               
152                 Sub-Saharan Africa
153                      Southern Asia
154                 Sub-Saharan Africa
155    Middle East and Northern Africa
156                 Sub-Saharan Africa
Name: Region, Length: 157, dtype: object
                            OLS Regression Results                            
Dep. Variable:        Happiness Score   R-squared:                       0.624
Model:                            OLS   Adj. R-squared:                  0.621
Method:                 Least Squares   F-statistic:                     256.7
Date:                Mon, 04 Nov 2024   Prob (F-statistic):           1.07e-34
Time:                        15:05:46   Log-Likelihood:                -166.39
No. Observations:                 157   AIC:            

In [17]:
#pie chart to present Happiness Score by Regions

df_pie = df.groupby('Region')['Happiness Score'].count().reset_index()
df_pie

# Visualization - Pie chart
fig4 = px.pie(df_pie, values='Happiness Score',names='Region',
             title='Happiness Score by Regions',width=800, height=600)
fig4.show()

In [18]:
# Write a Plotly code that creates a map named fig5 to display GDP per capita of countries and include Healthy Life Expectancy to be shown as a tooltip.

#print(df)
# Create a map to display GDP per Capita with Healthy Life Expectancy as tooltip
fig5 = px.choropleth(
    df,
    locations='Country',  # Use ISO codes for country locations
    locationmode='country names',  # Specify that we are using country names
    color='Economy (GDP per Capita)',
    hover_name='Country',
    hover_data={'Health (Life Expectancy)': True},  # Show Healthy Life Expectancy in tooltip
    color_continuous_scale=px.colors.sequential.Plasma,
    title='GDP per Capita and Healthy Life Expectancy by Country',
    labels={'Economy (GDP per Capita)': 'Economy (GDP per Capita)'}
)


# Show the map
fig5.show()

## Saving all Plots to a HTML File (Dashboard.html)

In [19]:
# Save the figures to a single HTML file
import plotly.offline as pyo

# Save each figure to an HTML file
pyo.plot(fig1, filename='figure1.html', auto_open=False)
pyo.plot(fig2, filename='figure2.html', auto_open=False)
pyo.plot(fig3, filename='figure3.html', auto_open=False)
pyo.plot(fig4, filename='figure4.html', auto_open=False)
pyo.plot(fig5, filename='figure5.html', auto_open=False)


# Combine the two HTML files into one
with open('Dashboard.html', 'w') as combined_file:
    combined_file.write('<html><head><title>Combined Figures</title></head><body>')
    
    # Read and append the content of the first figure
    with open('figure1.html', 'r') as f1:
        combined_file.write(f1.read())
    
    # Read and append the content of the second figure
    with open('figure2.html', 'r') as f2:
        combined_file.write(f2.read())

    # Read and append the content of the second figure
    with open('figure3.html', 'r') as f3:
        combined_file.write(f3.read())

    # Read and append the content of the second figure
    with open('figure4.html', 'r') as f4:
        combined_file.write(f4.read())

    # Read and append the content of the second figure
    with open('figure5.html', 'r') as f5:
        combined_file.write(f5.read())
    
    
    combined_file.write('</body></html>')


print("Figures saved to Dashboard.html")


Figures saved to Dashboard.html


In [20]:
print(df.columns)

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')


In [21]:
# Create subplots
fig = make_subplots(
    rows=3, cols=2,
    specs=[[{"type": "bar"}, {"type": "scatter"}],
           [{"type": "heatmap"}, {"type": "pie"}],
           [{"type": "scattergeo"}, None]],
    subplot_titles=("Bar Chart: GDP & Life Expectancy", "Scatter Plot: GDP vs Happiness Score",
                    "Heatmap: Correlation", "Pie Chart: Happiness by Region", "Map: GDP & Life Expectancy")
)

# Section 1: Bar Chart - GDP Capita and Health Life expectancy of top 10 countries.
fig.add_trace(go.Bar(
    x=top_10_countries['Country'],
    y=top_10_countries['Economy (GDP per Capita)'],
    name='GDP per Capita',
    marker_color='blue'
), row=1, col=1)

fig.add_trace(go.Bar(
    x=top_10_countries['Country'],
    y=top_10_countries['Health (Life Expectancy)'],
    name='Healthy Life Expectancy',
    marker_color='orange'
), row=1, col=1)


# 2. Heatmap 

fig.add_trace(go.Heatmap(
    z=corr_matrix.values,  # Use the values of the correlation matrix
    x=corr_matrix.columns,  # Use the column names for x-axis
    y=corr_matrix.index,     # Use the index names for y-axis
    colorscale='Viridis',
    colorbar=dict(
        title='Color Scale',  # Title for the colorbar
        x=-0.3,  # Adjust x position (negative value moves it to the left)
        y=0.5,   # Adjust y position (0.5 centers it vertically)
        xanchor='right',  # Anchor the colorbar to the right
        yanchor='middle'  # Anchor the colorbar to the middle
    )
), row=2, col=1)

# Section 3: Scatter Plot
fig.add_trace(go.Scatter(
    x=df['Economy (GDP per Capita)'],
    y=df['Happiness Score'],
    mode='markers',
    marker=dict(size=10, color='blue', line=dict(width=1)),
    text=df['Country'],
    name='GDP vs Happiness'
), row=1, col=2)



# 4. Pie Chart

fig.add_trace(go.Pie(
    labels=df_pie['Region'],
    values=df_pie['Happiness Score'],
    name='Happiness Score by Region'
), row=2, col=2)

# We have to include only country code values in locations for scattergeo for map.
# we have to import pycountry to get country codes for country names in df
# Create a DataFrame with country names and ISO codes
countries = [(country.name, country.alpha_3) for country in pc.countries]
country_df = pd.DataFrame(countries, columns=['Country Name', 'Country Code'])

# Example: Display the DataFrame
#print(country_df[country_df['Country Name'] == 'Denmark'])

# List to hold country codes
locations = []

#Get country code
for country in df['Country']:
    if country in country_df['Country Name'].values:  # Check if country is in the list of country names
        # Get the corresponding country code
        country_code = country_df[country_df['Country Name'] == country]['Country Code'].values[0]
        locations.append(country_code)
        #print(country_code)
        #print(f'Country: {country}, Country Code: {country_code}')
    else:
        locations.append(None)  # Append None if no country code is found
        #print('No country code for the country:', country)

# 5. Heat Map

text=df['Economy (GDP per Capita)']
#print(locations)
fig.add_trace(go.Scattergeo(
    locations=locations,
    text=text,
    #customdata = df,
    #text=[f"GDP: ${gdp}<br>Life Expectancy: {life} years" for gdp, life in zip(df['Economy (GDP per Capita)'], df['Health (Life Expectancy)'])],
    #hoverinfo=df['Country'],
    textposition='top center',  # Adjust text position
    marker=dict(
        size=10,
        color=df['Economy (GDP per Capita)'],  # Color based on GDP
        colorscale='Viridis',  # Choose a colorscale
        colorbar=dict(
            title='GDP/Capita',
            x=1.8,  # Adjust x position (1.1 moves it to the right)
            y=3.5,  # Adjust y position (0.5 centers it vertically)
            xanchor='right',  # Anchor the colorbar to the left
            yanchor='middle'  # Anchor the colorbar to the middle
        ),
        line=dict(width=0.5, color='black')
    )
), row=3, col=1)


# Update layout
fig.update_layout(
    title_text='World Happiness Report',
    geo=dict(
        scope='world',  # Adjust the scope as needed
        showland=True
    ),
    font=dict(size=10),  # Adjust font size
    #width=1000,  # Overall figure width
    height=600,   # Overall figure height
    showlegend=True,
    margin=dict(l=50, r=50, t=50, b=10)  # Adjust bottom margin
)

# Save to HTML file
fig.write_html("world_happiness_report.html")

# Assuming 'fig' is your figure object
#This code was to avid printing html tags at the botton of the html file. But it doesn't work.
pyo.plot(fig, filename='world_happiness_report.html', auto_open=False)

# Show the figure
fig.show()



## Here is a narrative for the bar chart to show GDP per capita and Healthy Life Expectancy of the top 10 countries:

**Title:** "World Happiness Report Dashboard"


**Introduction:**
The World Happiness Report is a widely recognized measure of happiness and well-being. This dashboard presents a comprehensive analysis of the relationship between GDP per capita and happiness score in various regions, as well as a scatter plot to identify the effect of GDP per capita on happiness score, and a pie chart to present happiness score by region. Additionally, a map will be displayed to show the GDP per capita of countries and include their corresponding healthy life expectancy as a tooltip.


**Section 1: Bar Chart**

* **GDP per Capita:** * **Healthy Life Expectancy:** A bar chart will be used to display the GDP per capita and healthy life expectancy of the top 10 countries, with the x-axis representing the countries and the y-axis representing the GDP per capita and healthy life expectancy in different colors of blue (GDP) and Orange(Life Expectancy). This bar chart will display the top 10 countries with the highest GDP per capita and healthy life expectancy, providing a snapshot of the well-being of these nations.


**Section 2: Correlation Heatmap**
A heatmap will be used to visualize the correlation between GDP per capita and happiness score in various regions. The heatmap will display the correlation coefficient (r) and the p-value, which will indicate the strength and significance of the relationship.

**Section 3: Scatter Plot - GDP per Capita vs. Happiness Score**
A scatter plot will be used to identify the effect of GDP per capita on happiness score in various regions. The plot will display the relationship between the two variables, with GDP per capita on the x-axis and happiness score on the y-axis.

**Section 4: Pie Chart - Happiness Score by Region**
A pie chart will be used to present happiness score by region. The chart will display the percentage of countries with high happiness score (above 7) and low happiness score (below 3) in each region.

**Section 5: Map with GDP per Capita and Healthy Life Expectancy**
A map will be displayed to show the GDP per capita of countries and include their corresponding healthy life expectancy as a tooltip. The map will be interactive, allowing users to hover over each country to view its GDP per capita and healthy life expectancy.

**Interactive Elements:**

* Hover-over tooltip: "GDP per capita: $[X]" and "Healthy Life Expectancy: [Y]"
* Click-and-hold feature: allows users to zoom in on a specific country to view its GDP per capita and healthy life expectancy


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-12-10|0.1|Abhishek Gagneja|Initial Draft created|


Copyright © 2023 IBM Corporation. All rights reserved.
