**Name**: Dewyani Deshmukh  
**UID**: 2021300024 
**Class**: BE Comps

# Dataset

You can view the dataset from [this link](https://www.kaggle.com/datasets/nishanthsalian/socioeconomic-country-profiles).


# Description
This dataset contains various socio-economic indicators for a selection of countries around the world. It includes data on demographics, economy, employment, health, education, infrastructure, environment, and quality of life.  The dataset offers a broad overview of each country's current socio-economic situation, allowing for comparisons and analyses of trends across regions and globally. 

# Metadata

| Variable | Description | Data Type |
|---|---|---|
| country | Name of the country | String |
| Region | Geographical region the country belongs to | String |
| Surface area (km2) | Total land area of the country in square kilometers | Numeric |
| Population in thousands (2017) | Total population of the country in thousands, as of 2017 | Numeric |
| Population density (per km2, 2017) | Number of people per square kilometer of land area, as of 2017 | Numeric |
| Sex ratio (m per 100 f, 2017) | Number of males per 100 females, as of 2017 | Numeric |
| GDP: Gross domestic product (million current US$) | Total value of goods and services produced within the country in a year, expressed in millions of US dollars | Numeric |
| GDP growth rate (annual %, const. 2005 prices) | Annual percentage change in GDP, adjusted for inflation using 2005 as the base year | Numeric |
| GDP per capita (current US$) | GDP divided by the total population, expressed in US dollars | Numeric |
| Economy: Agriculture (% of GVA) | Contribution of agriculture to the Gross Value Added (GVA), expressed as a percentage | Numeric |
| Economy: Industry (% of GVA) | Contribution of industry to the GVA, expressed as a percentage | Numeric |
| Economy: Services and other activity (% of GVA) | Contribution of services and other economic activities to the GVA, expressed as a percentage | Numeric |
| Employment: Agriculture (% of employed) | Percentage of the employed population working in agriculture | Numeric |
| Employment: Industry (% of employed) | Percentage of the employed population working in industry | Numeric |
| Employment: Services (% of employed) | Percentage of the employed population working in services | Numeric |
| Unemployment (% of labour force) | Percentage of the labor force that is unemployed | Numeric |
| Labour force participation (female/male pop. %) | Percentage of the female and male population that is part of the labor force | Numeric |
| Agricultural production index (2004-2006=100) | Index measuring the volume of agricultural production, with 2004-2006 as the base period | Numeric |
| Food production index (2004-2006=100) | Index measuring the volume of food production, with 2004-2006 as the base period | Numeric |
| International trade: Exports (million US$) | Total value of goods and services exported by the country in millions of US dollars | Numeric |
| International trade: Imports (million US$) | Total value of goods and services imported by the country in millions of US dollars | Numeric |
| International trade: Balance (million US$) | Difference between exports and imports, expressed in millions of US dollars | Numeric |


In [2]:
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS 

In [4]:
df = pd.read_csv("socio_economic.csv",index_col=0)
df

Unnamed: 0,country,Region,Surface area (km2),Population in thousands (2017),"Population density (per km2, 2017)","Sex ratio (m per 100 f, 2017)",GDP: Gross domestic product (million current US$),"GDP growth rate (annual %, const. 2005 prices)",GDP per capita (current US$),Economy: Agriculture (% of GVA),...,"Inflation, consumer prices (annual %)","Life expectancy at birth, female (years)","Life expectancy at birth, male (years)","Life expectancy at birth, total (years)",Military expenditure (% of GDP),"Population, female","Population, male",Tax revenue (% of GDP),"Taxes on income, profits and capital gains (% of revenue)",Urban population (% of total population)_y
0,Argentina,SouthAmerica,2780400,44271,16.2,95.9,632343,2.4,14564.5,6.0,...,,79.726,72.924,76.372000,0.856138,22572521.0,21472290.0,10.955501,12.929913,91.749
1,Australia,Oceania,7692060,24451,3.2,99.3,1230859,2.4,51352.2,2.5,...,1.948647,84.600,80.500,82.500000,2.007966,12349632.0,12252228.0,21.915859,64.110306,85.904
2,Austria,WesternEurope,83871,8736,106.0,96.2,376967,1.0,44117.7,1.3,...,2.081269,84.000,79.400,81.643902,0.756179,4478340.0,4319226.0,25.355237,27.024073,58.094
3,Belarus,EasternEurope,207600,9468,46.7,87.0,54609,-3.9,5750.8,7.5,...,6.031837,79.200,69.300,74.129268,1.162417,5077542.0,4420722.0,13.019006,2.933101,78.134
4,Belgium,WesternEurope,30528,11429,377.5,97.3,455107,1.5,40277.8,0.7,...,2.125971,83.900,79.200,81.492683,0.910371,5766141.0,5609017.0,23.399721,33.727746,97.961
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,United Arab Emirates,WesternAsia,83600,9400,112.4,262.4,370296,3.8,40438.8,0.7,...,1.966826,79.008,76.966,77.647000,,2891723.0,6595480.0,0.066457,,86.248
62,United Kingdom,NorthernEurope,242495,66182,273.6,97.4,2858003,2.2,44162.4,0.7,...,2.557756,83.100,79.500,81.256098,1.771094,33464674.0,32594185.0,25.523504,33.227524,83.143
63,United States of America,NorthernAmerica,9833517,324460,35.5,98.0,18036648,2.6,56053.8,1.0,...,2.130110,81.100,76.100,78.539024,3.109010,164193686.0,160791853.0,11.764267,49.600037,82.058
64,Venezuela (Bolivarian Republic of),SouthAmerica,912050,31977,36.3,99.0,344331,-6.2,11068.9,5.3,...,,76.194,68.523,72.246000,0.487844,14843348.0,14547061.0,,,88.183


# Word Chart

In [None]:
# Word Cloud for Country Names
text = " ".join(df['country'].astype(str).values)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Country Names Word Cloud')
plt.show()

![image.png](attachment:image.png)

# Box and Whisker Plot

In [None]:
# Box Plot: GDP per Capita by Region
fig_box = px.box(df, x='Region', y='GDP per capita (current US$)', title='GDP per Capita by Region')
fig_box.show()

![image.png](attachment:image.png)

## Observation:
   - Western Europe and Northern America regions have the highest median GDP per capita.
   - Sub-Saharan Africa and Central Asia show the lowest GDP per capita among all regions.
   - There are outliers in some regions, like Northern Europe, indicating significant variation within those regions.


# Violin Plot


In [None]:
# Violin Plot: Fertility Rate by Region
fig_violin = px.violin(df, x='Region', y='Fertility rate, total (live births per woman)', title='Fertility Rate by Region (Violin)')
fig_violin.show()


![image.png](attachment:image.png)

## Observation:
   - Sub-Saharan Africa has the highest fertility rate distribution, with some regions having fertility rates above 5 live births per woman.
   - Western Europe, Northern America, and Eastern Asia have low fertility rates, clustering around 1.5 to 2 live births per woman.
   - The distributions in most regions are relatively narrow, indicating less variability in fertility rates within those regions.


# Linear Regression

In [None]:
# Linear Regression: Fertility Rate vs Life Expectancy
fig_linear_2 = px.scatter(df, x='Fertility rate, total (live births per woman)', y='Life expectancy at birth, total (years)',
                          trendline='ols', title='Linear Regression: Fertility Rate vs Life Expectancy')
fig_linear_2.show()


![image.png](attachment:image.png)

## Observation:
   - There is a negative correlation between fertility rate and life expectancy.
   - As the fertility rate increases, the life expectancy tends to decrease.
   - Most of the data points are concentrated in areas with lower fertility rates and higher life expectancy.

# NonLinear Regression


In [None]:
x = df['Population density (per km2, 2017)']
y = df['GDP per capita (current US$)']

# Fit a polynomial regression of degree 2
coefficients = np.polyfit(x, y, 2)
polynomial = np.poly1d(coefficients)
x_fit = np.linspace(min(x), max(x), 500)
y_fit = polynomial(x_fit)

# Scatter plot with the original data
fig_nonlinear = go.Figure()

fig_nonlinear.add_trace(go.Scatter(x=x, y=y, mode='markers', name='Data'))

# Add the polynomial regression line
fig_nonlinear.add_trace(go.Scatter(x=x_fit, y=y_fit, mode='lines', name='Polynomial Fit'))

fig_nonlinear.update_layout(title='Nonlinear Regression: GDP per Capita vs Population Density (Polynomial Fit)',
                            xaxis_title='Population density (per km2, 2017)',
                            yaxis_title='GDP per capita (current US$)')
fig_nonlinear.show()

![image.png](attachment:image.png)

## Observation

The scatter plot with a polynomial fit line suggests a nonlinear relationship between GDP per Capita and Population Density. As population density increases, GDP per capita initially shows a slight increase, followed by a more pronounced rise at higher densities. This indicates that while a sparse population might limit economic activity, exceeding a certain density threshold could lead to accelerated economic growth. 

# 3D Plot

In [None]:
# 3D Scatter Plot: Population Growth Rate, GDP Growth Rate, and Life Expectancy
fig_3d_scatter_4 = px.scatter_3d(df, x='Population growth rate (average annual %)', 
                                 y='GDP growth rate (annual %, const. 2005 prices)', 
                                 z='Life expectancy at birth, total (years)', 
                                 color='Region',  # Color by Region
                                 size='Population in thousands (2017)',  # Size by Population
                                 title='3D Scatter Plot: Population Growth Rate, GDP Growth Rate, and Life Expectancy')
fig_3d_scatter_4.show()


![image.png](attachment:image.png)

## Observation
The 3D scatter plot visualizing the interplay of Population Growth Rate, GDP Growth Rate, and Life Expectancy reveals a complex relationship between these factors. We can observe some clustering of data points based on region, indicating that geographical location influences these development indicators. Notably, regions with higher life expectancy tend to exhibit lower population growth rates and potentially more moderate GDP growth rates. 

# Jitter Plot


In [None]:
# Jitter Plot: Unemployment by Region
fig_jitter_unemployment = px.strip(df, x='Region', y='Unemployment (% of labour force)',
                                   title='Jitter Plot: Unemployment by Region')
fig_jitter_unemployment.show()

![image.png](attachment:image.png)

## Observation
The jitter plot showcasing unemployment rates by region provides a clear overview of regional disparities in employment levels. Southern Africa stands out with the highest unemployment rate, while regions like Eastern Europe, Southern Europe, and Western Asia show considerable variations in unemployment, suggesting diverse economic conditions within these regions.  Oceania and South America demonstrate relatively tight clustering of unemployment data points, indicating a more consistent employment landscape across the countries in these regions. 


# Line Plot

In [None]:
# Line Plot for GDP Growth Rate
fig_line = px.line(df, x='country', y='GDP growth rate (annual %, const. 2005 prices)', title='GDP Growth Rate by Country')
fig_line.show()

![image.png](attachment:image.png)

## Observation

The line chart depicting GDP Growth Rate by Country reveals considerable fluctuations in economic performance across nations over the analyzed period. Some countries experience sharp peaks and troughs, indicating volatile economic growth, while others maintain a relatively stable trajectory. Notably, Japan experiences a significant spike in GDP growth, followed by a period of more moderate expansion. 

# Area Chart

In [None]:
# Area Plot for Urban Population Growth Rate
fig_area = px.area(df, x='country', y='Urban population growth rate (average annual %)', title='Urban Population Growth Rate by Country')
fig_area.show()


![image.png](attachment:image.png)

## Observation

The area chart illustrating Urban Population Growth Rate by Country shows an overall upward trend in urbanization across most countries. However, the pace of urbanization varies significantly. Countries like Kazakhstan and Turkmenistan exhibit rapid urban population growth, potentially linked to economic development and migration patterns. Conversely, some European countries, such as Bulgaria and Romania, demonstrate slower urban population growth, reflecting established urban centers and potentially lower rural-to-urban migration. 


# Waterfall Chart

In [None]:
# Prepare data for the waterfall chart
regions = df['Region'].unique()
co2_emissions = [df[df['Region'] == region]['CO2 emission estimates (million tons/tons per capita)'].mean() for region in regions]

# Create a waterfall chart
fig_waterfall_co2 = go.Figure(go.Waterfall(
    x=regions,
    y=co2_emissions,
    name='CO2 Emissions by Region',
    connector=dict(line=dict(color='rgb(63, 63, 63)', width=2))
))

fig_waterfall_co2.update_layout(title='Waterfall Chart: CO2 Emissions by Region',
                                xaxis_title='Region',
                                yaxis_title='CO2 Emissions (million tons/tons per capita)')
fig_waterfall_co2.show()


![image.png](attachment:image.png)

## Observation

The waterfall chart showcasing CO2 emissions per capita by region highlights the stark differences in emissions levels across geographical areas. North America emerges as the region with the highest per capita CO2 emissions, followed closely by Eastern Asia. Conversely, Southern Asia and Sub-Saharan Africa exhibit significantly lower emissions levels. The cumulative effect of emissions from each region culminates in the total global per capita emissions, emphasizing the collective responsibility in addressing climate change. 

# Donut Chart

In [None]:
# Donut Chart for Employment Distribution (Agriculture, Industry, Services)
employment_distribution = df[['Employment: Agriculture (% of employed)', 'Employment: Industry (% of employed)', 'Employment: Services (% of employed)']].mean()
employment_labels = ['Agriculture', 'Industry', 'Services']
fig_donut = px.pie(values=employment_distribution, names=employment_labels, hole=0.4, title='Average Employment Distribution')
fig_donut.show()

![image.png](attachment:image.png)

## Observation

The donut chart illustrating average employment distribution across sectors reveals a dominant service sector, accounting for 65.4% of employment. The industry sector holds a significant share at 24.1%, while agriculture represents a smaller portion at 10.6%. This suggests a global economy largely driven by service-oriented activities, with industrial production playing a substantial role, and agriculture occupying a less prominent position. 

# TreeMap

In [None]:
# Treemap for Surface Area by Region
fig_treemap = px.treemap(df, path=['Region', 'country'], values='Surface area (km2)', title='Surface Area by Region and Country')
fig_treemap.show()

![image.png](attachment:image.png)

## Observation

The treemap visualizing surface area by region and country clearly demonstrates the vastness of certain regions and nations. Russia stands out as the country with the largest surface area, followed by Canada, the United States, and China. The relative sizes of the rectangles within each region illustrate the distribution of land area among countries, revealing the dominance of a few large countries in terms of geographical size. 

# Funnel Chart

In [None]:
stages = [
    'Total Population',
    'Urban Population',
    'Rural Population'
]
values = [
    df['Population in thousands (2017)'].sum(),  # Total Population
    df['Urban population (% of total population)_x'].mean() * df['Population in thousands (2017)'].sum() / 100,  # Urban Population
    df['Population in thousands (2017)'].sum() - (df['Urban population (% of total population)_x'].mean() * df['Population in thousands (2017)'].sum() / 100)  # Rural Population
]

# Create a funnel chart
fig_funnel_population = go.Figure(go.Funnel(
    y=stages,
    x=values,
    name='Population Funnel'
))

fig_funnel_population.update_layout(title='Funnel Chart: Population Growth',
                                     xaxis_title='Population',
                                     yaxis_title='Stage')
fig_funnel_population.show()

![image.png](attachment:image.png)

## Observation

The funnel chart depicting population growth provides a straightforward visual representation of the breakdown of global population into urban and rural segments. The widest section represents the total population, followed by a narrowing segment for urban population, and a smaller base for rural population. This visual emphasizes the significant and growing proportion of the global population residing in urban areas, highlighting the increasing trend of urbanization worldwide. 
