# World Happiness Report - Data Analysis Project (Review)

Welcome to this multi-day data analysis project! In this lab, you'll practice data analyst skills using the **World Happiness Report** dataset (https://www.kaggle.com/datasets/unsdsn/world-happiness). You'll go through the entire data analysis workflow: loading the data, cleaning it, exploring it, analyzing correlations, visualizing results, and summarizing insights.

This project is designed for beginners, so we'll guide you with step-by-step instructions and hints. However, try to think critically and attempt each step on your own before revealing any provided solutions. The goal is to help you become comfortable with the data analysis process and with using Python libraries like pandas, matplotlib, and seaborn.

## Step 1: Load and Inspect the Dataset

First, let's load the World Happiness Report dataset and do some initial inspection of the data.
The dataset can be downloaded from Kaggle. Make sure you have the CSV file in your working directory.

**Tasks:**
- Import the necessary libraries (pandas, matplotlib, seaborn).
- Load the dataset into a pandas DataFrame. (Use `pd.read_csv()` and the file name of the dataset.)
- Display the first few rows of the DataFrame to verify it loaded correctly (`DataFrame.head()`).
- Display the shape of the DataFrame (number of rows and columns) to understand the dataset size.
- (Optional) Display basic info about the DataFrame (`DataFrame.info()`) to see the data types and non-null counts.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# Note: replace 'PATH/TO/DATAFILE.csv' with the actual filename (e.g., 'world_happiness.csv' or '2019.csv').
# For example, if the file is named 'world_happiness.csv':
# happiness_df = pd.read_csv('world_happiness.csv')
happiness_df = pd.read_csv('PATH/TO/DATAFILE.csv')

# Inspect the first few rows
happiness_df.head()

In [None]:
# Check the size of the DataFrame (rows, columns)
happiness_df.shape

# (Optional) view DataFrame info
happiness_df.info()

## Step 2: Data Cleaning

Now that the data is loaded, it's time to clean it if necessary. Data cleaning ensures that our dataset is ready for analysis (no missing or inconsistent values, etc.).

**Tasks:**
- Check for missing values in the dataset.
  - Use functions like `isnull().sum()` to see if any columns have missing entries.
- If there are missing values, decide how to handle them. (For example, you might drop rows or fill them with an appropriate value.)
- Check for duplicate rows in the dataset and remove them if any.
- Make sure the columns have appropriate data types (for example, numerical columns should be `int` or `float`, not strings).
- (Optional) If the dataset contains multiple years or if you have separate data files per year, you might combine them into one DataFrame and add a `Year` column. (This is an advanced step; only do this if you want to analyze trends over time.)

**Hint:** The World Happiness Report data is usually pretty clean. But it's always good to verify! If there's a column like `Country or region` and another like `Country` in different files, you may want to rename columns for consistency before combining data.

In [None]:
# Check for missing values in each column
happiness_df.isnull().sum()

# If missing values are present, handle them (e.g., drop or fill).
# Example: to drop rows with any missing values, you could do:
# happiness_df = happiness_df.dropna()

# Check for duplicate rows
duplicate_count = happiness_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")
# If duplicates exist, you can remove them:
# happiness_df = happiness_df.drop_duplicates()

# (Optional) If needed, convert data types or rename columns for consistency.
# For example, if a column name has spaces or inconsistent naming across years, you might:
# happiness_df.rename(columns={'Country or region': 'Country'}, inplace=True)

## Step 3: Exploratory Data Analysis (EDA)

Now, let's explore the data to understand the distributions and relationships within it. In this step, you'll compute summary statistics and answer some initial questions about the data.

**Tasks:**
- Calculate basic statistical measures for the numeric columns (e.g., using `DataFrame.describe()`). This gives you the mean, min, max, etc., for each factor.
- Identify the range of happiness scores. What is the minimum and maximum happiness score? Which countries have these scores?
- Find out how many unique countries (and, if applicable, regions) are in the dataset.
- Determine the happiest and least happy countries in the data (top 5 and bottom 5 by happiness score).
- (If your data spans multiple years) figure out how many years are covered and which year had the highest average happiness score.

Try to answer these questions using pandas operations. For example, you might sort the DataFrame by the happiness score, or use functions like `idxmax()` to find the index of the max value in a column.

In [None]:
# Summary statistics for numeric columns
happiness_df.describe()

In [None]:
# Find the country with the maximum happiness score
max_score = happiness_df['Happiness Score'].max()
max_score_country = happiness_df.loc[happiness_df['Happiness Score'].idxmax(), 'Country']
print(f"Highest Happiness Score: {max_score} (Country: {max_score_country})")

# Find the country with the minimum happiness score
min_score = happiness_df['Happiness Score'].min()
min_score_country = happiness_df.loc[happiness_df['Happiness Score'].idxmin(), 'Country']
print(f"Lowest Happiness Score: {min_score} (Country: {min_score_country})")

In [None]:
# List of top 5 happiest countries
happiest_countries = happiness_df.sort_values('Happiness Score', ascending=False).head(5)
print("Top 5 happiest countries:")
print(happiest_countries[['Country', 'Happiness Score']])

# List of bottom 5 least happy countries
least_happy_countries = happiness_df.sort_values('Happiness Score', ascending=True).head(5)
print("\nBottom 5 least happy countries:")
print(least_happy_countries[['Country', 'Happiness Score']])

In [None]:
# (Optional) If a 'Region' column exists, you can explore average happiness by region:
if 'Region' in happiness_df.columns:
    region_group = happiness_df.groupby('Region')['Happiness Score'].mean().sort_values(ascending=False)
    print("\nAverage Happiness Score by Region:\n", region_group)

# (Optional) If data includes multiple years, you can explore trends over years:
if 'Year' in happiness_df.columns:
    year_group = happiness_df.groupby('Year')['Happiness Score'].mean().sort_values(ascending=False)
    print("\nAverage Happiness Score by Year:\n", year_group)

## Step 4: Correlation Analysis

Next, let's examine the relationships between different factors and happiness. We'll calculate the correlation between the happiness score and other variables (GDP, social support, etc.), as well as correlations among all numerical factors.

**Tasks:**
- Compute the correlation matrix for the numerical columns in the dataset (pandas `DataFrame.corr()` can be used).
- Identify which factors have the strongest positive or negative correlation with the happiness score. Consider what these correlations mean.
- Based on the correlations, which factors seem most important in contributing to happiness?
- Create a heatmap using seaborn to visualize the correlation matrix. This will make it easier to see which variables are strongly or weakly correlated.

**Hint:** A correlation value ranges from -1 to 1. Values close to 1 indicate a strong positive correlation (when one goes up, the other goes up), whereas values close to -1 indicate a strong negative correlation (when one goes up, the other goes down). Values near 0 mean no strong linear correlation.

In [None]:
# Compute correlation matrix for numerical columns
corr_matrix = happiness_df.corr(numeric_only=True)
corr_matrix

In [None]:
# Identify the correlations of each factor with the Happiness Score
# (excluding the Score correlating with itself which is always 1)
score_corr = corr_matrix['Happiness Score'].sort_values(ascending=False)
print("Correlation of each factor with Happiness Score:\n", score_corr[1:], "\n")  # [1:] to skip the first entry which is Score with itself

# TODO: Visualize the correlation matrix with a heatmap for easier interpretation.
# HINT: You can use sns.heatmap(corr_matrix, annot=True, cmap='YlGnBu') to create a heatmap.
# (Don't forget to plt.show() to display it.)

## Step 5: Data Visualization

Now it's time to create some visualizations to further explore the data and present findings. Visualizing data can reveal patterns that aren't obvious from tables alone.

**Tasks:**
- Create a bar chart showing the top 10 happiest countries and their happiness scores.
- (Optional) Create a bar chart for the bottom 10 (least happy) countries.
- Plot a histogram or distribution plot of the happiness scores to see how scores are distributed across countries.
- Create a scatter plot to examine the relationship between two factors and the happiness score. For example, plot GDP per capita vs. Happiness Score, or Social Support vs. Happiness Score.
- (Optional) If a region column is available, create a chart (like a bar plot) showing the average happiness score for each region.

Use matplotlib or seaborn for these plots. Remember to add labels and titles to make the charts understandable.

**Hint:** For the bar charts, you can sort the data by score and use `sns.barplot`. For the scatter plot, try `sns.scatterplot`. Use `plt.hist` or `sns.histplot` for the distribution of scores. Always label your axes and add a title!

In [None]:
# Bar chart of top 10 happiest countries
top10 = happiness_df.sort_values('Happiness Score', ascending=False).head(10)

plt.figure(figsize=(8,5))
sns.barplot(x='Happiness Score', y='Country', data=top10, palette='Blues_r')
plt.title('Top 10 Happiest Countries')
plt.xlabel('Happiness Score')
plt.ylabel('Country')
plt.show()

# TODO: Similarly, you can create a bar chart for the bottom 10 countries.


In [None]:
# Histogram of Happiness Scores
plt.figure(figsize=(6,4))
plt.hist(happiness_df['Happiness Score'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Happiness Scores')
plt.xlabel('Happiness Score')
plt.ylabel('Number of Countries')
plt.show()

In [None]:
# Scatter plot: GDP per Capita vs Happiness Score (as an example)
# Replace 'GDP per Capita' with the exact column name in your dataset for GDP.
plt.figure(figsize=(6,4))
sns.scatterplot(x='Economy (GDP per Capita)', y='Happiness Score', data=happiness_df)
plt.title('GDP per Capita vs Happiness Score')
plt.xlabel('Economy (GDP per Capita)')
plt.ylabel('Happiness Score')
plt.show()

# TODO: Try other scatter plots:
# e.g., Social support vs Happiness Score, Life Expectancy vs Happiness Score, etc.
# You can also use sns.regplot to add a regression line for trend.


## Step 6: Interpret and Report Findings

You've now explored the data through various analyses and visualizations. The final step is to summarize your findings.

**Tasks:**
- Write a short summary of the insights you've gained from the analysis. Consider the following:
  - Which countries are the happiest and which are the least happy? Are there any common characteristics among the top or bottom countries?
  - What factors seem most strongly associated with happiness (from the correlation analysis)? Do richer countries tend to be happier? How about the influence of life expectancy, freedom, etc.?
  - If you looked at multiple years, have happiness scores changed over time or remained relatively stable? Any notable trends or events that might explain changes?
  - Any surprising findings or does the data largely fit expectations?
- Make sure to reference your figures in your discussion (for example, "As shown in the scatter plot of GDP vs Happiness, there is a positive correlation between wealth and happiness..."). Include the charts or results from earlier as needed to support your points.

This summary can be written in Markdown below, as if you were writing a report. Be clear and concise, and imagine you are explaining your findings to someone who is not familiar with the data. Use bullet points or bold text for emphasis where appropriate.

*After completing your report, you have finished the project! Feel free to explore further or try additional questions with the data.*