# **Data Science Exam: The World Happiness Report Analysis**  
**Total Points: 100**

**[Dataset Link](https://drive.google.com/drive/folders/1mQB00JAQtHwgCf5CJp9j2KCEhcWDkeor?usp=drive_link)**
---

## **1. Introduction & Context**

The pursuit of happiness is a fundamental human goal. The **World Happiness Report** is a landmark global survey that ranks countries based on how happy their citizens perceive themselves to be. Understanding the drivers of happiness is essential for policymakers, economists, and social scientists.

In this exam, you will step into the role of a **Data Analyst/Scientist** at a global think tank. You have been tasked with analyzing **five years of World Happiness Report data (2015–2019)** to uncover insights into the factors that contribute to national happiness and how these trends evolve over time.

Your mission is to **clean, merge, analyze, and interpret the data**, then present your findings in a comprehensive analytical report.

---

## **2. The Dataset**

You are provided with **five datasets**, one for each year:  
**2015, 2016, 2017, 2018, and 2019.**

Each dataset contains indicators such as:

- Happiness Score and Rank  
- Economy (GDP per Capita)  
- Social Support  
- Healthy Life Expectancy  
- Freedom to Make Life Choices  
- Generosity  
- Perceptions of Corruption  

**Important Note:**  
Column names and structures vary across years.  
- 2015–2017 share similar column names.  
- 2018–2019 use different naming conventions.  
You must carefully examine and standardize them before combining.

---

## **3. Tasks & Questions (70 Points)**

---

### **Part A: Data Preparation & Merging (20 Points)**

---

### **A1. Data Loading & Inspection (5 pts)**

Load the datasets for **2015, 2016, 2017, 2018, and 2019** into separate Pandas DataFrames.

For each dataset:  
- Display the first few rows (`.head()`)  
- Inspect data structure (`.info()`)  
- Show summary statistics (`.describe()`)  

---

### **A2. Data Standardization (15 pts)**  
Create a unified dataset called **`df_combined`**.

#### **Step 1: Column Renaming**
- Standardize column names across all years.  
- Use clear, consistent **snake_case** naming, e.g.:
  - `gdp_per_capita`
  - `social_support`
  - `healthy_life_expectancy`
- Use the mappings for **2015–2017** as a guide and create equivalent mappings for **2018 and 2019**.

#### **Step 2: Region Handling**
- The **2018 and 2019** datasets do *not* include a `region` column.  
- Create a master region lookup using **2015 or 2016** data.  
- Map region values to the 2018 and 2019 datasets.

#### **Step 3: Missing Columns**
- The columns:
  - `dystopia_residual`  
  - `standard_error`  
  are missing in **2018 and 2019**.
- Add these columns with `NaN` values to maintain uniform structure.

#### **Step 4: Add a Year Column**
- Before merging, add a **`year`** column to each DataFrame indicating the dataset's year.

#### **Step 5: Final Merge**
- Vertically concatenate all five DataFrames into one:  
  **`df_combined`**
- Reset and clean the index.

---

## **Part B — QUESTIONS** 
**Answer ALL questions. Provide code, visualizations, and explanations.**

> _Notes for submission_: include the Python code cells that produce each result, any plots (inline), and short written interpretations (2–5 sentences) after each result. Use `pandas`, `numpy`, `matplotlib` / `seaborn`, and `scikit-learn` where appropriate.

---

### 1. Data Quality Assessment (10 pts)
Perform exploratory data checks and report findings.

- Tasks:
  - Show dataset **shape** (`.shape`) and **columns**.
  - Report **missing values** per column (absolute count and percentage).
  - Detect and display **duplicate** rows (and drop them if appropriate).
  - Show **summary statistics** (`.describe()` for numerics; value counts for important categorical cols).
- Deliverables:
  - Code cell(s) that run the checks.
  - A short written summary answering: *What quality problems did you observe?* (e.g., inconsistent column names, missing region in 2018/2019, outliers, differing scales, duplicates).

---

### 2. Region Distribution Visualization (5 pts)
Plot the number of countries per region across all years and interpret.

- Tasks:
  - Create a bar chart showing **count of unique countries per region** (aggregate across all years or show per-year facets).
  - Optionally show a stacked bar or heatmap of counts by year × region.
- Deliverables:
  - Plot(s) and 2–3 sentence interpretation: *What does this tell you about the geographical distribution of countries in the dataset?*

---

### 3. Trend Analysis of Happiness Score (10 pts)
Analyze trends 2015–2019 for the top 5 happiest countries each year.

- Tasks:
  - For each year, determine the **top 5 happiest countries** (by `happiness_score`).
  - Create a line plot showing **happiness score trends** (2015–2019) for these countries (one plot with years on x-axis).
  - Identify countries with **consistent improvement** (monotonic increase or mostly increasing) or **consistent decline**.
- Deliverables:
  - Trend plot and a short list of countries with improvement/decline and brief explanation.

---

### 4. Correlation Analysis (8 pts)
Compute correlations and interpret.

- Tasks:
  - Compute correlation matrix (Pearson) between:
    - `happiness_score`, `gdp_per_capita`, `social_support`, `freedom`, `healthy_life_expectancy`, `generosity`, `perceptions_of_corruption` (or `corruption`).
  - Visualize with a heatmap and/or pairwise scatter plots for top associations.
- Deliverables:
  - Correlation matrix, heatmap, and 2–3 sentence answer: *Which factors appear most strongly associated with happiness?*

---

### 5. Build a Regression Model (12 pts)
Predict `happiness_score` using the listed predictors.

- Predictors: `gdp_per_capita`, `social_support`, `healthy_life_expectancy` (or `health`), `freedom`, `generosity`, `perceptions_of_corruption` (or `corruption`).
- Tasks:
  - Split data into train/test (e.g., 80/20) or use k-fold CV.
  - Fit a linear regression model (and optionally a regularized model such as Ridge).
  - Report:
    - Model coefficients (with mapped variable names)
    - R² on test set (and train if desired)
    - Short interpretation of coefficient signs / magnitudes
- Deliverables:
  - Code, coefficient table, R² value(s), and interpretation (3–6 sentences).

---

### 6. Regional Comparison (8 pts)
Analyze regional average happiness per year.

- Tasks:
  - Compute **average happiness score by region and year**.
  - Plot a line chart (region on color/hue) showing regional trends 2015–2019.
  - Answer: *Which region improved the most? Which region declined?*
- Deliverables:
  - Table of regional averages, trend plot, and concise conclusions.

---

### 7. Identify Key Drivers of Happiness (Ranking) (7 pts)
Rank predictor variables by importance.

- Tasks:
  - Provide a ranking using **(a)** absolute correlation with happiness, and **(b)** a model-based feature importance (e.g., coefficients magnitude from a standardized linear model or feature importances from RandomForest).
  - Compare rankings across years (e.g., show top 3 drivers per year).
- Deliverables:
  - Two ranking lists (correlation & model-based) and a 2–3 sentence conclusion: *Which factors consistently influence happiness?*

---

### 8. Country-Level Case Study (6 pts)
Pick one country (e.g., Finland or Denmark) and analyze year-by-year changes.

- Tasks:
  - Plot a small multiples chart (one subplot per indicator) showing that country’s indicators from 2015–2019: `happiness_score`, `gdp_per_capita`, `social_support`, `healthy_life_expectancy`, `freedom`, `generosity`, `corruption`.
  - Provide a short narrative explaining what likely drove changes in happiness for that country.
- Deliverables:
  - Plots and a 3–5 sentence case study.

---

### 9. Detect Outliers (7 pts)
Find and comment on notable outliers.

- Tasks:
  - Identify countries with:
    - unusually **high GDP but low happiness**
    - unusually **high happiness but low GDP**
  - Use scatter plots with annotations to highlight these points.
- Deliverables:
  - Scatter plot(s), list of outlier countries, and possible explanations (2–4 sentences each).

---

### 10. Create a Final Combined Dashboard/Table (7 pts)
Produce a summary table for each year.

- Tasks:
  - For each year (2015–2019), produce:
    - **Top 5 happiest countries** (with scores)
    - **Bottom 5 happiest countries** (with scores)
    - **Regional averages** (mean happiness per region)
  - Combine into one summary DataFrame or multi-sheet Excel (optional).
- Deliverables:
  - Final combined table (displayed in notebook) and brief commentary (2–4 sentences).

---

## SECTION C — FINAL DELIVERABLE (INSIGHT REPORT)

Submit a **2–4 page** analytical report (PDF or Markdown + exported PDF) summarizing your findings. The report should be self-contained and targeted to a non-technical policy audience.

### Report Structure (suggested)

1. **Introduction**
   - Brief description of the World Happiness Report data and the question(s) you aimed to answer.

2. **Methods**
   - Summarize cleaning steps: renaming, region mapping, handling missing values, concatenation.
   - Briefly describe analysis methods: exploratory checks, correlation, regression model, visualizations.

3. **Key Insights**
   - Top drivers of happiness (short bullets)
   - Regional winners & losers (who improved, who declined)
   - How global happiness changed from 2015–2019 (summary)
   - Any surprising outliers or anomalies

4. **Conclusion & Recommendations**
   - Practical recommendations for policymakers (which areas to invest in, e.g., health, social support, anti-corruption)
   - Prioritized actions based on results

5. **Appendix**
   - Include code snippets (key cells), important plots, and summary tables. (Keep full code in notebook; include representative snippets in appendix.)

---

### Submission checklist
- Notebook with runnable code and inline plots.
- `df_combined` DataFrame saved and used throughout the analysis.
- Final insight report (2–4 pages) as a Markdown or PDF file attached/linked.
- All figures labeled and axes titled; use readable font sizes.

---

**Good luck — and remember to comment your code and explain the reasoning behind each analytic choice.**


We analyzed the World Happiness Report with data from 2015 - 2019. Our aim is to understand what are the drivers of national happiness. How does happiness compare across different regions, and how does the global happiness trend.

The primary questions we answered are:
- What factors or predictors strongly influence a country's happiness score?
- Regions that are performing well or poorly. How does the trend changed over time?
- Which countries are doing well and which are doing poorly?
- Do we have similar pattern in the data or are there outliers?


**2. METHODS**

**2.1 Data Cleaning**

- Column Renaming : We started by renaming the columns for clarity and consistency across all the datasets.
- Region Handling/Mapping :