# Homework 3: Collaborative Data Wrangling & EDA

### Team Members: Gagandeep Kaur & Anibely Torres
### Date: September 15, 2025


##  Git and GitHub Collaboration ü§ù

This section outlines the collaborative process for our team. We've set up a shared GitHub repository and are using Git branches to manage our contributions.

-   **Team Repository**: https://github.com/atp-dotcom/DSEintro/tree/main/HW3
-   **Our Branches**: Each of us has created a separate branch (`Gagan_DA` and `HW3`) to work on our respective parts of the assignment.
-   **Committing and Pull Requests**: We've made multiple commits with descriptive messages and will be submitting pull requests to merge our work into the main branch.


##  Exploratory Data Analysis (EDA) üîé

This notebook documents our exploratory analysis of the `co2-gdp-pop-growth.csv` dataset. Our goal is to clean the data and perform a preliminary investigation into the relationship between economic growth and carbon emissions.

### Step 1: Data Cleaning and Preparation

#### Import Libraries & Load Data

We start by importing the necessary libraries. Pandas is essential for data manipulation, NumPy for numerical operations, and Matplotlib and Seaborn for visualizations. We're importing all of them at the beginning so they're ready to use throughout the notebook.

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:
# Load the dataset from the 'data' folder

try:
    df = pd.read_csv('../data/co2-gdp-pop-growth.csv')
    
except FileNotFoundError:
    print("Error: 'data/co2-gdp-pop-growth.csv' not found. Please ensure the file is in the 'data' folder.")


In [35]:
print("First 5 rows of the raw data:")
df.head()

First 5 rows of the raw data:


Unnamed: 0,Entity,Code,Year,Population growth (annual %),GDP growth (annual %),Annual CO‚ÇÇ emissions growth (%)
0,Afghanistan,AFG,1961,1.962239,,18.58318
1,Afghanistan,AFG,1962,2.044523,,40.300896
2,Afghanistan,AFG,1963,2.105208,,2.634644
3,Afghanistan,AFG,1964,2.161195,,18.651236
4,Afghanistan,AFG,1965,2.233709,,20.078205


In [36]:
# We'll get an initial look at the data to understand its structure, data types, and any missing values.
print("Initial Dataset Information:")
df.info()

Initial Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27645 entries, 0 to 27644
Data columns (total 6 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Entity                           27645 non-null  object 
 1   Code                             23674 non-null  object 
 2   Year                             27645 non-null  int64  
 3   Population growth (annual %)     14458 non-null  float64
 4   GDP growth (annual %)            11811 non-null  float64
 5   Annual CO‚ÇÇ emissions growth (%)  26002 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 1.3+ MB


In [37]:
df = df.rename(columns={
    "Population growth (annual %)": "population_growth",
    "GDP growth (annual %)": "gdp_growth",
    "Annual CO‚ÇÇ emissions growth (%)": "co2_growth"
})

In [38]:
# drop missing rows
# We'll drop rows where we have missing values.
# This is a simple and effective way to ensure our analysis is based on complete data points.
df = df.dropna()

In [39]:
df

Unnamed: 0,Entity,Code,Year,population_growth,gdp_growth,co2_growth
40,Afghanistan,AFG,2001,0.762005,-9.431974,2.098131
41,Afghanistan,AFG,2002,5.252029,28.600000,25.432373
42,Afghanistan,AFG,2003,6.145194,8.832278,16.301846
43,Afghanistan,AFG,2004,3.575835,1.414118,-20.669056
44,Afghanistan,AFG,2005,3.519217,11.229714,52.718650
...,...,...,...,...,...,...
27583,Zimbabwe,ZWE,2019,1.563533,-6.332447,-8.410621
27584,Zimbabwe,ZWE,2020,1.659353,-7.816951,-17.231369
27585,Zimbabwe,ZWE,2021,1.726011,8.468017,20.120394
27586,Zimbabwe,ZWE,2022,1.706209,6.139263,2.168930


In [40]:
# reset the index so it starts at 0 again
df = df.reset_index(drop=True)

In [41]:
df

Unnamed: 0,Entity,Code,Year,population_growth,gdp_growth,co2_growth
0,Afghanistan,AFG,2001,0.762005,-9.431974,2.098131
1,Afghanistan,AFG,2002,5.252029,28.600000,25.432373
2,Afghanistan,AFG,2003,6.145194,8.832278,16.301846
3,Afghanistan,AFG,2004,3.575835,1.414118,-20.669056
4,Afghanistan,AFG,2005,3.519217,11.229714,52.718650
...,...,...,...,...,...,...
10589,Zimbabwe,ZWE,2019,1.563533,-6.332447,-8.410621
10590,Zimbabwe,ZWE,2020,1.659353,-7.816951,-17.231369
10591,Zimbabwe,ZWE,2021,1.726011,8.468017,20.120394
10592,Zimbabwe,ZWE,2022,1.706209,6.139263,2.168930


In [42]:
# save cleaned dataset
df.to_csv("../data/cleaned_co2_gdp.csv", index=False)