# **Understanding the Business Context**

## **Business Context**
Greenhouse gas (GHG) emissions have a profound impact on climate change, affecting global temperatures, sea levels, and weather patterns. Governments, policymakers, and businesses rely on emissions data to formulate policies, set sustainability goals, and monitor progress toward climate commitments.

## **Business Need**
There is a growing demand for data-driven insights into emissions trends, sector-wise contributions, and the effectiveness of sustainability policies. This project will help answer:

✔ How have global emissions evolved over time?

✔ Which sectors contribute the most?

✔ How can we forecast future emissions trends?

✔ What actionable insights can be derived to reduce emissions?

✔ Which countries or regions have contributed the most to emissions over time?

✔ How effective have policy changes been in reducing emissions?

## **Business Process Behind the Request**
This type of data is typically used by:

Policymakers → For setting climate goals and regulations

Corporations → For sustainability reporting and reducing carbon footprints

Researchers & NGOs → For analyzing environmental impact and tracking global trends

# 🌍 Greenhouse Gas Emissions Analysis & Modeling

## **📌 Step 1: Business Understanding & Defining Goals**  
### **1.1 Business Problem**  
We aim to analyze historical greenhouse gas (GHG) emissions to:  
- **Track emission trends** over time.  
- **Evaluate policy impact** on emissions.  
- **Predict future emission levels** using machine learning.  
- **Support decision-making** for sustainability initiatives.  

### **1.2 Intended Audience**  
👥 The following groups will use this analysis:  
- **Government & Policymakers** – for emission regulations.  
- **Businesses & Corporations** – for ESG compliance.  
- **Sustainability Analysts & NGOs** – to assess policy effectiveness.  
- **Researchers & Academics** – for climate change studies.  

### **1.3 Deliverables**  
📌 Our final outputs will include:  
✅ **Power BI, Tableau, StoryMaps Dashboards**  
✅ **Summary Reports** (Google Docs, Notion)  
✅ **Predictive Models** (Python, Machine Learning)  
✅ **Infographics & Presentations**  

### **1.4 Key Performance Indicators (KPIs)**  
📊 We will measure success using:  
- **Total CO2 emissions per country/sector/year**  
- **Percentage change in emissions over time**  
- **Per capita emissions** (CO2 per person)  
- **Emissions intensity** (CO2 per unit of GDP)  
- **Forecast accuracy for emissions prediction models**  

### **1.5 Business Processes Behind the Request**  
🔎 The need for this analysis comes from:  
- **Paris Agreement & Net-Zero Goals 2050**  
- **Government GHG accounting regulations**  
- **Business carbon footprint tracking for ESG reporting**  

# **Understanding the Data**

## **Data Source**

Climate Watch (Historical GHG Emissions Data, 1990-2021)

## **📊 Nature of the Data**

Structured dataset containing greenhouse gas emissions across different years, regions, and sectors

Includes CO₂, CH₄, N₂O, and other GHG emissions data

Covers various economic sectors (Energy, Transport, Industry, Agriculture, etc.)


In [1]:
import pandas as pd  

# Load Data
file_path = "GHG_Emissions.csv"  
df = pd.read_csv(file_path)  

# Display first 5 rows
df.head()

Unnamed: 0,ISO,Country,Data source,Sector,Gas,Unit,2021,2020,2019,2018,...,1999,1998,1997,1996,1995,1994,1993,1992,1991,1990
0,WORLD,World,Climate Watch,Total including LUCF,All GHG,MtCO₂e,49553.48,47463.17,49843.57,49482.1,...,35281.42,35278.92,35739.15,34372.44,33982.29,33191.77,32924.03,32787.52,32880.8,32735.02
1,WORLD,World,Climate Watch,Total excluding LUCF,All GHG,MtCO₂e,48209.5,46066.32,48046.6,47960.07,...,33346.2,33152.61,32942.62,32545.41,31949.87,31164.96,30896.88,30760.8,30854.08,30708.31
2,WORLD,World,Climate Watch,Energy,All GHG,MtCO₂e,37407.79,35450.68,37613.05,37635.61,...,25466.4,25336.58,25177.64,24756.95,24245.42,23592.68,23503.34,23380.55,23505.38,23364.85
3,WORLD,World,Climate Watch,Total including LUCF,CO2,MtCO₂e,36693.41,34819.59,37046.27,36865.29,...,25227.59,25154.68,25546.76,24434.73,24086.41,23457.62,23311.16,23167.98,23217.45,23060.38
4,WORLD,World,Climate Watch,Total excluding LUCF,CO2,MtCO₂e,35540.43,33646.42,35575.61,35569.42,...,23553.91,23384.67,23220.76,22798.57,22317.43,21694.25,21547.45,21404.71,21454.18,21297.11


## **Features in the Data:**

1. Country/Region – Specifies whether the data represents the world or a specific country
2. Data Source – "World Climate Watch"
3. Sector – The economic or industrial sector contributing to emissions, such as:
* Energy – Emissions from energy production and consumption.
* Electricity/Heat – Emissions from power generation.
* Total (Including/Excluding LUCF) – Overall emissions, with or without Land Use Change and Forestry (LUCF) contributions.
4. Gas Type – The type of greenhouse gas being measured, such as:
* GHG (Greenhouse Gases)
* CO₂ (Carbon Dioxide)
* Methane (CH₄)
* Nitrous Oxide (N₂O)
5. Unit of Measurement – in Million Metric Tons of CO₂ equivalent (MtCO₂e).
6. Yearly Emission Data – Numbers indicating emissions levels across different years from 1990 to 2021.

In [2]:
# Check dataset structure
print("\nDataset Info:")
print(df.info())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10928 entries, 0 to 10927
Data columns (total 38 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ISO          10928 non-null  object 
 1   Country      10928 non-null  object 
 2   Data source  10928 non-null  object 
 3   Sector       10928 non-null  object 
 4   Gas          10928 non-null  object 
 5   Unit         10928 non-null  object 
 6   2021         10923 non-null  float64
 7   2020         10924 non-null  float64
 8   2019         10923 non-null  float64
 9   2018         10924 non-null  float64
 10  2017         10924 non-null  float64
 11  2016         10926 non-null  float64
 12  2015         10924 non-null  float64
 13  2014         10924 non-null  float64
 14  2013         10924 non-null  float64
 15  2012         10918 non-null  float64
 16  2011         10893 non-null  float64
 17  2010         10891 non-null  float64
 18  2009         10889 non-null  fl

In [3]:
# Summary statistics
print("\nSummary Statistics:")
print(df.describe())


Summary Statistics:
               2021          2020          2019          2018          2017  \
count  10923.000000  10924.000000  10923.000000  10924.000000  10924.000000   
mean      69.249856     66.097739     69.420656     69.219737     67.762867   
std     1056.653203   1004.748301   1054.656929   1051.496841   1025.740185   
min     -647.850000   -647.850000   -647.850000   -647.910000   -647.910000   
25%        0.010000      0.010000      0.010000      0.010000      0.010000   
50%        0.420000      0.400000      0.450000      0.425000      0.420000   
75%        6.230000      6.110000      6.250000      6.172500      6.012500   
max    49553.480000  47463.170000  49843.570000  49482.100000  48341.290000   

               2016          2015          2014          2013          2012  \
count  10926.000000  10924.000000  10924.000000  10924.000000  10918.000000   
mean      66.792576     66.123134     66.241657     65.750736     64.882562   
std     1010.350805   1003.799

In [4]:
# Check for missing values
print("\n Missing Values Per Column:")
print(df.isnull().sum())


 Missing Values Per Column:
ISO              0
Country          0
Data source      0
Sector           0
Gas              0
Unit             0
2021             5
2020             4
2019             5
2018             4
2017             4
2016             2
2015             4
2014             4
2013             4
2012            10
2011            35
2010            37
2009            39
2008            41
2007            40
2006            40
2005            40
2004            65
2003            65
2002            65
2001            67
2000            67
1999           139
1998           139
1997           139
1996           140
1995           142
1994           165
1993           167
1992           179
1991           231
1990           284
dtype: int64


Observations from Data Exploration
1.	Dataset Size: 10,928 rows × 38 columns.
2.	Missing Values:
* Some years (especially 2010-2021) have a few missing values.
3.	Outliers:
* Some negative emissions values (e.g., -647.85, -707.98) could be errors.
4.	Data Structure:
* Each row contains ISO, Country, Sector, Gas, and yearly emissions data.
5.	Unit Standardization Needed:
* Need to check if all values are in metric tons of CO2-equivalent (MtCO₂e).

## ** Handling Missing Values**

We will fill missing numeric values using forward fill (ffill()) since emissions usually follow a trend over time.

In [5]:
# Forward-fill missing values
df.fillna(method='ffill', inplace=True)

# Check again for missing values
print("\nMissing Values After Cleaning:")
print(df.isnull().sum())


Missing Values After Cleaning:
ISO            0
Country        0
Data source    0
Sector         0
Gas            0
Unit           0
2021           0
2020           0
2019           0
2018           0
2017           0
2016           0
2015           0
2014           0
2013           0
2012           0
2011           0
2010           0
2009           0
2008           0
2007           0
2006           0
2005           0
2004           0
2003           0
2002           0
2001           0
2000           0
1999           0
1998           0
1997           0
1996           0
1995           0
1994           0
1993           0
1992           0
1991           0
1990           0
dtype: int64


  df.fillna(method='ffill', inplace=True)


## **Remove Negative & Unrealistic Values**
* Negative emissions could be data entry errors or carbon capture projects.
* We will replace extreme negatives with NaN and forward-fill again.

In [6]:
# Replace extreme negatives with NaN
df[df.iloc[:, 6:] < 0] = None  # All year columns (2021-1990)

# Forward-fill again
df.fillna(method='ffill', inplace=True)

# Check if negatives still exist
print("\n📌 Min Values in Each Year Column After Cleaning:")
print(df.iloc[:, 6:].min())


📌 Min Values in Each Year Column After Cleaning:
2021    0.0
2020    0.0
2019    0.0
2018    0.0
2017    0.0
2016    0.0
2015    0.0
2014    0.0
2013    0.0
2012    0.0
2011    0.0
2010    0.0
2009    0.0
2008    0.0
2007    0.0
2006    0.0
2005    0.0
2004    0.0
2003    0.0
2002    0.0
2001    0.0
2000    0.0
1999    0.0
1998    0.0
1997    0.0
1996    0.0
1995    0.0
1994    0.0
1993    0.0
1992    0.0
1991    0.0
1990    0.0
dtype: float64


  df.fillna(method='ffill', inplace=True)


## Standardize Column Names & Data Types
* Rename columns to lowercase and remove spaces.
* Ensure numeric data types are correct.

In [7]:
# Rename columns
#df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Convert year columns to float
for col in df.columns[6:]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print("\n Data Types After Cleaning:")
print(df.dtypes)


 Data Types After Cleaning:
ISO             object
Country         object
Data source     object
Sector          object
Gas             object
Unit            object
2021           float64
2020           float64
2019           float64
2018           float64
2017           float64
2016           float64
2015           float64
2014           float64
2013           float64
2012           float64
2011           float64
2010           float64
2009           float64
2008           float64
2007           float64
2006           float64
2005           float64
2004           float64
2003           float64
2002           float64
2001           float64
2000           float64
1999           float64
1998           float64
1997           float64
1996           float64
1995           float64
1994           float64
1993           float64
1992           float64
1991           float64
1990           float64
dtype: object


## Remove Duplicate Rows

In [8]:
# Check for duplicates
print("\n Duplicate Rows Before Cleaning:", df.duplicated().sum())

# Drop duplicates
df.drop_duplicates(inplace=True)

print("\n Duplicate Rows After Cleaning:", df.duplicated().sum())


 Duplicate Rows Before Cleaning: 0

 Duplicate Rows After Cleaning: 0


## Save the Cleaned Data

In [9]:
# Save cleaned dataset
# Identify fixed columns (non-year columns)
fixed_columns = ['ISO', 'Country', 'Data source', 'Sector', 'Gas', 'Unit']

# Convert wide format (years as columns) → long format
df_long = df.melt(id_vars=fixed_columns, var_name='Year', value_name='Emissions')

# Convert Year column to integer
df_long['Year'] = pd.to_numeric(df_long['Year'], errors='coerce')

# Keep only relevant years (1990-2021)
df_long = df_long[(df_long['Year'] >= 1990) & (df_long['Year'] <= 2021)]

# Save the reshaped dataset
print("\nGHG Emissions dataset has been cleaned and reshaped to long format and saved as 'cleaned_emissions.csv'.")

df_long.to_csv("cleaned_emissions.csv", index=False)


GHG Emissions dataset has been cleaned and reshaped to long format and saved as 'cleaned_emissions.csv'.


## **Population and GDP Datasets**

In [11]:
# Load Population Data
pop_file = "Population.csv"  
gdp_file = "GDP.csv" 

In [12]:
population = pd.read_csv(pop_file)
gdp = pd.read_csv(gdp_file)

In [13]:
population.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54922.0,55578.0,56320.0,57002.0,57619.0,58190.0,...,107906.0,108727.0,108735.0,108908.0,109203.0,108587.0,107700.0,107310.0,107359.0,
1,Africa Eastern and Southern,AFE,"Population, total",SP.POP.TOTL,130072080.0,133534923.0,137171659.0,140945536.0,144904094.0,149033472.0,...,607123269.0,623369401.0,640058741.0,657801085.0,675950189.0,694446100.0,713090928.0,731821393.0,750503764.0,
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,9035043.0,9214083.0,9404406.0,9604487.0,9814318.0,10036008.0,...,33831764.0,34700612.0,35688935.0,36743039.0,37856121.0,39068979.0,40000412.0,40578842.0,41454761.0,
3,Africa Western and Central,AFW,"Population, total",SP.POP.TOTL,97630925.0,99706674.0,101854756.0,104089175.0,106388440.0,108772632.0,...,418127845.0,429454743.0,440882906.0,452195915.0,463365429.0,474569351.0,485920997.0,497387180.0,509398589.0,
4,Angola,AGO,"Population, total",SP.POP.TOTL,5231654.0,5301583.0,5354310.0,5408320.0,5464187.0,5521981.0,...,28157798.0,29183070.0,30234839.0,31297155.0,32375632.0,33451132.0,34532429.0,35635029.0,36749906.0,


In [14]:
gdp.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,2962907000.0,2983635000.0,3092429000.0,3276184000.0,3395799000.0,2481857000.0,2929447000.0,3279344000.0,3648573000.0,
1,Africa Eastern and Southern,AFE,GDP (current US$),NY.GDP.MKTP.CD,24210630000.0,24963980000.0,27078800000.0,31775750000.0,30285790000.0,33813170000.0,...,898277800000.0,828942800000.0,972998900000.0,1012306000000.0,1009721000000.0,933391800000.0,1085745000000.0,1191423000000.0,1245472000000.0,
2,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,19134220000.0,18116570000.0,18753460000.0,18053220000.0,18799440000.0,19955930000.0,14260000000.0,14497240000.0,17233050000.0,
3,Africa Western and Central,AFW,GDP (current US$),NY.GDP.MKTP.CD,11904950000.0,12707880000.0,13630760000.0,14469090000.0,15803760000.0,16921090000.0,...,771766900000.0,694361000000.0,687849200000.0,770495000000.0,826483800000.0,789801700000.0,849312400000.0,883973900000.0,799106000000.0,
4,Angola,AGO,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,90496420000.0,52761620000.0,73690150000.0,79450690000.0,70897960000.0,48501560000.0,66505130000.0,104399700000.0,84824650000.0,


Key Points from the Data:
1.	Columns:
* Country Name: Full country name
* Country Code: ISO-3 country code
* Indicator Name: Specifies what the data represents (e.g., “Population, total”, “GDP (current US$)”)
* Indicator Code: Unique identifier for each type of data (e.g., SP.POP.TOTL for population, NY.GDP.MKTP.CD for GDP)
* Years (1960-2023): Each column represents the population or GDP value for that year.
2.	Missing Data:
* Some countries have missing values (NaN) for certain years, especially for GDP.
* Unnamed: 68 (extra column) is an artifact from the CSV format and can be removed.
3.	We only need data from 1990 to 2021, so we will filter out unnecessary years (1960-1989 & 2022-2023).


Before filtering and reshaping the data, we should perform a data quality check to identify and handle:

✅ Null or missing values

✅ Improper values (e.g., negative population, GDP values of 0, etc.)

✅ Duplicate country-year entries

By doing this before saving the cleaned dataset, we avoid creating multiple intermediate files and ensure clean data from the start.

In [15]:
population.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
count,264.0,264.0,264.0,264.0,264.0,264.0,264.0,264.0,264.0,264.0,...,265.0,265.0,265.0,265.0,265.0,265.0,265.0,265.0,265.0,0.0
mean,115448000.0,117053800.0,119216000.0,121887800.0,124583500.0,127311100.0,130158100.0,133012700.0,135942500.0,138970500.0,...,300490000.0,304332900.0,308172500.0,311927200.0,315608000.0,319203600.0,322354600.0,325523500.0,328788600.0,
std,362652300.0,367166000.0,373830300.0,382460800.0,391139700.0,399925600.0,409187000.0,418436100.0,427950700.0,437823900.0,...,938856500.0,949990600.0,961125900.0,971920900.0,982403400.0,992459900.0,1001248000.0,1009781000.0,1018589000.0,
min,2715.0,2970.0,3264.0,3584.0,3922.0,4282.0,4664.0,5071.0,5500.0,5631.0,...,10954.0,10930.0,10869.0,10751.0,10581.0,10399.0,10194.0,9992.0,9816.0,
25%,515202.8,525523.0,536301.8,547587.5,559363.8,567575.0,571169.5,577952.5,582517.0,586118.5,...,1786457.0,1777568.0,1791019.0,1797086.0,1788891.0,1790152.0,1786080.0,1803545.0,1827816.0,
50%,3659633.0,3747132.0,3831900.0,3919710.0,4010150.0,4102976.0,4198738.0,4297792.0,4396290.0,4503420.0,...,10358080.0,10325450.0,10259150.0,10283820.0,10423380.0,10697860.0,10505770.0,10486940.0,10644850.0,
75%,26862930.0,27613260.0,28373020.0,29154480.0,29952230.0,30759210.0,31475160.0,32039460.0,32470570.0,32771490.0,...,60730580.0,60627500.0,60536710.0,60421760.0,59729080.0,60972800.0,62830410.0,64711820.0,66617610.0,
max,3021529000.0,3062769000.0,3117373000.0,3184063000.0,3251253000.0,3318998000.0,3389087000.0,3459014000.0,3530702000.0,3604812000.0,...,7441472000.0,7528523000.0,7614114000.0,7696495000.0,7776892000.0,7856139000.0,7921184000.0,7989982000.0,8061876000.0,


In [16]:
gdp.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
count,151.0,154.0,157.0,157.0,157.0,163.0,164.0,167.0,168.0,168.0,...,259.0,258.0,258.0,258.0,258.0,257.0,257.0,254.0,242.0,0.0
mean,67861710000.0,69822510000.0,73274920000.0,79082180000.0,86803280000.0,91334820000.0,97933280000.0,101497100000.0,108679400000.0,120348000000.0,...,2382043000000.0,2424134000000.0,2597268000000.0,2765285000000.0,2807353000000.0,2736278000000.0,3141029000000.0,3307885000000.0,3618640000000.0,
std,201925400000.0,211601600000.0,225726900000.0,242731000000.0,265797600000.0,284477300000.0,308838600000.0,325787600000.0,351586000000.0,387603200000.0,...,8174322000000.0,8328694000000.0,8865044000000.0,9440413000000.0,9580867000000.0,9368340000000.0,10696430000000.0,11144710000000.0,11904010000000.0,
min,12012020.0,11592020.0,12541640.0,12833300.0,13416630.0,13593930.0,14469080.0,15835110.0,14600000.0,15850000.0,...,36811940.0,41629060.0,45276600.0,48015260.0,54123200.0,51746590.0,60196410.0,59065980.0,62280310.0,
25%,507924100.0,500733800.0,574091100.0,586294900.0,582816400.0,595657200.0,645066700.0,631123400.0,621190600.0,657467000.0,...,8766202000.0,8620784000.0,9193745000.0,9928840000.0,10111660000.0,9516738000.0,10071350000.0,12463610000.0,15431700000.0,
50%,3359404000.0,3330233000.0,3308913000.0,3988462000.0,4016794000.0,3817227000.0,4153527000.0,3532700000.0,4529031000.0,5087251000.0,...,48717500000.0,48065650000.0,52770010000.0,56097190000.0,57752480000.0,53668640000.0,61529280000.0,69704150000.0,83837520000.0,
75%,33250710000.0,32829770000.0,31841620000.0,36572880000.0,33198810000.0,34649100000.0,37201170000.0,36580900000.0,38513400000.0,43343340000.0,...,497362500000.0,504231500000.0,534152100000.0,549144100000.0,542164100000.0,545147600000.0,637186900000.0,679903100000.0,1021922000000.0,
max,1371947000000.0,1445951000000.0,1550598000000.0,1669570000000.0,1830168000000.0,1994298000000.0,2161754000000.0,2293944000000.0,2478900000000.0,2738144000000.0,...,75472470000000.0,76702550000000.0,81712040000000.0,86884840000000.0,88149850000000.0,85763010000000.0,97848300000000.0,101770900000000.0,106171700000000.0,


In [17]:
### 1️⃣ Check for Missing Values ###
print("\n📌 Checking Missing Values in Population Data:")
print(population.isnull().sum())

print("\n📌 Checking Missing Values in GDP Data:")
print(gdp.isnull().sum())


📌 Checking Missing Values in Population Data:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960                2
                 ... 
2020                1
2021                1
2022                1
2023                1
Unnamed: 68       266
Length: 69, dtype: int64

📌 Checking Missing Values in GDP Data:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960              115
                 ... 
2020                9
2021                9
2022               12
2023               24
Unnamed: 68       266
Length: 69, dtype: int64


Since there are some missing values in the emissions columns for some years, instead of removing missing values, we can use the forward-fill method (ffill()) to propagate the last valid value forward. This is a good approach because population and GDP typically follow a trend over time, meaning the previous year’s value is a reasonable estimate when data is missing. As for the negative values, we will replace the negative values with absolute values approach. 

In [18]:
# Drop unnecessary columns (keeping only 1990-2021)
cols_to_keep = ['Country Name', 'Country Code'] + [str(year) for year in range(1990, 2022)]
population = population[cols_to_keep]
gdp = gdp[cols_to_keep]

# Convert Wide Format to Long Format
population_long = population.melt(id_vars=['Country Name', 'Country Code'], var_name='Year', value_name='Population')
gdp_long = gdp.melt(id_vars=['Country Name', 'Country Code'], var_name='Year', value_name='GDP')

# Convert 'Year' to integer
population_long['Year'] = population_long['Year'].astype(int)
gdp_long['Year'] = gdp_long['Year'].astype(int)

### 3️⃣ Handle Negative & Missing Values (Same as Emissions Dataset) ###
# Replace negative values with NaN
population_long.loc[population_long['Population'] < 0, 'Population'] = None
gdp_long.loc[gdp_long['GDP'] < 0, 'GDP'] = None

# Forward-fill missing values
population_long['Population'].fillna(method='ffill', inplace=True)
gdp_long['GDP'].fillna(method='ffill', inplace=True)

# Check if negatives still exist
print("\n📌 Min Population Value After Cleaning:", population_long['Population'].min())
print("\n📌 Min GDP Value After Cleaning:", gdp_long['GDP'].min())

### Check for Duplicate Entries ###
print("\n📌 Checking for Duplicate Country-Year Entries in Population Data:")
print(population_long.duplicated(subset=['Country Code', 'Year']).sum())

print("\n📌 Checking for Duplicate Country-Year Entries in GDP Data:")
print(gdp_long.duplicated(subset=['Country Code', 'Year']).sum())

# Drop duplicate records (if any)
population_long.drop_duplicates(subset=['Country Code', 'Year'], inplace=True)
gdp_long.drop_duplicates(subset=['Country Code', 'Year'], inplace=True)

### 5️⃣ Rename Columns for Consistency ###
population_long.rename(columns={'Country Name': 'Country', 'Country Code': 'ISO'}, inplace=True)
gdp_long.rename(columns={'Country Name': 'Country', 'Country Code': 'ISO'}, inplace=True)

### 6️⃣ Save the Cleaned & Filtered Data ###
population_long.to_csv("filtered_population.csv", index=False)
gdp_long.to_csv("filtered_gdp.csv", index=False)

print("\n✅ Data Cleaning Completed. Filtered Population & GDP Data Saved for 1990-2021.")


📌 Min Population Value After Cleaning: 8798.0

📌 Min GDP Value After Cleaning: 9542900.90136505

📌 Checking for Duplicate Country-Year Entries in Population Data:
0

📌 Checking for Duplicate Country-Year Entries in GDP Data:
0

✅ Data Cleaning Completed. Filtered Population & GDP Data Saved for 1990-2021.


  population_long['Population'].fillna(method='ffill', inplace=True)
  gdp_long['GDP'].fillna(method='ffill', inplace=True)
