# Data Cleaning | CO2 Emissions

#### This file is the cleaning of the CO2 Emissions file.
#### The file was downloaded in CSV format from https://data.worldbank.org/indicator/EN.ATM.CO2E.PC?view=map on 17/11/21 at 11:50AM.

## Import Libraries

In [22]:
import pandas as pd
import numpy as np

## Import CSV

In [50]:
# There is an error with the headers in the csv file. The following read will skip the first 3 rows.
    # (If you open the csv file in a text editor first, you'll see the problem.)
    
global_co2_emissions = pd.read_csv('emissions_data.csv', skiprows=3, error_bad_lines=False)

In [24]:
# View the df.
global_co2_emissions # or global_co2_emissions.head() if you want the first few rows

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,204.631696,208.837879,226.081890,214.785217,207.626699,185.213644,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.906060,0.922474,0.930816,0.940570,0.996033,1.047280,...,1.021646,1.031833,1.041145,0.987393,0.971016,0.959978,0.933541,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046057,0.053589,0.073721,0.074161,0.086174,0.101285,...,0.335351,0.263716,0.234037,0.232176,0.208857,0.203328,0.200151,,,
3,Africa Western and Central,AFW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.090880,0.095283,0.096612,0.112376,0.133258,0.184803,...,0.490867,0.504655,0.507671,0.480743,0.472959,0.476438,0.515544,,,
4,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.100835,0.082204,0.210533,0.202739,0.213562,0.205891,...,1.204799,1.261542,1.285365,1.260921,1.227703,1.034317,0.887380,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,
262,"Yemen, Rep.",YEM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.011038,0.013599,0.012729,0.014518,0.017550,0.017926,...,0.804146,1.047834,1.034330,0.536269,0.400468,0.361418,0.326682,,,
263,South Africa,ZAF,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,5.727223,5.832621,5.887168,5.961337,6.332343,6.616545,...,8.076633,8.137333,8.213158,7.671202,7.564451,7.632729,7.496645,,,
264,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,0.946606,1.096875,...,0.277909,0.284058,0.311693,0.319282,0.341615,0.414748,0.446065,,,


### What do we notice about the dataframe so far?

1. There are a lot of nulls in the rows. Why?
        - Perhaps they hadn't started recording?
        - Perhaps the country didn't exist
        - Perhaps the country no longer exists
        - Perhaps the UN did not receive the data
2. There is an unnamed column at the end (65), with presumably empty data. 
3. There are 266 rows and 66 columns.
4. Country Code is our unique primary key. But we should double check there is an entry on every row.
5. It looks like some countries have no CO2 Emission data at all. Should we include them anyway?

### Counting the nulls

In [25]:
total_empty = [] # This will hold any cols completely empty

for col in global_co2_emissions:
    nulls = global_co2_emissions[col].isna().sum()
    if nulls == 0:
        pass
    elif nulls != len(global_co2_emissions.index):
        print('Nulls in: ', col, ': ', nulls)
    elif nulls == len(global_co2_emissions.index):
        total_empty.append(col) # This will place any cols completely empty in the 'empty' list.
        print('Nulls in: ', col, ': ', nulls, '   <----   This col is completely empty')

Nulls in:  1960 :  63
Nulls in:  1961 :  62
Nulls in:  1962 :  60
Nulls in:  1963 :  59
Nulls in:  1964 :  53
Nulls in:  1965 :  53
Nulls in:  1966 :  53
Nulls in:  1967 :  53
Nulls in:  1968 :  53
Nulls in:  1969 :  53
Nulls in:  1970 :  52
Nulls in:  1971 :  51
Nulls in:  1972 :  50
Nulls in:  1973 :  50
Nulls in:  1974 :  50
Nulls in:  1975 :  50
Nulls in:  1976 :  50
Nulls in:  1977 :  50
Nulls in:  1978 :  50
Nulls in:  1979 :  50
Nulls in:  1980 :  50
Nulls in:  1981 :  50
Nulls in:  1982 :  50
Nulls in:  1983 :  50
Nulls in:  1984 :  50
Nulls in:  1985 :  50
Nulls in:  1986 :  50
Nulls in:  1987 :  50
Nulls in:  1988 :  50
Nulls in:  1989 :  50
Nulls in:  1990 :  28
Nulls in:  1991 :  27
Nulls in:  1992 :  28
Nulls in:  1993 :  28
Nulls in:  1994 :  28
Nulls in:  1995 :  27
Nulls in:  1996 :  27
Nulls in:  1997 :  27
Nulls in:  1998 :  27
Nulls in:  1999 :  27
Nulls in:  2000 :  27
Nulls in:  2001 :  27
Nulls in:  2002 :  27
Nulls in:  2003 :  27
Nulls in:  2004 :  27
Nulls in: 

Some of these columns are completely empty. Do we really need them?

In [26]:
# Dropping the empty columns
global_co2_emissions = global_co2_emissions.drop(columns=total_empty)

What countries have absolutely no entries?
Do we have columns for every year?

In [27]:
# If 1960 is our first year and 2018 our last, we have columns for every year in between?

datayears = range(1960, 2019) # probably will reuse this

expectedcols = len(datayears) + 4 # (First recorded year and last recorded year, and first 4 columns we can see)
result = print('missing cols') if expectedcols != len(global_co2_emissions.columns) else print('not missing cols')
result

not missing cols


In [28]:
# Countries that have no co2 emissions data.

no_data = []
partial_data = []

for i in range(len(global_co2_emissions.index)) :
    empties_in_row = global_co2_emissions.iloc[i].isnull().sum()
    if empties_in_row == 59:
        country_info = [empties_in_row, global_co2_emissions.iloc[i]['Country Name']]
        no_data.append(country_info)
    if empties_in_row > 0:
        country_info = [empties_in_row, global_co2_emissions.iloc[i]['Country Name']]
        partial_data.append(country_info)     

In [29]:
# We can see that there is a row called 'not classified' with no data in at all. It might as well go. 👋🏻
# The rest of the countries will stay for now, as not sure what the best thing to do is, at least they are identified.
no_data

global_co2_emissions = global_co2_emissions[global_co2_emissions['Country Name'] != 'Not classified']

# There should be 265 rows now.
print(len(global_co2_emissions.index))

265


Are there any rows that are duplicates?

In [30]:
# No 😊

global_co2_emissions.duplicated().sum()

0

In [31]:
# Are there any wildly strange numbers in there? 

### Melt Table
##### (Reverse Pivot? Unpivot?)

This will make it easier to join with other data.

In [41]:
# Before we melt the table it's a good idea to visualise what the table will look like. 

no_years = len(datayears) # This is how many years we have data on
# And 265 countries (rows) 
expected_rows = no_years*265

15635


In [37]:
global_co2_emissions_melted = pd.melt(global_co2_emissions, 
            id_vars=["Country Name", "Country Code"],
            value_vars=list(global_co2_emissions.columns[4:]),
            var_name='Year', 
            value_name='CO2 emissions (metric tons per capita)')

In [38]:
global_co2_emissions_melted

Unnamed: 0,Country Name,Country Code,Year,CO2 emissions (metric tons per capita)
0,Aruba,ABW,1960,204.631696
1,Africa Eastern and Southern,AFE,1960,0.906060
2,Afghanistan,AFG,1960,0.046057
3,Africa Western and Central,AFW,1960,0.090880
4,Angola,AGO,1960,0.100835
...,...,...,...,...
15630,Kosovo,XKX,2018,
15631,"Yemen, Rep.",YEM,2018,0.326682
15632,South Africa,ZAF,2018,7.496645
15633,Zambia,ZMB,2018,0.446065


In [47]:
# Double checking row lengths
if len(global_co2_emissions_melted.index) == expected_rows:
    print('all good 😊') 
else:
    print('oops 🤯') 

all good 😊


In [49]:
# Order by country, then year.
global_co2_emissions_melted = global_co2_emissions_melted.sort_values(by=['Country Name', 'Year'])
global_co2_emissions_melted

Unnamed: 0,Country Name,Country Code,Year,CO2 emissions (metric tons per capita)
2,Afghanistan,AFG,1960,0.046057
267,Afghanistan,AFG,1961,0.053589
532,Afghanistan,AFG,1962,0.073721
797,Afghanistan,AFG,1963,0.074161
1062,Afghanistan,AFG,1964,0.086174
...,...,...,...,...
14574,Zimbabwe,ZWE,2014,0.894256
14839,Zimbabwe,ZWE,2015,0.897598
15104,Zimbabwe,ZWE,2016,0.783303
15369,Zimbabwe,ZWE,2017,0.718570
