`Carbon Credits - done, could reduce some columns`

Predictors description:
- Carbon credits are part of the C1.1.2 section of the CDP report
- The main distinction is on whether a credit is originated or purchased, originated credits result from carbon savings initiatives, while purchases serve to reduce GHG footprint

Main operations:
- Conversion to firm-year: aggregated by id and year and created summary statistics (sum, count) for the numerical variables
- for the categorical ones I one hot encoded them (without dropping the first category) and then I summed the ohe variables. In this way, if a company has two credits in a year belonging to that category, then that category would have an assigned value of two. Therefore, the final categorical columns are not binary, but count.

Alternatives: 
- When I am dealing with origination and purchase, should I change the sign of the number of credits?
- Note that some companies do not report the number of credits they purchase, in this case when aggregating with summary statistics the sum will equal zero. 
- Consider removing those columns: cdp_credits_cancelled_clean_missing	cdp_purpose_clean_compliance	cdp_purpose_clean_other	cdp_purpose_clean_voluntary offsetting	cdp_purpose_clean_missing if you want to reduce the dimensionality of the data

Joining process with id, year, isin:
- Not all companies have carbon credits reported, therefore when joining them it is important to prevent dropping rows.
- To do so, in the joining process, an indicator on whether there are no carbon purchases shall be created, along with missing values corresponding to those rows

In [682]:
import pandas as pd
import seaborn as sns
import numpy as np

In [683]:
# read the file into a dataframe
df = pd.read_stata('../../data/CDP Cleaning/cleaned outputs/cdp_carbon_credits_full.dta')
df.shape

(15348, 16)

In [684]:
df.head()

Unnamed: 0,id,year,Q1,Q2,Q4,Q5,Q6,Q7,Q8,cdp_orig_or_purchase_clean,cdp_proj_type_clean,cdp_verification_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean
0,53,2015.0,credit purchase,biomass energy,vcs (verified carbon standard),41.0,41.0,yes,voluntary offsetting,credit purchase,biomass energy,vcs (verified carbon standard),41.0,41.0,yes,voluntary offsetting
1,53,2016.0,credit purchase,biomass energy,vcs (verified carbon standard),40.0,40.0,yes,voluntary offsetting,credit purchase,biomass energy,vcs (verified carbon standard),40.0,40.0,yes,voluntary offsetting
2,64,2014.0,credit purchase,other: emission credit brokerage,not yet verified,3.0,0.0,yes,compliance,credit purchase,other,not yet verified,3.0,0.0,yes,compliance
3,79,2021.0,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting
4,79,2021.0,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting


In [685]:
# count number of unique id, year pairs
df.groupby(['id', 'year']).size().shape

(5784,)

`Multiple credit purchases per company`
- on average a company purchases: 2.65 types of credits per year
- there shall be an indicator that signals wheter a company has purchased credits or not

In [686]:
df.groupby(['id', 'year'])['year'].count().mean()

2.6535269709543567

In [687]:
# are there any duplicates?
df[df.duplicated(['id', 'year'], keep=False)]

Unnamed: 0,id,year,Q1,Q2,Q4,Q5,Q6,Q7,Q8,cdp_orig_or_purchase_clean,cdp_proj_type_clean,cdp_verification_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean
3,79,2021.0,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting
4,79,2021.0,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting
5,79,2021.0,credit purchase,biomass energy,gold standard,1282.0,0.0,yes,voluntary offsetting,credit purchase,biomass energy,gold standard,1282.0,0.0,yes,voluntary offsetting
6,79,2022.0,credit purchase,forests,vcs (verified carbon standard),1380.0,1380.0,yes,voluntary offsetting,credit purchase,forests,vcs (verified carbon standard),1380.0,1380.0,yes,voluntary offsetting
7,79,2022.0,credit purchase,forests,vcs (verified carbon standard),3620.0,3620.0,yes,voluntary offsetting,credit purchase,forests,vcs (verified carbon standard),3620.0,3620.0,yes,voluntary offsetting
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15341,895062,2022.0,credit purchase,transport,acr (american carbon registry),803.0,803.0,yes,voluntary offsetting,credit purchase,transport,acr (american carbon registry),803.0,803.0,yes,voluntary offsetting
15343,895218,2022.0,credit purchase,co2 usage,gold standard,2438.0,2438.0,not relevant,voluntary offsetting,credit purchase,co2 usage,gold standard,2438.0,2438.0,,voluntary offsetting
15344,895218,2022.0,credit purchase,forests,"ccbs (developed by the climate, community and ...",4753.0,4753.0,not relevant,voluntary offsetting,credit purchase,forests,"ccbs (developed by the climate, community and ...",4753.0,4753.0,,voluntary offsetting
15345,895218,2022.0,credit purchase,forests,gold standard,1638.0,1638.0,not relevant,voluntary offsetting,credit purchase,forests,gold standard,1638.0,1638.0,,voluntary offsetting


In [688]:
# remove leading and trailing spaces from all values
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [689]:
# if a value is an empty string, replace it with NaN
df = df.replace(r'^\s*$', pd.NA, regex=True)

Dropping Columns that are repeated (namely Q1, Q2, ...) given that they are already in the dataframe with proper names

In [690]:
df.drop(columns=['Q1', 'Q2', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8'], inplace=True)

In [691]:
df.head()

Unnamed: 0,id,year,cdp_orig_or_purchase_clean,cdp_proj_type_clean,cdp_verification_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean
0,53,2015.0,credit purchase,biomass energy,vcs (verified carbon standard),41.0,41.0,yes,voluntary offsetting
1,53,2016.0,credit purchase,biomass energy,vcs (verified carbon standard),40.0,40.0,yes,voluntary offsetting
2,64,2014.0,credit purchase,other,not yet verified,3.0,0.0,yes,compliance
3,79,2021.0,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting
4,79,2021.0,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting


In [692]:
# print the unique values for each column
for col in df.columns:
    print(col)
    print(df[col].unique())
    print('\n')

id
[    53     64     79 ... 895100 895218 895519]


year
[2015. 2016. 2014. 2021. 2022. 2010. 2011. 2012. 2013. 2017. 2018. 2019.
 2020.]


cdp_orig_or_purchase_clean
['credit purchase' 'credit origination']


cdp_proj_type_clean
['biomass energy' 'other' 'forests' 'wind' 'methane avoidance'
 'landfill gas' <NA> 'energy efficiency' 'hydro' 'fossil fuel switch'
 'cement' 'agriculture' 'n20' 'transport' 'co2 usage' 'coal mine/bed ch4'
 'solar' 'energy distribution' 'fugitive' 'pfcs and sf6' 'hfcs' 'n2o'
 'tidal']


cdp_verification_clean
['vcs (verified carbon standard)' 'not yet verified' 'gold standard' <NA>
 'cdm' 'cdm (clean development mechanism)' 'ji (joint implementation)'
 'ccbs (developed by the climate, community and biodiversity alliance, ccba)'
 'other' 'vcs' 'vcs (voluntary carbon standard)' 'ver+' 'car'
 'emissions reduction fund of the australian government'
 'ver+ (tÜv sÜd standard)' 'acr (american carbon registry)' 'plan vivo'
 'ji' 'car (the climate action reserve)'
 '

In [693]:
# number of nas for each column
df.isna().sum()

id                                  0
year                                0
cdp_orig_or_purchase_clean          0
cdp_proj_type_clean               711
cdp_verification_clean            253
cdp_num_credits_clean             669
cdp_num_credits_riskadj_clean    3208
cdp_credits_cancelled_clean      3417
cdp_purpose_clean                 814
dtype: int64

In [694]:
df.dtypes

id                                 int32
year                             float32
cdp_orig_or_purchase_clean        object
cdp_proj_type_clean               object
cdp_verification_clean            object
cdp_num_credits_clean            float32
cdp_num_credits_riskadj_clean    float32
cdp_credits_cancelled_clean       object
cdp_purpose_clean                 object
dtype: object

In [695]:
# convert year to int
df['year'] = df['year'].astype('Int32')
df['id'] = df['id'].astype('Int32')

# convert all object columns to category
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].astype('category')

In [696]:
df.head()

Unnamed: 0,id,year,cdp_orig_or_purchase_clean,cdp_proj_type_clean,cdp_verification_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean
0,53,2015,credit purchase,biomass energy,vcs (verified carbon standard),41.0,41.0,yes,voluntary offsetting
1,53,2016,credit purchase,biomass energy,vcs (verified carbon standard),40.0,40.0,yes,voluntary offsetting
2,64,2014,credit purchase,other,not yet verified,3.0,0.0,yes,compliance
3,79,2021,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting
4,79,2021,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting


In [697]:
df.dtypes

id                                  Int32
year                                Int32
cdp_orig_or_purchase_clean       category
cdp_proj_type_clean              category
cdp_verification_clean           category
cdp_num_credits_clean             float32
cdp_num_credits_riskadj_clean     float32
cdp_credits_cancelled_clean      category
cdp_purpose_clean                category
dtype: object

In [698]:
summary = pd.DataFrame(df.dtypes)
summary['nuber unique'] = df.nunique()
summary['isna'] = df.isna().sum()

In [699]:
summary

Unnamed: 0,0,nuber unique,isna
id,Int32,1661,0
year,Int32,13,0
cdp_orig_or_purchase_clean,category,2,0
cdp_proj_type_clean,category,22,711
cdp_verification_clean,category,20,253
cdp_num_credits_clean,float32,7988,669
cdp_num_credits_riskadj_clean,float32,6328,3208
cdp_credits_cancelled_clean,category,2,3417
cdp_purpose_clean,category,3,814


In [700]:
# for each category column, mark missing values as missing
for col in df.columns:
    if df[col].dtype.name == 'category':
        df[col] = df[col].cat.add_categories('missing')
        df[col].fillna('missing', inplace=True)

In [701]:
df

Unnamed: 0,id,year,cdp_orig_or_purchase_clean,cdp_proj_type_clean,cdp_verification_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean
0,53,2015,credit purchase,biomass energy,vcs (verified carbon standard),41.0,41.0,yes,voluntary offsetting
1,53,2016,credit purchase,biomass energy,vcs (verified carbon standard),40.0,40.0,yes,voluntary offsetting
2,64,2014,credit purchase,other,not yet verified,3.0,0.0,yes,compliance
3,79,2021,credit purchase,forests,gold standard,366.0,,yes,voluntary offsetting
4,79,2021,credit purchase,wind,gold standard,18492.0,,yes,voluntary offsetting
...,...,...,...,...,...,...,...,...,...
15343,895218,2022,credit purchase,co2 usage,gold standard,2438.0,2438.0,missing,voluntary offsetting
15344,895218,2022,credit purchase,forests,"ccbs (developed by the climate, community and ...",4753.0,4753.0,missing,voluntary offsetting
15345,895218,2022,credit purchase,forests,gold standard,1638.0,1638.0,missing,voluntary offsetting
15346,895218,2022,credit purchase,other,other,185.0,185.0,missing,voluntary offsetting


In [702]:
# for each float32 column, mark missing values as zero and create an indicator column to mark the missing values
for col in df.columns:
    if df[col].dtype.name == 'float32':
        df[col + '_missing'] = df[col].isna().astype(int)
        df[col].fillna(0, inplace=True)

In [703]:
df.dtypes

id                                          Int32
year                                        Int32
cdp_orig_or_purchase_clean               category
cdp_proj_type_clean                      category
cdp_verification_clean                   category
cdp_num_credits_clean                     float32
cdp_num_credits_riskadj_clean             float32
cdp_credits_cancelled_clean              category
cdp_purpose_clean                        category
cdp_num_credits_clean_missing               int64
cdp_num_credits_riskadj_clean_missing       int64
dtype: object

In [704]:
df.shape

(15348, 11)

In [705]:
# remove unusued categories for each category column
for col in df.columns:
    if df[col].dtype.name == 'category':
        df[col] = df[col].cat.remove_unused_categories()

In [706]:
# print the unique values for each column
for col in df.columns:
    print(col)
    print(df[col].unique())
    print('\n')

id
<IntegerArray>
[    53,     64,     79,     87,     97,    119,    135,    138,    143,
    154,
 ...
 893445, 893453, 893701, 893848, 893859, 894762, 895062, 895100, 895218,
 895519]
Length: 1661, dtype: Int32


year
<IntegerArray>
[2015, 2016, 2014, 2021, 2022, 2010, 2011, 2012, 2013, 2017, 2018, 2019, 2020]
Length: 13, dtype: Int32


cdp_orig_or_purchase_clean
['credit purchase', 'credit origination']
Categories (2, object): ['credit origination', 'credit purchase']


cdp_proj_type_clean
['biomass energy', 'other', 'forests', 'wind', 'methane avoidance', ..., 'fugitive', 'pfcs and sf6', 'hfcs', 'n2o', 'tidal']
Length: 23
Categories (23, object): ['agriculture', 'biomass energy', 'cement', 'co2 usage', ..., 'tidal', 'transport', 'wind', 'missing']


cdp_verification_clean
['vcs (verified carbon standard)', 'not yet verified', 'gold standard', 'missing', 'cdm', ..., 'plan vivo', 'ji', 'car (the climate action reserve)', 'ccbs - climate, community & biodiversity stan..., 'ver+ (tv s

In [707]:
# dropping proj_type and cdp_verification_clean
df.drop(columns=['cdp_verification_clean', 'cdp_proj_type_clean'], inplace=True)

## Data Transformation Summary

This notebook section performs data preparation suitable for predictive modeling, with a focus on training an LSTM model. The dataset originates from carbon credit purchase records, detailing transactions by firms across various years. Our predictive target is to forecast the `real_ghg_change`, representing the actual change in greenhouse gas emissions.

### Transformations applied are as follows:

1. **Binary Variables Creation**: For each categorical attribute, binary (dummy) variables are generated. This process transforms the categorical data into a set of binary features where each feature signifies the presence (1) or absence (0) of a category for each firm-year instance.

2. **Numerical Summary Statistics**: Summary statistics (sum, mean, median, max, and min) are calculated for the total number of carbon credits before and after risk adjustment (`cdp_num_credits_clean` and `cdp_num_credits_riskadj_clean`). These aggregated metrics offer a nuanced view of credit purchase activities per firm per annum.

3. **Grouping and Aggregation**: The dataset is grouped by `id` and `year`, and the aforementioned binary and numerical attributes are aggregated accordingly. This step ensures a single composite row per firm-year pairing, aligning with the structure requisite for LSTM and other time-series analysis methodologies.

4. **Final Dataset Assembly**: The aggregated binary features are combined with the numerical summary statistics to compose the final dataset. Structured with a multi-index of firm ID and year, this dataset is well-suited for panel data analysis and subsequent modeling tasks.

The resulting DataFrame is a comprehensive feature matrix, melding transformed categorical data with aggregated numerical data, primed for deployment in machine learning models aiming to predict changes in greenhouse gas emissions.

## Rationale for Retaining Missingness Indicators

In the transformed dataset, we have chosen to retain both 'True' and 'False' missingness indicator columns for each type of credit number—`cdp_num_credits_clean` and `cdp_num_credits_riskadj_clean`. This decision is rooted in the aim to preserve the full extent of information on data completeness post-aggregation. The 'True' columns provide the count of missing entries, while the 'False' columns inform us about the count of non-missing entries for each firm-year combination. As we do not have a separate record of the total number of observations for each firm-year grouping before aggregation, maintaining these columns becomes crucial. They serve as a dual record that not only informs about the presence of missing values but also implicitly indicates the volume of reported data, thus acting as a proxy for the original number of observations. This information is valuable for predictive modeling as it allows the model to account for data completeness and potential patterns in missingness, which may contribute to more accurate predictions.


In [708]:
df

Unnamed: 0,id,year,cdp_orig_or_purchase_clean,cdp_num_credits_clean,cdp_num_credits_riskadj_clean,cdp_credits_cancelled_clean,cdp_purpose_clean,cdp_num_credits_clean_missing,cdp_num_credits_riskadj_clean_missing
0,53,2015,credit purchase,41.0,41.0,yes,voluntary offsetting,0,0
1,53,2016,credit purchase,40.0,40.0,yes,voluntary offsetting,0,0
2,64,2014,credit purchase,3.0,0.0,yes,compliance,0,0
3,79,2021,credit purchase,366.0,0.0,yes,voluntary offsetting,0,1
4,79,2021,credit purchase,18492.0,0.0,yes,voluntary offsetting,0,1
...,...,...,...,...,...,...,...,...,...
15343,895218,2022,credit purchase,2438.0,2438.0,missing,voluntary offsetting,0,0
15344,895218,2022,credit purchase,4753.0,4753.0,missing,voluntary offsetting,0,0
15345,895218,2022,credit purchase,1638.0,1638.0,missing,voluntary offsetting,0,0
15346,895218,2022,credit purchase,185.0,185.0,missing,voluntary offsetting,0,0


In [709]:
# Create binary variables for each categorical column
categorical_columns = ['cdp_orig_or_purchase_clean', 'cdp_credits_cancelled_clean','cdp_purpose_clean']
df_dummies = pd.get_dummies(df[categorical_columns])

# Define the aggregations for numerical columns
aggregations = {
    'cdp_num_credits_clean': ['sum', 'count'],
    'cdp_num_credits_riskadj_clean': ['sum'],
    'cdp_num_credits_clean_missing': ['sum'],
    'cdp_num_credits_riskadj_clean_missing': ['sum'],
}

# Group by 'id' and 'year', then aggregate
df_grouped = df.groupby(['id', 'year']).agg(aggregations)

# Flatten the MultiIndex for columns created by groupby aggregation
df_grouped.columns = ['_'.join(col).strip() for col in df_grouped.columns.values]

# Group the binary variables by 'id' and 'year', and sum them to get the count of each category
df_dummies_grouped = df_dummies.groupby([df['id'], df['year']]).sum()

# Join the binary variables with the grouped numerical data
# Make sure indices are sorted before joining if they aren't already
df_final = df_grouped.join(df_dummies_grouped, how='left').reset_index()


# Now df_final is ready for further analysis or modeling

In [710]:
categorical_columns

['cdp_orig_or_purchase_clean',
 'cdp_credits_cancelled_clean',
 'cdp_purpose_clean']

In [711]:
# check unique id, year pairs
df_final.groupby(['id', 'year']).size().shape

(5784,)

In [712]:
df_final.shape

(5784, 16)

In [713]:
df_final.isna().sum()

id                                               0
year                                             0
cdp_num_credits_clean_sum                        0
cdp_num_credits_clean_count                      0
cdp_num_credits_riskadj_clean_sum                0
cdp_num_credits_clean_missing_sum                0
cdp_num_credits_riskadj_clean_missing_sum        0
cdp_orig_or_purchase_clean_credit origination    0
cdp_orig_or_purchase_clean_credit purchase       0
cdp_credits_cancelled_clean_no                   0
cdp_credits_cancelled_clean_yes                  0
cdp_credits_cancelled_clean_missing              0
cdp_purpose_clean_compliance                     0
cdp_purpose_clean_other                          0
cdp_purpose_clean_voluntary offsetting           0
cdp_purpose_clean_missing                        0
dtype: int64

In [714]:
df_final.head()

Unnamed: 0,id,year,cdp_num_credits_clean_sum,cdp_num_credits_clean_count,cdp_num_credits_riskadj_clean_sum,cdp_num_credits_clean_missing_sum,cdp_num_credits_riskadj_clean_missing_sum,cdp_orig_or_purchase_clean_credit origination,cdp_orig_or_purchase_clean_credit purchase,cdp_credits_cancelled_clean_no,cdp_credits_cancelled_clean_yes,cdp_credits_cancelled_clean_missing,cdp_purpose_clean_compliance,cdp_purpose_clean_other,cdp_purpose_clean_voluntary offsetting,cdp_purpose_clean_missing
0,53,2015,41.0,1,41.0,0,0,0,1,0,1,0,0,0,1,0
1,53,2016,40.0,1,40.0,0,0,0,1,0,1,0,0,0,1,0
2,64,2014,3.0,1,0.0,0,0,0,1,0,1,0,1,0,0,0
3,79,2021,20140.0,3,0.0,0,2,0,3,0,3,0,0,0,3,0
4,79,2022,20438.0,5,20438.0,0,0,0,5,0,5,0,0,0,5,0


**I am taking the log1p of the numerical variables as they look skewed.**

In [715]:
# log1p all the cdp_num_credits_clean
df_final['cdp_num_credits_clean_sum'] = np.log1p(df_final['cdp_num_credits_clean_sum'])

# log1p all the cdp_num_credits_riskadj_clean
df_final['cdp_num_credits_riskadj_clean_sum'] = np.log1p(df_final['cdp_num_credits_riskadj_clean_sum'])

In [716]:
df_final.describe()

Unnamed: 0,id,year,cdp_num_credits_clean_sum,cdp_num_credits_clean_count,cdp_num_credits_riskadj_clean_sum,cdp_num_credits_clean_missing_sum,cdp_num_credits_riskadj_clean_missing_sum,cdp_orig_or_purchase_clean_credit origination,cdp_orig_or_purchase_clean_credit purchase,cdp_credits_cancelled_clean_no,cdp_credits_cancelled_clean_yes,cdp_credits_cancelled_clean_missing,cdp_purpose_clean_compliance,cdp_purpose_clean_other,cdp_purpose_clean_voluntary offsetting,cdp_purpose_clean_missing
count,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0,5784.0
mean,45941.104772,2017.308956,8.906804,2.653527,6.695766,0.115664,0.554633,0.692254,1.961272,0.689834,1.372925,0.590768,0.602006,0.142808,1.767981,0.140733
std,154398.856698,3.777004,3.674026,4.433919,4.982852,0.881695,2.592329,2.588707,3.559463,2.20345,3.157095,2.706342,3.834322,0.778631,2.534173,0.718983
min,53.0,2010.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5634.0,2014.0,6.980308,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,14526.0,2018.0,9.403383,1.0,7.881182,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
75%,21537.0,2021.0,11.456654,3.0,10.747437,0.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,0.0,2.0,0.0
max,895519.0,2022.0,18.290958,98.0,18.290958,28.0,72.0,97.0,57.0,54.0,68.0,96.0,98.0,20.0,42.0,11.0


In [717]:
# save the df final to csv in the processed data folder
df_final.to_csv('../../data/processed/cdp_carbon_credits_full_processed.csv', index=False)

In [718]:
df_final.shape

(5784, 16)

In [719]:
# jetblue sanity check - seems to be fine
df_final.loc[df_final['id'] == 9759]

Unnamed: 0,id,year,cdp_num_credits_clean_sum,cdp_num_credits_clean_count,cdp_num_credits_riskadj_clean_sum,cdp_num_credits_clean_missing_sum,cdp_num_credits_riskadj_clean_missing_sum,cdp_orig_or_purchase_clean_credit origination,cdp_orig_or_purchase_clean_credit purchase,cdp_credits_cancelled_clean_no,cdp_credits_cancelled_clean_yes,cdp_credits_cancelled_clean_missing,cdp_purpose_clean_compliance,cdp_purpose_clean_other,cdp_purpose_clean_voluntary offsetting,cdp_purpose_clean_missing
2114,9759,2018,4.194341,1,4.194341,0,0,0,1,0,0,1,0,0,1,0
2115,9759,2019,4.194341,1,4.194341,0,0,0,1,0,1,0,0,0,1,0
2116,9759,2020,12.524282,2,0.0,0,2,0,2,0,2,0,0,0,2,0
2117,9759,2021,14.13612,1,0.0,0,1,0,1,0,0,1,0,0,1,0
2118,9759,2022,0.0,1,0.0,1,1,0,1,0,1,0,0,0,1,0


# Test Join with Id Year Isin Mapping

In [720]:
# read the mapping 
mapping = pd.read_csv("../../data/processed/id_year_isin_mapping.csv", index_col=['id', 'year'])

In [721]:
mapping.shape

(24302, 1)

In [722]:
df_final.set_index(['id','year'], inplace=True)

In [723]:
df_mapped = mapping.join(df_final, on=['id', 'year']).reset_index()

In [724]:
df_mapped['absent_cdp_carbon_credits_full'] = (df_mapped.isna().sum(axis=1) > 0 ).astype(int)

In [725]:
df_mapped.replace(np.nan, 0, inplace=True)

In [726]:
df_mapped.sample(10)

Unnamed: 0,id,year,isin,cdp_num_credits_clean_sum,cdp_num_credits_clean_count,cdp_num_credits_riskadj_clean_sum,cdp_num_credits_clean_missing_sum,cdp_num_credits_riskadj_clean_missing_sum,cdp_orig_or_purchase_clean_credit origination,cdp_orig_or_purchase_clean_credit purchase,cdp_credits_cancelled_clean_no,cdp_credits_cancelled_clean_yes,cdp_credits_cancelled_clean_missing,cdp_purpose_clean_compliance,cdp_purpose_clean_other,cdp_purpose_clean_voluntary offsetting,cdp_purpose_clean_missing,absent_cdp_carbon_credits_full
1079,820,2012.0,GB0000456144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
11728,14360,2010.0,US7010941042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4971,5169,2016.0,US2774321002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
653,569,2010.0,FI0009013114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
17142,21122,2011.0,TRECOLA00011,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
22560,53669,2022.0,AU000000SCG8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1384,1097,2012.0,GB0000608009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
22893,58619,2017.0,ES0105066007,7.791936,1.0,7.791936,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
22182,50099,2019.0,AU000000VRL0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
8276,9843,2019.0,US48020Q1076,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [727]:
df_mapped.shape

(24302, 18)