# **Data Preprocessing**

## Load the Libraries

First, we load the libraries needed for data cleaning

In [2]:
import pandas as pd

## Load the Data

In [3]:
df = pd.read_csv('climate_disasters.csv')
df.head()

Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS Code,CTS Name,CTS Full Descriptor,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,1,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Climate Indicator...",...,,,,,,1.0,,,1.0,
1,2,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Climate Indicator...",...,,,,,,,,,,
2,3,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Climate Indicator...",...,4.0,2.0,1.0,4.0,1.0,3.0,6.0,5.0,2.0,5.0
3,4,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Climate Indicator...",...,1.0,,4.0,,2.0,1.0,1.0,1.0,1.0,1.0
4,5,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Climate Indicator...",...,,1.0,,,2.0,,,1.0,,


## Data Cleaning
As you can see there are some columns that we do not need, so we delete them 

In [4]:
df = df.drop(['ISO2','ISO3','Unit','Source','CTS Code','CTS Name','CTS Full Descriptor'], axis = 1)


In [5]:
print(df.columns)

Index(['ObjectId', 'Country', 'Indicator', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', '2021', '2022'],
      dtype='object')


Then, we delete the rows with TOTAL at the Indicator column because we don't need those rows.

In [6]:
rows_to_delete = df['Indicator'].str.contains('TOTAL', case=False)
df = df.drop(df[rows_to_delete].index)
count = rows_to_delete.sum()

print("Number of rows to delete:", count)

Number of rows to delete: 215


In [8]:
duplicates = df.duplicated()
duplicate_count = duplicates.sum()
print(f"Number of duplicate rows: {duplicate_count}")

Number of duplicate rows: 0


After that, we check if there are any missing values in the dataset.

In [None]:
print(df.isnull().sum())
 

As you can see, there are missing values from row 1980 to 2022. We decided to fill there missing values with 0.

In [None]:
df.iloc[:, 3:] = df.iloc[:, 3:].fillna(0)

After that, there are no missing values.

In [None]:
print(df.isnull().sum())

In [None]:
df.rename(columns={'Indicator':'Disaster Type'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Drought':'Drought'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Extreme temperature':'Extreme temperature'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Flood':'Flood'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Landslide':'Landslide'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Storm':'Storm'}, inplace=True)
df['Disaster Type'].replace({'Climate related disasters frequency, Number of Disasters: Wildfire':'Wildfire'}, inplace=True)
df

When the data cleaning is done, we save the dataset as cleaned_disasters.csv

In [None]:
df.to_csv('cleaned_disasters.csv', index=False)