Tasks in this notebook:  
* Importing datasets for data cleaning.  
* Saving the clean datasets to work with them later.

In [31]:
#Importing  libraries
import pandas as pd
import numpy as np

In [32]:
#Importing datasets
data_co2 = pd.read_csv('/Users/anna/data/climate-change/datasets/carbondioxide.csv')
data_temp = pd.read_csv('/Users/anna/data/climate-change/datasets/GlobalTemperatures.csv')
data_sea = pd.read_csv('/Users/anna/data/climate-change/datasets/seaice.csv')

## Carbon dioxide

First of all, we have a look at the data to see the information we have and its main characteristics.

In [33]:
data_co2.head()

Unnamed: 0,Year,Month,Decimal Date,Carbon Dioxide (ppm),Seasonally Adjusted CO2 (ppm),Carbon Dioxide Fit (ppm),Seasonally Adjusted CO2 Fit (ppm)
0,1958,1,1958.0411,,,,
1,1958,2,1958.126,,,,
2,1958,3,1958.2027,315.69,314.42,316.18,314.89
3,1958,4,1958.2877,317.45,315.15,317.3,314.98
4,1958,5,1958.3699,317.5,314.73,317.83,315.06


In [34]:
data_co2.dtypes

Year                                   int64
Month                                  int64
Decimal Date                         float64
Carbon Dioxide (ppm)                 float64
Seasonally Adjusted CO2 (ppm)        float64
Carbon Dioxide Fit (ppm)             float64
Seasonally Adjusted CO2 Fit (ppm)    float64
dtype: object

In [35]:
data_co2.isna().sum()

Year                                  0
Month                                 0
Decimal Date                          0
Carbon Dioxide (ppm)                 17
Seasonally Adjusted CO2 (ppm)        17
Carbon Dioxide Fit (ppm)             13
Seasonally Adjusted CO2 Fit (ppm)    13
dtype: int64

In [36]:
data_co2.shape

(720, 7)

The types of the columns seem to be all correct.  
We have a few null values, and since they are not a lot (less than 20 out of 720), we weill just drop them.  
Looking at the last five columns, we will only use the last one, 'Seasonally Adjusted CO2 Fit (ppm)'.

In [37]:
# Renaming columns and selecting the ones we want
data_co2.rename(columns={'Seasonally Adjusted CO2 Fit (ppm)':'CO2'}, inplace=True)
co2 = data_co2[['Year', 'Month', 'CO2']]
co2.head()

Unnamed: 0,Year,Month,CO2
0,1958,1,
1,1958,2,
2,1958,3,314.89
3,1958,4,314.98
4,1958,5,315.06


In [38]:
#Dropping NaNs
co2 = co2.dropna().reset_index(drop=True)
co2.head()

Unnamed: 0,Year,Month,CO2
0,1958,3,314.89
1,1958,4,314.98
2,1958,5,315.06
3,1958,6,315.14
4,1958,7,315.21


In [39]:
#Saving clean dataset
co2.to_csv('/Users/anna/data/climate-change/datasets/clean_co2.csv', index=False)

## Temperature

Repeating the process of checking the information of the dataset, types of columns and NaNs.

In [40]:
data_temp.head()

Unnamed: 0,dt,LandAverageTemperature,LandAverageTemperatureUncertainty,LandMaxTemperature,LandMaxTemperatureUncertainty,LandMinTemperature,LandMinTemperatureUncertainty,LandAndOceanAverageTemperature,LandAndOceanAverageTemperatureUncertainty
0,1750-01-01,3.034,3.574,,,,,,
1,1750-02-01,3.083,3.702,,,,,,
2,1750-03-01,5.626,3.076,,,,,,
3,1750-04-01,8.49,2.451,,,,,,
4,1750-05-01,11.573,2.072,,,,,,


In [41]:
data_temp.shape

(3192, 9)

In [42]:
# Selecting columns we need and renaming
temp = data_temp.rename(columns = {'dt':'Year', 'LandAndOceanAverageTemperature':'AvgTemp'})
temp = temp[['Year', 'AvgTemp']]
temp.head()

Unnamed: 0,Year,AvgTemp
0,1750-01-01,
1,1750-02-01,
2,1750-03-01,
3,1750-04-01,
4,1750-05-01,


In [43]:
#Dropping NaNs
temp = temp.dropna().reset_index(drop=True)
temp.head()

Unnamed: 0,Year,AvgTemp
0,1850-01-01,12.833
1,1850-02-01,13.588
2,1850-03-01,14.043
3,1850-04-01,14.667
4,1850-05-01,15.507


In [44]:
temp.isna().sum()

Year       0
AvgTemp    0
dtype: int64

In [45]:
# Changing Year dtype 
temp.dtypes

Year        object
AvgTemp    float64
dtype: object

In [46]:
temp['Year'] = pd.to_datetime(temp['Year'])
temp.dtypes

Year       datetime64[ns]
AvgTemp           float64
dtype: object

In [47]:
temp.head()

Unnamed: 0,Year,AvgTemp
0,1850-01-01,12.833
1,1850-02-01,13.588
2,1850-03-01,14.043
3,1850-04-01,14.667
4,1850-05-01,15.507


In [48]:
temp.dtypes

Year       datetime64[ns]
AvgTemp           float64
dtype: object

In [49]:
#Saving clean dataset
temp.to_csv('/Users/anna/data/climate-change/datasets/clean_temp.csv', index=False)

## Sea ice

Repeating the process of checking the information of the dataset, types of columns and NaNs.

In [50]:
data_sea.head()

Unnamed: 0,Year,Month,Day,Extent,Missing,Source Data,hemisphere
0,1978,10,26,10.231,0.0,['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...,north
1,1978,10,28,10.42,0.0,['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...,north
2,1978,10,30,10.557,0.0,['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...,north
3,1978,11,1,10.67,0.0,['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...,north
4,1978,11,3,10.777,0.0,['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...,north


In [51]:
data_sea.shape

(24908, 7)

In [52]:
# Dropping columns we don't want
data_sea.drop(['Source Data', 'Day', 'Missing'], axis=1, inplace=True) 

In [53]:
data_sea.dtypes

Year            int64
Month           int64
Extent        float64
hemisphere     object
dtype: object

Types are correct.

In [54]:
data_sea.isna().sum()

Year          0
Month         0
Extent        0
hemisphere    0
dtype: int64

In [55]:
data_sea.head()

Unnamed: 0,Year,Month,Extent,hemisphere
0,1978,10,10.231,north
1,1978,10,10.42,north
2,1978,10,10.557,north
3,1978,11,10.67,north
4,1978,11,10.777,north


In [56]:
#Saving clean dataset
data_sea.to_csv('/Users/anna/data/climate-change/datasets/clean_ice.csv', index=False)