# Exploratory Analysis - Climate Change dataset

Main Focus of this notebook is the exploratory analysis and cleansing of the dataset. 
Here is the <a href='https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/data'>DataSet </a>

1. Cleaning the dataset
2. Feature Engineering 
3. Visualization 

First, let's choose the dataset that we want to work on: 
- GlobalLandTemperaturesByCountry.csv
- GlobalTemperatures.csv

## Cleaning the dataset 
Let's import some library first.

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import datatime

### Explore the dataset 

Before importing the dataset in Python , I opened simply the csv in Excel and checked the for Accents anomalies and correct what it was been imported incorrectly. 

In [6]:
DataByCountry = pd.read_csv('./Data/GlobalLandTemperaturesByCountry.csv')
DataByCountry.head(10)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
0,1743-11-01,4.384,2.294,Aland
1,1743-12-01,,,Aland
2,1744-01-01,,,Aland
3,1744-02-01,,,Aland
4,1744-03-01,,,Aland
5,1744-04-01,1.53,4.68,Aland
6,1744-05-01,6.702,1.789,Aland
7,1744-06-01,11.609,1.577,Aland
8,1744-07-01,15.342,1.41,Aland
9,1744-08-01,,,Aland


In [7]:
DataByCountry.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577462 entries, 0 to 577461
Data columns (total 4 columns):
dt                               577462 non-null object
AverageTemperature               544811 non-null float64
AverageTemperatureUncertainty    545550 non-null float64
Country                          577462 non-null object
dtypes: float64(2), object(2)
memory usage: 17.6+ MB


### First Column 'dt' as DateTime type

In [16]:
DataByCountry['dt'] = DataByCountry['dt'].astype('datetime64[ns]')

In [17]:
DataByCountry.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577462 entries, 0 to 577461
Data columns (total 4 columns):
dt                               577462 non-null datetime64[ns]
AverageTemperature               544811 non-null float64
AverageTemperatureUncertainty    545550 non-null float64
Country                          577462 non-null object
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 17.6+ MB


### Add a column for Year and Month

In [22]:
DataByCountry['Year']= pd.DatetimeIndex(DataByCountry['dt']).year
DataByCountry['Month'] = pd.DatetimeIndex(DataByCountry['dt']).month

In [23]:
DataByCountry.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country,Year,Month
0,1743-11-01,4.384,2.294,Aland,1743,11
1,1743-12-01,,,Aland,1743,12
2,1744-01-01,,,Aland,1744,1
3,1744-02-01,,,Aland,1744,2
4,1744-03-01,,,Aland,1744,3


### Check for missing values 

In [26]:
DataByCountry['AverageTemperature'].isnull().values.sum()

32651

In [27]:
DataByCountry['AverageTemperatureUncertainty'].isnull().values.sum()

31912

In [30]:
DataByCountry['Country'].isnull().values.sum()
DataByCountry['dt'].isnull().values.sum()

0

In [31]:
# Choose only the rows where we have a finite value for Average Temperature and Uncertainty 
DataByCountry_clean = DataByCountry[np.isfinite(DataByCountry['AverageTemperature'])]

In [33]:
DataByCountry_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 544811 entries, 0 to 577460
Data columns (total 6 columns):
dt                               544811 non-null datetime64[ns]
AverageTemperature               544811 non-null float64
AverageTemperatureUncertainty    544811 non-null float64
Country                          544811 non-null object
Year                             544811 non-null int64
Month                            544811 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(2), object(1)
memory usage: 29.1+ MB


Conclusion: I decided to just keep the values of non-NaN data. If I do not have the AverageTemprature data there is no point to keep the row. 

### Check how many years we have available 

In [42]:
years = DataByCountry_clean['Year'].unique()

In [47]:
freq_years = pd.Series(DataByCountry_clean['Year']).value_counts()

In [55]:
pd.set_option('display.max_rows', 500)
print(freq_years)

1970    2904
1965    2904
1978    2904
1977    2904
1976    2904
1975    2904
1974    2904
1973    2904
1972    2904
1971    2904
2012    2904
1969    2904
1968    2904
1967    2904
1966    2904
1964    2904
1980    2904
1963    2904
1962    2904
1961    2904
1960    2904
1959    2904
1958    2904
1957    2904
1956    2904
1955    2904
1954    2904
1953    2904
1952    2904
1951    2904
1979    2904
1981    2904
1997    2904
2011    2904
2009    2904
2008    2904
2007    2904
2006    2904
2005    2904
2004    2904
2003    2904
2002    2904
2001    2904
2000    2904
1999    2904
1998    2904
1982    2904
1996    2904
1995    2904
1994    2904
1993    2904
1992    2904
1991    2904
1990    2904
1989    2904
1988    2904
1987    2904
1986    2904
1985    2904
1984    2904
1983    2904
2010    2904
1950    2902
1949    2900
1948    2900
1927    2880
1915    2880
1925    2880
1924    2880
1923    2880
1922    2880
1919    2880
1918    2880
1917    2880
1916    2880
1914    2880
1903    2880

Since 1850 it seems to be that there are more complete data (more than 2000 data points per year) so let's start the analysis from the 1850'...