# An Analysis of Climate Change and Global Warming Since 1750

## 1. Background
<br>Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. The global climate change has already had observable effects on the environment. Glaciers have shrunk, ice on rivers and lakes is breaking up earlier, plant and animal ranges have shifted and trees are flowering sooner. Effects that scientists had predicted in the past would result from global climate change are now occurring: loss of sea ice, accelerated sea level rise and longer, more intense heat waves.<br />
<br>Global climate is projected to continue to change over this century and beyond, and the temperatures will continue to change in future.<br />
<br>An analysis of global climate and temperature data will provide us a good understanding of how the climate will change and how critical the effect of global warming. Also, there might be hidden patterns, trends, or relationships between temperature and climate change to be revealed.<br />

## 2. Protential clients
<br>There are two different types of potential clients that could be interested in the finding from this project. The first type of clients would be the government. The government may have to concern about the inflence comes with the climate change. We all know that environment change induced by human activities will affect climate greatly. In this way, the government have to take necessary actions to reduce this effect before it is too late. For example, the government could limit the waste gas emission of a factory every day to reduce the damage to the atmosphere. The scond type of clients would be the environmental and climate protection organizationm, which may pay attention to appeal for protecting environment.<br/>

## 3. Datasets used, exploration and wrangling
<br>The dataset is coming from one of the Kaggle competition, "Climate Change: Earth Surface Temperature Data", which consists of 5 sub-datasets. They are organized into 4 different categories, which are cities, major cities, states, and countries. In this project, I used 4 sub-datasets from this dataset. <br/>

### 3.1 Initial data exploration
<br> Each dataset was provided as a CSV file, so that I can improt them into IPython directly as Pandas data frames.<br/>
<br> I checked the first five rows of each dataset. The Global Temperatures one (GlobalTemperatures.csv) consists of average global temperatures from 1750 to 2015. The other four datasets records temperatures in different catergories. And in addition to temperature data, they also records geographic information respectively.<br/>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import plotly.offline as py
py.init_notebook_mode(connected=True)
%matplotlib inline

In [8]:
global_temp = pd.read_csv('GlobalTemperatures.csv')
global_temp.head()

Unnamed: 0,dt,LandAverageTemperature,LandAverageTemperatureUncertainty,LandMaxTemperature,LandMaxTemperatureUncertainty,LandMinTemperature,LandMinTemperatureUncertainty,LandAndOceanAverageTemperature,LandAndOceanAverageTemperatureUncertainty
0,1750-01-01,3.034,3.574,,,,,,
1,1750-02-01,3.083,3.702,,,,,,
2,1750-03-01,5.626,3.076,,,,,,
3,1750-04-01,8.49,2.451,,,,,,
4,1750-05-01,11.573,2.072,,,,,,


In [4]:
global_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 9 columns):
dt                                           3192 non-null object
LandAverageTemperature                       3180 non-null float64
LandAverageTemperatureUncertainty            3180 non-null float64
LandMaxTemperature                           1992 non-null float64
LandMaxTemperatureUncertainty                1992 non-null float64
LandMinTemperature                           1992 non-null float64
LandMinTemperatureUncertainty                1992 non-null float64
LandAndOceanAverageTemperature               1992 non-null float64
LandAndOceanAverageTemperatureUncertainty    1992 non-null float64
dtypes: float64(8), object(1)
memory usage: 224.5+ KB


In [5]:
MCity_temp = pd.read_csv('GlobalLandTemperaturesByMajorCity.csv')
MCity_temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1,1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
2,1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
3,1849-04-01,26.14,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
4,1849-05-01,25.427,1.2,Abidjan,Côte D'Ivoire,5.63N,3.23W


In [6]:
MCity_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239177 entries, 0 to 239176
Data columns (total 7 columns):
dt                               239177 non-null object
AverageTemperature               228175 non-null float64
AverageTemperatureUncertainty    228175 non-null float64
City                             239177 non-null object
Country                          239177 non-null object
Latitude                         239177 non-null object
Longitude                        239177 non-null object
dtypes: float64(2), object(5)
memory usage: 12.8+ MB


### 3.2 Checking for missing values
<br> I also checked to see if there were any missing values in these dataframes. I noticed that there are many missing values in the beginning of the dataframe. We could expect that the data missing are caused by lost in transmission. Fortunately, the missing values are a relatively small fraction of the whole dataset, and we don't interested in the data far before, so I could dropped the rows with missing values in the dataframes without affecting the final result too much. <br/>

In [9]:
global_temp_dt = global_temp[['dt','LandAverageTemperature','LandAverageTemperatureUncertainty']]
global_temp_dt = global_temp_dt[global_temp_dt.LandAverageTemperature.notnull()]
global_temp_dt = global_temp_dt.reset_index(drop=True)
global_temp_dt.head()

Unnamed: 0,dt,LandAverageTemperature,LandAverageTemperatureUncertainty
0,1750-01-01,3.034,3.574
1,1750-02-01,3.083,3.702
2,1750-03-01,5.626,3.076
3,1750-04-01,8.49,2.451
4,1750-05-01,11.573,2.072


In [14]:
Country_temp = pd.read_csv('GlobalLandTemperaturesByCountry.csv')
Country_temp = Country_temp[Country_temp.AverageTemperature.notnull()]
Country_temp.reset_index(drop=True).head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,Country
0,1743-11-01,4.384,2.294,Åland
1,1744-04-01,1.53,4.68,Åland
2,1744-05-01,6.702,1.789,Åland
3,1744-06-01,11.609,1.577,Åland
4,1744-07-01,15.342,1.41,Åland


### 3.3 Data wrangling
<br> In this project, some data is not necessary for analysis and can be dropped, such as "Uncertainty". Also, there are geographic informations in datasets. These infomations are presented as string type, so I have to convert them all into float type. Also, the index of the dataset have to reset as location name, or date-time index if necessary.<br/>

In [10]:
def convert(tude):
    multi = 1 if tude[-1] in ['N','E'] else -1
    return multi * float(tude[:-1])

In [11]:
MCity_dt = MCity_temp.groupby(['City'])
MCity_mean = MCity_dt.AverageTemperature.mean()
MCity_Lat = MCity_dt.Latitude.first()
MCity_new_dt = pd.DataFrame(MCity_mean)
MCity_new_dt['Latitude'] = pd.Series(MCity_Lat,index = MCity_new_dt.index)
for i, num in enumerate(MCity_new_dt['Latitude']):
     MCity_new_dt.ix[i,'Latitude'] = convert(num)
MCity_new_dt.head()

Unnamed: 0_level_0,AverageTemperature,Latitude
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Abidjan,26.163737,5.63
Addis Abeba,17.525073,8.84
Ahmadabad,26.529853,23.31
Aleppo,17.370587,36.17
Alexandria,20.312617,31.35


In [12]:
City_temp = pd.read_csv('GlobalLandTemperaturesByCity.csv')
City_temp = City_temp[City_temp.AverageTemperature.notnull()]
City_temp = City_temp.reset_index(drop=True)
City_dt = City_temp.groupby(['City'])
City_mean = City_dt.AverageTemperature.mean()
City_Lat = City_dt.Latitude.first()
City_new_dt = pd.DataFrame(City_mean)
City_new_dt['Latitude'] = pd.Series(City_Lat,index = City_new_dt.index)
City_new_dt = City_new_dt[City_new_dt.Latitude.notnull()]
for i, num in enumerate(City_new_dt['Latitude']):
     City_new_dt.ix[i,'Latitude'] = convert(num)
City_new_dt.head()

Unnamed: 0_level_0,AverageTemperature,Latitude
City,Unnamed: 1_level_1,Unnamed: 2_level_1
A Coruña,13.147277,42.59
Aachen,8.825173,50.63
Aalborg,7.695135,57.05
Aba,26.612824,5.63
Abadan,25.034749,29.74
