# Final Project: Temperature Analysis of Major Cities and Countries Over Years
## Name: Xuechun Sun (xs2254) - Work Solo
- Data Source: Kaggle Website (https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
    - Dataset Descriptions:
        1. GlobalLandTemperaturesByMajorCity.csv
            - Land Temperature data of global major city from beginning year(1700) to end year(2012)
        2. GlobalLandTemperaturesByCountry.csv
            - Land Temperature data of each Country from beginning year(1700) to end year(2012)
            
- Project Description and Outline of Structure of the Code
    - Part 1
        -  Data loading and Data cleaning
    - Part 2
        -  Data visualization using interative plot tool: Plotly
        -  Temperature Time Series plot for Five Main Countries: United States, United Kingdom, Australia, China, India. Time line is from 1840 to 2012.
        -  Top Ten Major Citys in Temperature Growth Rate and their Temperature of both beginning and end years.
    - Part 3 
        - Run Principal Component Analysis (PCA) for global major cities
        - Visualize PCA plot for 1st and 2nd Principal Components
    - Part 4
        - K-means Clustering
        - Visualize cluster group plot based on PCA plot
        
- Instruction of running codes
    - Install plotly package in your PC(in your terminal, enter: conda install -c https://conda.anaconda.org/plotly plotly)
    - Get API key in plotly (https://plot.ly/python/getting-started/)
    - Change the path directory of data files while loading csv files
    - Run codes cell by cell
    
- Dependencies
    - data files
        - GlobalLandTemperaturesByMajorCity.csv
        - GlobalLandTemperaturesByCountry.csv
    - modules
        - numpy, pandas, plotly.plotly, plotly.graph_objs, plotly.tools, sklearn.decomposition, cluster

- Any problems encountered
    - No
    
- Evaluation of Python for my task
    - Data processing and visualization part is straightforward
    - Model Processing part is not satisfied, because python seems having less function for models. In the model part, I think R might be better because R have various packages of models for users to choose in order to realize different functions and needs.


# part 1
- Loading the data and filling the gaps
- data cleaning and processing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Change the path directory of data files while loading csv files
global_city_temp = pd.read_csv('/Users/sun93/Documents/programming_lang_python/project/GlobalLandTemperatures/GlobalLandTemperaturesByMajorCity.csv',
                                infer_datetime_format=True, index_col='dt',parse_dates = True)
country_temp = pd.read_csv('/Users/sun93/Documents/programming_lang_python/project/GlobalLandTemperatures/GlobalLandTemperaturesByCountry.csv',
                                infer_datetime_format=True, index_col='dt',parse_dates =['dt'])
                               
global_city_temp = global_city_temp.fillna(method = 'ffill')
country_temp = country_temp.fillna(method = 'ffill')

country_temp.head()

Unnamed: 0_level_0,AverageTemperature,AverageTemperatureUncertainty,Country
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1743-11-01,4.384,2.294,Åland
1743-12-01,4.384,2.294,Åland
1744-01-01,4.384,2.294,Åland
1744-02-01,4.384,2.294,Åland
1744-03-01,4.384,2.294,Åland


In [2]:
global_city_temp.head()

Unnamed: 0_level_0,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
1849-04-01,26.14,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
1849-05-01,25.427,1.2,Abidjan,Côte D'Ivoire,5.63N,3.23W


# Part 2
## Data visualization
(for install Plotly, in mac terminal, enter: conda install -c https://conda.anaconda.org/plotly plotly)
- Time Series plot of five main countries: United States, China, India, Australia, United Kingdom

In [3]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls

#change to your username and api_key
tls.set_credentials_file(username='XuechunSun', api_key='bf4ku25v0j')

# Create traces
#temp_Year = country_temp.groupby(country_temp.index.year)[country_temp.Country == 'United States']['AverageTemperature'].mean()
temp_us = country_temp[country_temp['Country'] == 'United States']
temp_us = temp_us[temp_us.index.year >= 1840]
temp_us_Year = temp_us.groupby(temp_us.index.year)['AverageTemperature'].mean()

temp_ch = country_temp[country_temp['Country'] == 'China']
temp_ch = temp_ch[temp_ch.index.year >= 1840]
temp_ch_Year = temp_ch.groupby(temp_ch.index.year)['AverageTemperature'].mean()

temp_india = country_temp[country_temp['Country'] == 'India']
temp_india = temp_india[temp_india.index.year >= 1840]
temp_india_Year = temp_india.groupby(temp_india.index.year)['AverageTemperature'].mean()

temp_aus = country_temp[country_temp['Country'] == 'Australia']
temp_aus_Year = temp_aus.groupby(temp_aus.index.year)['AverageTemperature'].mean()

temp_uk = country_temp[country_temp['Country'] == 'United Kingdom']
temp_uk = temp_uk[temp_uk.index.year >= 1840]
temp_uk_Year = temp_uk.groupby(temp_uk.index.year)['AverageTemperature'].mean()

trace0 = go.Scatter(
    x = temp_us_Year.index.tolist(),
    y = temp_us_Year.tolist(),
    mode = 'lines',
    name = 'United Stated'
)

trace1 = go.Scatter(
    x = temp_ch_Year.index.tolist(),
    y = temp_ch_Year.tolist(),
    mode = 'lines',
    name = 'China'
)

trace2 = go.Scatter(
    x = temp_india_Year.index.tolist(),
    y = temp_india_Year.tolist(),
    mode = 'lines',
    name = 'India'
)

trace3 = go.Scatter(
    x = temp_aus_Year.index.tolist(),
    y = temp_aus_Year.tolist(),
    mode = 'lines',
    name = 'Australia'
)

trace4 = go.Scatter(
    x = temp_uk_Year.index.tolist(),
    y = temp_uk_Year.tolist(),
    mode = 'lines',
    name = 'United Kingdom'
)
data = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(
    title='Temperature Time Series plot for Five Main Countries',
    legend=dict(
        y=0.5,
        traceorder='reversed',
        font=dict(
            size=16
        )
    ),
    xaxis=dict(
        title='Time',
        titlefont=dict(
            size=18
        )
    ),
    yaxis=dict(
        title='Temperature',
        titlefont=dict(
            size=18
        )
    )
)
fig = dict(data=data, layout=layout)
py.iplot(fig, title='Temperature Time Series plot for Five Main Countries')

- Average Temperature Growth Rate plot based on majors cities

In [4]:
#Data Cleaning
#calculate growth rate for major city
temp_majorcity_growthrate = {}
temp_majorcity_begin_end = {}
for i in global_city_temp.City.unique().tolist():
    temp_city_i = global_city_temp[global_city_temp['City'] == i]
    temp_city_i = temp_city_i[temp_city_i.index.year >= 1840]
    temp_city_i_year = temp_city_i.groupby(temp_city_i.index.year)['AverageTemperature'].mean()
    #growrate 
    temp_city_i_year_list = temp_city_i_year.tolist()
    temp_city_i_year_gr = (temp_city_i_year_list[len(temp_city_i_year_list) - 1] - temp_city_i_year_list[0])/temp_city_i_year_list[0]
    #write dict
    temp_majorcity_growthrate[i] = float("{0:.4f}".format(temp_city_i_year_gr))
    temp_majorcity_begin_end[i] = [float("{0:.2f}".format(temp_city_i_year_list[0])),
                                   float("{0:.2f}".format(temp_city_i_year_list[len(temp_city_i_year_list) - 1]))]
                                   

#sort temperature growth rate and get top ten major city
top_10city_gr_temp = sorted(temp_majorcity_growthrate, key=temp_majorcity_growthrate.get, reverse=True)[:10]  

In [6]:
temp_list_top10_begin = [temp_majorcity_begin_end[i][0] for i in top_10city_gr_temp]
temp_list_top10_end = [temp_majorcity_begin_end[i][1] for i in top_10city_gr_temp]

trace0 = go.Bar(
    x= top_10city_gr_temp,
    y= temp_list_top10_begin,
    name='Begin Temp(1840)',
    marker=dict(
        color='rgb(49,130,189)'
    )
)
trace1 = go.Bar(
    x= top_10city_gr_temp,
    y= temp_list_top10_end,
    name='End Temp(2012)',
    marker=dict(
        color='rgb(204,204,204)',
    )
)

data = [trace0, trace1]
layout = go.Layout(
    barmode='group',
    title='Temp of Top Ten Major Citys in Temperature Growth Rate From 1840 to 2012',
    legend=dict(
        y=0.5,
        traceorder='reversed',
        font=dict(
            size=16
        )
    ),
    xaxis=dict(
        title='Major City',
        titlefont=dict(
            size=18
        ),
        tickangle=-45
    ),
    yaxis=dict(
        title='Temperature',
        titlefont=dict(
            size=18
        )
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Temp of Top Ten Major Citys in Temperature Growth Rate From 1840 to 2012')

In [7]:
temp_list_growth_rate = [temp_majorcity_growthrate[i] for i in top_10city_gr_temp]

data = [go.Bar(
            x= temp_list_growth_rate,
            y= top_10city_gr_temp,
            orientation = 'h',
            marker=dict(color='rgb(220,100,100)'
    )
)]
layout = go.Layout(
    title='Top Ten Major Citys in Temperature Growth Rate From 1840 to 2012',
    legend=dict(
        y=0.5,
        traceorder='reversed',
        font=dict(
            size=16
        )
    ),
    xaxis=dict(
        title='Growth Rate',
        titlefont=dict(
            size=18
        ),
        tickangle=-45
    ),
    yaxis=dict(
        title='Major City',
        titlefont=dict(
            size=18
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Top Ten Major Citys in Temperature Growth Rate From 1840 to 2012')

# Part 3
## Principal Component Analysis (PCA) 
- Apply PC Analysis to major cities in order to get dimension reduction. Generally speaking, in the PCA plot, if two points are close, then they may have similar features.
- The dataset generated has six variables: City Name(character), growth rate(num), temp difference between 1850 to 1900(num), temp difference between 1900 to 1950(num), temp difference between 1950 to 2000(num), temp difference between 2000 to 2010(num). We use the last five variable to apply PCA model.

In [8]:
#data cleaning
pca_list = []
for i in range(len(global_city_temp.City.unique())):
    city_name = global_city_temp.City.unique()[i]
    temp_city_i = global_city_temp[global_city_temp['City'] == city_name]
    
    temp_city_i = temp_city_i[temp_city_i.index.year >= 1850]
    temp_city_i_1850_1900 = temp_city_i[temp_city_i.index.year < 1900]
    temp_city_i_1850_1900 = temp_city_i_1850_1900.groupby(temp_city_i_1850_1900.index.year)['AverageTemperature'].mean()
    temp_city_i_1850_1900_list = temp_city_i_1850_1900.tolist()
    temp_city_i_1850_1900_diff = temp_city_i_1850_1900_list[len(temp_city_i_1850_1900_list) - 1] - temp_city_i_1850_1900_list[0]
    
    temp_city_i = temp_city_i[temp_city_i.index.year >= 1900 ]
    temp_city_i_1900_1950 = temp_city_i[temp_city_i.index.year < 1950]
    temp_city_i_1900_1950 = temp_city_i_1900_1950.groupby(temp_city_i_1900_1950.index.year)['AverageTemperature'].mean()
    temp_city_i_1900_1950_list = temp_city_i_1900_1950.tolist()
    temp_city_i_1900_1950_diff = temp_city_i_1900_1950_list[len(temp_city_i_1900_1950_list) - 1] - temp_city_i_1900_1950_list[0]
    
    temp_city_i = temp_city_i[temp_city_i.index.year >= 1950]
    temp_city_i_1950_2000 = temp_city_i[temp_city_i.index.year < 2000]
    temp_city_i_1950_2000 = temp_city_i_1950_2000.groupby(temp_city_i_1950_2000.index.year)['AverageTemperature'].mean()
    temp_city_i_1950_2000_list = temp_city_i_1950_2000.tolist()
    temp_city_i_1950_2000_diff = temp_city_i_1950_2000_list[len(temp_city_i_1950_2000_list) - 1] - temp_city_i_1950_2000_list[0]
    
    temp_city_i = temp_city_i[temp_city_i.index.year >= 2000]
    temp_city_i_2000_2010 = temp_city_i[temp_city_i.index.year < 2010]
    temp_city_i_2000_2010 = temp_city_i_2000_2010.groupby(temp_city_i_2000_2010.index.year)['AverageTemperature'].mean()
    temp_city_i_2000_2010_list = temp_city_i_2000_2010.tolist()
    temp_city_i_2000_2010_diff = temp_city_i_2000_2010_list[len(temp_city_i_2000_2010_list) - 1] - temp_city_i_2000_2010_list[0]
    

    pca_city = [global_city_temp.City.unique()[i], temp_majorcity_growthrate[global_city_temp.City.unique()[i]],
                temp_city_i_1850_1900_diff, temp_city_i_1900_1950_diff,temp_city_i_1950_2000_diff,
                temp_city_i_2000_2010_diff]
    pca_list.append(pca_city)

pca_matrix = []
for i in range(len(pca_list)):
    for j in range(len(pca_list[i])):
        pca_matrix.append(pca_list[i][j])
        
pca_matrix = np.array(pca_matrix)
shape = (100, 6)
pca_matrix = pca_matrix.reshape(shape)

In [9]:
from sklearn.decomposition import PCA as sklearnPCA

#run PCA model to dataset
city_pca = sklearnPCA(n_components=2)
city_pca_transf = city_pca.fit_transform(pca_matrix[0:100,1:6])

In [10]:
#find top 10 cities
top10_index = []
for i in range(len(top_10city_gr_temp)):
    top10_index.append(np.where(pca_matrix[:,0]==top_10city_gr_temp[i]))

#[(array([65]),), (array([80]),), (array([34]),), (array([82]),), (array([48]),), 
#(array([19]),), (array([64]),), (array([92]),), (array([86]),), (array([5]),)]

top10_index = [65,80,34,82,48,19,64,92,86,5]

#draw PCA plot based on PC
trace1 = go.Scatter(
    x = city_pca_transf[:,0],
    y = city_pca_transf[:,1],
    mode='markers',
    name='Major City',
    text= pca_matrix[0:100,0],
    textposition='top'
)
trace2 = go.Scatter(
    x = city_pca_transf[top10_index,0],
    y = city_pca_transf[top10_index,1],
    mode='markers',
    name='Top 10 Cities',
    text= top_10city_gr_temp,
    textposition='top',
    marker = dict(
        size = 10,
        color = 'red',
    )
)
data = [trace1,trace2]
layout = dict(title = 'PCA Plots of 100 major cities',
    xaxis=dict(
        title='1st Princpal Component',
        titlefont=dict(
            size=18
        )
    ),
    yaxis=dict(
        title='2st Princpal Component',
        titlefont=dict(
            size=18
        )
    )
    )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='PCA Plots of 100 major cities')

Red points stand for the cities which have top ten largest temp growth rate. Each point in the plot means a city.

# Part 4
## K-means Clustering

In [11]:
from sklearn import cluster

#use K means model to the dataset
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(pca_matrix[0:100,1:6]) 

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [12]:
# set up labels of kmeans
k_means_labels = k_means.labels_.tolist()
k_means_labels[k_means_labels[:] == 0]

k_mean_data = np.concatenate((pca_matrix, k_means.labels_.reshape(100,1)), 1)
k_mean_data.shape

cluster0 = k_mean_data[k_mean_data[:,6] == '0',:]
cluster1 = k_mean_data[k_mean_data[:,6] == '1',:]
cluster2 = k_mean_data[k_mean_data[:,6] == '2']

label = np.array(range(100))
cluster0_label = label[k_mean_data[:,6] == '0'].tolist()
cluster1_label = label[k_mean_data[:,6] == '1'].tolist()
cluster2_label = label[k_mean_data[:,6] == '2'].tolist()


In [13]:
#draw cluster plot based on PCA plot

trace0 = go.Scatter(
    x = city_pca_transf[cluster0_label,0],
    y = city_pca_transf[cluster0_label,1],
    mode='markers',
    name='Cluster 0',
    text= pca_matrix[cluster0_label,0],
    textposition='top',
    marker = dict(
        color = 'blue',
    )
)
trace1 = go.Scatter(
    x = city_pca_transf[cluster1_label,0],
    y = city_pca_transf[cluster1_label,1],
    mode='markers',
    name='Cluster 1',
    text= pca_matrix[cluster1_label,0],
    textposition='top',
    marker = dict(
        color = 'red',
    )
)
trace2 = go.Scatter(
    x = city_pca_transf[cluster2_label,0],
    y = city_pca_transf[cluster2_label,1],
    mode='markers',
    name='Cluster 2',
    text= pca_matrix[cluster2_label,0],
    textposition='top',
    marker = dict(
        color = 'green',
    )
)

data = [trace0,trace1,trace2]
layout = dict(title = 'Cluster Groups of 100 major cities',
    xaxis=dict(
        title='1st Princpal Component',
        titlefont=dict(
            size=18
        )
    ),
    yaxis=dict(
        title='2st Princpal Component',
        titlefont=dict(
            size=18
        )
    )
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Cluster Groups of 100 major cities')

- Form the Cluster groups plot above, we can see that cities in the same group are very close to each other in the PCA plot. 
- Cities in the same cluster group have similar features in growth rate, temp increases in the certain time gaps. The reasons behind it may vary. For example, maybe they are in the similar latitude, and some of them are all large and developing city in certain years that we interested. Further, population and climate type may also be the reasons that influence temperature in cities.