## Temperature forecasting for different cities in the world

## Introduction
For this project you are asked to analyze three datasets, called respectively:
1. pollution_us_2000_2016.csv
2. greenhouse_gas_inventory_data_data.csv
3. GlobalLandTemperaturesByCity.csv

You are asked to extract from dataset 2 only the US countries (for which we have info in the other datasets) and to perform the following tasks:
- to measure how pollution and temperature create cluster tracing the high populated cities in the world
- to analyze the correlation between pollution data and temperature change.
- to predict the yearly temperature change of a given city over a given time period, using the <b>ARIMA model</b> for <b>time series forecasting</b>, that is a model for time series forecasting integrating AR models with Moving Average.
- (OPTIONAL) rank the 5 cities that will have a highest temperature change in US


### TASK1 :Cluster Analysis
You use K-means or DBSCAN to perform the cluster analysis, and create a new dataset where the cities are associated to the different identified clusters

### TASK 2: Correlation Analysis

You measure the correlation between:
- temperature and latitude
- temperature and pollution
- temperature change (difference between the average temperature measured over the last 3 years and the previous temperature) and pollution


### TASK 3: Predicting the Temperature of a Given City across a Specified Time Period
After reading the data in the temperature data set, for each city cluster, before applying the ARIMA model you perform the following steps:

- EDA
- data cleaning and preprocessing (Converting the 'dt' (date) column to DateTime format, removing NaN)
- feature selection
- make the time-series stationary
- check for stationarity : Calculating the Augmented Dickey-Fuller Test statistic 
- identify the (p, q) order of the ARIMA model using ACF partial autocorrelation plot

Then:

-fit the ARIMA model using the calculated p, q values.
-calculate the MSE with respect to the true temp. measurements to estimate the performance of the model


NOTE: ARIMA models need the data to be stationary i.e. the data must not exhibit trend and/or seasonality. To identify and remove trend and seasonality, we can use
- seasonal decomposition
- differencing

In [111]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA, ARMAResults
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets


import seaborn as sns


## My tasks 
1. Rename things 

## SECTION 1: Cluster Analysis

---
Loading the data in CSV format as Pandas DataFrame 
---
1. Clean the data by dropping Nan Values
2. Setting the date as index of the three dataframes
3. Selecting the relevant features we will consider 

In [112]:
# read the csv file containing the polluters 
df_pollution = pd.read_csv("./Project_3/data/data-project3/pollution_us_2000_2016.csv")
# read the csv file containing temperature data into a DataFrame
df_temp = pd.read_csv("./Project_3/data/data-project3/GlobalLandTemperaturesByCity.csv")
# reading the 
df_greenhouse = pd.read_csv("./Project_3/data/data-project3/greenhouse_gas_inventory_data_data(1).csv")

In [113]:
# 1.Cleaning the data by droping the NaN values 
df_pollution.dropna(inplace=True)
df_temp.dropna(inplace=True)
df_greenhouse.dropna(inplace=True)

In [114]:
# Dropping duplicates 
df_pollution.drop_duplicates(subset=None, keep='first', inplace=True)
df_temp.drop_duplicates(subset=None, keep='first', inplace=True)
df_greenhouse.drop_duplicates(subset=None, keep='first', inplace=True)

In [115]:
print(df_pollution.columns)
print(df_temp.columns)
print(df_greenhouse.columns)

Index(['Unnamed: 0', 'State Code', 'County Code', 'Site Num', 'Address',
       'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean',
       'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units',
       'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units',
       'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI',
       'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI'],
      dtype='object')
Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')
Index(['country_or_area', 'year', 'value', 'category'], dtype='object')


In [116]:
# Setting the date as an index of the three dataframes 
df_pollution.set_index("Date Local", inplace=True)
df_temp.set_index("dt", inplace=True)
df_greenhouse.set_index("year", inplace=True)

In [117]:
df_temp

Unnamed: 0_level_0,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
...,...,...,...,...,...,...
2013-04-01,7.710,0.182,Zwolle,Netherlands,52.24N,5.26E
2013-05-01,11.464,0.236,Zwolle,Netherlands,52.24N,5.26E
2013-06-01,15.043,0.261,Zwolle,Netherlands,52.24N,5.26E
2013-07-01,18.775,0.193,Zwolle,Netherlands,52.24N,5.26E


In [118]:
# Changing the type of the index to dattime 
df_pollution.index = pd.to_datetime(df_pollution.index)
df_temp.index = pd.to_datetime(df_temp.index)
df_greenhouse.index = pd.to_datetime(df_greenhouse.index)

In [119]:
mask = df_temp["Country"] == "United States"
df_temp = df_temp[mask]

In [120]:
df_temp.drop("Country", inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp.drop("Country", inplace=True, axis=1)


In [121]:
# df_temp = df_temp.groupby(['City', 'Latitude', 'Longitude']).resample('YE').mean()
df_pollution = df_pollution.groupby('City').resample('YE').mean()

TypeError: agg function failed [how->mean,dtype->object]

In [122]:
df_pollution

Unnamed: 0_level_0,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,NO2 Units,NO2 Mean,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
Date Local,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,19.041667,...,Parts per billion,3.000000,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2000-01-02,5,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,22.958333,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,1.066667,2.3,0,26.0
2000-01-03,9,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,38.125000,...,Parts per billion,5.250000,11.0,19,16.0,Parts per million,1.762500,2.5,8,28.0
2000-01-04,13,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,40.260870,...,Parts per billion,7.083333,16.0,8,23.0,Parts per million,1.829167,3.0,23,34.0
2000-01-05,17,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,48.450000,...,Parts per billion,8.708333,15.0,7,21.0,Parts per million,2.700000,3.7,2,42.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-03-27,24585,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,Parts per billion,4.277273,...,Parts per billion,-0.095238,0.0,0,0.0,Parts per million,0.100000,0.1,0,1.0
2016-03-28,24589,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,Parts per billion,8.317391,...,Parts per billion,0.117391,0.5,7,0.0,Parts per million,0.100000,0.1,0,1.0
2016-03-29,24593,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,Parts per billion,2.564706,...,Parts per billion,0.143750,0.7,8,0.0,Parts per million,0.006667,0.1,0,1.0
2016-03-30,24597,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,Parts per billion,1.083333,...,Parts per billion,0.016667,0.1,0,0.0,Parts per million,0.091667,0.1,2,1.0


In [123]:
df_pollution = df_pollution[["City", "NO2 Mean", "O3 Mean", "SO2 Mean", "CO Mean"]]
df_temp = df_temp[["AverageTemperature", "City", "Latitude", "Longitude"]]
df_greenhouse = df_greenhouse[["country_or_area", "value"]]

In [124]:
cutoff_date = pd.to_datetime("2000-01-01")
cutoff_date

Timestamp('2000-01-01 00:00:00')

In [125]:
df_pollution = df_pollution[df_pollution.index >= cutoff_date]
df_temp = df_temp[df_temp.index >= cutoff_date]
# df_greenhouse = df_greenhouse[df_greenhouse.index >= cutoff_date]

In [127]:
df_temp

Unnamed: 0_level_0,AverageTemperature,City,Latitude,Longitude
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,8.039,Abilene,32.95N,100.53W
2000-02-01,11.908,Abilene,32.95N,100.53W
2000-03-01,14.423,Abilene,32.95N,100.53W
2000-04-01,18.274,Abilene,32.95N,100.53W
2000-05-01,25.358,Abilene,32.95N,100.53W
...,...,...,...,...
2013-05-01,15.544,Yonkers,40.99N,74.56W
2013-06-01,20.892,Yonkers,40.99N,74.56W
2013-07-01,24.722,Yonkers,40.99N,74.56W
2013-08-01,21.001,Yonkers,40.99N,74.56W


## SECTION 2: Correlation Analysis

##SECTION 3: ARIMA model for temperature forecasting

In [None]:
# Although we can determine p, q values manually by looking at the ACF and PACF plots for a given city, we must automate the process
#(OPTIONAL) To automate the process, we must perform a grid search over different values of p and q and choose the ARIMA model for which the AIC and BIC values are minimum

p_range = q_range = list(range(0,#))  # taking values from 0 to # (decide this looking at PACF)

aic_values = []
bic_values = []
pq_values = []

for p in p_range:
    for q in q_range:
        try:
            model = ARIMA(city_df, order=(p, d, q))
            results = model.fit(disp=-1)
            aic_values.append(ARMAResults.aic(results))
            bic_values.append(ARMAResults.bic(results))
            pq_values.append((p, q))
        except:
            pass

best_pq = pq_values[aic_values.index(min(aic_values))]  # (p,q) corresponding to lowest AIC score
print("(p,q) corresponding to lowest AIC score: ", best_pq)

In [None]:
# fitting an ARIMA model with chosen p, d, q values and calculating the mean squared error
from sklearn.metrics import mean_absolute_error

arima_model = ARIMA(city_df, order=(best_pq[0], 0, best_pq[1])).fit()
predictions = arima_model.predict(start=0, end=len(city_df)-1)



## Conclusion

write here the report for the project