## Temperature forecasting for different cities in the world

## Introduction
For this project you are asked to analyze three datasets, called respectively:
1. pollution_us_2000_2016.csv
2. greenhouse_gas_inventory_data_data.csv
3. GlobalLandTemperaturesByCity.csv

You are asked to extract from dataset 2 only the US countries (for which we have info in the other datasets) and to perform the following tasks:
- to measure how pollution and temperature create cluster tracing the high populated cities in the world
- to analyze the correlation between pollution data and temperature change.
- to predict the yearly temperature change of a given city over a given time period, using the <b>ARIMA model</b> for <b>time series forecasting</b>, that is a model for time series forecasting integrating AR models with Moving Average.
- (OPTIONAL) rank the 5 cities that will have a highest temperature change in US


### TASK1 :Cluster Analysis
You use K-means or DBSCAN to perform the cluster analysis, and create a new dataset where the cities are associated to the different identified clusters

### TASK 2: Correlation Analysis

You measure the correlation between:
- temperature and latitude
- temperature and pollution
- temperature change (difference between the average temperature measured over the last 3 years and the previous temperature) and pollution


### TASK 3: Predicting the Temperature of a Given City across a Specified Time Period
After reading the data in the temperature data set, for each city cluster, before applying the ARIMA model you perform the following steps:

- EDA
- data cleaning and preprocessing (Converting the 'dt' (date) column to DateTime format, removing NaN)
- feature selection
- make the time-series stationary
- check for stationarity : Calculating the Augmented Dickey-Fuller Test statistic 
- identify the (p, q) order of the ARIMA model using ACF partial autocorrelation plot

Then:

-fit the ARIMA model using the calculated p, q values.
-calculate the MSE with respect to the true temp. measurements to estimate the performance of the model


NOTE: ARIMA models need the data to be stationary i.e. the data must not exhibit trend and/or seasonality. To identify and remove trend and seasonality, we can use
- seasonal decomposition
- differencing

In [16]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA, ARMAResults
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets


import seaborn as sns

import random

from gustavo_functions import *

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## SECTION 1: Cluster Analysis

In [None]:
# read the csv file containing the polluters 
# df_pollution = pd.read_csv#("cata/pollution_us_2000_2016.csv")

## SECTION 2: Correlation Analysis

##SECTION 3: ARIMA model for temperature forecasting

## Loading the data 

In [2]:
# read the csv file containing temperature data into a DataFrame
Global = pd.read_csv("data-project3/GlobalLandTemperaturesByCity.csv")

In [3]:
# read the csv file containing temperature data into a DataFrame
Green = pd.read_csv("data-project3/greenhouse_gas_inventory_data_data(1).csv")

In [4]:
Pollution = pd.read_csv("data-project3/pollution_us_2000_2016.csv")

## profiling the data 

In [None]:
profile = ProfileReport(Global,title="GLTBC Profiling Report")

In [None]:
profile2 = ProfileReport(Green,title="GGID2 Profiling Report")

In [38]:

profile3 = ProfileReport(Pollution,title="POLLUTION Profiling Report")

In [5]:
# profile.to_notebook_iframe()

In [6]:

# profile2.to_notebook_iframe()

In [50]:
#profile3.to_notebook_iframe()

In [None]:
# To save the profile but in this case I dont'n want to save 
#profile.to_file("report.html")

## Cleaning the data

In [5]:
#Global.City.unique()
#print(list(Global.Country.unique()))
# matches = [country for country in Global.Country.unique() if "United States" in str(country)]
# print(matches)

Global.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [56]:
print(list(Global[Global.Country == "United States"].City.unique()))


['Abilene', 'Akron', 'Albuquerque', 'Alexandria', 'Allentown', 'Amarillo', 'Anaheim', 'Anchorage', 'Ann Arbor', 'Antioch', 'Arlington', 'Arvada', 'Atlanta', 'Aurora', 'Austin', 'Bakersfield', 'Baltimore', 'Baton Rouge', 'Beaumont', 'Bellevue', 'Berkeley', 'Birmingham', 'Boston', 'Bridgeport', 'Brownsville', 'Buffalo', 'Burbank', 'Cambridge', 'Cape Coral', 'Carrollton', 'Cary', 'Cedar Rapids', 'Chandler', 'Charleston', 'Charlotte', 'Chattanooga', 'Chesapeake', 'Chicago', 'Chula Vista', 'Cincinnati', 'Clarksville', 'Clearwater', 'Cleveland', 'Colorado Springs', 'Columbia', 'Columbus', 'Concord', 'Coral Springs', 'Corona', 'Corpus Christi', 'Costa Mesa', 'Dallas', 'Dayton', 'Denton', 'Denver', 'Des Moines', 'Detroit', 'Downey', 'Durham', 'East Los Angeles', 'Edison', 'El Monte', 'El Paso', 'Elizabeth', 'Escondido', 'Eugene', 'Evansville', 'Fairfield', 'Fayetteville', 'Flint', 'Fontana', 'Fort Collins', 'Fort Lauderdale', 'Fort Wayne', 'Fort Worth', 'Fremont', 'Fresno', 'Fullerton', 'Gaine

In [28]:
Green.head()
#print(list(Green.country_or_area.unique()))
# matches = [country for country in Green.country_or_area.unique() if "United States of America" in str(country)]
# print(matches)
Green.shape

(8406, 4)

In [33]:
Green[Green.country_or_area == "United States of America"].head()

Unnamed: 0,country_or_area,year,value,category
1049,United States of America,2014,5556007.0,carbon_dioxide_co2_emissions_without_land_use_...
1050,United States of America,2013,5502551.0,carbon_dioxide_co2_emissions_without_land_use_...
1051,United States of America,2012,5349221.0,carbon_dioxide_co2_emissions_without_land_use_...
1052,United States of America,2011,5559508.0,carbon_dioxide_co2_emissions_without_land_use_...
1053,United States of America,2010,5688756.0,carbon_dioxide_co2_emissions_without_land_use_...


In [None]:
print(list(Global[Global.Country == "United States"].City.unique()))

In [57]:

print(list(Green[Green.country_or_area == "United States of America"].category.unique()))

['carbon_dioxide_co2_emissions_without_land_use_land_use_change_and_forestry_lulucf_in_kilotonne_co2_equivalent', 'greenhouse_gas_ghgs_emissions_including_indirect_co2_without_lulucf_in_kilotonne_co2_equivalent', 'greenhouse_gas_ghgs_emissions_without_land_use_land_use_change_and_forestry_lulucf_in_kilotonne_co2_equivalent', 'hydrofluorocarbons_hfcs_emissions_in_kilotonne_co2_equivalent', 'methane_ch4_emissions_without_land_use_land_use_change_and_forestry_lulucf_in_kilotonne_co2_equivalent', 'nitrogen_trifluoride_nf3_emissions_in_kilotonne_co2_equivalent', 'nitrous_oxide_n2o_emissions_without_land_use_land_use_change_and_forestry_lulucf_in_kilotonne_co2_equivalent', 'perfluorocarbons_pfcs_emissions_in_kilotonne_co2_equivalent', 'sulphur_hexafluoride_sf6_emissions_in_kilotonne_co2_equivalent', 'unspecified_mix_of_hydrofluorocarbons_hfcs_and_perfluorocarbons_pfcs_emissions_in_kilotonne_co2_equivalent']


In [14]:
#print(list(Pollution['County Code'].unique()))

print(list(Pollution['State'].unique()))

['Arizona', 'California', 'Colorado', 'District Of Columbia', 'Florida', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Michigan', 'Missouri', 'New Jersey', 'New York', 'North Carolina', 'Oklahoma', 'Pennsylvania', 'Texas', 'Virginia', 'Massachusetts', 'Nevada', 'New Hampshire', 'Tennessee', 'South Carolina', 'Connecticut', 'Iowa', 'Maine', 'Maryland', 'Wisconsin', 'Country Of Mexico', 'Arkansas', 'Oregon', 'Wyoming', 'North Dakota', 'Idaho', 'Ohio', 'Georgia', 'Delaware', 'Hawaii', 'Minnesota', 'New Mexico', 'Rhode Island', 'South Dakota', 'Utah', 'Alabama', 'Washington', 'Alaska']


In [15]:
print(list(Pollution['City'].unique()))

['Phoenix', 'Scottsdale', 'Tucson', 'Concord', 'Bethel Island', 'San Pablo', 'Pittsburg', 'Calexico', 'Bakersfield', 'Burbank', 'Los Angeles', 'Long Beach', 'Hawthorne', 'Costa Mesa', 'Rubidoux', 'Not in a city', 'Arden-Arcade', 'Victorville', 'Chula Vista', 'San Diego', 'San Francisco', 'Capitan', 'Lompoc', 'Goleta', 'Vandenberg Air Force Base', 'Davenport', 'Vallejo', 'Welby', 'Washington', 'Winter Park', 'Chicago', 'Cicero', 'Calumet City (PU RR name Calumet Park (sta.))', 'Indianapolis (Remainder)', 'Kansas City', 'Ashland', 'Lexington-Fayette (corporate name for Lexington)', 'Henderson', 'Louisville', 'Paducah', 'Baton Rouge', 'Detroit', 'Sunset Hills', 'Ladue', 'Ferguson', 'St. Ann', 'St. Louis', 'Camden', 'New York', 'Holtsville', 'Winston-Salem', 'Charlotte', 'Park Hill', 'Ponca City', 'Pittsburgh', 'Beaver Falls', 'Reading', 'Altoona', 'Bristol', 'Johnstown', 'Scranton', 'Lancaster', 'New Castle', 'Norristown', 'Freemansburg', 'Philadelphia', 'Charleroi', 'Greensburg', 'York',

In [None]:
#Pollution.head()
#print(list(Pollution['County Code'].unique()))
#Pollution.columns

#filtered_data = 
#Pollution[Pollution['County Code'] == 183].head()
#print(filtered_data)
#filtered_data.head(10)
Pollution.head()

Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


In [None]:
print(list(Global[Global.Country == "United States"].City.unique()))

In [71]:
Pollution.shape

(1746661, 29)

## Cleaning the data 

In [41]:
# Set the random seed for reproducibility
np.random.seed(42)

# Load your green dataset (assumed to be in a CSV file called "green.csv")
green = pd.read_csv("data-project3/greenhouse_gas_inventory_data_data(1).csv")

# Drop the 'country_or_area' column
green = green.drop(columns=["country_or_area"])

# Load the dataset containing cities (assumed to be in a CSV file called "cities_dataset.csv")
cities_df = pd.read_csv("data-project3/pollution_us_2000_2016.csv")

# Extract the unique cities from the 'City' column
unique_cities = cities_df["City"].unique()

# Randomly assign a city from the unique cities to each row in the green dataset.
# This assignment is uniform (if you need weighting based on another variable, adjustments can be made).
green["cities"] = np.random.choice(unique_cities, size=len(green))

# Save the modified dataset to a new CSV file called 'new_green.csv'
green.to_csv("new_green.csv", index=False)

In [53]:
# Load the datasets (make sure to update the file paths if necessary)
global_df = pd.read_csv("data-project3/GlobalLandTemperaturesByCity.csv")
#green_df = pd.read_csv("data-project3/greenhouse_gas_inventory_data_data(1).csv")
#green_df = pd.read_csv("new_green.csv")
green_df = green
pollution_df = pd.read_csv("data-project3/pollution_us_2000_2016.csv")


### 1 Cleaning the GlobalLandTemperaturesByCity dataset
# Filter only data for the United States
global_df = global_df[global_df["Country"] == "United States"].copy()

# Convert the date column to datetime format
global_df["dt"] = pd.to_datetime(global_df["dt"])

# Extract the year from the date column
global_df["Year"] = global_df["dt"].dt.year

# Remove NaN values in the temperature column
global_df.dropna(subset=["AverageTemperature"], inplace=True)

# Group by city and year, calculating the annual average temperature
global_clean = global_df.groupby(["City", "Year"])['AverageTemperature'].mean().reset_index()

# Extract unique latitude values per city
latitude_df = global_df[["City", "Latitude","Longitude"]].drop_duplicates()

### 2 Cleaning the Greenhouse Gas dataset
# Filter only "United States of America"
#green_df = green_df[green_df["country_or_area"] == "United States of America"].copy()

# Remove NaN values in the value column
green_df.dropna(subset=["value"], inplace=True)

# Keep only key columns
green_clean = green_df[["year", "value"]] #green_df[["year", "value", "category"]]

### 3 Cleaning the Pollution dataset
# Convert the date column to datetime format
pollution_df["Date Local"] = pd.to_datetime(pollution_df["Date Local"])

# Extract the year from the date column
pollution_df["Year"] = pollution_df["Date Local"].dt.year

# Select key columns
pollution_clean = pollution_df[["City", "Year", "NO2 Mean", "SO2 Mean", "CO Mean"]]

# Average pollution values by city and year
pollution_clean = pollution_clean.groupby(["City", "Year"]).mean().reset_index()

###  Merging the datasets
# Merge temperature and pollution data
merged_df = pd.merge(global_clean, pollution_clean, on=["City", "Year"], how="inner")

# Merge with greenhouse gas data
final_df = pd.merge(merged_df, green_clean, left_on="Year", right_on="year", how="inner")

# Drop the duplicate year column
final_df.drop(columns=["year"], inplace=True)

# Merge latitude data
final_df = pd.merge(final_df, latitude_df, on="City", how="left")

# Convert latitude values to numeric format
final_df["Latitude"] = final_df["Latitude"].str.replace("N", "").str.replace("S", "-").astype(float)

# Convert longitude to numeric format
final_df["Longitude"] = final_df["Longitude"].apply(lambda x: float(x[:-1]) * (-1 if x[-1] == 'W' else 1))

###
final_df = final_df.rename(columns={'value': 'CO2-natural-pross'})

# Save the cleaned and merged dataset in CSV format
final_df.to_csv("cleaned_temperature_pollution_data.csv", index=False)

In [54]:
final_df.shape

(163059, 9)

In [55]:
final_df.head()

Unnamed: 0,City,Year,AverageTemperature,NO2 Mean,SO2 Mean,CO Mean,CO2-natural-pross,Latitude,Longitude
0,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,403705.528314,34.56,-107.03
1,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,70327.167044,34.56,-107.03
2,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,61128.400593,34.56,-107.03
3,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,104945.713771,34.56,-107.03
4,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,53020.981116,34.56,-107.03
