## Temperature forecasting for different cities in the world

## Introduction
For this project you are asked to analyze three datasets, called respectively:
1. pollution_us_2000_2016.csv
2. greenhouse_gas_inventory_data_data.csv
3. GlobalLandTemperaturesByCity.csv

You are asked to extract from dataset 2 only the US countries (for which we have info in the other datasets) and to perform the following tasks:
- to measure how pollution and temperature create cluster tracing the high populated cities in the world
- to analyze the correlation between pollution data and temperature change.
- to predict the yearly temperature change of a given city over a given time period, using the <b>ARIMA model</b> for <b>time series forecasting</b>, that is a model for time series forecasting integrating AR models with Moving Average.
- (OPTIONAL) rank the 5 cities that will have a highest temperature change in US


### TASK1 :Cluster Analysis
You use K-means or DBSCAN to perform the cluster analysis, and create a new dataset where the cities are associated to the different identified clusters

### TASK 2: Correlation Analysis

You measure the correlation between:
- temperature and latitude
- temperature and pollution
- temperature change (difference between the average temperature measured over the last 3 years and the previous temperature) and pollution


### TASK 3: Predicting the Temperature of a Given City across a Specified Time Period
After reading the data in the temperature data set, for each city cluster, before applying the ARIMA model you perform the following steps:

- EDA
- data cleaning and preprocessing (Converting the 'dt' (date) column to DateTime format, removing NaN)
- feature selection
- make the time-series stationary
- check for stationarity : Calculating the Augmented Dickey-Fuller Test statistic 
- identify the (p, q) order of the ARIMA model using ACF partial autocorrelation plot

Then:

-fit the ARIMA model using the calculated p, q values.
-calculate the MSE with respect to the true temp. measurements to estimate the performance of the model


NOTE: ARIMA models need the data to be stationary i.e. the data must not exhibit trend and/or seasonality. To identify and remove trend and seasonality, we can use
- seasonal decomposition
- differencing

In [60]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA, ARMAResults
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets


import seaborn as sns

import random

from gustavo_functions import *

# %load_ext autoreload
# %autoreload 2


In [62]:
hello_gustavo()

Hello 2 Gustavo!


## SECTION 1: Cluster Analysis

In [None]:
# read the csv file containing the polluters 
# df_pollution = pd.read_csv#("cata/pollution_us_2000_2016.csv")

## SECTION 2: Correlation Analysis

##SECTION 3: ARIMA model for temperature forecasting

## Cleaning the data 

In [41]:
# Set the random seed for reproducibility
np.random.seed(42)

# Load your green dataset (assumed to be in a CSV file called "green.csv")
green = pd.read_csv("data-project3/greenhouse_gas_inventory_data_data(1).csv")

# Drop the 'country_or_area' column
green = green.drop(columns=["country_or_area"])

# Load the dataset containing cities (assumed to be in a CSV file called "cities_dataset.csv")
cities_df = pd.read_csv("data-project3/pollution_us_2000_2016.csv")

# Extract the unique cities from the 'City' column
unique_cities = cities_df["City"].unique()

# Randomly assign a city from the unique cities to each row in the green dataset.
# This assignment is uniform (if you need weighting based on another variable, adjustments can be made).
green["cities"] = np.random.choice(unique_cities, size=len(green))

# Save the modified dataset to a new CSV file called 'new_green.csv'
green.to_csv("new_green.csv", index=False)

In [53]:
# Load the datasets (make sure to update the file paths if necessary)
global_df = pd.read_csv("data-project3/GlobalLandTemperaturesByCity.csv")
#green_df = pd.read_csv("data-project3/greenhouse_gas_inventory_data_data(1).csv")
#green_df = pd.read_csv("new_green.csv")
green_df = green
pollution_df = pd.read_csv("data-project3/pollution_us_2000_2016.csv")


### 1 Cleaning the GlobalLandTemperaturesByCity dataset
# Filter only data for the United States
global_df = global_df[global_df["Country"] == "United States"].copy()

# Convert the date column to datetime format
global_df["dt"] = pd.to_datetime(global_df["dt"])

# Extract the year from the date column
global_df["Year"] = global_df["dt"].dt.year

# Remove NaN values in the temperature column
global_df.dropna(subset=["AverageTemperature"], inplace=True)

# Group by city and year, calculating the annual average temperature
global_clean = global_df.groupby(["City", "Year"])['AverageTemperature'].mean().reset_index()

# Extract unique latitude values per city
latitude_df = global_df[["City", "Latitude","Longitude"]].drop_duplicates()

### 2 Cleaning the Greenhouse Gas dataset
# Filter only "United States of America"
#green_df = green_df[green_df["country_or_area"] == "United States of America"].copy()

# Remove NaN values in the value column
green_df.dropna(subset=["value"], inplace=True)

# Keep only key columns
green_clean = green_df[["year", "value"]] #green_df[["year", "value", "category"]]

### 3 Cleaning the Pollution dataset
# Convert the date column to datetime format
pollution_df["Date Local"] = pd.to_datetime(pollution_df["Date Local"])

# Extract the year from the date column
pollution_df["Year"] = pollution_df["Date Local"].dt.year

# Select key columns
pollution_clean = pollution_df[["City", "Year", "NO2 Mean", "SO2 Mean", "CO Mean"]]

# Average pollution values by city and year
pollution_clean = pollution_clean.groupby(["City", "Year"]).mean().reset_index()

###  Merging the datasets
# Merge temperature and pollution data
merged_df = pd.merge(global_clean, pollution_clean, on=["City", "Year"], how="inner")

# Merge with greenhouse gas data
final_df = pd.merge(merged_df, green_clean, left_on="Year", right_on="year", how="inner")

# Drop the duplicate year column
final_df.drop(columns=["year"], inplace=True)

# Merge latitude data
final_df = pd.merge(final_df, latitude_df, on="City", how="left")

# Convert latitude values to numeric format
final_df["Latitude"] = final_df["Latitude"].str.replace("N", "").str.replace("S", "-").astype(float)

# Convert longitude to numeric format
final_df["Longitude"] = final_df["Longitude"].apply(lambda x: float(x[:-1]) * (-1 if x[-1] == 'W' else 1))

###
final_df = final_df.rename(columns={'value': 'CO2-natural-pross'})

# Save the cleaned and merged dataset in CSV format
final_df.to_csv("cleaned_temperature_pollution_data.csv", index=False)

In [54]:
final_df.shape

(163059, 9)

In [55]:
final_df.head()

Unnamed: 0,City,Year,AverageTemperature,NO2 Mean,SO2 Mean,CO Mean,CO2-natural-pross,Latitude,Longitude
0,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,403705.528314,34.56,-107.03
1,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,70327.167044,34.56,-107.03
2,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,61128.400593,34.56,-107.03
3,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,104945.713771,34.56,-107.03
4,Albuquerque,2011,11.7585,13.406795,0.404034,0.205522,53020.981116,34.56,-107.03


## profiling the data 

In [57]:

profile = ProfileReport(final_df,title="cleaned_temperature_pollution_data")

In [58]:
profile.to_notebook_iframe()
# To save the profile but in this case I dont'n want to save 
#profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 9/9 [00:00<00:00, 42.35it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]