# Predicting weather in the next hour using raw data

This notebook demonstrates how we can use vector search for time series forecasting on climate data with Pinecone.
We use the [Jena Climate dataset](https://www.kaggle.com/stytch16/jena-climate-2009-2016) for this example. Jena Climate dataset is made up of quantities such as air temperature, atmospheric pressure, humidity, wind direction, etc. that were recorded every 10 minutes, over several years.

In a tabular dataset like this, every column can be seen as a feature vector identified uniquely by the time stamp associated with them. We can use these vectors to perform similarity search with a given query vector at a certain time to predict the weather for that hour. Though a very simple embedding extraction process, we want to see how far we can get even with a basic similarity search method like this. We will see how to do with Pinecone in the steps below.



### Install Pinecone




In [None]:
!pip install -qU \
    pinecone-client==3.1.0 \
    matplotlib==3.2.2 \
    tensorflow==2.9.2 \
    scikit-learn==1.0.2 \
    pandas==1.3.5 \
    tqdm\
    pinecone-notebooks==0.1.1

You can get your Pinecone API Key [here](https://www.pinecone.io/start/) if you don't have one.

In [None]:
import os

# initialize connection to pinecone (orget API key at app.pinecone.io)
if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [None]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

### Import other dependencies

In [None]:
import matplotlib as mpl
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import List
import itertools

mpl.rcParams['figure.figsize'] = (20, 16)
mpl.rcParams['axes.grid'] = False

### Load the dataset

In [None]:
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

Load the hourly data into a dataframe

In [None]:
original_data_for_insert = pd.read_csv(csv_path)
original_data_for_insert = original_data_for_insert[5::6]

original_data_for_insert['Date Time'] = pd.to_datetime(original_data_for_insert['Date Time'], format='%d.%m.%Y %H:%M:%S')

Split data into data that is going to be inserted into Pinecone, and data that is going to be used for querying.

In [None]:
n = len(original_data_for_insert)
train_data = original_data_for_insert[:int(n*0.9)]
test_data = original_data_for_insert[int(n*0.9):]


Let's see what the data looks like.

In [None]:
train_data.head()

Prepare data for upload. We will be querying data by the date and time.

In [None]:
items_to_upload = []
for row in train_data.values.tolist():
    key = str(row[0])
    values = row[1:]
    items_to_upload.append((key, values))

Prepare data that is going to be queried.
Here we create two lists - one with dates that are going to be queried and the other one with vectors.



In [None]:
query_dates = []
query_data = []
for row in test_data.values.tolist():
    query_dates.append(str(row[0]))
    query_data.append(row[1:])

### Setting up an index

In [None]:
# Pick a name for the new service
index_name = 'time-series-weather'

In [None]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=14,
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

In [None]:
# Upload items
def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

for batch in chunks(items_to_upload, 500):
    index.upsert(vectors=batch)

In [None]:
# Check the index size to confirm the data was upserted properly
index.describe_index_stats()

In [None]:
from tqdm.auto import tqdm

# Query items
all_query_results = []
for xq in tqdm(query_data):
    res = index.query(vector=xq, top_k=1)
    all_query_results.append(res)

Here we create a function for getting predictions from Pinecone. We do this by using vectors to find the most similar vector in the index and then reading the hour after that.

In [None]:
def get_predictions(feature: str) -> (List, List):

    true_values = []
    predicted_values = []

    for test_date, qr in zip(query_dates, all_query_results):
        similar_date = [res.id for res in qr.matches][0]
        hour_from_original = datetime.strptime(str(test_date), '%Y-%m-%d %H:%M:%S') + timedelta(hours=1)
        hour_from_similar = datetime.strptime(similar_date, '%Y-%m-%d %H:%M:%S') + timedelta(hours=1)

        original_temperature = original_data_for_insert.loc[original_data_for_insert['Date Time'] == hour_from_original][feature].tolist()
        similar_temperature = original_data_for_insert.loc[original_data_for_insert['Date Time'] == hour_from_similar][feature].tolist()

        if original_temperature and similar_temperature:
            true_values.append(original_temperature[0])
            predicted_values.append(similar_temperature[0])
    return true_values, predicted_values


In [None]:
def plot_results(predicted_values: List, true_values: List):
    x_list = range(0, len(predicted_values))
    plt.plot(x_list[:200], predicted_values[:200], label='forecast')
    plt.plot(x_list[:200], true_values[:200], label='true')
    plt.legend()
    plt.show()

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

def print_results(true_values: List, predicted_values: List):
    print(f'MSE: {mean_squared_error(true_values, predicted_values)}')
    print(f'RMSE: {mean_squared_error(true_values, predicted_values, squared=False)}')
    print(f'MAE: {mean_absolute_error(true_values, predicted_values)}')

### Results

To evaluate our results we will plot the predicted and true values for all the 14 features.

In [None]:
for feature in original_data_for_insert.columns[1:]:
    print(f'Analyzing predictions for {feature}')
    true_values, predicted_values = get_predictions(feature)
    plot_results(true_values, predicted_values)
    print_results(true_values, predicted_values)

### Summary

From the plots above we can see that the method is able to predict pretty accurately for feature like VPdef, VPmax, rh(%) etc. predict roughly accurately for features like H20C, rho and is not that great for features like wd, max.vv, wv. Given how simple the approach is and doesn't involve any feature engineering, it does pretty good in some spots!

We can improve these predictions by using more complex methods like LSTMs which are better suited to handle data like these.

### Delete the Index

Once we don't have use of the index we can delete them.


*Note: Index deletion is permanent*

In [None]:
pc.delete_index(index_name)