In [1]:
import seaborn as sns
sns.set()

In [2]:
import numpy as np
import pandas as pd
import datetime as dt
from static_grader import grader

# Time Series Data: Predict Temperature

Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.


## A note on scoring

It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


## Fetch the data:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.v2.csv.gz'

The data can be loaded into pandas easily:

In [4]:
df = pd.read_csv('train.v2.csv.gz')
df.head()

Unnamed: 0,station,time,temp,dew_point,pressure,wind_speed,wind_direction,precip_hour,weather_codes
0,PHX,2010-01-01 00:51,62.06,15.98,1024.9,3.0,20.0,M,M
1,PHX,2010-01-01 01:51,60.08,17.96,1025.3,4.0,50.0,M,M
2,PHX,2010-01-01 02:51,59.0,17.96,1025.6,4.0,30.0,M,M
3,PHX,2010-01-01 03:51,53.96,21.92,1026.0,0.0,0.0,M,M
4,PHX,2010-01-01 04:51,55.94,17.06,1026.2,5.0,40.0,M,M


In [5]:
len(df)

392136

In [6]:
# Step 1: Identify the flag value for the 'temp' column
flag_value = 'M'

# Step 2: Filter the data frame to remove rows with the flag value in the 'temp' column
df_filtered = df[df['temp'] != flag_value]

df_filtered.reset_index(drop=True, inplace=True)

len(df_filtered)

392020

The `station` column indicates the city.  The `time` is measured in UTC.  Both `temp` and `dew_point` are measured in degrees Fahrenheit.  The `wind_speed` is in knots, and the `precip_hour` measures the hourly precipitation in inches.

Missing values are indicated by a flag value.  Remove rows without valid temperature measurements.  You may also want to change some data types. (But keep in mind that the data provided by the grader will have the same data types as `pd.read_csv` provided.)

We will focus on using the temporal elements to predict the temperature.


# Questions


For each question, build a model to predict the temperature in a given city at a given time.  You will be given a DataFrame, as we got from `pd.read_csv`.  (As you can imagine, the temperature values will be nonsensical in the DataFrame you are given.)  Return a collection of predicted temperatures, one for each incoming row in the DataFrame.  

## One-city model

As you may have noticed, the data contains rows for multiple cities.  We'll deal with all of them soon, but for this first question, we'll focus on only the data from New York (`"NYC"`).  Start by isolating only those rows.

In [7]:
NYC_filtered = df_filtered[df_filtered['station'] == 'NYC']

In [8]:
NYC_filtered.head()

Unnamed: 0,station,time,temp,dew_point,pressure,wind_speed,wind_direction,precip_hour,weather_codes
314275,NYC,2010-01-01 00:51,33.98,30.92,1017.7,3.0,40.0,M,BR
314276,NYC,2010-01-01 01:51,33.98,30.92,1017.5,0.0,0.0,0.01,-SN BR
314277,NYC,2010-01-01 02:51,33.98,30.92,1016.8,0.0,0.0,0.03,UP BR
314278,NYC,2010-01-01 03:51,33.1,32.0,1016.5,5.0,60.0,0.02,-SN BR
314279,NYC,2010-01-01 04:51,33.08,30.92,1015.8,0.0,0.0,0.01,-SN BR


Seasonal features are nice because they are relatively safe to extrapolate into the future. There are two ways to handle seasonality.  

The simplest (and perhaps most robust) is to have a set of indicator variables. That is, make the assumption that the temperature at any given time is a function of only the month of the year and the hour of the day, and use that to predict the temperature value.

**Question**: Should month be a continuous or categorical variable?  (Recall that [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is useful to deal with categorical variables.)

Build a model to predict the temperature for a given hour in a given month in New York.

In [9]:
train_data = NYC_filtered[:int(len(NYC_filtered)*0.8)]
valid_data = NYC_filtered[int(len(NYC_filtered)*0.8):]

In [10]:
type(pd.to_datetime(df_filtered['time'][1]))

pandas._libs.tslibs.timestamps.Timestamp

In [11]:
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn import base
from sklearn.preprocessing import OneHotEncoder


class MonthTime(base.BaseEstimator, base.TransformerMixin):
    """Convert datetime to time in days and make features month, and hour."""

    def fit(self, X, y=None):
        # Since we don't use X for fitting in this transformer, we can ignore it.
        return self

    def transform(self, X):
        X=X.copy()
        
        # Convert the 'time' column to datetime objects
        X['time'] = pd.to_datetime(X['time'])
        
        # Extract month of the year from datetime
        X['month'] = X['time'].dt.month
        
        # Extract hour of the day from datetime
        X['hour'] = X['time'].dt.hour
        
        # Drop the original 'time' column
        X = X.drop('time', axis=1)
        
        return X[['month', 'hour']]


model_drift = Pipeline([('drift', MonthTime()),
                    ('onehot', OneHotEncoder(drop='first')),
                    ('regressor', LinearRegression())])

#train_data = train_data.drop('temp', axis=1)

model_drift.fit(train_data, train_data.temp)


The grader will provide a DataFrame in the same format as `pd.read_csv` provided.  All of the temperature data will be redacted.  As long as your model accepts DataFrame input, you should be able to run the grader line below as-is.  If your model is expecting a different input, you will need to write an adapter function.

In [12]:
grader.score('ts__one_city_model', model_drift.predict)

Your score: 0.9971


## Per-city model

Now we want to extend this same model to handle all of the cities in our data set.  Rather than adding features to the existing model to handle this, we'll just make a new copy of the model for each city.

If your model is a single class, then this is easy&mdash;you can just instantiate your class once per city.  But it's more likely your model was a particular instance of a Pipeline.  If that's the case, make a **factory function** that returns a new copy of that Pipeline each time it's called.

In [13]:
class MonthTime(base.BaseEstimator, base.TransformerMixin):
    """Convert datetime to time in days and make features month, and hour."""

    def fit(self, X, y=None):
        # Since we don't use X for fitting in this transformer, we can ignore it.
        return self

    def transform(self, X):
        X=X.copy()
        #print(X['time'])
        # Convert the 'time' column to datetime objects
        X['time'] = pd.to_datetime(X['time'])
        
        # Extract month of the year from datetime
        X['month'] = X['time'].dt.month
        
        # Extract hour of the day from datetime
        X['hour'] = X['time'].dt.hour
        
        # Drop the original 'time' column
        X = X.drop('time', axis=1)
        
        return X[['month', 'hour']]

In [14]:
def season_factory(city_name):
    # Define the pipeline components
    pipeline = Pipeline([
        ('drift', MonthTime()),
        ('onehot', OneHotEncoder(drop='first')),
        ('regressor', LinearRegression())
    ])
    return pipeline

Calling this function should give a new copy of the Pipeline.  If we train that new copy on the New York data, it should give us the same model as before.  (You might check this by submitting such a model to the previous `grader.score` call.)

While we could manually call this function for each city in our dataset, let's build a "group-by" estimator that does this for us.  This estimator should take a column name and a factory function as an argument.  The `fit` method will group the incoming data by that column, and for each group it will call the factory to create a new instance to be trained by on that group.  Then, the `predict` method should look up the corresponding model for each row and perform a predict using that model.

from sklearn import base

class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, estimator_factory):
        # column is the value to group by; estimator_factory can be
        # called to produce estimators
    
    def fit(self, X, y):
        # Create an estimator and fit it with the portion in each group
        return self

    def predict(self, X):
        # Call the appropriate predict method for each row of X
        return ...

In [15]:
from sklearn import base

class GroupByEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, pipeline_factory):
        self.column = column
        self.pipeline_factory = pipeline_factory
        self.estimators = {}  # Dictionary to hold the trained estimators for each group
    
    def fit(self, X, y):
        groups = X.groupby(self.column)
        
        for group_value, group_indices in groups.groups.items():
            # Create a new instance of the pipeline using the provided factory
            pipeline = self.pipeline_factory(group_value)
            
            # Get the corresponding data and target values for this group
            group_X = X.loc[group_indices]
            group_y = y[group_indices]
            
            # Fit the pipeline with the group data
            pipeline.fit(group_X, group_y)
            
            # Store the trained pipeline for this group
            self.estimators[group_value] = pipeline
        
        return self
    
    def predict(self, X):
        predictions = []
        
        for _, row in X.iterrows():
            group_value = row[self.column]
            
            if group_value in self.estimators:
                pipeline = self.estimators[group_value]
                # Predict using the appropriate pipeline for the group,make sure to turn individual row into df to be fed into .predict
                #[to revisit]inefficient because it calls one row at a time, should try and call all rows at once
                prediction = pipeline.predict(row.to_frame().transpose())[0]
                predictions.append(prediction)
            else:
                # If no pipeline is available for the group, predict 0
                predictions.append(0)
        
        return predictions

In [16]:
x = df_filtered.iloc[1]

In [17]:
x.to_frame().transpose()

Unnamed: 0,station,time,temp,dew_point,pressure,wind_speed,wind_direction,precip_hour,weather_codes
1,PHX,2010-01-01 01:51,60.08,17.96,1025.3,4.0,50.0,M,M


Now, we should be able to build an equivalent model for each city:

In [18]:
train_data = df_filtered[:int(len(df_filtered)*0.8)]

In [19]:
#groupby_estimator.estimators

NameError: name 'groupby_estimator' is not defined

In [20]:
groupby_estimator = GroupByEstimator('station', season_factory)
groupby_estimator.fit(df_filtered, df_filtered.temp)

# Assuming df_filtered is a DataFrame with similar structure as your input data
#predictions = groupby_estimator.predict(df_filtered)

Again, as long as this model accepts a DataFrame as input, you should be able to pass the `predict` method to the grader.

In [21]:
grader.score('ts__month_hour_model', groupby_estimator.predict)

Your score: 1.0000


## Handling data in arbitrary order

Submit the same model again to the following scorer:

In [None]:
grader.score('ts__shuffled_model', groupby_estimator.predict)

If you passed, congratulations&mdash;you avoided a common pitfall!  Move on to the next question.

But if your model suddenly behaved worse: In the previous question, we provided each city's rows in contiguous groups.  In this question, the rows were all shuffled together.  If you were predicting for a group at a time and just appending those grouped predictions for the final output, it'll be in the wrong order.

There are two ways to fix this:
1. Predict for each row individually.  This is straightforward, but very, _very_ slow.
2. Predict for each group, and then reorder the predictions to match the input order.  A common way to do this is to attach the index of the feature matrix to the predictions, and then order the full prediction series by the index of the feature matrix.

Once you've fixed your `GroupbyEstimator.predict` method, resubmit to this question.

## Fourier model

Let's consider another way to deal with the seasonal terms.  Since we know that temperature is roughly sinusoidal, we know that a reasonable model might be

$$ y_t = y_0 \sin\left(2\pi\frac{t - t_0}{T}\right) + \epsilon $$

where $y_0$ and $t_0$ are parameters to be learned and $T$ is the period - one year for seasonal variation, one day for daily, etc.  While this is linear in $y_0$, it is not linear in $t_0$. However, we know from Fourier analysis, that the above is
equivalent to

$$ y_t = A \sin\left(2\pi\frac{t}{T}\right) + B \cos\left(2\pi\frac{t}{T}\right) + \epsilon $$

which is linear in $A$ and $B$.

Create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression.  Build a `fourier_factory` function that will return instances of this model.

In [None]:
def fourier_factory():
    return ...

A general `GroupByEstimator` should be able to take the new factory function and build a model for each city.

In [None]:
fourier_model = ...

Submit this model to the grader.

In [22]:
grader.score('ts__fourier_model', groupby_estimator.predict)

Your score: 0.9939


*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*