# Yosemite Village yearly weather

## Part 0 - Data Preprocessing 
Temperature is cyclical, not only on a 24 hour basis but also on a yearly basis. Convert the dataset into a richer format whereby the day of the year is also captured. For example the time “20150212 1605”, can be converted into (43, 965) because the 12th of February is the 43rd day of the year, and 16:05 is the 965th minute of the day.

This data covers 6 years, so split the data into a training set of the first 5 years, and a testing set of the 6th year.

In [3]:
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import date


In [4]:
years = range(2011, 2017) 
files = ['CRNS0101-05-%d-CA_Yosemite_Village_12_W.txt' % y for y in years]

usecols = [1, 2, 8] #[UTC_time, UTC date, Temperature]
tr = [np.loadtxt(f, usecols=usecols) for f in files]
ts = np.loadtxt('CRNS0101-05-2016-CA_Yosemite_Village_12_W.txt', usecols=usecols)

In [3]:
## TEST SET ## 

#convert dates from YYYYMMDD format to days since start of year
ts_year = '20160101'
ts_year_start = datetime.strptime(ts_year, '%Y%m%d')
ts_date=pd.to_datetime(ts[:,0], format='%Y%m%d')
ts_date.to_numpy()
ts_days = ts_date - ts_year_start 

#convert time of day from HHMM to minutes since start of day 
hour_mins = np.divmod(ts[:,1],np.full((105408,), 100)) #two rows for hour and minutes each
ts_minutes = (hour_mins[0]*60)+hour_mins[1] #add them together to get new array of time 
ts_minutes = ts_minutes.astype(int)


## TRAIN SET ## 

tr_years = ['20110101','20120101','20130101','20140101','20150101']
train_dates = [] #list for arrays of days for each year 
train_minutes = [] #list for array of day minutes for each year
for i in range(len(tr_years)):
    #convert date 
    tr_year_start = datetime.strptime(tr_years[i], '%Y%m%d')
    year_data = tr[i]
    year_data=pd.to_datetime(year_data[:,0], format='%Y%m%d')
    year_data.to_numpy()
    year_data = year_data - tr_year_start 
    train_dates.append(year_data)
    #convert time
    h_n_m = np.divmod(ts[:,1],np.full((105408,), 100))
    train_minutes.append((h_n_m[0]*60)+h_n_m[1])

    
#merge everything into the same arrays 


### Part 1 - Applying Radial Basis Functions 
Cover each input dimension with a list of radial basis functions. This turns the pair of inputs into a much richer representation, mapping (d,t) into (Φ₁(d), Φ₂(t)). Experiment with different numbers of radial basis functions and different widths of the radial basis function in different dimensions.

### Part 2 - Build Linear Parameter Model 
Using this new representation, build a linear parameter model that captures both seasonal variations and daily variations.

## Part 3 - Visualization
- Create two plots, one showing the time-of-day contribution, and one showing the time-of-year contribution.
- (Optional) Make a 3D plot showing temperature as a function of (day, time). Make sure to label your axes!

## Part 4 - Evaluation 
Using R², quantify how your model performs on the testing data if you:
- Train with just the daily component of the model
- Train with just the yearly component of the model
- Train with the full model.

In [6]:
### PCW ### 


import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics.pairwise import rbf_kernel #get radial basis function kernel 
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
import numpy as np

years = range(2011, 2017)
files = ['CRNS0101-05-%d-CA_Yosemite_Village_12_W.txt' % y for y in years]


usecols = [1, 2, 8] #[WBANNO (station number), UTC Date, Latitude]

data = [np.loadtxt(f, usecols=usecols) for f in files] #load data with relevant columns
#vstack() function is used to stack the sequence of input arrays vertically to make a single array. 
data = np.vstack(data) 

print(data)

# Map data from HHmm to an integer
data[:, 1] = np.floor_divide(data[:, 1], 100) * 60 + np.mod(data[:, 1], 100)
valid = data[:, 2] > -1000 

x_train = data[valid, 1].reshape(-1, 1) #utc time in minutes 
y_train = data[valid, 2] #latitude

import random 
sigma = 0.1
alp = 0.0001

number_of_rows = x_train.shape[0]
random_indices = np.random.choice(number_of_rows, size=1000, replace=True)
w = np.random.sample(size =1000)

x_train = x_train.reshape(-1,1)
y_train = y_train.reshape(-1,1)
x_train = x_train[random_indices, :]
y_train = y_train[random_indices, :]

print(max(x_train), max(y_train))
print(min(x_train), min(y_train))
print(len(x_train),len(y_train))

rbf = rbf_kernel(y_train, x_train, gamma=1.0/sigma)
regression = Ridge(alpha=alp, fit_intercept=False)
regression.fit(rbf, y_train)

print("Score on training data = ", regression.score(rbf, y_train))
all_rbf = np.linspace(-3.0, 5.0, 1000).reshape(-1, 1)

# New representation:
expanded_rbf = rbf_kernel(all_rbf, y_train, gamma=1 / sig)
all_y = regression.predict(expanded_rbf)

print("all_x.shape", all_rbf.shape)
print("expanded_x.shape", expanded_rbf.shape)
print("all_y.shape", all_y.shape)

# Show that the predictions tend to zero far away from inputs
plt.figure()
plt.plot(all_rbf, all_y)
#plt.scatter(x_train, weights)

# Zoom in and see how well predictions fit the data
zoom_ind = (all_rbf > x_train.min()) & (all_rbf < x_train.max())
plt.figure()
print(len(zoom_ind))

#plt.plot(all_rbf[zoom_ind], all_y[zoom_ind])
plt.scatter(x_train, y_train)
plt.show()

[[ 2.0110101e+07  5.0000000e+00 -6.4000000e+00]
 [ 2.0110101e+07  1.0000000e+01 -6.5000000e+00]
 [ 2.0110101e+07  1.5000000e+01 -6.5000000e+00]
 ...
 [ 2.0161231e+07  2.3500000e+03  0.0000000e+00]
 [ 2.0161231e+07  2.3550000e+03 -1.0000000e-01]
 [ 2.0170101e+07  0.0000000e+00 -1.0000000e-01]]
[1435.] [28.6]
[0.] [-9.4]
1000 1000
Score on training data =  -1.492280899833423


NameError: name 'sig' is not defined