### Model Exercises

The end result of this exercise should be a Jupyter notebook named model.

Using saas.csv or log data from API usage or store_item_sales

In [5]:
import numpy as np
import pandas as pd

#working with dates/tsa
from datetime import datetime
import statsmodels.api as sm
from statsmodels.tsa.api import Holt

# evaluate performance using rmse
from sklearn.metrics import mean_squared_error
from math import sqrt

#for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

# turn off warning boxes for presentation purposes
import warnings
warnings.filterwarnings("ignore")

In [3]:
# read csv to a dataframe
df = pd.read_csv('GlobalLandTemperaturesbyState.csv')

#look at initial rows, transposed
df.head().T
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645675 entries, 0 to 645674
Data columns (total 5 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   dt                             645675 non-null  object 
 1   AverageTemperature             620027 non-null  float64
 2   AverageTemperatureUncertainty  620027 non-null  float64
 3   State                          645675 non-null  object 
 4   Country                        645675 non-null  object 
dtypes: float64(2), object(3)
memory usage: 24.6+ MB


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2325 entries, 549727 to 552051
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   dt                             2325 non-null   object 
 1   AverageTemperature             2325 non-null   float64
 2   AverageTemperatureUncertainty  2325 non-null   float64
 3   State                          2325 non-null   object 
 4   Country                        2325 non-null   object 
dtypes: float64(2), object(3)
memory usage: 109.0+ KB


#### Takeaways for Prepping Data: 
1. dt to datetime
2. set index
3. sort values by date
4. keep only City of San Antonio data
5. drop unnecessary columns (Texas and US)


In [8]:
def wrangle_txtemps():
    
    '''
    This function reads the Global Land Temperatures by State data from csv 
    into a data frame. It removes data for all other states except Texas, renames 
    the dt column to "Date", converts Date to datetime object, sorts it by date and 
    sets Date as the index of the dataframe. Unnecessary columns of State and
    Country are removed and a dataframe with only Texas data is returned.
    '''
    
    # read csv to a dataframe
    df = pd.read_csv('GlobalLandTemperaturesbyState.csv')
    
    # keep only Texas data
    df = df[df.State == 'Texas']
    
    #rename dt column to Date
    df.rename(columns = {'dt': 'Date'}, inplace = True)
    
    # convert dt datetime format
    df['Date'] = pd.to_datetime(df.Date)
    
    # sort values by date
    df = df.sort_values('Date')
    
    # set index
    df = df.set_index('Date')

    # drop unneccesary columns of State and Country
    df = df.drop(columns=['State', 'Country'])
    
    return df
    


In [11]:
df = wrangle_txtemps()
df.head(2)

Unnamed: 0_level_0,AverageTemperature,AverageTemperatureUncertainty
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1820-01-01,4.489,3.369
1820-02-01,9.081,2.873


In [12]:
# check that for data shape and type and nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2325 entries, 1820-01-01 to 2013-09-01
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   AverageTemperature             2325 non-null   float64
 1   AverageTemperatureUncertainty  2325 non-null   float64
dtypes: float64(2)
memory usage: 54.5 KB


In [7]:
# ensure all Date entries are unique (no duplicates) 
df.value_counts('Date')

Date
2013-09-01    1
1886-06-01    1
1884-10-01    1
1884-09-01    1
1884-08-01    1
             ..
1949-02-01    1
1949-01-01    1
1948-12-01    1
1948-11-01    1
1820-01-01    1
Length: 2325, dtype: int64

#### 1. Split data (train/validate/test) and resample by any period, except daily, and aggregate using the sum.

In [14]:
# train size will be 50% of the total
train_size = int(len(df) * .5)
train_size

1162

In [15]:
# validate will be 30% of total entries
validate_size = int(len(df) * .3)
validate_size

697

In [16]:
# test size will be the remaining rows
test_size = int(len(df) - train_size - validate_size)
test_size

466

In [18]:
# validate will start at 1162 and go through 1162+697 (train + validate)
validate_end_index = train_size + validate_size
validate_end_index

1859

In [19]:
# use the above values to split our data

# train will go from 0 to 1161
train = df[: train_size]

# validate will go from 1162 to 1858
validate = df[train_size:validate_end_index]

# test will include 1859 to the end
test = df[validate_end_index:]

#### 2. Forecast, plot and evaluate using each of the 4 parametric based methods we discussed:
- Simple Average
- Moving Average
- Holt's Linear Trend Model
- Based on previous year/month/etc., this is up to you.