<a href="https://colab.research.google.com/github/allakoala/data_science/blob/main/colab_notebooks/HW_Time_Series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#HW: https://docs.google.com/document/d/1pMJgIkYXNGxT6ZMpvQuyz-qY6o9QByAf/edit

Previous hw: https://colab.research.google.com/drive/1r-J7BcwY1E-deeWoe1CaDoGvK4s8LCif#scrollTo=qcAWHBpVZ_LH



1.   Goal: Predict the bike renting studied in previous homework using Time Series techniques.
2.   Data: The same data as for homework on Advanced regression
3.   Target: `cnt` - Number of bikes rented per hour.
4.   Endogenous variables: All variables engineered from target and time
5.   Exogenous variables: Other than endogenous (e.g., something about weather)
6.   Test sample: last month of data
7.   Metrics: MAE




##EDA using Time Series Analysis techniques:
1. According to the TS count plot, the count of total rental bikes plotted through time makes a double bell-like graph, showing the same overall shape, and an increase in bike rental in 2012. So the time series shows a general tendency of rental increasing in the first two seasons then decreases in the last two seasons. The time series also shows a repeating short-term cycle through the two years.

2. According to the correlation matrix:
  * We observe a highly positive correlation between 'temp' and 'atemp' and between 'casual' and 'registered'.
  * 'Windspeed' displays an insignificant contribution to the count.
  * Hence, we will drop a few unnecessary columns later.

3. The seasonal component confirms our hypothesis, and shows up even a more interesting seasonal monthly cycle - which could be removed

4. Based on these results, we can conclude that the "cnt" time series appears to be stationary. Stationary time series exhibit consistent statistical properties over time, which is desirable for many time series analysis techniques:

    *   ADF Statistic: -6.822918711895098 - The ADF statistic is significantly below the critical values, indicating strong evidence against the null hypothesis of non-stationarity. This suggests that the "cnt" series is likely stationary.
    *   p-value: 1.9808626277977946e-09 - The extremely low p-value (close to zero) further supports the rejection of the null hypothesis. It suggests that the "cnt" series is unlikely to be non-stationary.
    Critical Values:
    *   The ADF statistic is lower than all critical values at the 1%, 5%, and 10% significance levels. This provides additional evidence that the "cnt" series is stationary.

5. The seasonal sub series plot can be more informative when redrawn with seasonal box plots. The box plot displays both central tendency and dispersion
within the seasonal data over a batch of time units.

6. To determine if the deseasonal_cnt series is stationary, should be performed a stationarity test. One commonly used test is the Augmented Dickey-Fuller (ADF) test. Overall, based on the ADF test results, we can conclude that the "deseasonal_cnt" series is stationary. This indicates that the seasonal component has been effectively removed, and the series exhibits no significant trends or patterns over time:

    *   To make the time series stationary, we’ll be using Differencing. Differencing is a process of subtracting each data point in the series from its successor.
    *   ADF Statistic: The ADF statistic value of -7.106123757516585 indicates a significant level of stationarity in the "deseasonal_cnt" series. A more negative ADF statistic suggests a stronger rejection of the null hypothesis of non-stationarity.
    *   p-value: The p-value of 4.0494746104502425e-10 is extremely small, indicating strong evidence against the null hypothesis of non-stationarity. The small p-value suggests that the "deseasonal_cnt" series is likely stationary.
    *   Critical Values: The ADF test also provides critical values at different significance levels (1%, 5%, and 10%) to compare with the ADF statistic. In this case, all critical values are more negative than the ADF statistic, further supporting the rejection of the null hypothesis. These critical values serve as thresholds for determining the stationarity of the series.


In [None]:
import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/drive')
#path of the file to read
url = "/content/drive/MyDrive/Colab Notebooks/data/regression advanced/hour.csv"

#read the file into a variable
data = pd.read_csv(url, sep=',')

#examine the data
print(data.head())

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

# Create the df
df = pd.DataFrame(data)

# Convert 'dteday' column to datetime type
df['dteday'] = pd.to_datetime(df['dteday'])

# Set 'dteday' as the DataFrame index
df.set_index('dteday', inplace=True)

# Sort the data by 'dteday' column
df = df.sort_values('dteday')

# Smooth the 'cnt' column using a 10-hour moving average
df['cnt_smoothed'] = df['cnt'].rolling(window=10, center=True, min_periods=1).mean()

# Print the dataframe with desired columns
print(df[['cnt', 'cnt_smoothed']])

# Perform time series decomposition
result = seasonal_decompose(df['cnt'], model='additive', period=1) # 365*24

# Plot the time series data
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['cnt'])
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Time Series of Count')
plt.show()

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['cnt_smoothed'])
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Smoothed Time Series of Count')
plt.show()

plt.figure(figsize=(10, 6))
plt.plot(df.index, df['cnt'], label='Original')
plt.plot(df.index, df['cnt_smoothed'], label='Smoothed')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Smoothed Time Series')
plt.legend()
plt.show()

In [None]:
# Plot the decomposition components
plt.figure(figsize=(12, 12))
plt.subplot(4, 1, 1)
plt.plot(result.trend)
plt.ylabel('Trend')
plt.title('Decomposition: Trend')

plt.subplot(4, 1, 2)
plt.plot(result.seasonal)
plt.ylabel('Seasonal')
plt.title('Decomposition: Seasonal')

plt.subplot(4, 1, 3)
plt.plot(result.resid)
plt.ylabel('Residual')
plt.title('Decomposition: Residual')

plt.subplot(4, 1, 4)
plt.plot(result.observed)
plt.xlabel('Date')
plt.ylabel('Observed')
plt.title('Decomposition: Observed')

plt.tight_layout()
plt.show()

# Perform correlation analysis
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Matrix')
plt.show()

# Perform stationarity test
def adf_test(series):
    result = adfuller(series)
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'   {key}: {value}')

print('ADF Test for "cnt":')
adf_test(df['cnt'])

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

# Generate seasonal sub-series plots
seasons = df.groupby(df['season'])['cnt'].apply(list)

plt.figure(figsize=(12, 8))
for season, values in seasons.items():
    plt.plot(values, label=f'Season {season}')
plt.xlabel('Time Units')
plt.ylabel('Count')
plt.title('Seasonal Sub-Series Plot')
plt.legend()
plt.show()

# Generate seasonal box plots using seaborn
plt.figure(figsize=(12, 8))
df['season_name'] = df['season'].map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})
sns.boxplot(x=df['season_name'], y=df['cnt'])
plt.xlabel('Season')
plt.ylabel('Count')
plt.title('Seasonal Box Plots')
plt.show()

In [None]:
# Get the deseasonalized component
deseasonal_cnt = df['cnt'] - result.seasonal

# Plot the deseasonalized time series
plt.figure(figsize=(12, 6))
plt.plot(df.index, deseasonal_cnt)
plt.xlabel('Date')
plt.ylabel('Deseasonalized Count')
plt.title('Deseasonalized Time Series of Count')
plt.show()

# Print the first few rows of the deseasonalized time series
print(deseasonal_cnt.head())

In [None]:
from statsmodels.tsa.stattools import adfuller

# Perform stationarity test
def adf_test(series):
    result = adfuller(series)
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'   {key}: {value}')
    return result

# Perform ADF test on the deseasonal_cnt series
print('ADF Test for "deseasonal_cnt":')
adf_result = adf_test(deseasonal_cnt)

# Make the series stationary
if adf_result[1] > 0.05:
    stationary_cnt = deseasonal_cnt.diff().dropna()  # Take the difference to make it stationary
else:
    stationary_cnt = deseasonal_cnt  # The series is already stationary

# Add stationary_cnt and deseasonal_cnt to df dataset
df['stationary_cnt'] = stationary_cnt
df['deseasonal_cnt'] = deseasonal_cnt

# Plot the stationary time series
plt.figure(figsize=(12, 6))
plt.plot(df.index, stationary_cnt)
plt.xlabel('Date')
plt.ylabel('Stationary Count')
plt.title('Stationary Time Series of Count')
plt.show()

# Print the first few rows of the stationary time series
print(stationary_cnt.head())

##Baseline model using Linear Regression keeping in data only endogenous feature:

**Baseline Model using Linear Regression (keeping only endogenous feature):**

- Mean Squared Error (MSE): 1.55e-26
- R-squared: 1.0
- Mean Absolute Error (MAE): 9.16e-13

The baseline model demonstrates outstanding performance. The extremely low MSE indicates that the model's predictions closely align with the actual values, exhibiting minimal error. The perfect R-squared score suggests that the model explains all the variance in the target variable using the selected endogenous features. Additionally, the very low MAE confirms that, on average, the model's predictions are remarkably close to the true values, with minimal absolute deviation.

**Baseline Model Improvement using Results of Target Decomposition and Exogenous Features:**

- Mean Squared Error (MSE): 5.48e-27
- R-squared: 1.0
- Mean Absolute Error (MAE): 5.51e-14

The improved model showcases a significant enhancement over the baseline model. The even lower MSE indicates a further reduction in prediction errors, making the model even more accurate. The perfect R-squared score demonstrates that the model captures all the variance in the target variable by utilizing both the endogenous and exogenous features. Moreover, the extremely low MAE affirms that the model's predictions are exceptionally close to the true values, exhibiting outstanding precision.

In conclusion, both the baseline model and the improved model exhibit exceptional performance. They provide highly accurate predictions with minimal errors. The inclusion of exogenous features and leveraging the results of target decomposition have significantly enhanced the models' predictive capabilities. These findings establish the reliability and precision of the models for making accurate predictions on the given dataset.




In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Select endogenous features
endogenous_features = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'casual', 'registered'] # delete target features

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[endogenous_features], df['stationary_cnt'], test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Print the evaluation metrics
print("Baseline model using Linear Regression keeping only endogenous feature:")
print("Mean Squared Error:", mse)
print("R-squared:", r2)
print("Mean Absolute Error:", mae)

###Baseline model improvement using results of target decomposition and exogenous features


In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Extract the trend component from decomposition
df['trend'] = result.trend

# Add exogenous features
exogenous_features = ['weathersit', 'temp', 'atemp', 'hum', 'windspeed']

# Combine endogenous and exogenous features
features = exogenous_features + ['trend']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df['stationary_cnt'], test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Print the evaluation metrics
print("Baseline model improvement using results of target decomposition and exogenous features:")
print("Mean Squared Error:", mse)
print("R-squared:", r2)
print("Mean Absolute Error:", mae)

##Train ARIMA model:
1. Mean Absolute Error (MAE): The MAE value of 56.88 indicates that, on average, the ARIMA model's predictions deviate from the actual values by approximately 56.88 units. Lower MAE values suggest better predictive accuracy, so this value could be considered moderately high.

2. AR Coefficients: The AR coefficients [-0.041, -0.152, 0.586, -0.103, -0.181, 0.042, -0.181, 0.099, -0.051, -0.051, -0.090] represent the autoregressive components of the ARIMA model. These coefficients determine the influence of previous observations on the current value. The coefficients can be interpreted as follows: for each unit increase in the corresponding lagged observation, the current value is expected to change by the coefficient value.

3. MA Coefficients: The MA coefficients [-0.531, -0.017, -0.839, 0.504, -0.079] represent the moving average components of the ARIMA model. These coefficients capture the dependency between the residual errors and the lagged forecast errors. Similar to the AR coefficients, these values indicate the impact of past errors on the current prediction.

4. Residual Analysis: The summary statistics of the residuals reveal important characteristics of the model's performance. The mean residual value of 0.082 indicates a slight positive bias, suggesting that, on average, the model tends to underestimate the true values. The standard deviation of 138.50 reflects the dispersion of the residuals around the mean, indicating a significant amount of variability. The minimum and maximum residuals are -516.22 and 707.06, respectively, suggesting occasional large prediction errors. The quartiles provide insights into the distribution of residuals, with the median (50th percentile) at -21.55 and the interquartile range (25th to 75th percentile) ranging from -81.44 to 65.89.

5. In conclusion, the refitted ARIMA model shows a moderate Mean Absolute Error, suggesting that it captures a significant portion of the underlying patterns in the data :

    * Mean Absolute Error (MAE): The MAE of 55.372 suggests that, on average, the refitted ARIMA model's predictions deviate by approximately 55.37 units from the actual values. Lower MAE values indicate better model accuracy.
    * AR Coefficients: The AR coefficients represent the weights assigned to the past observations in the autoregressive component of the model. The values [-0.017, -0.182, 0.577, -0.078, -0.204, 0.056, -0.175, 0.098, -0.031, -0.068, -0.056] indicate the magnitude and direction of the influence of each lagged observation on the current value. Positive coefficients suggest a positive relationship, while negative coefficients imply an inverse relationship.
    * MA Coefficients: The MA coefficients represent the weights assigned to the past forecast errors in the moving average component of the model. The values [-0.556, 0.058, -0.877, 0.482, -0.081] indicate the impact of the forecast errors at different lags on the current value. These coefficients capture any remaining dependency in the errors.
    * Residual Analysis: The summary statistics of the residuals indicate that they have a mean close to zero (0.101) with a standard deviation of 138.52. The minimum and maximum residuals are -519.32 and 683.52, respectively. The interquartile range (IQR) spans from -82.01 to 66.39, suggesting that most of the residuals lie within this range.
    * AIC and BIC: The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used to compare models based on their goodness of fit and complexity. The initial model has an AIC of 220,747.06 and a BIC of 220,879.03, while the refitted model has a slightly higher AIC of 220,764.62 and BIC of 220,896.59. Lower AIC and BIC values indicate a better trade-off between model fit and complexity.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Select the relevant columns for the time series analysis
ts_data = df[['stationary_cnt']]
ts_data.index = pd.to_datetime(df.index)

# Perform autocorrelation and partial autocorrelation analysis
plot_acf(ts_data['stationary_cnt'], lags=50)
plt.show()

plot_pacf(ts_data['stationary_cnt'], lags=50)
plt.show()


In [None]:
# Determine the ARIMA coefficients based on the analysis
p = 11  # Number of AR terms determined from the PACF plot
d = 1  # Number of differences (0 if the data is already stationary)
q = 5  # Number of MA terms determined from the ACF plot

# Set n as the length of the last month of data
last_month = ts_data.index[-1].to_period('M').month
second_last_month = ts_data.index[-2].to_period('M').month
n = last_month - second_last_month + 1

# Split the data into training and test sets
train_data = ts_data.iloc[:-n]
test_data = ts_data.iloc[-n:]

# Fit ARIMA model
model = ARIMA(train_data, order=(p, d, q))
model_fit = model.fit()

# Make predictions
predictions = model_fit.predict(start=test_data.index[0], end=test_data.index[-1])

# Calculate MAE
mae = mean_absolute_error(test_data['stationary_cnt'], predictions)

# Analyze residuals
residuals = model_fit.resid

# Explanation of ARIMA coefficients
ar_coef = model_fit.arparams
ma_coef = model_fit.maparams

# Print the results
print("ARIMA model:")

print(f"Mean Absolute Error (MAE): {mae}")
print(f"AR Coefficients: {ar_coef}")
print(f"MA Coefficients: {ma_coef}")
print("Residual Analysis:")
print(residuals.describe())

In [None]:
import matplotlib.pyplot as plt
import statsmodels.graphics.tsaplots as tsaplots

# Residual analysis
print("Residual Analysis:")
print(residuals.describe())

# Plot ACF and PACF of the residuals
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
tsaplots.plot_acf(residuals, lags=20, ax=axes[0])
tsaplots.plot_pacf(residuals, lags=20, ax=axes[1])
plt.tight_layout()
plt.show()

# Refit the model using the entire dataset
model_refit = ARIMA(ts_data, order=(p, d, q))
model_fit_refit = model_refit.fit()

# Make predictions on the entire dataset
predictions_refit = model_fit_refit.predict(start=test_data.index[0], end=test_data.index[-1])

# Calculate MAE for the refit model
mae_refit = mean_absolute_error(test_data['stationary_cnt'], predictions_refit)

# Print the results
print("Refitted ARIMA model:")
print(f"Mean Absolute Error (MAE): {mae_refit}")
print(f"AR Coefficients: {model_fit_refit.arparams}")
print(f"MA Coefficients: {model_fit_refit.maparams}")
print("Residual Analysis:")
print(model_fit_refit.resid.describe())

# Compare fit criteria such as AIC or BIC
print(f"AIC: {model_fit.aic}")
print(f"BIC: {model_fit.bic}")
print(f"Refitted AIC: {model_fit_refit.aic}")
print(f"Refitted BIC: {model_fit_refit.bic}")

##Prophet and SARIMAX.

In [None]:
from prophet import Prophet
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Preprocess the data for Prophet
prophet_df = df.copy()
prophet_df.reset_index(inplace=True)
prophet_df = prophet_df[['dteday', 'cnt']].copy()
prophet_df.columns = ['ds', 'y']
prophet_df['ds'] = pd.to_datetime(prophet_df['ds'])

# Fit Prophet model
prophet_model = Prophet(daily_seasonality=False)  # Disabling daily seasonality # add own seasonality
prophet_model.fit(prophet_df)

# Make future predictions with Prophet
future = prophet_model.make_future_dataframe(periods=365)  # Extend the dataframe by 365 days
prophet_forecast = prophet_model.predict(future)

# Preprocess the data for SARIMAX
sarimax_df = prophet_df.copy()
sarimax_df.rename(columns={'dteday': 'ds', 'cnt': 'y'}, inplace=True)
sarimax_df.set_index('ds', inplace=True)

# Fit SARIMAX model
sarimax_model = SARIMAX(sarimax_df, order=(1, 0, 1), seasonal_order=(1, 1, 1, 24)) # seasonality - weekly, 24*365
sarimax_model_fit = sarimax_model.fit()

# Make predictions with SARIMAX
sarimax_forecast = sarimax_model_fit.forecast(steps=365)

# Print the forecasted values
print("Prophet Forecast:")
print(prophet_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

print("\nSARIMAX Forecast:")
print(sarimax_forecast.tail())


In [None]:
# Plotting Prophet Forecast
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(prophet_forecast['ds'], prophet_forecast['yhat'], label='Forecast')
ax.fill_between(prophet_forecast['ds'], prophet_forecast['yhat_lower'], prophet_forecast['yhat_upper'], alpha=0.3, color='gray', label='Confidence Interval')
ax.scatter(prophet_df['ds'], prophet_df['y'], color='red', label='Actual')
ax.legend()
ax.set_xlabel('Date')
ax.set_ylabel('Count')
ax.set_title('Prophet Forecast')
plt.xticks(rotation=45)
plt.show()

# Plotting SARIMAX Forecast
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(sarimax_forecast.index, sarimax_forecast.values, label='Forecast')
ax.scatter(sarimax_df.index, sarimax_df['y'].values, color='red', label='Actual')
ax.legend()
ax.set_xlabel('Date')
ax.set_ylabel('Count')
ax.set_title('SARIMAX Forecast')
plt.xticks(rotation=45)
plt.show()

## Advanced regression model with the time-based features engineered for linear model
1. Mean Absolute Error (MAE): The MAE value of 104.92 indicates the average absolute difference between the predicted values and the actual values. A lower MAE indicates better model performance, so the model's predictions are relatively close to the actual values on average.

2. Coefficients: The coefficients represent the estimated impact of each feature on the target variable (cnt). Some key observations include:

3. The features with the highest positive coefficients are 'atemp' (200.13), 'temp' (103.26), and 'yr' (82.80). These variables have a positive relationship with the target variable, meaning an increase in these features tends to correspond to an increase in the bike count.
The features with the highest negative coefficients are 'hum' (-200.40) and 'weathersit' (-3.29). These variables have a negative relationship with the target variable, indicating that an increase in these features is associated with a decrease in the bike count.

4. Feature Importance: The feature importance indicates the relative importance of each feature in the model. The importance values can help prioritize feature selection or further analysis. Key observations include: The features 'mnth' and 'month' have the highest importance values of 2.73e+13. This suggests that the month of the year is highly influential in predicting the bike count.
Other important features include 'hum' (200.40), 'atemp' (200.13), and 'temp' (103.26), indicating the significance of weather-related variables.
Prediction for New Data: The predicted bike count for new data is 177.73. This prediction can be used as an estimate for the bike count given the input variables provided.

5. Overall, the model provides insights into the relationship between the features and the bike count. It suggests that factors such as temperature, humidity, month, and weather conditions play a significant role in predicting the bike count.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Engineer time-based features
data['dteday'] = pd.to_datetime(data['dteday'])  # Convert the 'dteday' column to datetime
data['hour'] = data['dteday'].dt.hour  # Extract the hour from the datetime
data['month'] = data['dteday'].dt.month  # Extract the month from the datetime
data['weekday'] = data['dteday'].dt.weekday  # Extract the weekday from the datetime

# Define the features and target variable
features = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp',
            'atemp', 'hum', 'windspeed', 'hour', 'month']  # Add or modify features as needed
target = 'cnt'  # Replace 'cnt' with the actual target variable name

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=42)

# Initialize and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)

# Additional analysis or model refinement can be performed here

# Example: Coefficient interpretation
coefficients = pd.DataFrame({'Feature': features, 'Coefficient': model.coef_})
print(coefficients)

# Example: Feature importance analysis
importance = pd.DataFrame({'Feature': features, 'Importance': np.abs(model.coef_)})
importance.sort_values(by='Importance', ascending=False, inplace=True)
print(importance)

# Example: Model prediction on new data
new_data = pd.DataFrame({'season': [2], 'yr': [0], 'mnth': [6], 'hr': [8], 'holiday': [0], 'weekday': [2],
                         'workingday': [1], 'weathersit': [1], 'temp': [0.7], 'atemp': [0.65], 'hum': [0.6],
                         'windspeed': [0.25], 'hour': [8], 'month': [6]})
new_prediction = model.predict(new_data)
print("Prediction for new data:", new_prediction)


#Compare obtained results with ones obtained using Advanced Regression techniques in the previous homework where it is possible.