# 1. Introduction
This notebook primarily deals with regression analysis of the dataset.
## 1.2. About this dataset

For each dataset, the fields are:
* Time
* Ls
* LT
* Tsurf: surface temperature (in Kelvin)
* Psurf: surface pressure (in Pascals)
* CO2ice: Surface carbon dioxide ice (in kg per metre squared)
* cloud: water ice column (in opacity)
* vapour: water vapour column (in kg per metre squared)
* u_wind: Zonal wind (west-east) (metres per second)
* v_wind: Meridional wind (north-south) (metres per second)
* dust: dust column (in opacity)
* temp: atmospheric temperature at a height of abour 2.5 km (in Kelvin)

## 1.3. About this kernel
The purpose of this notebook is to explore this dataset and apply basic deep learning techniques in order to predict pressure given a set of atmospheric variables.

__Limitations__
No optimisation for memory, runtime, or readability.

# 2. Preprocessing the data

## 2.1. Defining Features and Labels
Machine learning algorithms operate on _features_ to predict _labels_.

* A __feature__ is an attribute of the system that affects the output.
Features act as "inputs" to the model.
Ideally, features are _independent_ variables.
* A __label__ is the value being predicted.
Labels act as "outputs" of the model.

### 2.1.1. Features
At every timestamp within each day, there are values for all other variables.
No other variables impact the values of time or date.
Therefore __date__ and __time of day__ are _independent_ variables.

Temperature, pressure, and humidity do not directly affect one another significantly, but since they are all properties which describe the local atmosphere, they do not vary independently from one another.
Similarly, all three of these variables have a stong relationship to time of day.

CO2ice is always 0.

Therefore we consider the following variables to be _features_ to the machine learning algorithm:
* Tsurf: surface temperature (in Kelvin)
* cloud: water ice column (in opacity)
* vapour: water vapour column (in kg per metre squared)
* u_wind: Zonal wind (west-east) (metres per second)
* v_wind: Meridional wind (north-south) (metres per second)
* dust: dust column (in opacity)
* temp: atmospheric temperature at a height of abour 2.5 km (in Kelvin)

Further exploration of the dataset may modify this list, but for now this is our best guess.

### 2.1.2. Labels
The goal is to model surface pressure (Psurf) based on the available features.

## Import/install useful libraries

In [4]:
!pip install statsmodels
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler



## Begin with code snippets

In [24]:
#Import data

df= pd.read_csv('data/insight_openmars_training_time.csv')
df.drop(['Ls','LT','CO2ice'],axis=1,inplace=True)
target = df['Psurf']
df.drop(labels=['Psurf'], axis=1,inplace = True)
df.insert(1, 'Psurf', target)

df['Time']= pd.to_datetime(df['Time'])

print(df.head())

df.describe()

df.info()

                 Time    Psurf    Tsurf  cloud  vapour  u_wind  v_wind   dust  \
0 1998-07-15 21:23:39  721.113  264.042  0.092   0.027  -7.451   8.604  0.428   
1 1998-07-15 23:26:53  705.090  274.736  0.145   0.026  -7.053   4.934  0.427   
2 1998-07-16 01:30:07  700.691  265.939  0.105   0.026  -6.825  -0.063  0.427   
3 1998-07-16 03:33:21  697.252  238.624  0.134   0.025  -5.373  -4.048  0.426   
4 1998-07-16 05:36:35  717.146  213.634  0.139   0.026  -3.899  -3.133  0.426   

      temp  
0  179.686  
1  174.502  
2  173.429  
3  173.556  
4  174.789  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72196 entries, 0 to 72195
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Time    72196 non-null  datetime64[ns]
 1   Psurf   72196 non-null  float64       
 2   Tsurf   72196 non-null  float64       
 3   cloud   72196 non-null  float64       
 4   vapour  72196 non-null  float64       
 5   u_wind  72196 

In [34]:
# Dividing data into features and predictor variables
predictors = ['Tsurf',  'cloud',  'vapour',  'u_wind',  'v_wind',   'dust', 'temp']
df2 = df[['Psurf'] + predictors]
df2.head()

Unnamed: 0,Psurf,Tsurf,cloud,vapour,u_wind,v_wind,dust,temp
0,721.113,264.042,0.092,0.027,-7.451,8.604,0.428,179.686
1,705.09,274.736,0.145,0.026,-7.053,4.934,0.427,174.502
2,700.691,265.939,0.105,0.026,-6.825,-0.063,0.427,173.429
3,697.252,238.624,0.134,0.025,-5.373,-4.048,0.426,173.556
4,717.146,213.634,0.139,0.026,-3.899,-3.133,0.426,174.789


In [35]:
# separate our my predictor variables (X) from my outcome variable y
X = df2[predictors]
y = df2['Psurf']

# Add a constant to the predictor variable set to represent the Bo intercept
X = sm.add_constant(X)
X.iloc[:5, :5]

Unnamed: 0,const,Tsurf,cloud,vapour,u_wind
0,1.0,264.042,0.092,0.027,-7.451
1,1.0,274.736,0.145,0.026,-7.053
2,1.0,265.939,0.105,0.026,-6.825
3,1.0,238.624,0.134,0.025,-5.373
4,1.0,213.634,0.139,0.026,-3.899


In [36]:
# (1) select a significance value
alpha = 0.05

# (2) Fit the model
model = sm.OLS(y, X).fit()

# (3) evaluate the coefficients' p-values
model.summary()

0,1,2,3
Dep. Variable:,Psurf,R-squared:,0.156
Model:,OLS,Adj. R-squared:,0.156
Method:,Least Squares,F-statistic:,1911.0
Date:,"Thu, 24 Feb 2022",Prob (F-statistic):,0.0
Time:,13:13:10,Log-Likelihood:,-375780.0
No. Observations:,72196,AIC:,751600.0
Df Residuals:,72188,BIC:,751700.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,816.4173,3.095,263.772,0.000,810.351,822.484
Tsurf,-0.1693,0.006,-27.282,0.000,-0.181,-0.157
cloud,-15.6108,2.986,-5.227,0.000,-21.464,-9.758
vapour,-1227.0244,36.202,-33.894,0.000,-1297.979,-1156.069
u_wind,3.5723,0.053,66.968,0.000,3.468,3.677
v_wind,-0.5472,0.037,-14.761,0.000,-0.620,-0.475
dust,8.8425,0.474,18.644,0.000,7.913,9.772
temp,-0.1692,0.017,-10.190,0.000,-0.202,-0.137

0,1,2,3
Omnibus:,4710.039,Durbin-Watson:,0.127
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3629.105
Skew:,-0.458,Prob(JB):,0.0
Kurtosis:,2.394,Cond. No.,63900.0


In [37]:
# first remove the const column because unlike statsmodels, SciKit-Learn will add that in for us
X = X.drop('const', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)


In [38]:
# instantiate the regressor class
regressor = LinearRegression()

# fit the build the model by fitting the regressor to the training data
regressor.fit(X_train, y_train)

# make a prediction set using the test set
prediction = regressor.predict(X_test)

# Evaluate the prediction accuracy of the model
from sklearn.metrics import mean_absolute_error, median_absolute_error
print("The Explained Variance: %.2f" % regressor.score(X_test, y_test))
print("The Mean Absolute Error: %.2f" % mean_absolute_error(y_test, prediction))
print("The Median Absolute Error: %.2f" % median_absolute_error(y_test, prediction))


The Explained Variance: 0.17
The Mean Absolute Error: 36.35
The Median Absolute Error: 33.70


In [39]:
scalers={}
for i in df2.columns:
    scaler = MinMaxScaler(feature_range=(-1,1))
    s_s = scaler.fit_transform(df[i].values.reshape(-1,1))
    s_s=np.reshape(s_s,len(s_s))
    scalers['scaler_'+ i] = scaler
    df2[i]=s_s
for i in df2.columns:
    scaler = scalers['scaler_'+i]
    s_s = scaler.transform(df2[i].values.reshape(-1,1))
    s_s=np.reshape(s_s,len(s_s))
    scalers['scaler_'+i] = scaler
    df2[i]=s_s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[i]=s_s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[i]=s_s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[i]=s_s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pa

In [40]:
X = df2[predictors]
y = df2['Psurf']
# X = X.drop('const', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)


In [41]:
regressor = LinearRegression()

# fit the build the model by fitting the regressor to the training data
regressor.fit(X_train, y_train)

# make a prediction set using the test set
prediction = regressor.predict(X_test)

# Evaluate the prediction accuracy of the model
from sklearn.metrics import mean_absolute_error, median_absolute_error
print("The Explained Variance: %.2f" % regressor.score(X_test, y_test))
print("The Mean Absolute Error: %.2f" % mean_absolute_error(y_test, prediction))
print("The Median Absolute Error: %.2f" % median_absolute_error(y_test, prediction))

The Explained Variance: 0.17
The Mean Absolute Error: 0.00
The Median Absolute Error: 0.00
