## Granger Causality and Inverse Probability Weighting
**Learning Objectives**
* Imputing Missing Data using semi-supervised learning with forecast model - just as with stock markets, unexpected events may occur. Although they may not fully be captured in the intended ARIMA model's noise terms, it is done for imputation so protocol buffers types within messages may be used.
* Granger Causality Implementation
* Making causal claims - difference in expected outcomes - using inverse probability weighting

In [3]:
# !pip3 install matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from pandas.plotting import autocorrelation_plot

## Handling Missing Values in Time Series
* As with any missing values, one can (1) impute the data or (2) remove the records. 
* Imputation occurs in RStudio - Kalman Filter

In [5]:
# added again for convenience
import numpy as np
import pandas as pd

In [8]:
updatedPortScanDf = pd.read_csv("./knockoff/newPortData.csv")

In [9]:
updatedPortScanDf = updatedPortScanDf.drop(columns=['X','Unnamed: 0'])

In [10]:
updatedPortScanDf.head()

Unnamed: 0,frame_info_time,frame_info_time_epoch,frame_info_number,frame_info_len,frame_info_cap_len,ip_id,ip_flags,ip_flags_df,ip_ttl,ip_checksum,...,tcp_flags_reset,tcp_flags_push,tcp_flags_ack,tcp_flags_urg,tcp_flags_cwr,tcp_window_size,tcp_checksum,tcp_urgent_pointer,tcp_options_mss_val,label
0,0,1574312000.0,7,54,54,46834,0,0,247,12832,...,0,0,0,0,0,1024,61355,0,17055.139055,1
1,2481,1574312000.0,12,551,66,3793,16384,1,56,13960,...,0,1,1,0,0,252,60933,0,16923.056473,1
2,2556,1574312000.0,14,94,94,0,16384,1,59,12472,...,0,0,1,0,0,4677,18651,0,16738.10983,1
3,3197,1574312000.0,15,68,66,8559,16384,1,55,64767,...,0,1,1,0,0,115,7590,0,16507.53589,1
4,3273,1574312000.0,16,54,54,54321,0,0,244,40274,...,0,0,0,0,0,65535,18914,0,16237.596014,1


## Notions of Causality

The earliest concept of causality in time-series data was granger causality that suggested the difference
between predicting stochastic process $Y$ compared to $X$ given all the information of the "universe". 
Assuming that $X,Y$ are stationary stochastic processes.

If removing $X$ reduces the predictive power regarding $Y$, $X$ contains unique information regarding $Y$ so 
$X$ Granger-causes $Y$.

$U_i = (U_{i_1},...,U_{i_{\infty}})$ contains all the information until time $i$. $\sigma^2 (Y_i|U_i)$
is the variance of predicting $Y_i$ using $U_i$ at time $i$. $\sigma^2 (Y_i|U_i \ X_i)$ excludes $X_i$
when predicting $Y_i$. 

If $\sigma^2 (Y_i|U_i) < \sigma^2 (Y_i|U_i \ X_i)$, then $X$ granger-causes $Y$.

**A feedback** occurs between $X,Y$ when $X$ granger-causes $Y$ and $Y$ granger-causes $X$.

**Practically, we have access to a limited set of observed time series $X$, so we observe $X$ 
granger-causes $Y$ w.r.t $X$**

**Instantaneous causality** occurs between 2 stochastic processes if at time $i$, adding $X_i$
helps improve the predicted value $Y_i$. 

If $\sigma^2(Y_i|U_i \cup \{X_i\}) < \sigma^2(Y_i|U_i)$, then there is instantaneous causality between X and Y.

## Inverse Probability Weighting (IPW)

IPW compensates for underrepresented and oversampled groups by weighting individuals from a particular group by the inverse of the probability of being in that group. Thereby, those in minority groups will be weighed more 
heavily than those in oversampled groups. 

The notion of a pseudo-population occurs with 

The causal estimand is **identifiable** if we have exchangeability  

One can look to root cause detection in anomalous time series. This comes from anomalous behavior
of one of the continuous variables. Another approach involves directly identifying the causal
structures at play with graph search.
**SGS/PC Algorithms**

In [12]:
updatedNumericPortScanDf = updatedPortScanDf.drop(columns=["ip_src","ip_dst"])

In [13]:
# here we use an "adjacency list" representation as a mapping of a feature to every other feature.
# (1) This is the initialization of the fully, densely connected graph
connectedGraph = dict()
numericCols = updatedNumericPortScanDf.columns
colsSet = set(numericCols)
for col in numericCols:
    connectedGraph[col] = colsSet.difference(set([col]))

In [14]:
# (2) edge elimination occurs when edges are conditionally independent - 
# test for conditional independence can be under PC Algorithm that 
# assumes 
# (3) identifying unshielded colliders - for pairs of variables connected
# through a third variable, test conditional independence on third variable.
# 

In [17]:
# !pip3 install statsmodels
from statsmodels.tsa.stattools import grangercausalitytests

### Granger Causality F-test
**With Granger Causality, we can conduct the F-test** by comparing a full model against reduced model.
If the null hypothesis is rejected, we say some stochastic process $X$ granger-causes $Y$. This 
provides insight into how well the previous values of 
one time series can predict the other.

**When doing these analyses, we assuming the causes precede the effect and the cause has
unique information about future values of effect**

In [18]:
grangercausalitytests(updatedPortScanDf[["tcp_options_mss_val","ip_ttl"]], [10])


Granger Causality
number of lags (no zero) 10
ssr based F test:         F=1.5329  , p=0.1205  , df_denom=284670, df_num=10
ssr based chi2 test:   chi2=15.3298 , p=0.1205  , df=10
likelihood ratio test: chi2=15.3294 , p=0.1205  , df=10
parameter F test:         F=1.5329  , p=0.1205  , df_denom=284670, df_num=10


{10: ({'ssr_ftest': (1.5328688653925866, 0.12053669995882672, 284670.0, 10),
   'ssr_chi2test': (15.329819445585446, 0.12049190196308561, 10),
   'lrtest': (15.329406723380089, 0.12050582193001443, 10),
   'params_ftest': (1.532868865099186, 0.12053670006170704, 284670.0, 10.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x12cf93be0>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x12cf92560>,
   array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
           0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
           0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
           0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
           0., 0., 0., 0., 0.],
          [0., 0., 0

In [22]:
# !pip3 install fcit
# we determine if P(y|x, z) = P(y|z)

In [24]:
from fcit import fcit
pvalLenMSS = fcit.test(np.transpose(np.matrix(updatedPortScanDf["frame_info_len"])),
                       np.transpose(np.matrix(updatedPortScanDf["tcp_options_mss_val"])))



In [26]:
pvalLenMSS
# fcit tests for the null hypothesis that x is independent of y
# low p-value, so given x is independent of y, obtaining the observations from the two
# vectors or more extreme is very unlikely. 
# we reject and may consider that x is not independent of y.
# another consideration for

0.010841559929812611

In [25]:
dfGraph = updatedPortScanDf.sample(10)

In [27]:
fcit.test(np.transpose(np.matrix(dfGraph["frame_info_len"])),
          np.transpose(np.matrix(dfGraph["tcp_options_mss_val"])))



0.9996722776383665

In [28]:
# there are dependencies from previous iterations - sequential
def eliminateEdges(dfGraph,connectedGraph):
    def reject(node,neighbor,third=None): 
        if third is None:
            third = np.empty([dfGraph[node].shape[0], 0])
        print(node,neighbor)
        pval = fcit.test(np.transpose(np.matrix(dfGraph[node])),
                       np.transpose(np.matrix(dfGraph[neighbor])), 
                        third)
        if pval < 0.05:
            return True
        else:
            return False
    for node in connectedGraph:
        currNeighbors = list(connectedGraph[node])
        while len(currNeighbors) != 0: # neighbor is element in a set
            neighbor = currNeighbors[0]
            if not reject(node,neighbor):
                connectedGraph[node].remove(neighbor)
                connectedGraph[neighbor].remove(node)
            else:
                for third in connectedGraph[neighbor]:
                    if third != node and reject(node,neighbor,
                                                np.transpose(np.matrix(dfGraph[third]))):
                        connectedGraph[node].remove(neighbor)
                        connectedGraph[neighbor].remove(node)
                        break
            currNeighbors = currNeighbors[1:]
    return connectedGraph

In [29]:
#newGraph = eliminateEdges(dfGraph,connectedGraph)
