In [None]:
from drawdata import draw_scatter
import math as m
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys

# Manually create a new dataset
draw_scatter()

# T statistics 
**Usage**: When we have data from the two population and we want to compare their statistical properties. We don't have any other information aside the sample. 

**Formula**: $$T(\hat{\beta}) = \frac{\hat{\beta} - \bar{\beta}}{s.e(\hat{\beta})}$$

**Description**:  T estimator follows t distribution

**Assumptions**: We assume that $X$ is normally distributed  


## Example: 
Compare if the two samples have the same means.  

In [None]:
X1 = pd.read_csv("mean-diff.csv")["x"]
X2 = pd.read_csv("mean-same.csv")["x"]

_=plt.hist(X1,30)
_=plt.hist(X2,30)

In [None]:
mu1 = X1.mean()
mu2 = X2.mean()
print(f"mu1={mu1}, mu2={mu2}")
s1 = X1.std()**2
s2 = X2.std()**2
n1 = len(X1)
n2 = len(X2)

(X2.mean() - X1.mean())/ m.sqrt(s1/n1 + s2/n2)

Lets calculate the covariance matrix for this dataset. We are expecting that this data will have positive pearson coefficient due to the linear shape of the data.


In [8]:
X1

0      231.980812
1      256.444591
2      208.471721
3      241.257488
4      203.468106
          ...    
610    203.113253
611    217.476162
612    239.832359
613    195.012965
614    240.972234
Name: x, Length: 615, dtype: float64

In [7]:
df = X1
X = df[["x", "y"]].to_numpy()
corr = np.corrcoef(X, rowvar=False)
cov = np.cov(X, rowvar=False)
pd.DataFrame(corr, columns=["X","Y"], index=["X", "Y"])


KeyError: "None of [Index(['x', 'y'], dtype='object')] are in the [index]"

In [None]:
pd.DataFrame(cov, columns=["X","Y"], index=["X", "Y"])

As expected, we can see very large Person correlation coefficient between two variables because perason coefficient measures the linear association between variables. 

## Outlier detection

Layman definition of the outliers are the datapoints which are abnormal and does not fall in the normal data generating regime. More preciselly, outliers are rare event E which have a very low probability to happen. Mathematically speaking, they have a low probability density function $p(x)$. 
To classify the event as an outlier event we first need to define threshold parameter $\alpha$ which is a boundery probability which defines the concept of outliers. 
Outliers are than event which satisfy the formula $p(|x| > \alpha)$ 

In [6]:
def points_inside_radius(df, x, y, radius):
    distances = (df["x"] - x)**2 + (df["y"] - y)**2
    distances = distances.apply(np.sqrt)
    mask = distances < radius
    inside_radius = df[mask]
    return inside_radius

x=300
y=300
r=10
inside_radius = points_inside_radius(df, x, y, r)
p = inside_radius.iloc[0]

NameError: name 'df' is not defined

In [None]:
_, ax = plt.subplots()
ax = df.plot.scatter(x="x", y="y", title="X vs. Y", ax=ax)

plt.annotate('X', 
             xy=(p["x"], p["y"]),
             color="red", 
             fontsize=20)

plt.annotate('X1', 
             xy=(p["x"]+100, p["y"]+100),
             color="red", 
             fontsize=15)


plt.annotate('X2', 
             xy=(p["x"]-60, p["y"]+130),
             color="red", 
             fontsize=15)


Upper picture shows us the X1 and X2 points which are approximatelly equally distanced from the centre  X point. But the main difference between these two points is that one of them lays in the region which is not dense with data and thus it appears to the human eye that this point is outlier. We could check this fact by estimating the probability density function for this dataset and calculating the probability $p(x2)$ and checking that this probability is smaller then $\alpha$ .

## Density estimation

### Kernel Density Estimation - KDE
KDE has two parameters:
* Basis Function (kernel)
* Smoothing Parameter

KDE places kernel function at each data point and sums the reponses. The formula is the following: 

$f(x) = \frac{1}{N} \sum_{i=0}^{N}{K(x-x_i)}$

First parameter to the KDE is choice of the kernel type. 
The kernel function should have the following properties:
* $K(x) \ge 0$ 
* $K(x) = K(-x)$
* $K'(x) \le 0$

Kernel actually represents our assumption how dense the epsilion surrounding near the data point is.   

Second choice which we need to make is the choice of the kernel bandwith $h$. This parameter controls the widness of the kernel function around the datapoint.  

**Silverman's rule of thumb**: automatic determination of the $h$

**Weighting the data**
Adding weights to kernel responses alows us not to treat every response the same. We are introducing the   

**Boundery conditions** 

In [None]:
from sklearn.neighbors import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)
np.exp(kde.score_samples(X))

#sample = sample.reshape((len(sample), 1))
#model.fit(sample)

Centering the dataset around the 0 by reducing the mean from the dataset.

In [None]:
dim_mu = X.mean(0)
print(dim_mu)
centered_X = X-dim_mu

_, ax = plt.subplots()
pd.DataFrame(centered_X, columns=["x", "y"]).plot.scatter(x="x", y="y", ax=ax)
df.plot.scatter(x="x", y="y", ax=ax, color="red")


In [None]:
C_ = np.linalg.inv(cov)
decor_X = centered_X @ C_ 

pd.DataFrame(decor_X, columns=["x", "y"]).plot.scatter(x="x", y="y")


In [None]:
C_ = np.linalg.inv(cov)
decor_X = centered_X @ C_ 
result = decor_X @ centered_X.T
result.shape
#pd.DataFrame(result, columns=["x", "y"]).plot.scatter(x="x", y="y")