# Researcher using Penguin dataset

### Install the lomas-client library

In [1]:
!pip install lomas-client

Collecting lomas-client
  Downloading lomas_client-0.3.5.tar.gz (14 kB)
  Installing build dependencies ... [?2done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting lomas-core==0.3.5 (from lomas-client)
  Downloading lomas_core-0.3.5.tar.gz (10 kB)
  Installing build dependencies ... [?2done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: lomas-client, lomas-core
  Building wheel for lomas-client (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lomas-client: filename=lomas_client-0.3.5-py3-none-any.whl size=15887 sha256=c3a35eebe63bf0966ea87ac3c81da525529fff050a32681ae51fa4de45c5cfcc
  Stored in directory: /root/.cache/pip/wheels/19/9a/c5/c00a710a3877b2c3584f36105082b34ded1a1c3491fb4deddc
  Building wheel for lomas-core (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lomas

### Prepare code to interact with the platform

In [2]:
from lomas_client import Client

In [3]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)

and now we are ready.

### Understand the available data

##### Metadata

In [4]:
penguin_metadata = client.get_dataset_metadata()
penguin_metadata

{'max_ids': 1,
 'rows': 344,
 'row_privacy': True,
 'censor_dims': False,
 'columns': {'species': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
  'island': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Torgersen', 'Biscoe', 'Dream']},
  'bill_length_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 30.0,
   'upper': 65.0},
  'bill_depth_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'm

In [5]:
penguin_metadata["columns"]["bill_length_mm"]

{'private_id': False,
 'nullable': False,
 'max_partition_length': None,
 'max_influenced_partitions': None,
 'max_partition_contributions': None,
 'type': 'float',
 'precision': 64,
 'lower': 30.0,
 'upper': 65.0}

In [6]:
penguin_metadata["columns"]["flipper_length_mm"]

{'private_id': False,
 'nullable': False,
 'max_partition_length': None,
 'max_influenced_partitions': None,
 'max_partition_contributions': None,
 'type': 'float',
 'precision': 64,
 'lower': 150.0,
 'upper': 250.0}

##### Dummy dataset (RANDOM DATA)

In [7]:
df_dummy = client.get_dummy_dataset()
df_dummy.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Dream,61.800324,20.774048,227.899635,3509.636957,FEMALE
1,Gentoo,Torgersen,54.48975,22.718264,163.455221,6592.209478,FEMALE
2,Chinstrap,Dream,39.305449,18.007412,203.606804,5906.470177,FEMALE
3,Chinstrap,Torgersen,63.921173,14.438975,201.422287,2552.942055,FEMALE
4,Chinstrap,Dream,57.256282,13.139363,235.757214,6985.173289,MALE


### Scientific question

Question: Are bill length and flipper length correlated ?

- $H_0$: Bill lenth and flipper length ARE NOT correlated.
- $H_1$: Bill lenth and flipper length ARE correlated.

Model $Y = \alpha + \beta x$ where:
- x is the bill length,
- Y is the flipper length,
- $\alpha$ and $\beta$ are unknown.

If $H_0$, then $\beta = 0$.

### Plan

Dr. Antartica will compute the $t_{score} = \frac{\beta_{estimate} - \beta_0}{SE}$,

with $\beta_{estimate}$ the estimated slope on the real data and $SE$ the standard error of least square estimators of $\beta$.

Therefore, she needs the number of penguin, variance of the flipper length, variance of the bill length (for $SE$) and $\beta_{estimate}$.

She will compute $t_{critical}$ with a 95% confidence interval ($\alpha=0.5$%).

Then, 
- If $|t_{score}| > t_{critical}$, reject the null hypothesis ($H_0$).
- If $|t_{score}| \leq t_{critical}$, fail to reject the null hypothesis.

### Query Lomas to get parameters

In [8]:
from sklearn.pipeline import Pipeline
from diffprivlib import models
from scipy.stats import t
import pandas as pd

#### Compute $\beta_{estimate}$

In [9]:
bill_length_meta = penguin_metadata['columns']['bill_length_mm']
flipper_length_meta = penguin_metadata['columns']['flipper_length_mm']

pipeline = Pipeline([
    (
        'lr', 
        models.LinearRegression(
            epsilon = 4.0, 
            bounds_X=(bill_length_meta['lower'], bill_length_meta['upper']), 
            bounds_y=(flipper_length_meta['lower'], flipper_length_meta['upper'])
        )
    ),
])

In [10]:
TEST_SIZE = 0.3

response = client.diffprivlib.query(
    pipeline = pipeline,
    feature_columns = ['bill_length_mm'],
    target_columns = ['flipper_length_mm'],
    test_size = TEST_SIZE
)

In [11]:
model = response.result.model.steps[0][1]
model

In [12]:
alpha_estimate = model.intercept_
alpha_estimate

113.67254423500745

In [13]:
beta_estimate = model.coef_[0]
beta_estimate

2.008158974688773

#### Compute SE (standard error of the slope)

$SE = (\frac{RSS}{N_{test} - 2})^2 * \frac{1}{Var(flipperlength)^2}$. 

From the documentation, the model score is $score = (1 - \frac{RSS}{TSS})$, 

$\;\;\;\;\;\;$ with $RSS$ the residual sum of squares and $TSS$ the total sum of squares.

Rewriting, we have $RSS =(1 - score) *TSS$, 

$\;\;\;\;\;\;$ with $TSS = N_{test} * Var(flipperlength)$ 

$\;\;\;\;\;\;$ and $N_{test} = N_{tot} * TEST\_SIZE$.

We need:
- the model score (computed on the test set)
- the variance of the flipper length
- the total number of penguin (we know the test set size is 30% of the total size)

In [25]:
score  = response.result.score
score

0.2915483855298694

In [14]:
query = "SELECT \
        STD(flipper_length_mm) AS std_flipper_length, \
        COUNT(flipper_length_mm) AS nb_penguin \
        FROM df"

In [15]:
sql_response = client.smartnoise_sql.query(query = query, epsilon = 1.5, delta = 1e-4)
res = sql_response.result.df
res

Unnamed: 0,std_flipper_length,nb_penguin
0,23.658498,339


In [16]:
var_flipper_length = res["std_flipper_length"][0]**2
N_TOT = res["nb_penguin"][0]

In [17]:
N_TEST = N_TOT * TEST_SIZE
TSS = N_TEST * var_flipper_length
RSS = (1 - score) * TSS
SE = (RSS / (N_TEST - 2))**0.5 / (var_flipper_length**0.5)
SE

0.8500960426689488

#### Compute $t_{score}$

In [18]:
t_score = (beta_estimate - 0)/SE
t_score

1.79220213282784

#### Compute $t_{critical}$

In [22]:
alpha = 0.05
dof = nb_test_penguin - 2
t_critical = t.ppf(1 - alpha / 2, dof)
t_critical

1.9840446252174464

### Conclusion

Test if $|t_{score}| > t_{critical}$.

In [23]:
if (t_score > t_critical):
    print("Result: t_score > t_critical")
    print("We reject the null hypothesis: there is a correlation between bill length and flipper length.")
else:
    print("Result: t_score <= t_critical")
    print("We fail to reject the null hypothesis.")

Result: t_score <= t_critical
We fail to reject the null hypothesis.
