# EDA Functions Demo
This is an example of how to use the functions in `eda_functions.py`

Make sure to import the functions.

In [30]:
import pandas as pd
from eda_functions import merge_with_target, split_data

Read in the dataframes as usual.

In [31]:
df_comments = pd.read_csv('../data/toxicity_annotated_comments.tsv', sep='\t', dtype={'rev_id': int})
df_comments.head(3)

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split
0,2232,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
1,4216,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
2,8953,Elected or Electoral? JHK,2002,False,article,random,test


In [32]:
df_annotations = pd.read_csv('../data/toxicity_annotations.tsv', sep='\t', dtype={'rev_id': int})
df_annotations.head(3)

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232,723,0,0.0
1,2232,4000,0,0.0
2,2232,3989,0,1.0


Use the function `merge_with_target` to create a dataframe with the information we need to start modeling.
<br>You **must** specify the name of the target column in `df_annotations`. See example below.

In [33]:
df_merged = merge_with_target(df_comments, df_annotations,
                              target_col_name='toxicity',
                              threshold=0.5)
df_merged.head()

Unnamed: 0_level_0,comment,target
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2232,This:NEWLINE_TOKEN:One can make an analogy in ...,0
4216,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,0
8953,Elected or Electoral? JHK,0
26547,`This is such a fun entry. DevotchkaNEWLINE_...,0
28959,Please relate the ozone hole to increases in c...,0


Use the function `split_data` to create training and testing splits.
<br>The function operates on the output of `merge_with_target`.

In [34]:
X_train, X_test, y_train, y_test = split_data(
    df_merged,
    pct_positive=0.5,
    test_size=5_000,
    train_size=20_000,
    random_state=42)                         

As shown below, we now have testing and training datasets of the specified sized. The training dataset is split 50/50 by class, while the testing dataset retains the original class proportions.

In [35]:
pd.DataFrame({
    f'Train (n={y_train.shape[0]})': y_train.value_counts(normalize=True),
    f'Test (n={y_test.shape[0]})': y_test.value_counts(normalize=True)})

Unnamed: 0,Train (n=20000),Test (n=5000)
0,0.5,0.8892
1,0.5,0.1108
