### Extract data & train/test sets 
* pretrain data: all black + which + combinations
    * Could supplemenet with more black/white from internet, and other jokes sources (reddit, jester) but overkill. 
    * Main goal is just to "warm up" a language/sentence model, generically. 

* train/test sets: Split by game/`fake_round_id`
    * Train data: Drop duplicates
    * keep "all" 1 cases, and keep all # occurrences by default

* IGNORE pick 2 cases for now (overcomplicates evaluation) ; but keep them for the pretraining

Each row of the file represents a white card that was given to a player in a round and how it performed. It also contains information about the round like how long it took and what black card was dealt.
So columns that describe general round data will be exactly the same for all ten rows of that round. These rows are: fake_round_id, round_completion_seconds, round_skipped, black_card_text, and black_card_pick_num.
But columns that describe a property of the white card will change over the ten rows of that round. These rows are: white_card_text, won, and winning_index.


The columns:
* fake_round_id: a unique integer for every round. This is to help you separate one round from another to find which 10 cards were presented in every round.
* round_completion_seconds: the number of seconds it took for the user to pick a winner after being presented the cards for that round. You'll probably want to filter out users that decide too quickly by setting some minimum threshold for this. Some of these numbers can be insanely high as well since the user just left their browser open for hours or possibly days before picking a winner.
* round_skipped: we have a red button in the top right of the lab page that says "No Good Plays!". If a user clicks this button, round_skipped will be "True" and no white card will be marked as a winner.
* black_card_text: self explanatory.
* black_card_pick_num: The number of white cards that the black card requires. Generally 1, but will be 2 for black cards like "That's right, I killed _____. How, you ask? _____."
* white_card_text: self explanatory.
* won: True if this white card won the round.
* winning_index: This is for storing the order that the winning card(s) were picked in. If the white card didn't win, this will just contain a null value (NULL). If the card won and the black card was a normal "Pick 1" this will always be 0. But if the black card was a "Pick 2", one of the winners will be 0 and the other will be 1. This is how you figure out which blank in pick 2 black cards was meant to be filled by which winning white card. The first blank is filled by the 0 and the second one by the 1.

* sample groups (for test);

* https://stackoverflow.com/questions/44007496/random-sampling-with-pandas-data-frame-disjoint-groups
    ```
    from sklearn.model_selection import GroupShuffleSplit

    # Initialize the GroupShuffleSplit.
    gss = GroupShuffleSplit(n_splits=1, test_size=0.5)

    # Get the indexers for the split.
    idx1, idx2 = next(gss.split(df, groups=df.ids))

    # Get the split DataFrames.
    df1, df2 = df.iloc[idx1], df.iloc[idx2]
    ```

* https://hippocampus-garden.com/pandas_group_sample/
    ```
    sampled_users = np.random.choice(df["user_id"].unique(), 100)
    df_sampled = df.groupby('user_id').filter(lambda x: x["user_id"].values[0] in sampled_users)
    df_sampled["user_id"].nunique()
    ```

In [1]:
import pandas as pd
import seaborn as sns
pd.set_option("display.precision", 4)
pd.set_option('display.width', 1024)
pd.set_option('display.max_colwidth', None)
%matplotlib inline 
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split,GridSearchCV, cross_val_score, cross_val_predict
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, roc_auc_score  
# https://stackoverflow.com/questions/53784971/how-to-disable-convergencewarning-using-sklearn
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
from sklearn.model_selection import GroupShuffleSplit

ord_enc = OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1)

In [2]:
# from sentence_transformers import SentenceTransformer, util ## versions error? 

In [4]:
LOAD_CSV = False#True#False ## load raw data from csv instead of parquet (which has had some cols dropped)
MIN_PAIR_FREQ = 1
SAVE_TRAIN_DATA = False#False#True

LOWERCASE_DATA = False
DROP_DOUBLES = True ## drop white cards with pick 2 options ; and their blacks. (TODO: analyze interesting pairs)
DROP_SKIPPED = True ## ignore rounds where skipped

DATA_PATH = #REDACTED

Load data

* Note there are some near-duplicate cards (e.g. footballers - could minhash - https://github.com/chrisjmccormick/MinHash/blob/master/runMinHashExample.py , LSH/shingles : https://onestopdataanalysis.com/lsh/  (broken library?) , https://github.com/zyocum/dedup/blob/master/dedup.py ,

`Datasketch` https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/ ,
https://stackoverflow.com/questions/25114338/approximate-string-matching-using-lsh , 

SKLearn friendly?: http://ethen8181.github.io/machine-learning/recsys/content_based/lsh_text.html

`dedupe` library - https://github.com/dedupeio/dedupe

* Paraphrase mining using sentence bert - https://www.sbert.net/examples/applications/paraphrase-mining/README.html  (may be too semantic)

* Example: "10 football players with erections *barreling* towards you at full speed" - 10 football players with erections **barrelling** towards you at full speed."

Task modelling:
* Ranking = most accurate
* Regression/classification: can be done just over single cards, or between pairs. 
    * if pairs-wise: can use sentencebert/SNLI models (which assume pairs of sentences) - https://github.com/UKPLab/sentence-transformers
    
    
Examples:

* https://www.sbert.net/examples/training/sts/README.html#training-data
* https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py
* https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py


* Keras example (not sentenceBert): https://keras.io/examples/nlp/semantic_similarity_with_bert/


* Interpretability tool : https://github.com/cdpierse/transformers-interpret

In [5]:
%%time
#### Original processing of data, before I saved it as a parquet for faster subsequent loading: 
if LOAD_CSV:
    df = pd.read_csv(DATA_PATH)
    ## remove punct/newlines
    df["white_card_text"] = df["white_card_text"].replace("\n", " \n ",regex=True).replace("\t", " \t ",regex=True) # .replace("\r", "")
    df["black_card_text"] = df["black_card_text"].replace("\n", " \n ",regex=True).replace("\t", " \t ",regex=True) # .replace("\r", "")

    ## add number within group - may not actually be the read order shown on screen!! 
    df['ID_index'] = df.groupby('fake_round_id').cumcount() + 1

    print(df[['black_card_text', 'black_card_pick_num', 'white_card_text',"fake_round_id"]].nunique())
    print(df.describe())
    
    ## save to disk in compressed, binary format
    df.drop(["winning_index"],axis=1).to_parquet('cah_lab_data.parquet')

else:  # load data saved in above code previously
    df = pd.read_parquet('cah_lab_data.parquet')
#     df= pd.read_feather('cah_lab_data.feather')

if DROP_DOUBLES:
    df = df.loc[df["black_card_pick_num"]==1]
if DROP_SKIPPED:
    df = df.loc[~df["round_skipped"]]
    
# added:
df["won"] = df["won"].astype(int)

df

CPU times: total: 1.91 s
Wall time: 1.32 s


Unnamed: 0,fake_round_id,round_completion_seconds,round_skipped,black_card_text,black_card_pick_num,white_card_text,won,ID_index
0,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into _____, and I love to have a good time.",1,Going inside at some point because of the mosquitoes.,0,1
1,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into _____, and I love to have a good time.",1,Being fat from noodles.,0,2
2,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into _____, and I love to have a good time.",1,Letting this loser eat me out.,0,3
3,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into _____, and I love to have a good time.",1,That chicken from Popeyes.®,0,4
4,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into _____, and I love to have a good time.",1,A sorry excuse for a father.,0,5
...,...,...,...,...,...,...,...,...
2989545,298955,7613,False,Oh my god! _____ killed Kenny!,1,Breastfeeding a ten-year-old.,0,6
2989546,298955,7613,False,Oh my god! _____ killed Kenny!,1,Happy daddies with happy sandals.,0,7
2989547,298955,7613,False,Oh my god! _____ killed Kenny!,1,Jerking off to a 10-second RealMedia clip.,0,8
2989548,298955,7613,False,Oh my god! _____ killed Kenny!,1,Getting naked and watching Nickelodeon.,0,9


In [6]:
df.describe()

Unnamed: 0,fake_round_id,round_completion_seconds,black_card_pick_num,won,ID_index
count,2446400.0,2446400.0,2446360.0,2446360.0,2446400.0
mean,149170.0,89.592,1.0,0.1,5.5
std,86119.0,3100.8,0.0,0.3,2.8723
min,1.0,2.0,1.0,0.0,1.0
25%,74948.0,10.0,1.0,0.0,3.0
50%,148580.0,17.0,1.0,0.0,5.5
75%,223510.0,27.0,1.0,0.0,8.0
max,298960.0,702100.0,1.0,1.0,10.0


* Weak, but clear not extreme bias in favor of "first" cards, or in middle of screen?

In [7]:
df.corrwith(df["won"])

fake_round_id               1.1587e-19
round_completion_seconds    1.6692e-19
round_skipped                      NaN
black_card_pick_num                NaN
won                         1.0000e+00
ID_index                   -1.2425e-02
dtype: float64

In [8]:
df.groupby(["ID_index"])["won"].mean()

ID_index
1     0.1035
2     0.1077
3     0.1042
4     0.1027
5     0.0995
6     0.0987
7     0.0962
8     0.0957
9     0.0949
10    0.0967
Name: won, dtype: float64

#### normalize \____ 's 
* Rememebr to do this if evaluating
* May not be needed!
    * We also do this for consistency when doing find and replace , and for cases where there's no ___ in the text

In [None]:
df["has_"] = df["black_card_text"].str.contains("_{2,}")
### filter out cases with 2 blanks to fill ? 

df['black_card_text'] = df['black_card_text'].str.replace("_{2,}","__")
## https://stackoverflow.com/questions/47696401/replace-character-of-column-value-with-string-from-another-column-in-pandas


### Pick 2s
* Keep only subset for free text, and self join is slow.... 
* Keep only sample of combinations, as otherwise we get huuuge # combinations

In [10]:
# df["black_card_pick_num"].describe() ## max 2, min 1

In [11]:
df_doubles = df.loc[df["black_card_pick_num"]>1].filter(['black_card_text', 'white_card_text',"won","fake_round_id"],axis=1).drop_duplicates().copy()

In [12]:
df = df.loc[df["black_card_pick_num"]<2] ## keep non doubles only

# df[["id_black","id_white"]] = ord_enc.fit_transform(df[["black_card_text","white_card_text"]])
df["id_white"] = ord_enc.fit_transform(df["white_card_text"].values.reshape(-1, 1))

In [13]:
if not DROP_DOUBLES:
    df_doubles["black_card_text"] = df_doubles["black_card_text"].str.replace("(PICK 2)","",case=False,regex=False)

    df_doubles["black_card_text"] = df_doubles["black_card_text"].str.replace("__.","__",case=False,regex=False)
    df_doubles.sort_values(['black_card_text',"won"],inplace=True,ascending=False)
    # display(df_doubles)
    print(df_doubles.nunique())
    print(df_doubles.drop_duplicates(['black_card_text', 'white_card_text']).shape[0],"unique double combinations")
    print("Rows:",df_doubles.shape[0])

    ### biased sample - winning combinations + some others (otherwise, we have too many possibles..)
    df_doubles_1 = df_doubles.loc[df_doubles["won"]==True]
    df_doubles_2 = df_doubles.loc[df_doubles["won"]==False].groupby(["black_card_text","fake_round_id"]).sample(2)
    df_doubles = pd.concat([df_doubles_1,df_doubles_2]).drop_duplicates(['black_card_text', 'white_card_text'])
    print("after biased sample: Rows:",df_doubles.shape[0])
    df_doubles.set_index(["black_card_text","fake_round_id"],inplace=True)

    # ## keep first, last occ per all the white cards - biased sample, 
    # df_doubles = pd.concat([df_doubles.drop_duplicates(keep="first"),
    #                         df_doubles.drop_duplicates(subset=["white_card_text"], keep="last")
    #                        ]).drop_duplicates()
    # df_doubles = df_doubles.drop_duplicates(subset=["white_card_text"],keep="first")

    ### could filter for mirror images ? 
    ### https://stackoverflow.com/questions/24676705/pandas-drop-duplicates-if-reverse-is-present-between-two-columns
    # df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

    # df_doubles.drop(columns="won",errors="ignore",inplace=True)
    print("Rows (after dropping):",df_doubles.shape[0])
    df_doubles

In [14]:
%%time
if not DROP_DOUBLES:
    ### slow merge + check string
    # df_doubles = df_doubles.merge(df_doubles,on=["black_card_text"])
    ## self join
    df_doubles = df_doubles.join(df_doubles["white_card_text"],lsuffix="_x",rsuffix="_y")
    df_doubles = df_doubles.loc[df_doubles["white_card_text_x"]!=df_doubles["white_card_text_y"]]
    # ## drop mirror images (maybe relevant, but too redundant)
    df_doubles['check_string'] = df_doubles.apply(lambda row: ''.join(sorted([row['white_card_text_x'], row['white_card_text_y']])), axis=1)

    df_doubles = df_doubles.reset_index().drop_duplicates(subset=["check_string"])
    df_doubles =df_doubles.sort_values(['black_card_text',"won"],ascending=False).drop(columns=["won","check_string","fake_round_id"],errors="ignore").drop_duplicates()
    print(df_doubles.shape) ## 200 million by default


    %%time
    ## remove . at end of text
    df_doubles["white_card_text_x"] = df_doubles["white_card_text_x"].str.replace("\.$", '',regex=True)
    df_doubles['text'] = df_doubles.apply(lambda x:x['black_card_text'].replace("__", x['white_card_text_x'],1), axis=1)
    df_doubles['text'] = df_doubles.apply(lambda x:x['text'].replace("__", x['white_card_text_y'],1), axis=1)
    df_doubles['text'] = df_doubles['text'].str.replace("..",".",regex=False) ## fix double dots. Still levaes extra punct.. 
    # df_doubles = df_doubles[["black_card_text",'text']].drop_duplicates()
    print(df_doubles.shape)
    df_doubles

CPU times: total: 0 ns
Wall time: 0 ns


## df with all texts from data
white, black and combinations

In [15]:
df_text = pd.concat([df["black_card_text"],
                       df["white_card_text"],
#                     df_doubles['text'],df_doubles['black_card_text']
                    ]).drop_duplicates()

Fill in "punchlines" into blanks (___) (or add at end of text for cards without ___)

* NOTE: Previously I had tried a model with black and white text as 2 prompts (which meant the model needed to learn to fill in blanks); and a model of just the filled in answer (1 col). 
* Unclear what is "easier" to learn/model

* Note this doesn't handle cases with double blanks (it will just input the same punchline twice)

In [16]:
df["has_"] = df["black_card_text"].str.contains("_{2,}",regex=True)

df['black_card_text'] = df['black_card_text'].str.replace("_{2,}","__",regex=True)
## https://stackoverflow.com/questions/47696401/replace-character-of-column-value-with-string-from-another-column-in-pandas
## https://datascience.stackexchange.com/questions/39345/how-to-replace-a-part-string-value-of-a-column-using-another-column
# Remove characters from one column based on string of another column

## remove "." from end of text
df["white_card_text"] = df["white_card_text"].str.replace("\.$", '',regex=True)
df['text'] = df.apply(lambda x:x['black_card_text'].replace("__", x['white_card_text']), axis=1)

## add answer at end of text, in cases where no ___ in black card
df.loc[(df["has_"]==False),"text"] = df["text"] +" " + df["white_card_text"] #+ "."

# removed: 
# df['text'] = df['text'].str.replace("..",".",regex=False) ## fix double dots. Still levaes extra punct.. 

display(df)

Unnamed: 0,fake_round_id,round_completion_seconds,round_skipped,black_card_text,black_card_pick_num,white_card_text,won,ID_index,has_,id_white,text
0,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into __, and I love to have a good time.",1,Going inside at some point because of the mosquitoes,0,1,True,927.0,"Hi MTV! My name is Kendra, I live in Malibu, I'm into Going inside at some point because of the mosquitoes, and I love to have a good time."
1,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into __, and I love to have a good time.",1,Being fat from noodles,0,2,True,428.0,"Hi MTV! My name is Kendra, I live in Malibu, I'm into Being fat from noodles, and I love to have a good time."
2,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into __, and I love to have a good time.",1,Letting this loser eat me out,0,3,True,1134.0,"Hi MTV! My name is Kendra, I live in Malibu, I'm into Letting this loser eat me out, and I love to have a good time."
3,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into __, and I love to have a good time.",1,That chicken from Popeyes.®,0,4,True,1719.0,"Hi MTV! My name is Kendra, I live in Malibu, I'm into That chicken from Popeyes.®, and I love to have a good time."
4,1,24,False,"Hi MTV! My name is Kendra, I live in Malibu, I'm into __, and I love to have a good time.",1,A sorry excuse for a father,0,5,True,231.0,"Hi MTV! My name is Kendra, I live in Malibu, I'm into A sorry excuse for a father, and I love to have a good time."
...,...,...,...,...,...,...,...,...,...,...,...
2989545,298955,7613,False,Oh my god! __ killed Kenny!,1,Breastfeeding a ten-year-old,0,6,True,495.0,Oh my god! Breastfeeding a ten-year-old killed Kenny!
2989546,298955,7613,False,Oh my god! __ killed Kenny!,1,Happy daddies with happy sandals,0,7,True,957.0,Oh my god! Happy daddies with happy sandals killed Kenny!
2989547,298955,7613,False,Oh my god! __ killed Kenny!,1,Jerking off to a 10-second RealMedia clip,0,8,True,1077.0,Oh my god! Jerking off to a 10-second RealMedia clip killed Kenny!
2989548,298955,7613,False,Oh my god! __ killed Kenny!,1,Getting naked and watching Nickelodeon,0,9,True,889.0,Oh my god! Getting naked and watching Nickelodeon killed Kenny!


* Add these to df_text as well

Save df_text to disk

In [17]:
df_text = pd.concat([df_text,df['text']]).drop_duplicates()
df_text = df_text.sample(frac=1)
df_text

279558                                                      Next from J.K. Rowling: Harry Potter and the Chamber of Nicolas Cage.
1628804                                                          WHOOO! God damn I love Crying and shitting and eating spaghetti!
1011134                                                                       LSD + Getting shot by the police = really bad time.
187356                                                       Feeling so grateful! #amazing #mylife #J.D. Power and his associates
1519224                                                         As king, how will I keep the peasants in line? An Oedipus complex
                                                                    ...                                                          
897116                                                 What really killed the dinosaurs? A juicy lil' booty going poot-poot-pooty
601515                                   Do the Dew® with our most extreme flavor yet! Get

In [18]:
# df_text.to_csv("df_text.csv.gz",index=False,compression="gzip")

##### Fit prior odds / encoder
* Use targetEncoder or WOE
* Do IPW either here, or later on data subset? 

In [19]:
assert df["won"].isna().sum()==0

In [21]:
df["won"].mean()

0.1

In [22]:
# ## for now, fit on just white cards, instead of black, white  (For black, we could just use mean_skipped as prior and multiply)
# ## seemed to work directly on text, origianlly? 
# df["prior_white"] = cat_encoder.fit_transform(X= df["white_card_text"].values# df["id_white"].values# pd.factorize(df.white_card_text)[0],#df["white_card_text"],
#                                               ,y=df["won"].astype(int).values,groups=df["fake_round_id"].values)

# # df["prior_white"] = cat_encoder.transform(df["white_card_text"]) ## catboost encoder - doesn't have min 0 (for those with 0 cases)
# ## rates for the whites (per unique white card)
# df.drop_duplicates(["white_card_text"])["prior_white"].describe(percentiles=[]).round(3)

# ## why nan values ??

### check white prior as baseline model

* 21% win rate by picking most popular answer, on FILTERED Data
* On _full_ 3 million games (including different outcomes for same pair, i.e no filtering by # picks), _it does not better than random! _

In [23]:
df.groupby(["ID_index"])["won"].mean()

ID_index
1     0.1035
2     0.1077
3     0.1042
4     0.1027
5     0.0995
6     0.0987
7     0.0962
8     0.0957
9     0.0949
10    0.0967
Name: won, dtype: float64

### get min count/cooccurences of sentence pairs
* Filter train +_ TEST for min 2 occurrences
* Filter games/rounds where positive was removed

In [24]:
df["pair_count"] = df.groupby("text")["won"].transform("count") ## can be used to filter sentences occurring less than k times
df["sum_won"] = df.groupby("text")["won"].transform("sum")

In [25]:
print(df.shape[0])
df = df.loc[df["pair_count"]>=MIN_PAIR_FREQ]
print(df.shape[0])

2446360
2446360


### remove groups where positive was filtered out
( If `MIN_PAIR_FREQ` is 1, then this does nothing, all cards/games/jokes will be kept)

In [26]:
if MIN_PAIR_FREQ>1:
    print("Rounds before:",df["fake_round_id"].nunique())
    any_pos = df.groupby("fake_round_id")["won"].transform("max")>0
    df = df.loc[any_pos]
    print("Rounds After (filtering those with no pick):",df["fake_round_id"].nunique())
    
    ### filter round with too few cards left (we can have without all 10, but ensure no "1" only cases.... )
     ## clear drop between 8,9 (17K),10 (6k) games
    round_cards_count_mask = df.groupby("fake_round_id")["won"].transform("count")>=8
    df = df[round_cards_count_mask==True]
    print(df.shape[0],"rows after round level filter")
    print(df["fake_round_id"].nunique(), "rounds")

In [27]:
print(df.nunique())

fake_round_id               244636
round_completion_seconds      1665
round_skipped                    1
black_card_text                581
black_card_pick_num              1
white_card_text               2128
won                              2
ID_index                        10
has_                             2
id_white                      2128
text                        784974
pair_count                      15
sum_won                          9
dtype: int64


In [28]:
df[df["sum_won"]>0]["sum_won"].describe().round(2)

count    756722.00
mean          1.31
std           0.61
min           1.00
25%           1.00
50%           1.00
75%           1.00
max           8.00
Name: sum_won, dtype: float64

In [29]:
df.nunique()

fake_round_id               244636
round_completion_seconds      1665
round_skipped                    1
black_card_text                581
black_card_pick_num              1
white_card_text               2128
won                              2
ID_index                        10
has_                             2
id_white                      2128
text                        784974
pair_count                      15
sum_won                          9
dtype: int64

In [30]:
print("Possible jokes:",(df.black_card_text.nunique()*df.white_card_text.nunique()))
100*df.text.nunique() / (df.black_card_text.nunique()*df.white_card_text.nunique())

Possible jokes: 1236368


63.49032003416458

## Split train/test by groups
* Later-  filter train data by label (but not the test)
* split by round_id / games

In [31]:
df = df.filter(['fake_round_id', 'black_card_text', 'white_card_text',
                'won', 'text', "sum_won"],axis=1)
df = df.sort_values(["fake_round_id"],ascending=False) # , 'won'
df["won"] = df["won"].astype(int)
print("mean won:",df["won"].mean())
print(df.shape[0])

mean won: 0.1
2446360


## Alt: split by unique cards
* unique / novel white, black card combinations
* Randomly select

In [32]:
all_blacks = df["black_card_text"].unique()
# print(len(all_blacks))
# test_blacks = np.random.choice(all_blacks,size=int(len(all_blacks)*0.05), replace=False)

all_whites = df["white_card_text"].unique()

test_whites = np.random.choice(all_whites,size=int(len(all_whites)*0.2), replace=False)
print("all_whites",len(all_whites),"test whites:",len(test_whites))

all_whites 2128 test whites: 425


In [33]:
card_mask = (df["white_card_text"].isin(test_whites))
df_test_cards = df.loc[card_mask].copy()
df_train_cards = df.loc[~card_mask].copy()

print("train rows",df_train_cards.shape[0])
print("test rows",df_test_cards.shape[0])
print("df_train_cards mean won",df_train_cards.won.mean())
print("df_test_cards mean won",df_test_cards.won.mean())

train rows 1956418
test rows 489942
df_train_cards mean won 0.09971028686098779
df_test_cards mean won 0.1011568716297031


In [34]:
if SAVE_TRAIN_DATA:
    df_train_cards.to_parquet("cah_train_cardsplit_games.parquet")
    df_test_cards.to_parquet("cah_test_cardsplit_games.parquet")

### "Normal" round-level train/test split:

* ALT - below - take last k rounds

In [52]:
df["fake_round_id"].nunique()

244636

In [53]:
df = df.sample(frac=1).sort_values("fake_round_id")

In [54]:
df.shape[0] # 2,446,360 rows  , and 298,955 games

2446360

In [55]:
test_size = df.shape[0]//5 ## 20% 
## round to nearest 10
test_size = int(math.ceil(test_size / 10.0)) * 10

df_test = df.tail(test_size)
print(df_test.shape[0])
display(df_test.head(12))

489280


Unnamed: 0,fake_round_id,black_card_text,white_card_text,won,text,sum_won
2387656,238766,What's about to take this dance floor to the next level?,Crippling social anxiety,0,What's about to take this dance floor to the next level? Crippling social anxiety,0
2387654,238766,What's about to take this dance floor to the next level?,Slowly releasing a huge fart over the course of two minutes,0,What's about to take this dance floor to the next level? Slowly releasing a huge fart over the course of two minutes,1
2387653,238766,What's about to take this dance floor to the next level?,Crab,1,What's about to take this dance floor to the next level? Crab,1
2387650,238766,What's about to take this dance floor to the next level?,Trimming the poop out of Chewbacca's butt hair,0,What's about to take this dance floor to the next level? Trimming the poop out of Chewbacca's butt hair,0
2387652,238766,What's about to take this dance floor to the next level?,Subduing a grizzly bear and making her your wife,0,What's about to take this dance floor to the next level? Subduing a grizzly bear and making her your wife,0
2387657,238766,What's about to take this dance floor to the next level?,The NRA,0,What's about to take this dance floor to the next level? The NRA,1
2387658,238766,What's about to take this dance floor to the next level?,Fucking a corpse back to life,0,What's about to take this dance floor to the next level? Fucking a corpse back to life,0
2387659,238766,What's about to take this dance floor to the next level?,Denying climate change,0,What's about to take this dance floor to the next level? Denying climate change,0
2387651,238766,What's about to take this dance floor to the next level?,Fisting,0,What's about to take this dance floor to the next level? Fisting,0
2387655,238766,What's about to take this dance floor to the next level?,Elon Musk,0,What's about to take this dance floor to the next level? Elon Musk,0


#### subset df to train

In [56]:
df_train = df.iloc[:-test_size]
df_train.shape[0]

1957080

In [57]:
## # Initialize the GroupShuffleSplit.
# gss = GroupShuffleSplit(n_splits=1, test_size=0.2,random_state=42)

# # Get the indexers for the split.
# idx1, idx2 = next(gss.split(df, groups=df["fake_round_id"].values))

# # Get the split DataFrames.
# df, df_test = df.iloc[idx1], df.iloc[idx2]

# print("df (train)",df.shape[0])
# print(df["fake_round_id"].nunique())
# print("test")
# print("df_test",df_test.shape[0])
# print(df_test["fake_round_id"].nunique())

# df_train = df

### save to disk

In [94]:
if SAVE_TRAIN_DATA:
#     df.to_parquet("cah_train_min3_games.parquet")
#     df_test.to_parquet("cah_test_min3_games.parquet")
#     df.to_parquet("cah_train_games.parquet")
    
    df_train.to_parquet("cah_train_games.parquet")
    df_test.to_parquet("cah_test_games.parquet")

#### Whites prior: 
* 20.5% accuracy on test set!!;
* 23% after min 3 filtering

In [59]:
# df_white_prior = df.groupby(["white_card_text"], as_index=False)["won"].mean().rename(columns={"won":"white_prior"}).set_index("white_card_text")
df_white_prior = df_train.groupby(["white_card_text"], as_index=False)["won"].mean().rename(columns={"won":"white_prior"}).set_index("white_card_text")
df_test = df_test.join(df_white_prior,on="white_card_text",how="left")
prior = df_test["white_prior"].mean()
# df_test["white_prior"] = df_test["white_prior"].fillna(prior)
df_test["white_prior"] = df_test["white_prior"].fillna(0.1)
print(df_test.shape[0],"test shape after join")
print("Prior Acc @1:",df_test.sort_values("white_prior",ascending=False).groupby("fake_round_id").head(1)["won"].mean())
## alkt: 
print("alt dedup Prior Acc @1:",df_test.sort_values("white_prior",ascending=False).drop_duplicates(subset=["fake_round_id"])["won"].mean())
# print("Prior Acc @1:",df_test.sort_values("white_prior",ascending=False).groupby("fake_round_id").head(1).groupby("fake_round_id")["won"].max().mean()) ## res identical to above
print("Prior Acc @2:",df_test.sort_values("white_prior",ascending=False).groupby("fake_round_id").head(2).groupby("fake_round_id")["won"].max().mean())
print("Prior Acc @3:",df_test.sort_values("white_prior",ascending=False).groupby("fake_round_id").head(3).groupby("fake_round_id")["won"].max().mean())

489280 test shape after join
Prior Acc @1: 0.2072841726618705
alt dedup Prior Acc @1: 0.2072841726618705
Prior Acc @2: 0.36032537606278614
Prior Acc @3: 0.48606115107913667


### keep 1s (for train) ?
* MAY want to filter - e.g. min # cases of it being picked (+- keeping that for use for sample weight)
* Keep all data for now, (to allow for grouping by rounds if we want, since this is a ranking problem) 

* temp

In [35]:
df_tr = df[['fake_round_id',"black_card_text","white_card_text","won"]].sort_values("won",ascending=False).copy()
df_tr =df_tr.drop_duplicates(subset=['fake_round_id',"black_card_text","white_card_text"])
df_tr["white_counts"] = df_tr.groupby("white_card_text")["won"].transform("size")
print(df_tr.nunique())
df_tr

fake_round_id      195708
black_card_text       581
white_card_text      2128
won                     2
white_counts          213
dtype: int64


Unnamed: 0,fake_round_id,black_card_text,white_card_text,won,white_counts
2989504,298951,"Every Tuesday, I purchase a box of donuts. I sit on the toilet. I eat the donuts. I remember __, and I cry.",Slowly easing down onto a cucumber,1,981
1798560,179857,Ain't it nifty? Barb and Bob hit 50! So get off your ass and raise a glass to 50 years of __.,Ejaculating a quart of hollandaise sauce,1,1012
1798439,179844,"When asked about the biggest threat facing the nation, 60% of Americans said __.",Being sad and horny,1,1014
1798441,179845,"Kids, I don't need drugs to get high. I'm high on __.",Dumpster juice,1,1021
1798459,179846,Click Here for __!!!,Everything,1,1038
...,...,...,...,...,...
1794108,179411,And what did you bring for show and tell?,A fuck-ton of almonds,0,1062
1794107,179411,And what did you bring for show and tell?,Nazis,0,1006
1794106,179411,And what did you bring for show and tell?,German dungeon porn,0,1011
1794105,179411,And what did you bring for show and tell?,A toxic family environment,0,1025


In [36]:
df_tr["white_counts"].describe().round(1)

count    1957076.0
mean         997.9
std           67.0
min           10.0
25%          978.0
50%         1001.0
75%         1023.0
max         1969.0
Name: white_counts, dtype: float64

## Sentence transformers model(s)
* Sentence bert (NLI? semtantic similarity?) 
https://www.sbert.net/docs/training/overview.html
https://www.sbert.net/docs/quickstart.html  , https://github.com/UKPLab/sentence-transformers
* Load data from disk ? 

* list of pretrained sentencebert models: (All are >300 M! slow download) https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

* `stsb-distilroberta-base-v2` - 305M sized model - semantic similarity
* `nli-distilroberta-base-v2` - NLI 
* `average_word_embeddings_glove.6B.300d`, `average_word_embeddings_glove.840B.300d` - glove/w2v embeddings

In [37]:
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses, evaluation
import torch
torch.cuda.is_available()
from torch.utils.data import DataLoader
import torch.utils.data as data_utils
## https://stackoverflow.com/questions/50307707/convert-pandas-dataframe-to-pytorch-tensor
from transformers import AutoTokenizer, AutoModel
## example from : https://www.sbert.net/docs/training/overview.html
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses 


In [38]:
## pretrained weights - download vis program often fails, easier to download externally from ftp
# model = SentenceTransformer('stsb-distilroberta-base-v2') ## was distilbert-base-nli-mean-tokens

# model = SentenceTransformer("nli-distilroberta-base-v2",device="cuda")
# model = SentenceTransformer("paraphrase-MiniLM-L12-v2") ## "paraphrase-MiniLM-L12-v2"
model = SentenceTransformer("all-MiniLM-L12-v2"
# model = SentenceTransformer("./stsb-distilroberta-base-v2",device="cuda")

# model = SentenceTransformer("./nli-mpnet-base-v2",device="cuda")

In [39]:
model.max_seq_length

128

In [40]:
# len(model.tokenize(["ad and "])["input_ids"])

In [41]:
model.get_sentence_embedding_dimension() # 768

384

#### redo above, but with real data


In [43]:
num_epochs = 2
TEST_SIZE = df.shape[0]//8 #4500
print(TEST_SIZE)

244635


In [44]:
df.shape

(1957080, 6)

In [45]:
df.drop_duplicates(['black_card_text', 'white_card_text']).shape[0]

746559

In [None]:
## is list correct? error otherwise? 
train_examples = list(df.iloc[TEST_SIZE:].apply(lambda row: InputExample(texts=[row["white_card_text"], row["black_card_text"]],
                                                                         label=row["picks"]),axis=1)) # picks
test_examples =  list(df.iloc[0:TEST_SIZE].apply(lambda row: InputExample(texts=[row["white_card_text"], row["black_card_text"]], 
                                                                          label=row["picks"]),axis=1)) # picks


In [None]:
# dev_eval = evaluation.BinaryClassificationEvaluator(sentences1=df.iloc[0:2500,0],
#     sentences2= df.iloc[0:2500,1],
#     labels=df.iloc[0:2500,2],show_progress_bar=True,batch_size=256)
dev_eval = evaluation.BinaryClassificationEvaluator(sentences1=list(df.iloc[0:TEST_SIZE,0].values),
    sentences2= list(df.iloc[0:TEST_SIZE,1].values),
    labels=list(df.iloc[0:TEST_SIZE,2].values),show_progress_bar=True,batch_size=256)

In [None]:
## https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py 

## evaluator example 
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=256)


In [None]:
%%time
# train_loss = losses.CosineSimilarityLoss(model) # ORIG

# train_loss = losses.SoftmaxLoss(model,
#     sentence_embedding_dimension= model.get_sentence_embedding_dimension(),
#     num_labels= 2,
#     concatenation_sent_rep= True,
#     concatenation_sent_difference = True,
#     concatenation_sent_multiplication = True) # ALT - for classification https://www.sbert.net/docs/package_reference/losses.html#softmaxloss

train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], 
          epochs=num_epochs, 
#           warmup_steps=100, # 100
          warmup_steps = math.ceil(len(train_dataloader) * 0.05*3), #5% of train data for warm-up
          evaluator=dev_eval,
          use_amp=True,
          optimizer_params= {'lr': 3e-04}, # 2e-05
          output_path="./output",
          save_best_model = False,
         )

In [None]:
model.evaluate(dev_eval)

#### Evaluate model on original data, game by game.
* Note: should use proper test set , disjoint

In [None]:
# df = pd.read_parquet('cah_lab_data.parquet')
df_test = df_sample.loc[df_sample["black_card_pick_num"]==1].sort_values("fake_round_id")[["fake_round_id","black_card_text","white_card_text","won"]].copy()

df_test = df_test.head(4800)
df_test

* if using just black/white, could do this for unique cards - ~1000 x less compute...

In [None]:
%%time
## https://www.sbert.net/docs/usage/semantic_textual_similarity.html

sentences1 = df_test["black_card_text"].values
sentences2 = df_test["white_card_text"].values
embeddings1 = model.encode(sentences1,batch_size=256,normalize_embeddings=False, convert_to_tensor=True)

embeddings2 = model.encode(sentences2,batch_size=256,normalize_embeddings=False, convert_to_tensor=True)

In [None]:
scores = []
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
#     print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
    scores.append(float(cosine_scores[i][i]))
df_test["cos_scores"] = scores
df_test.sort_values(["fake_round_id","cos_scores"],inplace=True,ascending=True)
print("top1 accuracy:",df_test.groupby("fake_round_id").head(1)["won"].mean())

df_test.sort_values(["fake_round_id","cos_scores"],inplace=True,ascending=False)
print("top1 accuracy, reverse sort:",df_test.groupby("fake_round_id").head(1)["won"].mean())
df_test