# Connections Solver Notebook

Author: Eric Nunes

This notebook is supposed to act as a "playground" for me to experiment with different procedures, algorithms, approaches, whatever.

## Load Dataset

I built a bot to collect the full game archive from the New York Times (all previous game data is stored in one of their APIs). I'm pulling the dataset from Kaggle, but a mirror exists on HuggingFace.

Source:https://www.kaggle.com/datasets/eric27n/the-new-york-times-connections

In [None]:
import numpy as np
import pandas as pd

import gzip
import json
import random
import re
from collections import Counter

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from transformers import BertTokenizer, BertModel
import torch
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors

### NYT Connections

In [None]:
df = pd.read_csv("hf://datasets/eric27n/NYT-Connections/Connections_Data.csv")
df['Word'] = df['Word'].fillna("NA")
df['Word'] = df['Word'].str.lower()
df['Group Name'] = df['Group Name'].str.lower()
grouped = df.groupby('Game ID')
result = []

for game_id, group in grouped:
  words = group['Word'].tolist()
  group_by_name = group.groupby('Group Name')
  solution = []
  
  for group_name, sub_group in group_by_name:
    group_words = sub_group['Word'].tolist()
    reason = sub_group['Group Name'].iloc[0]
    solution.append({'words': group_words, 'reason': reason})

  result.append({'words': words, 'solution': {'groups': solution}})

ds = result
ds_len = len(ds)
print(len(ds), ds[0])

### Only Connect (Datathon)

#### Input Data

In [None]:
input_data = """
["Cannon","Santa","Fury","Jonas"]	 People named ‘Nick'
["Love","Zip","Duck","Goose-egg"] 	Terms for 'zero'
["Haddock","Ahab","Hook","Sparrow"] 	Fictional captains
["Ash","Lime","Willow","Cork"]	 Trees

["Bush","Dole","Gore","Carter"]	US presidential losers
["Mexico","Hampshire","Jersey","York"]	'New' US states
["Glandular","Scarlet","Yellow","Trench"]	Types of fever
["Eastern","Central","Mountain","Pacific"]	USA Time Zones

["Manhattan","Sidecar","Margarita","Gibson"]	Cocktails
["Puzzle","Business","Nuts","Suit"]	Monkey ____
["Fruit","Carrot","Sheet","Marble"]	Cakes
["Dough","Bands","Scratch","Bread"]	  Slang for money

["Magpie","Crane","Turkey","Black"]	Birds
["Passion","Kiwi","Jack","Grape"]	Fruit
["Gold","Yellow","Jungle","Scarlet"]	Fever
["Hawk","Cruise","Trident","Patriot"]	Missiles

["Anvil","Hammer","Stirrup","Tympanum"]	Ear parts
["Square","Cube","Irrational","Perfect"]	Types of numbers
["Presto","Largo","Piano","Grave"]	Musical directions
["Altering","Triangle","Relating","Integral"]	Anagrams of each other

["Queue","Pea","Jay","Sea"]	Sound like letters of the alphabet
["Lime","Emerald","Jade","Olive"]	Shades of green
["Warner","Wright","Chemical","Moss"]	____ Brothers
["Cardinal","Burgundy","Venetian","Rust"]	Shades of red

["Coffee","Terrier","Yew","Rover"]	Irish ____
["Boxer","Whiskey","Mau Mau","Easter"]	Rebellions
["Chekov","Kirk","McCoy","Scott"]	'Star Trek' characters
["Echo","Explorer","Voyager","Sputnik"]	Spacecraft

["Foxtrot","Tango","Tap","Morris"]	Dances
["Clover","Napoleon","Major","Boxer"]	'Animal Farm' characters
["Naan","Hannah","Level","Peep"]	Palindromes
["Honey","Mason","Bumble","Mining"]	Varieties of bee

["Rabbit","Knife","Hammer","Ass"]	Jack____
["Question","Pot","Gun","Employee"]	Things you can 'fire'
["Pencil","Bubble","Tulip","Hobble"]	Styles of skirt
["Cruiser","Mountain","Tandem","BMX"]	Types of bicycles

["Stag","Tom","Bull","Buck"]	Male animals
["Crumb","Column","Debris","Crochet"]	End with a silent letter
["Big Bang","Chaos","String","Game"]	Scientific theories
["Coming","Hand","Childhood","Guess"]	Second ____

["Whip","Smile","Joke","Bottle"]	Things you can crack
["Baby","Concert","Ballroom","Parlour"]	Sizes of grand pianos
["Con","Fleece","Rook","Screw"]	Synonyms for swindle
["Precious","Love","Petal","Honey"]	Terms of endearment

["Base","Heavy","Scrap","Noble"]	____ metal
["Mace","Saffron","Nigella","Ginger"]	Spices
["Screw","Spit","Fast","Curve"]	Baseball pitches
["Python","Logo","Perl","C"]	Computer programming languages

["Marks","Last legs","Knees","Feet"]	On your ____
["Jackson","Austin","Carson","Pierre"]	US state capitals and male first names
["Dewey","Louie","Daisy","Donald"]	Disney ducks
["Control","Escape","Insert","Function"]	PC computer keys

["Coal","Pipe","Scarf","Carrot"]	Accessories for a snowman
["Jester","Cow","Sleigh","Church"]	Things that have bells
["Diamond","Hound","Oath","Bath"]	Blood ____
["Priest","Bishop","Pastor","Cantor"]	Religious offices

["Power","Harmonic","Fourier","Telescoping"]	Classes of mathematical series
["Okay","Eighty","Tepee","Excess"]	Pronounced as two letters
["Jellyfish","Snake","Algae","Mushroom"]	Things with poisonous varieties
["Grape","Ivy","Clematis","Wisteria"]	Types of vines

["Scar","Hades","Stromboli","Ursula"]	Disney animated villains
["Doctor","Hazel","Hunt","Craft"]	Witch ____
["Doll","Bird","Dame","Sheila"]	Nicknames for women
["Fighting","Grass","Dragon","Psychic"]	Types of Pokémon

["Key","Ribbon","Platen","Carriage"]	Parts of a typewriter
["Sheffield United","Razor","Helicopter","Ice skate"]	Blades
["Bottle","Suicide","Dumb","Legally"]	____ blonde
["Coin","Spider's web","Cricket ball","Good yarn"]	Things that are spun

["Honey","Cox","Nit","Curry"]	Types of comb
["Vega","Betelgeuse","Pollux","Altair"]	Stars
["10","Pitch","Tense","Number"]	Perfect ____
["Blood","Tyre","Iron","Shotgun"]	Things that can be pumped

["Sacrifice","Pin","Fork","Discovered attack"]	Chess tactics
["Rocking","Carver","Windsor","Tub"]	Chairs
["Brilliant","Emerald","Baguette","Pear"]	Cuts of diamond
["Swing","Arch","Beam","Pontoon"]	Bridges

["Worm","Trojan horse","Virus","Keylogger"]	Malware
["Fandango","Thunderbolt","Beelzebub","Galileo"]	Lyrics in 'Bohemian Rhapsody'
["Torpedo","Hoagie","Poor boy","Submarine"]	US sandwiches
["Viper","Maverick","Hollywood","Iceman"]	Pilots in 'Top Gun'

["Rotary","Fight","Mile High","Yacht"]	Famous clubs
["Dragon","Fiona","Puss in Boots","Harold"]	Characters in Shrek films
["Leech","Fluke","Tick","Chigger"]	Parasites
["Wrecking","Medicine","Masked","Crystal"]	____ Ball

["Bow","Plough","Hook","Drogue"]	Types of anchor
["Tangled","Mulan","Cinderella","Fantasia"]	Disney animations
["Police","Steam","Slide","Penny"]	Types of whistle
["Dakota","Shields","Korea","Atlantic"]	North and South ____

["Beard","Bath","Shrewd","Harem"]	Mammals plus one letter
["Air","Street","Ice","Field"]	Types of hockey
["World","64","Land","Bros."]	'Super Mario' games
["Radar","Table","Sea","Weather"]	Under the ____

["Garter","Confetti","Bouquet","Petals"]	Thrown at weddings
["Zuckerberg","Filo","Wales","Page"]	Internet entrepreneurs
["Jet","Bill","Raven","Saint"]	Members of American football teams
["Won","Tu","Fore","Ait"]	Number homophones

["Boycott","Walkout","Picket","Rally"]	Types of protest
["Demolition","Medicine","Masked","Crystal"]	____ ball
["Red","Tiananmen","Times","Trafalgar"]	World-famous city squares
["Brian","Stewie","Cleveland","Meg"]	'Family Guy' characters

["Snorlax","Jigglypuff","Bulbasaur","Ditto"]	Pokémon characters
["Rare","Well done","Medium","Blue"]	Temperatures for cooking beef
["Shelf","Drift","Plate","Crust"]	Continental ____
["Dubrovnik","Split","Rijeka","Zadar"]	Cities in Croatia

["Achtung Baby","Zooropa","War","Rattle and Hum"]	U2 albums
["Yosemite","Redwood","Joshua Tree","Death Valley"]	National Parks in California
["Eden","Babylon","Versailles","Monet"]	Famous gardens
["Angry Birds","Vine","Instagram","Fruit Ninja"]	Popular apps

["Iron Man","Daredevil","Hulk","Blade"]	Marvel Comics superheroes
["Determined","Honest","Sure","Frank"]	To be ____
["Castle","Gambit","Stalemate","En passant"]	Chess terms
["Corral","Assemble","Muster","Convene"]	Gather people together

["Colorado","Missouri","Green","Arkansas"]	Major US rivers
["Lapel","Cuff","Breast","Pocket"]	Parts of a jacket
["Ross","Monk","Elephant","Harbour"]	Species of seal
["Brown","Columbia","Cornell","Yale"]	Ivy League establishments

["Merida","Pocahontas","Jasmine","Snow White"]	Disney Princesses
["Yellow","Sparks","Viva La Vida","Paradise"]	Coldplay songs
["Thick","Raw","Altogether","Cut"]	In the ____
["Saratoga","Iowa","Missouri","Duffy"]	US ships of World War II

["Radiation","Altitude","Morning","Motion"]	Forms of sickness
["Turnover","Strudel","Cobbler","Crumble"]	Apple dishes
["Off the Wall","Invincible","Thriller","Dangerous"]	Michael Jackson albums
["Pigwidgeon","Hermes","Hedwig","Errol"]	Harry Potter owls

["Bloody","Virgin","Typhoid","Hail"]	___ Mary
["Rare","Medium","Well done","Blue"]	Meat cooking times
["Nobel","Carver","Japp","Franklin"]	Chemists
["Malik","Styles","Horan","Payne"]	One Direction

["Cat","Sleep","Moon","Cake"]	____walk
["Midtown","Hell's Kitchen","NoHo","Chelsea"]	Manhattan
["Orache","Escarole","Kale","Bok Choy"]	Leafy green vegetables	
["Use","Wise","Pease","Seize"]	Homophones of letter plurals

["Time","Skin","Face","Breath"]	Things that can be saved	
["Fairy","Puck","Imp","Sprite"]	Supernatural creatures	
["Neymar","Silva","Sócrates","Vinicius"]	Brazilian international footballers	
["Dream","Sniper","Hustle","Pie"]	American ___

["Debris","Coup","Autumn","Crumb"]	End with a silent letter
["Clooney","Pitt","Cheadle","Mac"]	'Ocean's' actors
["Windward","Virgin","Canary","Faroe"]	Island groups
["Overwatch","Destiny","Halo","Doom"]	First-person shooters

["Fingerboard","Neck","Bridge","Pickup"]	Parts of an electric guitar
["Blurred Lines","Get Lucky","Happy","Angel"]	Pharrell Williams songs
["Frontier","Countdown","Fantasy","Score"]	Final  ____
["Note","Micra","Cube","Atlas"]	Nissan car models

["Patriot","Charger","Cowboy","Saint"]	NFL Teams
["Tucker","Barbie","Yakka","Sheila"]	Australian slang terms
["Grip","Stock","Butt","Barrel"]	Parts of a rifle
["Flick","Bowie","Stanley","Cheese"]	Knives

["Wraith","Phantom","Cullinan","Silver Ghost"]	Rolls Royce models
["Trouble","Franco","Neuron","Boulevard"]	Contain currencies
["Corpus Christi","Amarillo","Houston","Austin"]	Texan towns
["1977","Toaster","Valencia","Rio de Janeiro"]	Instagram filters

["Reeve","Cavill","Cain","Routh"]	Played Superman
["False","Robin","Neighbor","Knight"]	____hood
["Bewilder","Puzzle","Stump","Baffle"]	Words meaning to perplex
["National","Piano","Cru","Canyon"]	Grand ____

["Indian","Triumph","Ducati","Aprilia"]	Motorcycle brands
["Grisham","Slaughter","King","Child"]	Thriller authors
["Crown","Amalgam","Calculus","Root"]	Dentistry terms
["Cookie","Mason","Bell","Leyden"]	Types of jar

["Cottonmouth","Rattle","Corn","Rat"]	Snakes
["Money","Clip","Trail","Towns"]	Paper _____
["Not","Night","Nave","Nob"]	Lost their initial k
["Deal","Livers","Nit","Snug tent"]	Anagrams of metals

["Constantine","The Matrix","Speed","John Wick"]	Keanu Reeves films
["Sea lion","Dugong","Narwhal","Manatee"]	Marine mammals
["Fu Manchu","Handlebar","Pencil","Walrus"]	 Mustaches
["Saddle","Pedal","Seat post","Fork"]	Parts of a bicycle

["The Big Short","Half Nelson","Fracture","La La Land"]	Ryan Gosling films
["Bet","Ante","Flutter","Wager"]	Gamble
["Say You Will","Tusk","Mirage","Penguin"]	Fleetwood Mac albums
["Rager","Blue Sky","Brightside","Morale"]	Songs called 'Mr ____'

["Cave","Performance","Pop","Kinetic"]	Types of art	
["United","Delta","Frontier","Spirit"]	US airlines	
["Milk","Puzzle","Measuring","Claret"]	Jugs	
["Boston","Nantucket","Quincy","Springfield"]	Places in Massachusetts	

["Lakeland","Kerry Blue","Border","Yorkshire"]	Terriers	
["Knuckle","Wedding","Flag","Trash"]	White ____	
["Drummer","Maid","Partridge","Swan"]	Twelve Days of Christmas	
["Sesame","Palm","Sunflower","Argan"]	Edible oils	

["Headon","Judd","Fleetwood","Bonham"]	Drummers	
["Work","Place","Wall","Power"]	Fire____	
["Photograph","Chance","Bow","Back seat"]	Things you can take	
["Rock","Pine","Evert","Jericho"]	Famous people named Chris	

["Junk","Royal","Chain","Fan"]	____ mail	
["Troy","Chad","Gabriella","Taylor"]	'High School Musical' characters	
["Barnacles","Peso","Kwazii","Inkling"]	Octonauts	
["Golf","Cricket","Rugby","Polo"]	Ball games	

["up!","Amarok","Polo","Jetta"]	Volkswagens	
["Tightrope","Dog","Walk","Line"]	Walk the ____	
["Safety","Cornerback","Tackle","Center"]	Positions in American football	
["Marker","Ballpoint","Fountain","Dip"]	Types of pen	

["Piquet","Hill","Scheckter","Ascari"]	Formula One champions	
["Vietnam","New Mexico","Soviet Union","Catalonia"]	Red and gold flags	
["Marshall","Skye","Rubble","Chase"]	'PAW Patrol' characters	
["Coriolanus","Haymitch","Peeta","Katniss"]	'The Hunger Games' characters	

["Rose","Winks","Kane","Walker-Peters"]	Tottenham Hotspur players	
["Fury","Whyte","Haye","Chisora"]	British heavyweight boxers	
["Case","Egg","Joke","Knuckles"]	Things you crack	
["Snare","Crash","Tom","Ride"]	Parts of a drum kit	

["Iron","Stan","Grandpa","Ben"]	Fictional animated uncles	
["Wisdom","False","Milk","Eye"]	____ teeth
["Phoenix","Stone","Prince","Fire"]	Last words of Harry Potter book titles
["Labour","Coach","Rescue","Birthday"]	Types of party

["George Bush","Dulles","Douglas","Sky Harbor"]	US airports
["Barnum","Memphis","Van Helsing","Valjean"]	Played by Hugh Jackman
["Grape","Rango","Wood","Brasco"]	Johnny Depp title roles
["Juliet","Antigone","Brunnhilde","Javert"]	Fictional suicides

["Sonora","Chihuahua","Yukatán","Tabasco"]	Mexican states
["Asuncion","Paramaribo","Lima","Quito"]	South American capitals
["Boxster","Panamera","Cayman","Cayenne"]	Porsche vehicles
["Vervet","Woolly","Spider","Proboscis"]	Monkeys

["Cavendish","Froom","Kenny","Varnish"]	British cyclists
["Stout","Bitter","Mild","Bock"]	Beer
["Meade","Grant","Sherman","Beauregard"]	US Civil War generals
["Sirloin","Skirt","T Bone","Picanha"]	Steaks

["Bonsai","Honcho","Sake","Emoji"]	Japanese words
["Strong nuclear","Electromagnetic","Elastic","Friction"]	Forces
["Lumbers","Jib","Rush","Judged"]	Books of the Bible with a letter changed
["Poppadom","Roti","Bhaji","Pakora"]	Indian food

["Amarillo","Rosa","Verde","Azul"]	Spanish colours
["Auror","Moan","Tian","Els"]	Disney princesses minus final "a"
["Rain","Joaquin","Summer","Liberty"]	Phoenix siblings
["World","Food","Sperm","River"]	____ Bank

["Wisdom","False","Milk","Eye"]	____ teeth
["Phoenix","Stone","Prince","Fire"]	Last words of Harry Potter book titles
["Labour","Coach","Rescue","Birthday"]	Types of party
["George Bush","Dulles","Douglas","Sky Harbor"]	US airports

["Barnum","Memphis","Van Helsing","Valjean"]	Played by Hugh Jackman
["Grape","Rango","Wood","Brasco"]	Johnny Depp title roles
["Juliet","Antigone","Brunnhilde","Javert"]	Fictional suicides
["Sonora","Chihuahua","Yukatán","Tabasco"]	Mexican states

["Asuncion","Paramaribo","Lima","Quito"]	South American capitals
["Boxster","Panamera","Cayman","Cayenne"]	Porsche vehicles
["Vervet","Woolly","Spider","Proboscis"]	Monkeys
["Cavendish","Froom","Kenny","Varnish"]	British cyclists

["Stout","Bitter","Mild","Bock"]	Beer
["Meade","Grant","Sherman","Beauregard"]	US Civil War generals
["Sirloin","Skirt","T Bone","Picanha"]	Steaks
["Bonsai","Honcho","Sake","Emoji"]	Japanese words

["Strong nuclear","Electromagnetic","Elastic","Friction"]	Forces
["Lumbers","Jib","Rush","Judged"]	Books of the Bible with a letter changed
["Poppadom","Roti","Bhaji","Pakora"]	Indian food
["Amarillo","Rosa","Verde","Azul"]	Spanish colours

["Auror","Moan","Tian","Els"]	Disney princesses minus final "a"
["Rain","Joaquin","Summer","Liberty"]	Phoenix siblings
["World","Food","Sperm","River"]	____ Bank
["Rush Hour","Vacation","Confessions","House Party"]	Comedy film series

["Peking","The Fat","Rubber","Milkshake"]	____ Duck
["Canyon","Gorge","Vale","Dene"]	Valleys
["Cram","Wolf","Scoff","Bolt"]	Eat quickly
["Lance","Neil","Alexander","Stretch"]	Armstrongs

["Tax","Extend","Push","Challenge"]	Ask for effort
["Surfer","Lining","Service","Birch"]	Silver ____
["Torr","Atmosphere","PSI","Pascal"]	Units of pressure
["Po","Tiber","Arno","Passer"]	Italian rivers

["Heat","Magic","Thunder","Celtics"]	US basketball teams
["God","Art","Dogs","Masters"]	____ of War
["Hunt","Fish","Boat","Drive"]	Need a license
["Marquez","Rossi","Lorenzo","Stoner"]	MotoGP champions

["Cello","Balalaika","Tambura","Harp"]	String instruments
["The Lovers","Wheel of Fortune","Justice","Strength"]	Tarot cards
["Subsistence","Arable","Intensive","Dairy"]	Types of farming
["Coley","Char","Cod","Cusk"]	Fish

["Chad","Cameroon","Comoros","Cabo Verde"]	African countries
["Tito","Castro","Pot","Stoph"]	Communist leaders
["Pole","Shooting","Dwarf","Evening"]	____ star
["Hobbit","Elf","Orc","Warg"]	Tolkien creatures

["Victorinox","Swatch","UBS","Schindler"]	Swiss companies
["Intertwine","Origin","Decider","Whale"]	End in alcoholic drinks
["Gneiss","Marble","Schist","Slate"]	Types of rock
["Rock","Dundee","Clip","Tears"]	Crocodile ____

["Cosmopolitan","Vesper","Caipiroska","Kamikaze"]	Vodka cocktails
["Peaches","Crash","Scrat","Manny"]	'Ice Age' characters
["Spotify","Warthog","Molecule","Zither"]	Start with skin blemishes
["Abraham","Christmas","Damien","Brown"]	Father ____

["Major","Peel","Eden","Heath"]	British Prime Ministers
["Marathi","Malayalam","Telugu","Odia"]	Indian languages
["Heimdall","Fenrir","Loki","Odin"]	Norse mythology
["Young","Bean","Lock","Combs"]	Seans

["Villanelle","Ode","Haiku","Sonnet"]	Poetic forms
["Vishnu","Kali","Ganesha","Brahma"]	Hindu deities
["Shiva","Richard Parker","Tigger","Hobbes"]	Fictional tigers
["Villa","Basilica","Forum","Circus"]	Roman buildings

["Number","Time","Meridian","Suspect"]	Prime ____
["Ratatouille","Coco","Brave","Onward"]	Pixar films
["Tan","Sec","Cosec","Cot"]	Trigonometric functions
["Nemo","Mrs Puff","Klaus","Flounder"]	Animated fish

["Smollett","Ahab","Haddock","Handy"]	Fictional sea captains
["Orzoi","Oxer","Eagle","Asset"]	Dogs missing an initial "B"
["Beer","Mothering","Anthrax","Mantissa"]	Start with insects
["Malinga","Sangakkara","Mathews","Arnold"]	Sri Lankan cricketers

["Dog","Slide","Pea","Tin"]	Types of whistle
["Fat","Lanca","Ferrar","Pagan"]	Italian car manufacturers missing "i"
["Corfu","Kos","Samos","Santorini"]	Greek islands
["Io","Callisto","Ganymede","Kale"]	Moons of Jupiter

["Scott","Poehler","Plaza","Ansari"]	'Parks and Recreation' actors
["Tuba","Sousaphone","Shofar","Cornet"]	Brass instruments
["Benning","Wayne","Sumter","Lauderdale"]	Fort ____ in US
["Smother","Carbuncle","Season","Flaunt"]	Ends with a relative

["Safari","Clock","Compass","Notes"]	Default iPhone apps
["Carbonara","Tintoretto","Neonate","Leadsom"]	Begin with an element
["X","Plum","Branestawm","Layton"]	Professor ____
["United","Rangers","Athletic","Orient"]	London football club suffixes

["Operation","Battleship","Twister","Mastermind"]	Games
["Tempest","Whirlwind","Cyclone","Hurricane"]	Violent winds
["Goose","Nature","Tongue","Shipton"]	Mother ____
["Winston","Bollo","Grodd","Donkey Kong"]	Fictional gorillas

["Aggregate","Replay","Draw","Leg"]	Cup football terminology
["Mark","Catch","Twig","Spot"]	Synonyms for "notice"
["Droopy","Brain","Astro","Tramp"]	Cartoon dogs
["Dill","Sage","Basil","Bay"]	Aromatic herbs

["Hand","Dial","Quartz","Crystal"]	Parts of a clock
["George Washington","Brooklyn","Broadway","Alexander Hamilton"]	Bridges in New York
["Hawker","Embraer","Bombardier","Airbus"]	Aeroplane brands
["Donald","Bentina","Scrooge","Louie"]	DuckTales characters

["Beady","Katie","Emmy","Esso"]	Letter + letter sounds
["Olympique","Bayern","Borussia","Real"]	Champions League winner prefixes
["Methuselah","Hannibal","Odin","Pope Benedict"]	Anthony Hopkins roles
["Shudder","Jerk","Spasm","Tic"]	Twitch

["Man","Card","Coel","Brain"]	-iac
["Drive","Fracture","First Man","The Notebook"]	Ryan Gosling films
["Diamonds","We Found Love","Umbrella","Stay"]	Rihanna songs
["Tue","Thu","Fri","Sat"]	Days of the week

["Fenugreek","Turmeric","Mace","Sumac"]	Spices
["Flash","Porcupine","Wilhelm","Soak"]	End in trees
["Crimson","Garnet","Maroon","Burgundy"]	Shades of red
["Creed","Rocky","Ali","Million Dollar Baby"]	Boxing films

["Flail","Mace","Dagger","Estoc"]	Historical weapons
["Dragon Ball","Naruto","One Piece","Bleach"]	Japanese manga series
["Peanuts","Archie","Dilbert","Blondie"]	Comic strips
["Pink","Pegasus","Stickleback","Tier"]	Begin with "fasten" words

["Azure","Navy","Cyan","Teal"]	Blues
["Wee","Dinky","Minute","Slight"]	Little
["Ray","Collateral","Baby Driver","Annie"]	Jamie Foxx films
["Dictionary","Amen","Scotch","Pooh"]	Famous corners

["Peacock","Red Admiral","Grayling","Brimstone"]	Butterflies
["Huahua","Lean","Chester","Nook"]	Chi-
["Scarab","Cockchafer","Firefly","Weevil"]	Beetles
["Older","Faith","Outside","Mothers Pride"]	George Michael songs

["Shave","Pare","Peel","Trim"]	Reduce
["Perceval","Canning","Major","Baldwin"]	Prime Ministers
["Dawn","Break","Darko","Dancing"]	Last word of Patrick Swayze film titles
["Hopscotch","Statues","Dodgeball","Tag"]	Playground games

["Jerry","Mrs Frisby","Tag","Mickey"]	Fictional mice
["Pouco","Modicum","Peu","Wenig"]	A little in different languages
["Goya","Miró","Gris","Varo"]	Spanish painters
["Trap","Web","Net","Noose"]	Things you can get caught in

["Kettle","Skillet","Griddle","Marmite"]	Types of cookware
["Disclose","Blab","Reveal","Divulge"]	Give out information
["Been","Leak","Beat","P"]	Sound a bit like vegetables
["Lukewarm","Johnson","Market","Mattress"]	Begin with male names

["Holidays","Decorations","Greetings","Lights"]	Christmas ____
["Date","Srichaphan","Bhupathi","Osaka"]	Asian tennis players
["Other Half","Inamorata","Consort","Partner"]	Significant others
["Applause","Drinks","Golf","Gunfire"]	Things that come in rounds

["Transporter","Beetle","Polo","Fox"]	Volkswagens
["Panier","Undined","Anal duck","Newton Gill"]	Anagrams of New Zealand cities
["Doozy","Corker","Crackerjack","Knockout"]	Excellent things
["Bus","Carer","Hooer","Haring"]	US presidents missing 4th letter

["Mamie","Nancy","Lady Bird","Bess"]	First ladies of the US
["Greit","Paroled","Penarth","Lino"]	Anagrams of big cats
["Play","Drive","Analytics","Authenticator"]	Google services
["Knows","Waste","Tow","Heal"]	Sounds a bit like a body part

["Pupil","Coltrane","Cube","Lambert"]	Begin with young animals
["Waltz","Malek","Bean","Wiseman"]	Bond villains
["Inner","Atlantic","Radio","Garden"]	____ City
["Kite","Eagle","Hobby","Owl"]	Birds of prey

["Station","Off","Book","Ground"]	Play____
["Pimple","Freckle","Mole","Spot"]	Facial features
["Morpheus","Tank","Trinity","Switch"]	Characters in 'The Matrix'
["Peterhouse","Clare","Magdalene","Darwin"]	Cambridge colleges

["Cézanne","Catania","Truth","Dilemma"]	End with a girl's name
["Wave","Voice","Storm","News"]	Things that break
["Inc","Vase","Crino","Stream"]	-line
["Admiral","Adidas","Nike","Puma"]	Sportswear brands

["Grunfeld","Reti","Sicilian","Budapest"]	Chess openings
["Guard","Breather","Off","Ulcer"]	Mouth ____
["Barley","Rye","Sorghum","Fonio"]	Grains
["Peso","Real","Bolívar","Boliviano"]	South American currencies

["Gammon","Speck","Bacon","Presunto"]	Pork products
["Single","Greenback","Buck","Simoleon"]	$1
["Steer","Drake","Drone","Bull"]	Male animals
["Dram","Yen","Yuan","Kip"]	Asian currencies

["Fight","Horn","Finch","Doze"]	Bull____
["Snooze","Slumber","Siesta","Nap"]	Sleep
["Brush","Trace","Snip","Share"]	Hurry, when you remove the first letter
["Bit","Nip","Smidge","Iota"]	Just a little

["Calvados","Tequila","Slivovitz","Krupnik"]	Alcoholic drinks
["Boxer","Pomeranian","Pointer","Akita"]	Breeds of dog
["Pomelo","Grapefruit","Yuzu","Clementine"]	Citrus fruits
["Turquoise","Dalmatian","Skeleton","Amalfi"]	Coasts

["Leviathan","Kraken","Umibozu","Jormungandr"]	Mythical sea monsters
["Ghost","Chipotle","Poblano","Jalapeno"]	Peppers	
["Xeno","Neo","Krypto","Rado"]	Noble gases missing their last letter	
["Pewterer","Cryptologist","Navel","Chancellor"]	Things you would find in a church	

["Yeoh","Obama","Visage","Kwan"]	Michelles	
["Bottom","Strange","Top","Charm"]	Types of quark	
["Akron","Dayton","Columbus","Kent"]	Cities in Ohio
["Black","Streatham","Clash","Thiago Silva"]	Dave songs

["Winner","Conqueror","Vanquisher","Champion"]	Winner
["Cabriole","Fondu","Arabesque","Coupé"]	Ballet terms
["Thyroid","Pituitary","Hypothalamus","Adrenal"]	Endocrine glands
["Minerva","Juno","Flora","Fortuna"]	Roman goddesses

["Angers","Gap","Nice","Nancy"]	French towns/cities
["Fifty","Inductance","Luxembourg","Learner"]	Can be represented by L
["Stone","Broom","Skip","Third"]	Curling terminology
["Hot","Fit","Foxy","Dreamy"]	Slang for attractive

["Blackness","Glamis","Balmoral","Tantallon"]	Scottish castles
["Pongo","Mowgli","Simba","Wart"]	Disney protagonists
["Skeleton","Smart","Dimple","Chip"]	Types of key
["Me","Me More","Kiss","Me Thru The Phone"]	'Kiss ____' songs

["Boom","Hike","Upswing","Swell"]	Increases
["Bitcoin","Dogecoin","Ethereum","Terra"]	Cryptocurrencies
["Mears","Charles","Davies","Parlour"]	Rays
["Doozy","Corker","Crackerjack","Knockout"]	Excellent things
"""

#### Rest of the code

In [None]:
def parse_connections_data(input_data):
    games = input_data.strip().split('\n\n')
    all_game_data = []

    for game in games:
        lines = [line.strip() for line in game.strip().split('\n') if line.strip()]
        game_entries = []
        all_words = []

        if len(lines) != 4:
            print(lines, "\n", game)
            raise ValueError("Each game must contain exactly 16 words across 4 groups.")

        for line in lines:
            parts = line.split('\t')
            if len(parts) != 2:
                continue

            words_str, reason = parts
            words = [word.strip(' "').lower() for word in words_str.strip('[]').split(",")]
            all_words.extend(words)
            game_entries.append({'words': words, 'reason': reason.strip()})

        if len(all_words) != 16:
            print(all_words, game)
            raise ValueError("Each game must have exactly 16 words.")
        
        random.shuffle(all_words)

        game_data = {
            'words': all_words,
            'solution': {
                'groups': game_entries
            }
        }

        all_game_data.append(game_data)

    return all_game_data

In [None]:
ds = parse_connections_data(input_data)
ds_len = len(ds)
print(ds[0])
print(ds_len)

### Only Connect (DB)

In [None]:
import csv
import random
import json

# Read the CSV data
with open('wall_groups.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    rows = list(reader)

# Process the data
ds = []
for i in range(0, len(rows), 4):
    game_data = rows[i:i+4]
    words = []
    solution = {'groups': []}
    
    for entry in game_data:
        clues = json.loads(entry['clues'])
        clues = [clue.lower() for clue in clues]
        words.extend(clues)
        solution['groups'].append({
            'words': clues,
            'reason': entry['connection']
        })
    
    if len(set(words)) != 16:
        continue
    
    random.shuffle(words)
    ds.append({
        'words': words,
        'solution': solution
    })

ds_len = len(ds)

## Word2Vec

Note: One W2V model used is conceptnet-numberbatch. Download the English-only compressed text file from here: https://github.com/commonsense/conceptnet-numberbatch?tab=readme-ov-file

In [None]:
gzipped_file_path = 'numberbatch-en-19.08.txt.gz'
decompressed_file_path = 'numberbatch-en-19.08.txt'

with gzip.open(gzipped_file_path, 'rt', encoding='utf-8') as f_in:
    with open(decompressed_file_path, 'w', encoding='utf-8') as f_out:
        f_out.write(f_in.read())

# Import different models
model_google = api.load('word2vec-google-news-300')
model_glove = api.load('glove-wiki-gigaword-300')
model_wiki = api.load('fasttext-wiki-news-subwords-300')
model_conceptnet = KeyedVectors.load_word2vec_format(decompressed_file_path, binary=False)

# Test: find similar words
print(f"GOOGLE NEWS: {model_google.most_similar('seattle')}")
print(f"GLOVE: {model_glove.most_similar('seattle')}")
print(f"WIKI: {model_wiki.most_similar('seattle')}")
print(f"CONCEPTNET': {model_conceptnet.most_similar('seattle')}")

In [None]:
print(model_google.similarity('squat', 'pushup'))

In [None]:
# Preprocess multi-word expressions (e.g. 'New York', 'push-up')
def preprocess_word(word, model):
  mwe = re.sub(r'[-\s]', '_', word.lower())
  
  if mwe not in model:
      mwe = re.sub(r'_', '', mwe)
  
  return mwe

In [None]:
# Extract words from ds[i]['words']
def guess(model, words):
  words = [preprocess_word(word, model) for word in words]
  similarity_matrix = np.zeros((len(words), len(words)))
  for i, word1 in enumerate(words):
      for j, word2 in enumerate(words):
          if word1 in model and word2 in model:
              similarity_matrix[i, j] = model.similarity(word1, word2)
          else:
              similarity_matrix[i, j] = 0

  similarity_df = pd.DataFrame(similarity_matrix, index=words, columns=words)
  _max = 0
  argmax = 0
  argword = ""
  for idx, word in enumerate(words):
    if type(similarity_df[word]) is pd.DataFrame:
      print(similarity_df[word])
    similar_words = similarity_df[word].sort_values(ascending=False)
    if similar_words.iloc[1] > _max:
      _max = similar_words.iloc[1]
      argmax = idx
      argword = similar_words.index[1]

  build_list = [words[argmax], argword]

  words_copy = words.copy()
  for test_word in build_list:
    if test_word not in words_copy:
      return None
    words_copy.remove(test_word)

  sim_list = []
  for test_word in words_copy:
    similarities = []
    for train_word in build_list:
        if train_word in model and test_word in model:
            similarity = model.similarity(train_word, test_word)
            similarities.append(similarity)
        else:
            similarities.append(0)  # Handle words not in the model
    average_similarity = sum(similarities) / len(similarities)
    sim_list.append(average_similarity)

  index_of_highest_value = sim_list.index(max(sim_list))
  build_list.append(words_copy[index_of_highest_value])

  words_copy = words.copy()
  for test_word in build_list:
    if test_word not in words_copy:
      return None
    words_copy.remove(test_word)

  sim_list = []
  for test_word in words_copy:
    similarities = []
    for train_word in build_list:
        if train_word in model and test_word in model:
            similarity = model.similarity(train_word, test_word)
            similarities.append(similarity)
        else:
            similarities.append(0)  # Handle words not in the model
    average_similarity = sum(similarities) / len(similarities)
    sim_list.append(average_similarity)

  index_of_highest_value = sim_list.index(max(sim_list))
  build_list.append(words_copy[index_of_highest_value])

  return build_list

In [None]:
from itertools import combinations
import numpy as np
import pandas as pd

def compute_similarity_matrix(model, words):
    words = [preprocess_word(word, model) for word in words]
    words = [word for word in words if word in model]
    
    similarity_matrix = {}
    for i, word1 in enumerate(words):
        for j, word2 in enumerate(words):
            if i < j:  # Avoid redundant computations
                similarity_matrix[(word1, word2)] = model.similarity(word1, word2)
    return similarity_matrix

# Extract words from ds[i]['words'] with fallback guesses
# similarity_matrix: precomputed similarity matrix

def guess_best_combination(model, words, similarity_matrix=None, lives=4):
    if len(words) == 4:
        return [list(words) * lives]
    words = [preprocess_word(word, model) for word in words]
    words = [word for word in words if word in model]

    if len(words) < 4 or lives < 1:
        return None

    if similarity_matrix is None:
        similarity_matrix = compute_similarity_matrix(model, words)

    all_combinations = list(combinations(words, 4))
    scored_combinations = []

    for combination in all_combinations:
        similarities = []
        for i, word1 in enumerate(combination):
            for j, word2 in enumerate(combination):
                if i < j:
                    similarities.append(similarity_matrix.get((word1, word2), similarity_matrix.get((word2, word1), 0)))

        average_similarity = np.mean(similarities)
        scored_combinations.append((combination, average_similarity))

    # Sort combinations by average similarity in descending order
    scored_combinations.sort(key=lambda x: x[1], reverse=True)

    # Return up to four attempts in descending order of similarity
    top_guesses = [list(comb[0]) for comb in scored_combinations[:lives]]
    return top_guesses

In [None]:
def eval_round(guess_list, solution):
  right_count = [0, 0, 0, 0]
  for final_word in guess_list:
    for idx, group in enumerate(solution['groups']):
      if final_word in group['words']:
        right_count[idx] += 1
  return max(right_count)

In [None]:
models = [model_google, model_glove, model_wiki, model_conceptnet]
model_names = ["Google News", "Glove", "Wikipedia", "ConceptNet"]
correct_idx = []
for idx, model in enumerate(models):
  print(f"======== {model_names[idx]} ========")
  right_list = []
  one_away_when = []
  for i in range(ds_len):
    guess_list = guess(model, ds[i]['words'])
    if guess_list is not None:
      score = eval_round(guess_list, ds[i]['solution'])
      right_list.append(score)
      if score == 4 and i not in correct_idx:
        correct_idx.append(i)

  print(f"AVERAGE SCORE: {sum(right_list) / len(right_list)}")
  for i in range(1, 5):
    print(f"{i}: {right_list.count(i)}")
  print()
print(f"Number of Games with At Least One Good First Guess: {len(correct_idx)} / {ds_len}")

In [None]:
def calculate_score(num_correct, strikes):
    # Define multipliers and penalties
    multipliers = [1, 2, 3, 3]
    penalties = [1.0, 0.9, 0.75, 0.5, 0.25]

    # Ensure the number of correct groups is within the valid range
    if num_correct > 4:
        num_correct = 4

    # Calculate the total score
    total_score = 0
    for i in range(num_correct):
        total_score += 1 * multipliers[i] * penalties[strikes]

    return np.round(total_score, 2)

# Example usage
num_correct_1 = 4
num_correct_2 = 4
num_correct_3 = 2

strikes_1 = 0
strikes_2 = 1
strikes_3 = 2

print("All Correct with 0 strikes:", calculate_score(num_correct_1, strikes_1))  # Output: 9.0
print("All Correct with 1 strike:", calculate_score(num_correct_2, strikes_2))   # Output: 8.1
print("2 Correct Groups - 2 strikes:", calculate_score(num_correct_3, strikes_3)) # Output: 2.25

In [None]:
models = [model_google, model_glove, model_wiki, model_conceptnet]
model_names = ["Google News", "Glove", "Wikipedia", "ConceptNet"]
correct_idx = []
multiplier = {4: 1.0, 3: 0.9, 2: 0.75, 1: 0.5, 0: 0.25}
for idx, model in enumerate(models):
  print(f"======== {model_names[idx]} ========")
  right_list = []
  correct_guesses = []
  total_scores = []
  one_away_when = []
  for i in range(ds_len):
    #print("I:", i)
    lives = 4
    correct_count = 0
    total_score = 0
    options = ds[i]['words']
    while lives > 0 and len(options) > 0:
      #print("LEN:", len(options))
      guess_list = guess_best_combination(model, options, lives=lives)
      #print("GUESS:", guess_list)
      if guess_list is None:
        lives -= 1
        continue
      if guess_list is not None:
        for guess in guess_list:
          score = eval_round(guess, ds[i]['solution'])
          if score == 4:
            correct_count += 1
            right_list.append(score)
            options = [item for item in options if item not in guess]
            if len(options) == 4:
              correct_count += 1
              options = []
            break
          lives -= 1
          if guess == guess_list[-1] or lives == 0:
            right_list.append(score)
            break
    correct_guesses.append(correct_count)
    total_scores.append(calculate_score(correct_count, 4 - lives))
    if correct_count == 4 and i not in correct_idx:
      correct_idx.append(i)

  print(f"AVERAGE SCORE: {sum(correct_guesses) / len(correct_guesses)}")
  for i in range(0, 5):
    print(f"{i}: {correct_guesses.count(i)}")
  print(f"Average Total Score: {sum(total_scores) / len(total_scores)} (Total: {sum(total_scores)})")
  print()
print(f"Number of Games with At Least One Complete Solve: {len(correct_idx)} / {ds_len}")

In [None]:
import requests

def get_associations_from_conceptnet(word):
    url = f"http://api.conceptnet.io/c/en/{word.lower()}"
    response = requests.get(url)
    data = response.json()
    return [edge['end']['label'] for edge in data['edges'] if 'Rel' in edge['rel']['label']]

# Example usage
associations = get_associations_from_conceptnet('devil')
print(associations)

In [None]:
from sklearn.cluster import KMeans

def cluster_words(model, words, num_clusters=4):
    word_vectors = [model[word] for word in words if word in model]
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(word_vectors)

    clusters = {i: [] for i in range(num_clusters)}
    for i, word in enumerate(words):
        if word in model:
            cluster_label = kmeans.predict([model[word]])[0]
            clusters[cluster_label].append(word)

    return clusters

In [None]:
def average_group_similarity(model, group, candidate):
    similarities = [model.similarity(candidate, word) for word in group if candidate in model and word in model]
    return sum(similarities) / len(similarities) if similarities else 0

In [None]:
print(average_group_similarity(model_google, ['strange', 'down', 'charm'], 'up'))

In [None]:
def ensemble_similarity(models, word1, word2):
    scores = [model.similarity(word1, word2) for model in models if word1 in model and word2 in model]
    return sum(scores) / len(scores) if scores else 0

In [None]:
%pip install spacy==3.5.0

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
def get_named_entities(words):
    doc = nlp(" ".join(words))
    return {ent.text: ent.label_ for ent in doc.ents}

In [None]:
print(ensemble_similarity(models, 'red-sox', 'yankees'))

## Transformers (BERT)

In [None]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer, util

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_pretrained('bert-base-uncased')
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Function to encode a word or phrase and get its embedding
def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the last hidden state and mean pooling to get the sentence embedding
    embeddings = outputs.last_hidden_state
    mean_embedding = embeddings.mean(dim=1)  # Mean pooling
    return mean_embedding.squeeze()

# Function to calculate cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    return torch.nn.functional.cosine_similarity(vec1.unsqueeze(0), vec2.unsqueeze(0)).item()

# Updated function to find the best combination of words
def guess_best_combination_bert(words):
    word_embeddings = {}
    for word in words:
        try:
            # Obtain embeddings for each word/phrase
            word_embeddings[word] = get_embedding(word)
        except Exception as e:
            print(f"Could not process {word}: {e}")
    
    if len(word_embeddings) < 4:
        raise ValueError("Not enough valid words to form a group of 4.")
    
    best_combination = None
    highest_average_similarity = -1

    for combination in combinations(word_embeddings.keys(), 4):
        similarities = []
        for i, word1 in enumerate(combination):
            for j, word2 in enumerate(combination):
                if i < j:
                    sim = cosine_similarity(word_embeddings[word1], word_embeddings[word2])
                    similarities.append(sim)
        
        average_similarity = np.mean(similarities)
        
        if average_similarity > highest_average_similarity:
            highest_average_similarity = average_similarity
            best_combination = combination
            print(highest_average_similarity, best_combination)

    return list(best_combination) if best_combination else []

# print(ds[0]['words'])
result = guess_best_combination_bert(["california", "wisconsin", "texas", "new york", "toaster", "bloom", "gone", "cheese"])
print("Best combination:", result)

In [None]:
def get_word_embedding(word):
    inputs = tokenizer(word, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    mean_embedding = embeddings.mean(dim=1).squeeze(0)
    return mean_embedding

In [None]:
def calculate_similarity(embedding1, embedding2):
    return cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))[0][0]

def find_most_similar_pair(word_embeddings):
    max_similarity = -1
    most_similar_pair = (None, None)
    words = list(word_embeddings.keys())

    for i in range(len(words)):
        for j in range(i + 1, len(words)):
            sim = calculate_similarity(word_embeddings[words[i]], word_embeddings[words[j]])
            if sim > max_similarity:
                max_similarity = sim
                most_similar_pair = (words[i], words[j])

    return most_similar_pair

def expand_group(selected_words, remaining_words, word_embeddings):
    max_similarity = -1
    best_word = None

    for word in remaining_words:
        # Calculate average similarity between the current word and the selected words
        similarities = [calculate_similarity(word_embeddings[word], word_embeddings[sw]) for sw in selected_words]
        avg_similarity = np.mean(similarities)

        if avg_similarity > max_similarity:
            max_similarity = avg_similarity
            best_word = word

    return best_word

In [None]:
def bert_guess(words):
    # Generate embeddings for each word
    word_embeddings = {word: get_word_embedding(word) for word in words}

    # Step 3: Find the pair of two words that are most similar to each other
    word1, word2 = find_most_similar_pair(word_embeddings)
    selected_words = [word1, word2]

    # Step 4-5: Find the next most similar word to the current selection
    remaining_words = [w for w in words if w not in selected_words]
    for _ in range(2):  # Repeat twice to find 3rd and 4th words
        best_word = expand_group(selected_words, remaining_words, word_embeddings)
        selected_words.append(best_word)
        remaining_words.remove(best_word)

    return selected_words

def eval_round(words, solution):
    right_count = [0, 0, 0, 0]
    for final_word in words:
        for idx, group in enumerate(solution['groups']):
            if final_word in group['words']:
                right_count[idx] += 1
    return max(right_count)

In [None]:
right_list = []
for i in range(ds_len):
    if i > 0 and i % 100 == 0:
        print(f"Game {i}")
    if i == 300:
        continue
    words = ds[i]['words']
    soln = ds[i]['solution']
    optimal_guess = bert_guess(words)
    score = eval_round(optimal_guess, soln)
    right_list.append(score)

print(f"AVERAGE SCORE: {sum(right_list) / len(right_list)}")
print("========GAMES BY MAX NUM RIGHT ENTRIES========")
for i in range(1, 5):
    print(f"{i}: {right_list.count(i)}")

In [None]:
def guess_best_combination_sbert(model, words):
    # Generate embeddings for all words
    word_embeddings = {word: model.encode(word, convert_to_tensor=True) for word in words}
    
    if len(word_embeddings) < 4:
        return None
    
    top_combinations = []

    for combination in combinations(word_embeddings.keys(), 4):
        similarities = []
        for i, word1 in enumerate(combination):
            for j, word2 in enumerate(combination):
                if i < j:
                    sim = util.pytorch_cos_sim(word_embeddings[word1], word_embeddings[word2]).item()
                    similarities.append(sim)
        
        average_similarity = np.mean(similarities)
        
        top_combinations.append((combination, average_similarity))

    top_combinations.sort(key=lambda x: x[1], reverse=True)
    return [list(combo[0]) for combo in top_combinations[:4]]

# Example usage:
result = guess_best_combination_sbert(model, ds[0]['words'])
for guess in result:
    print("Best combinations:", guess)

In [None]:
print(ds[300]['words'])
print(sbert_model.encode("🐑"))

In [None]:
right_list = []
correct_idx = []
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
for i in range(ds_len):
  if i % 100 == 0:
    print(f"Game {i}")
  guess_list = guess_best_combination_sbert(sbert_model, ds[i]['words'])
  best_guess = []
  if guess_list is not None:
    for guess in guess_list:
      score = eval_round(guess, ds[i]['solution'])
      best_guess.append(score)
      if score == 4:
        right_list.append(score)
        if i not in correct_idx:
          correct_idx.append(i)
        break
      elif guess == guess_list[-1]:
        right_list.append(max(best_guess))
        break

print(f"AVERAGE SCORE: {sum(right_list) / len(right_list)}")
for i in range(1, 5):
  print(f"{i}: {right_list.count(i)}")
print()
print(f"Number of Games with At Least One Good First Guess: {len(correct_idx)} / {ds_len}")

...work in progress?

In [None]:
%pip install datasets

In [None]:
from datasets import load_dataset

wiki_ds = load_dataset("wikimedia/structured-wikipedia", "20240916.en")
print(wiki_ds)