HOW TO RUN NOTEBOOK: This notebook can be run in order. The final cell is used for getting the coherence score for different numbers of components. If you choose to run this cell, then you can go back and rerun the code starting from the cell containing the code:
svd = TruncatedSVD(n_components=20, random_state=42)  # Number of topics
X_lsa = svd.fit_transform(X)
This will give you a different set of player comparisons and topics

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

In [2]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


In [3]:
df = pd.read_csv('all_draft_classes.csv')

overviews = df['Overview']

In [4]:
import re

df['Strengths'] = \
df['Strengths'].map(lambda x: re.sub('[,\.!?]', '', x))

df['Weaknesses'] = \
df['Weaknesses'].map(lambda x: re.sub('[,\.!?]', '', x))

df['Overview'] = \
df['Overview'].map(lambda x: re.sub('[,\.!?]', '', x))

df['Strengths'] = 'STRENGTHS: ' + df['Strengths'].astype(str)
df['Weaknesses'] = 'WEAKNESSES: ' + df['Weaknesses'].astype(str)
df.head()

Unnamed: 0,Year,Name,Overview,Strengths,Weaknesses,Label
0,2025,Cam Ward,Gunslinger with good size a big arm and the mo...,STRENGTHS: Recognizes pre-snap pressure and ca...,WEAKNESSES: Too willing to work out of structu...,
1,2025,Shedeur Sanders,Any perceptions that Sanders is a product of H...,STRENGTHS: Plays with confidence and composure...,WEAKNESSES: Spacing and clearly defined route ...,
2,2025,Jaxson Dart,Three-year SEC starter who saw improvement in ...,STRENGTHS: Gets across the full field of progr...,WEAKNESSES: Deep zone coverages slowed his mom...,
3,2025,Jalen Milroe,Milroe is an explosive athlete who is very cap...,STRENGTHS: Unflinching when he delivers throws...,WEAKNESSES: Threw five touchdowns and 10 inter...,
4,2025,Will Howard,Howard brings outstanding size and toughness t...,STRENGTHS: Outstanding size and toughness insi...,WEAKNESSES: Very gradual in his setup and rele...,


In [5]:
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
df['composite'] = df['Overview'] + df['Strengths'] + df['Weaknesses']
# Tokenize the sentence
overviews = []
for doc in df['composite']:
    filtered_sentence = [w for w in doc.split() if not w.lower() in stop_words]
    filtered_sentence = [lemmatizer.lemmatize(word) for word in filtered_sentence]
    overviews.append(" ".join(filtered_sentence).lower())


In [6]:
#transform documents into tf-idf matrix
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(overviews)

In [7]:
#run SVD
svd = TruncatedSVD(n_components=16, random_state=42) 
X_lsa = svd.fit_transform(X)

In [8]:
print("Original TF-IDF matrix shape:", X.shape)
print("LSA reduced matrix shape:", X_lsa.shape)
print("LSA components:\n", svd.components_)

Original TF-IDF matrix shape: (179, 4444)
LSA reduced matrix shape: (179, 16)
LSA components:
 [[ 0.00053463  0.00053463  0.02521744 ...  0.04968638  0.00197938
   0.00532543]
 [-0.00171739 -0.00171739 -0.01795803 ...  0.07010186 -0.00492206
   0.01011557]
 [ 0.00373974  0.00373974  0.07827389 ... -0.01910766  0.01542337
  -0.00491781]
 ...
 [-0.00031578 -0.00031578  0.01043718 ... -0.02311465  0.00871292
  -0.00720029]
 [-0.00041496 -0.00041496  0.01531823 ... -0.01640533 -0.00337343
   0.00393199]
 [ 0.00059597  0.00059597 -0.01302152 ...  0.02975666 -0.00250666
   0.00908784]]


In [None]:
#Examine what words are in each topic
terms = vectorizer.get_feature_names_out()  # Get the terms (words) from TF-IDF vectorizer
topic_words = []
for i, component in enumerate(svd.components_):
    top_terms_idx = component.argsort()[-20:][::-1]  
    top_terms = [terms[idx] for idx in top_terms_idx]  # Get the actual terms
    topic_words.append(top_terms)  # Store the top terms for each topic
    print(f"Component {i+1}: {', '.join(top_terms)}")


Component 1: throw, pocket, ball, play, arm, accuracy, field, make, time, good, deep, nfl, game, quarterback, he, pressure, average, ability, get, talent
Component 2: he, throw, talent, placement, average, platform, safety, play, anticipation, timing, 2020, zone, off, tape, poise, reader, below, pocket, eye, willing
Component 3: yard, target, percent, span, hand, inch, bowl, season, touchdown, middle, run, 10, attempt, trying, body, all, school, sideline, last, interception
Component 4: football, could, smart, percentage, game, duress, action, intangible, heavily, require, highly, respected, tends, dink, dunk, completion, outstanding, works, wave, to
Component 5: team, play, wilson, luck, 2020, he, athlete, passer, rg3, first, bowl, well, span, inch, game, state, intangible, one, athletic, also
Component 6: luck, play, rg3, make, able, percent, ability, athletic, likely, 2014, nfl, year, foles, allen, quarterback, harnish, learn, ease, talented, simply
Component 7: improve, open, snap,

In [10]:
#Generate player comparisons
from sklearn.metrics.pairwise import cosine_similarity
similar_players = []
for i, v1 in enumerate(X_lsa):
    most_similar = 0
    player = None
    for j, v2 in enumerate(X_lsa):
        if i != j:
            similarity = cosine_similarity(v1.reshape(1, -1), v2.reshape(1, -1))
            if similarity[0][0] > most_similar:
                most_similar = similarity[0][0]
                player = df.iloc[j]['Name']
    similar_players.append(player)
    print(df.iloc[i]['Name'], 'is most similar to', player, 'with similarity:', most_similar)

Cam Ward is most similar to Drake Maye with similarity: 0.8354728769682168
Shedeur Sanders is most similar to Desmond Ridder with similarity: 0.8487506574094439
Jaxson Dart is most similar to Ryan Finley with similarity: 0.8333668444408866
Jalen Milroe is most similar to Matt Barkley with similarity: 0.8033066881552291
Will Howard is most similar to Riley Leonard with similarity: 0.9370398780903564
Kyle McCord is most similar to Bailey Zappe with similarity: 0.8571613167590146
Tyler Shough is most similar to Jordan Travis with similarity: 0.8154626340561049
Quinn Ewers is most similar to Ryan Finley with similarity: 0.806327247659282
Dillon Gabriel is most similar to Cam Ward with similarity: 0.7999205418508317
Riley Leonard is most similar to Will Howard with similarity: 0.9370398780903564
Caleb Williams is most similar to Drake Maye with similarity: 0.8447698093499749
Jayden Daniels is most similar to Kyle Trask with similarity: 0.8952585861168113
Drake Maye is most similar to Caleb 

In [19]:
#coherence score
from gensim.models import CoherenceModel
import gensim.corpora as corpora

def get_Cv(model, df_columnm):
  topics = model.components_

  n_top_words = 20
  texts = [[word for word in doc.split()] for doc in df_columnm]

  # create the dictionary
  dictionary = corpora.Dictionary(texts)
  # Create a gensim dictionary from the word count matrix

  feature_names = [dictionary[i] for i in range(len(dictionary))]
  # Get the top words for each topic from the components_ attribute
  top_words = []
  for topic in topics:
      top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

  coherence_model = CoherenceModel(topics=top_words, texts=texts, dictionary=dictionary, coherence='c_v')
  coherence = coherence_model.get_coherence()
  return coherence

coherence = get_Cv(svd, df['composite'])
print(coherence)

0.7512626614658426


In [12]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from nltk.corpus import stopwords
import nltk
#u_mass coherence score

texts = [doc.split() for doc in overviews]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
coherence_model = CoherenceModel(
    topics=topic_words,
    corpus=corpus,
    dictionary=dictionary,
    coherence='u_mass'  # Other options: 'u_mass', 'c_uci', 'c_npmi'
)

coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score:.4f}")

Coherence Score: -3.6574


In [13]:
#NOTE used for determining optimal number of components
optimal_components = 0
for i in range(15, 25):
    svd = TruncatedSVD(n_components=i)  # Number of topics
    X_lsa = svd.fit_transform(X)
    cv = get_Cv(svd, df['composite'])
    print('coherence', cv, 'components', i)
    if cv > coherence:
        coherence = cv
        optimal_components = i

['2024', '25', '39%', 'Able', 'Alters', 'Arm', 'Bobs', 'Can', 'Capable', 'Completed', 'Displays', 'Drop-down', 'Focus', 'Football', 'Georgia', 'Gets', 'Gunslinger', 'He', 'Inconsistent', 'Like', 'Mobility', 'NFL', 'Needs', 'Pocket', 'Pro', 'QB-hunters', 'Rears', 'Recognizes', 'Release', 'Seam', 'Struggled', 'Sudden', 'Tech’s', 'Too', 'Unjustifiable', 'Ward', 'While', 'With', 'a', 'ability', 'accentuate', 'accuracy', 'against', 'aggressive', 'all', 'and', 'arm', 'around', 'attacking', 'average', 'away', 'back', 'backward', 'ball', 'become', 'before', 'better', 'big', 'blanket', 'bolt', 'breaks', 'bucket', 'but', 'can', 'cannot', 'cause', 'class', 'combo', 'consistent', 'contractSTRENGTHS:', 'coordinator', 'could', 'cover', 'coverage', 'coverages', 'creates', 'cross-field', 'cuts', 'decision-making', 'defender’s', 'delivery', 'develop', 'develops', 'discipline', 'disguised', 'doesn’t', 'down', 'draft', 'dual-threat', 'efficiency', 'elements', 'erratic', 'excited', 'extend', 'eye', 'eyes'

KeyboardInterrupt: 