# TDDE16 Text Mining Individual Project
* By Axel Strid (axest556)
* Linköping University
* Course taken during HT2 2025

# Analyzing NHL Interview Data Using Text Mining Techniques

In [5]:
# Potential later troubleshooting of llama-cpp-python performance:

# If you try to load a Llama model in your code and it feels frozen, it's likely struggling with 
# the Intel CPU. You can try reducing the number of layers offloaded to the "GPU" (which you don't 
# really have a powerful one of) by setting n_gpu_layers=0 in your Python code:

# from llama_cpp import Llama
# llm = Llama(model_path="path/to/model.gguf", n_gpu_layers=0)

## Step 0: Setup Environment
To set up your python environment:
1. Create a virtual environment (venv)
2. Activate the virtual environment
3. Install the required packages in `requirements.txt` using pip

## Step 1: Load the data

This project uses interview data stored in a Kaggle dataset. To load the data, run the cell below:

In [11]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "interview_data.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "dtamming/national-hockey-league-interviews",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

  df = kagglehub.load_dataset(


## Step 2: Extract Dataset Information

**The dataset contains interviews from NHL Stanley Cup Finals.**

#### From the author of the dataset on Kaggle:
"This dataset was scraped from http://www.asapsports.com/, using the code in this repository. I designed the webscraping code to account for most of the variance in the website's formatting, but some webpages with formatting that differed significantly were ignored. While manually inspecting random rows of the dataset I did not notice any glaring errors in the transcripts, but I cannot guarantee that there aren't any."

#### Attributes:
- `RowId`: A unique row identifier.
- `team1` and `team2`: The two teams in the Stanley Cup Final. Whether a team is team1 or team2 has no meaning: it's determined by the order of their listing on the website.
- `date`: The date of the interview.
- `name`: The name of the person being interviewed.
- `job`: Takes values **"player"**, **"coach"**, and **"other"**. If they are a player or coach at the time of the interview they are assigned accordingly. Otherwise they are assigned "other". Most of the people in the "other" category are general managers, league officials, and commentators. Some of these values were assigned automatically based on their title in a transcript (e.g. "Coach Mike Babcock"), and others were assigned manually. A possible source of error is the fact that I did not manually inspect names that appeared only once.
- `text`: The interview transcript. Interviewer questions were not collected, so all of the speech comes from the interviewee. Responses to questions are separated by periods. These periods will be the only punctuation in the text. Note that a likely source of error in this column is a failure to ignore an interviewer's questions.


**Below are some useful insights about the dataset:**

In [13]:
# Display the first and last 5 records
print("First 5 records:\n", df.head()) # first 5 records
print("\nLast 5 records:\n", df.tail())  # last 5 records

First 5 records:
    RowId       team1      team2        date              name     job  \
0      0  blackhawks  lightning  2015-06-02       stan bowman   other   
1      1  blackhawks  lightning  2015-06-02     steve yzerman   other   
2      2  blackhawks  lightning  2015-06-03  antoine vermette  player   
3      3  blackhawks  lightning  2015-06-03  joel quenneville   coach   
4      4  blackhawks  lightning  2015-06-03        jon cooper   coach   

                                                text  
0  well we're very fortunate to have the players ...  
1  no we didn't really set a timeline on it i wou...  
2  that's a good question i don't recall specific...  
3  yeah we got better as the game went on i thoug...  
4  i don't know i think the way i'd look at the g...  

Last 5 records:
       RowId  team1   team2        date               name     job  \
2091   2091  stars  devils  2000-06-10        mike modano  player   
2092   2092  stars  devils  2000-06-10  richard matvichuk

In [22]:
# Print one full record for inspection
print("\nSample record:\n", df.iloc[0])

# Display one full text record for inspection
print("\nFull text of first record:\n", df.iloc[0]['text'])


Sample record:
 RowId                                                    0
team1                                           blackhawks
team2                                            lightning
date                                            2015-06-02
name                                           stan bowman
job                                                  other
text     well we're very fortunate to have the players ...
Name: 0, dtype: object

Full text of first record:
 well we're very fortunate to have the players that we do here i look back at when this all started sort of signaled when rocky took over the franchise the changes he made gave us some momentum and excitement we had a good year leading into that year but rocky came onboard and sort of changed the whole mentality of the organization he brought john mcdonough onboard from that point on we kind of felt like we were getting closer and closer obviously the season was when we finally broke through those players were the

In [19]:
# Display unique years in the dataset
unique_years = df['date'].str.extract(r'(\d{4})')[0].unique()
print("\nUnique years in the dataset:\n", unique_years)


Unique years in the dataset:
 ['2015' '2019' '2013' '2011' '2018' '1998' '2001' '2004' '2010' '2002'
 '2012' '2003' '2006' '2009' '2017' '2014' '1997' '1999' '2007' '2016'
 '2000']


In [23]:
# Total number of records
total_records = len(df)
print("\nTotal number of records in the dataset:", total_records)

# Number of interviews per year
interviews_per_year = df['date'].str.extract(r'(\d{4})')[0].value_counts().sort_index()
print("\nNumber of interviews per year:\n", interviews_per_year)

# Average number of interviews per year
average_interviews_per_year = interviews_per_year.mean()
print("\nAverage number of interviews per year:", average_interviews_per_year)


Total number of records in the dataset: 2096

Number of interviews per year:
 0
1997     90
1998    118
1999    140
2000    111
2001    160
2002    124
2003    116
2004     89
2006    152
2007    130
2009    132
2010    147
2011    146
2012    100
2013     61
2014     43
2015     55
2016     57
2017     74
2018     18
2019     33
Name: count, dtype: int64

Average number of interviews per year: 99.80952380952381


In [32]:
# Total number of interviewees:
total_interviewees = df['name'].nunique()
print("\nTotal number of unique interviewees in the dataset:", total_interviewees)

# Total number of teams:
total_teams1 = df['team1'].unique()
total_teams2 = df['team2'].unique()
total_teams = set(total_teams1).union(set(total_teams2))
print("\nTotal number of unique teams in the dataset:", len(total_teams))
print("Teams:", total_teams)

# Number of interviews per job category:
interviews_per_job = df['job'].value_counts()
print("\nNumber of interviews per job category:\n", interviews_per_job)


Total number of unique interviewees in the dataset: 470

Total number of unique teams in the dataset: 24
Teams: {'penguins', 'oilers', 'red wings', 'lightning', 'golden knights', 'avalanche', 'devils', 'ducks', 'blues', 'sabres', 'blackhawks', 'bruins', 'stars', 'mighty ducks', 'hurricanes', 'rangers', 'sharks', 'kings', 'canucks', 'predators', 'flyers', 'senators', 'flames', 'capitals'}

Number of interviews per job category:
 job
player    1498
coach      515
other       83
Name: count, dtype: int64


## Key Takeaways from Data Inspection
- Total number of interviewws: **2096**
- Interviews span over a **21-year-period**; every year from **1997 to 2019**
- Interviews per year range from **18 to 160**, with an average of about **100 interviews per year**
- Total unique interviewees: **470**
- Total unique teams represented: **24**
- Number of interviews per job category:
  - **Player**: 1498
  - **Coach**: 515
  - **Other**: 83  


## Step 3: Start Preprocessing