<a href="https://colab.research.google.com/github/azholl/lis5693/blob/main/lab-2/lab-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install spaCy



In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import spacy

In [9]:
nlp = spacy.load("en_core_web_sm")

**TASK 1: Load and read raw transcript file from lab-1 Github repo**

In [10]:
import requests

url = "https://raw.githubusercontent.com/azholl/lis5693/refs/heads/main/lab-1/transcript.txt"
response = requests.get(url)
response.raise_for_status()
text = response.text

**TASK 2: Discover number of characters in transcript and print the first 100**

In [11]:
print("Number of characters:", len(text))

print(text[:300])

Number of characters: 48700
Hey everybody, welcome to the next commentary. Today we're playing some Mel mid into a Yona. Uh Mel got changed
recently with her W reflect like losing the invulnerability and a lot of other changes too. And her win rate dropped so
hard to the point that she got hot fixed. So she did get buffed alre


**TASK 3: Perform sentence segmentation using the blank pipeline**

In [12]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

doc = nlp(text)

sentences = list(doc.sents)

print("Number of sentences:", len(sentences))

print("\nFirst 5 sentences:")
for sent in sentences[:5]:
    print(sent)

Number of sentences: 898

First 5 sentences:
Hey everybody, welcome to the next commentary.
Today we're playing some Mel mid into a Yona.
Uh Mel got changed
recently with her W reflect like losing the invulnerability and a lot of other changes too.
And her win rate dropped so
hard to the point that she got hot fixed.
So she did get buffed already because hot fix is like a live patch
basically.


**TASK 4: Perform word count and token analysis**

In [13]:
words = [token.text.lower() for token in doc if token.is_alpha]

print("Total words:", len(words))
print("Unique words:", len(set(words)))

Total words: 9684
Unique words: 1149


**TASK 5: Find most frequent words**

In [14]:
from collections import Counter

word_freq = Counter(words)

print("Top 10 most frequent words:")
for word, count in word_freq.most_common(10):
    print(word, count)



Top 10 most frequent words:
i 511
to 311
that 271
like 222
the 197
just 188
it 181
of 162
a 146
and 139


The most frequent word being "I" is not surprising, because the video is of a video game streamer giving a commentary describing how he's playing the game and why he's doing the things he's doing. So he says "I am just going to" or some variation of that all the time.

**TASK 6: Run full spaCy pipeline**

In [15]:
nlp2 = spacy.load("en_core_web_sm")

doc2 = nlp2(text)

print("Named Entities:")
for ent in doc2.ents:
    print(ent.text, ent.label_)

Named Entities:
Today DATE
Mel PERSON
Mel PERSON
AP ORG
Mel PERSON
Mel PERSON
TP ORG
Yona PERSON
TP ORG
TP ORG
Nice GPE
Jungler PERSON
Graves PERSON
Mel PERSON
seven CARDINAL
seven CARDINAL
nine CARDINAL
fed ORG
Ludens PRODUCT
Luden GPE
Ivvern NORP
Nollas PERSON
Yona PERSON
Graves PERSON
Nollis PERSON
second ORDINAL
Ezreal PERSON
Dude PERSON
Ivvern NORP
Rakan PERSON
6 CARDINAL
Cosmic Tribe PERSON
Horizon Focus PERSON
CDR ORG
Storm
Surge ORG
Storm Surge PERSON
1v CARDINAL
Bork PERSON
22 CARDINAL
three CARDINAL
nine CARDINAL
up to nine CARDINAL
CC ORG
Draven GPE
two CARDINAL
AoE ORG
Cosmic Drive PERSON
Ezreal PRODUCT
Rift PERSON
Ivvern NORP
Raton PERSON
Draven GPE
Dude PERSON
30 CARDINAL
Needless ORG
Rift Herald PERSON
Garen PERSON
five CARDINAL
Garen PERSON
Yona PERSON
Zahen PERSON
Garren PERSON
Garen PERSON
100 CARDINAL
65 CARDINAL
Ezreal PERSON
firstly ORDINAL
Dang PERSON
14 CARDINAL
Yona PERSON
500 CARDINAL
500 CARDINAL
Raon PERSON
Yona PERSON
Yaso PERSON
Riven PERSON
Yon PERSON
two 

There were about 250 named entities, including ORG, PERSON, CARDINAL, PERCENT, WORK_OF_ART, NORP, DATE, GPE, and PRODUCT.

**TASK 7: Using PhraseMaster**

In [16]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp2.vocab, attr="LOWER")

phrases = ["mel", "draven", "jungler"]

patterns = [nlp2(p) for p in phrases]

matcher.add("TECH_TERMS", patterns)

matches = matcher(doc2)

print("Matches found:")
for match_id, start, end in matches:
    print(doc2[start:end])
    print("Sentence:", doc2[start].sent)


Matches found:
Mel
Sentence: Today we're playing some Mel mid into a Yona.
Mel
Sentence: Uh Mel got changed
recently with her W reflect like losing the invulnerability and a lot of other changes too.
Mel
Sentence: We just want to be doing uh auto attack weaving while using our abilities just like regular Mel really.
Mel
Sentence: I mean, I feel
like this guy might be killable if I landed everything because Mel still is all about landing
your auto attacks and abilities because your abilities uh cause your auto attacks to be empowered to do this auto
attack weave where it does these bonus hits and you apply this stacking execute that shows over their head.
Jungler
Sentence: Jungler has not been part of anything so far, but he's
doing his best.

Mel
Sentence: I don't know if maybe previously the Mel before these changes could also
clear at level seven, but being able to clear back wave at level seven is kind of nice.

Draven
Sentence: I mean, Draven's plenty fed as
well, I guess, but I am

I chose the phrases I did because the video was over a character named Mel, so I wanted to see how much she was mentioned, along with another character he was playing against, and one of the prominent roles in the game.

**TASK 8: Reflection**

***What went well?*** I think with the relevant information from the introduction to spaCy notebook that we were instructed to go over before this lab, I wasn't lost while going through the lab. The code worked as it was meant to.

***What did not go well?*** I chose the YouTube video I did because it was just what I was watching at the time. I didn't think about how the video had YouTube auto captions on, and how that would make the transcript data very messy. There were names of characters and items that were said frequently but spelled a different way almost every time. This meant that there were over 250 NERs (which I counted by hand because I couldn't figure out how to make the model tell me how many there were) and many of them were repeats of the same word, just spelled a different way. It wasn't necessarily a challenge, but it did make me think more about preprocessing when it comes to mining large amounts of text that would probably be messy, and how important it is to make sure I'm critical about my preprocessing decisions.