### In almost any speaking setting, we need to understand the frequency of stop words, which can greatly influence the quality of the text. It could help to know which words/tokens are unnecessary in the semantic meaning of the text, and can be removed later for conciseness.

Make sure to install necessary dependencies.

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Imports

In [2]:
import pandas as pd
import spacy

Read current CSV file with named entity recogntion results into a dataframe

In [3]:
df = pd.read_csv('../OUTPUT/Named_Entity_Recognition.csv')

Use the en_core_web_sm english pipeline. The spacy package includes a list of words that have been recognized as stop words, and we can reference it on each row of text. Each transcript/row of text is broken down into individual words/tokens, which are then cross-referenced to the set of stop words using the is_stop boolean. Any stop words are added with their character indices into a tuple. 

In [6]:
stop_words = spacy.load('en_core_web_sm')
df['Stop Words'] = ''

for index, row in df.iterrows():
    doc = stop_words(row['Text'])
    stop_words_found = [(token.text, token.idx) for token in doc if token.is_stop]
    df.at[index, 'Stop Words'] = stop_words_found

df.head()

Unnamed: 0,FileName,Text,Entities,Stop Words
0,2KEI29IfOp4.txt,You guys welcome back a cheat day Jared has re...,"[('Jared', 'ORG'), ('Thrones', 'ORG'), ('Atlan...","[(You, 0), (back, 17), (a, 22), (has, 40), (to..."
1,--aOisk7Hf8.txt,How low guys it's me. Hello. Welcome back to a...,"[('Halloween', 'DATE'), ('Halloween', 'DATE'),...","[(How, 0), (it, 13), ('s, 15), (me, 18), (back..."
2,2omuOarg2hE.txt,"Hi guys, I have been real lazy I could do a 15...","[('15 minute', 'TIME'), ('Animal Crossing', 'P...","[(I, 9), (have, 11), (been, 16), (I, 31), (cou..."
3,2wfWK2Z9A58.txt,"Gentlemen, I'm here today to meet an old frien...","[('today', 'DATE'), ('Kevin', 'PERSON'), ('One...","[(I, 11), ('m, 12), (here, 15), (to, 26), (an,..."
4,2uaGw1D-X0Y.txt,Is that a little music? No. Is that me is now ...,"[('today', 'DATE'), ('Wednesday upload day', '...","[(Is, 0), (that, 3), (a, 8), (No, 24), (Is, 28..."


Save the final CSV file

In [5]:
df.to_csv('../OUTPUT/results.csv', index=False)