## Preprocessing

Using one article as an example:

* sentence tokenize
* stop word removal
* Stemming (or lemmatization) (no stemming in spaCy)

In [2]:
import pandas as pd
import spacy

In [3]:
data = pd.read_csv("ESPN_football.csv")

In [4]:
data

Unnamed: 0,author,class,data-id,sport,teamname,timestamp,url,summary,text,headline
0,Michael DiRocco,story-link,26094253,nfl,buffalo-bills,4h,http://www.espn.com/nfl/story/_/id/26094253/ja...,"JACKSONVILLE, Fla. -- The Jacksonville Jaguars...","JACKSONVILLE, Fla. -- The Jacksonville Jaguars...",Jags GM: Fournette 'in a good spot' after meeting
1,Mike Rodak,story-link,buffalo-bills-32911,nfl,buffalo-bills,2d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The cash-flush Buffalo Bills could be among th...,The cash-flush Buffalo Bills could be among th...,Bills' focus on homegrown talent could temper ...
2,Mike Rodak,story-link,buffalo-bills-32889,nfl,buffalo-bills,6d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The Buffalo Bills will be under pressure over ...,The Buffalo Bills will be under pressure over ...,Why Bills should consider trading sacks leader...
3,"Michael C. Wright, Greg Wyshynski, Mike Rodak ...",story-link,25760086,nfl,buffalo-bills,6d,http://www.espn.com/espn/story/_/id/25760086/b...,93-year-old Pete Anton has spent decades worki...,93-year-old Pete Anton has spent decades worki...,Behind-the-scenes game-day jobs you never knew...
4,ESPN.com,story-link,25998951,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/25998951/ho...,The five quarterbacks drafted in the first rou...,The five quarterbacks drafted in the first rou...,How the NFL's worst quarterbacks can improve i...
5,Mike Rodak,story-link,26026509,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/26026509/bi...,Mina Kimes and Pablo S. Torre react to Bills Q...,Mina Kimes and Pablo S. Torre react to Bills Q...,'Trash' talk: Bills QB's autograph jabs at Ramsey
6,Mike Rodak,story-link,26005150,nfl,buffalo-bills,12d,http://www.espn.com/nfl/story/_/id/26005150/bi...,The Buffalo Bills have released tight end Char...,The Buffalo Bills have released tight end Char...,Bills release TE Clay after career-worst season
7,Jeremy Willis,story-link,25995553,nfl,buffalo-bills,13d,http://www.espn.com/nfl/story/_/id/25995553/nf...,Can't you feel the love around the NFL?,Can't you feel the love around the NFL? It's V...,The best cheesy Valentines from around the NFL
8,Mike Rodak,story-link,25980647,nfl,buffalo-bills,15d,http://www.espn.com/nfl/story/_/id/25980647/bi...,The Buffalo Bills on Tuesday signed free-agent...,The Buffalo Bills on Tuesday signed free-agent...,Bills sign ex-Jet Long to bolster offensive line
9,Mike Rodak,story-link,buffalo-bills-32874,nfl,buffalo-bills,21d,http://espn.com/blog/buffalo-bills/post/_/id/3...,"Five weeks after the Bills' 2018 season ended,...","Five weeks after the Bills' 2018 season ended,...",Why Bills' Lorenzo Alexander embraces the Buff...


In [5]:
data.text.get(0)

'JACKSONVILLE, Fla. -- The Jacksonville Jaguars are pleased with how Leonard Fournette has begun his offseason and are eager to see how the third-year player responds to a disappointing -- and at times concerning -- 2018. General manager Dave Caldwell and coach Doug Marrone each praised Fournette at the NFL scouting combine on Wednesday morning. "I think Leonard\'s in a good spot," Caldwell said. "I know a lot was made out of the end-of-the-season stuff, but he seems like he\'s in a good phase. He\'s working out. I know he\'s taking his nutrition and his workout seriously. I think he\'s in a good spot, so we\'ll see when he comes in April with the rest of the veterans and in the OTAs." Fournette\'s second NFL season didn\'t go nearly as well as his first, when he rushed for 1,040 yards and nine touchdowns to help the Jaguars win the AFC South and reach the AFC Championship game. Instead, in 2018, he missed seven games because of injuries, was suspended for another, had several off-fiel

In [23]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [25]:
# remove stopwords (from NLTK), punctuation, pronouns
# lemmatize and tokenize
punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~'
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
def cleanup_text(docs, logging = False):
    texts = []
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    texts.append(tokens)
    return pd.Series(texts)
data['text_cleaned'] = data['text'].apply(lambda x: cleanup_text(x, False))

In [26]:
data

Unnamed: 0,author,class,data-id,sport,teamname,timestamp,url,summary,text,headline,text_cleaned
0,Michael DiRocco,story-link,26094253,nfl,buffalo-bills,4h,http://www.espn.com/nfl/story/_/id/26094253/ja...,"JACKSONVILLE, Fla. -- The Jacksonville Jaguars...","JACKSONVILLE, Fla. -- The Jacksonville Jaguars...",Jags GM: Fournette 'in a good spot' after meeting,jacksonville florida -- jacksonville jaguar pl...
1,Mike Rodak,story-link,buffalo-bills-32911,nfl,buffalo-bills,2d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The cash-flush Buffalo Bills could be among th...,The cash-flush Buffalo Bills could be among th...,Bills' focus on homegrown talent could temper ...,cash flush buffalo bills could among active te...
2,Mike Rodak,story-link,buffalo-bills-32889,nfl,buffalo-bills,6d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The Buffalo Bills will be under pressure over ...,The Buffalo Bills will be under pressure over ...,Why Bills should consider trading sacks leader...,buffalo bills pressure upcoming two month impr...
3,"Michael C. Wright, Greg Wyshynski, Mike Rodak ...",story-link,25760086,nfl,buffalo-bills,6d,http://www.espn.com/espn/story/_/id/25760086/b...,93-year-old Pete Anton has spent decades worki...,93-year-old Pete Anton has spent decades worki...,Behind-the-scenes game-day jobs you never knew...,93-year old pete anton spend decade work spurs...
4,ESPN.com,story-link,25998951,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/25998951/ho...,The five quarterbacks drafted in the first rou...,The five quarterbacks drafted in the first rou...,How the NFL's worst quarterbacks can improve i...,five quarterback draft first round last april ...
5,Mike Rodak,story-link,26026509,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/26026509/bi...,Mina Kimes and Pablo S. Torre react to Bills Q...,Mina Kimes and Pablo S. Torre react to Bills Q...,'Trash' talk: Bills QB's autograph jabs at Ramsey,mina kimes pablo s. torre react bill qb josh a...
6,Mike Rodak,story-link,26005150,nfl,buffalo-bills,12d,http://www.espn.com/nfl/story/_/id/26005150/bi...,The Buffalo Bills have released tight end Char...,The Buffalo Bills have released tight end Char...,Bills release TE Clay after career-worst season,buffalo bills release tight end charles clay t...
7,Jeremy Willis,story-link,25995553,nfl,buffalo-bills,13d,http://www.espn.com/nfl/story/_/id/25995553/nf...,Can't you feel the love around the NFL?,Can't you feel the love around the NFL? It's V...,The best cheesy Valentines from around the NFL,feel love around nfl valentine 's day course n...
8,Mike Rodak,story-link,25980647,nfl,buffalo-bills,15d,http://www.espn.com/nfl/story/_/id/25980647/bi...,The Buffalo Bills on Tuesday signed free-agent...,The Buffalo Bills on Tuesday signed free-agent...,Bills sign ex-Jet Long to bolster offensive line,buffalo bills tuesday sign free agent offensiv...
9,Mike Rodak,story-link,buffalo-bills-32874,nfl,buffalo-bills,21d,http://espn.com/blog/buffalo-bills/post/_/id/3...,"Five weeks after the Bills' 2018 season ended,...","Five weeks after the Bills' 2018 season ended,...",Why Bills' Lorenzo Alexander embraces the Buff...,five week bills 2018 season end buffalo bring ...


In [29]:
data.text_cleaned.get(0)

"jacksonville florida -- jacksonville jaguar pleased leonard fournette begin offseason eager see third year player respond disappointing -- time concern -- 2018 . general manager dave caldwell coach doug marrone praise fournette nfl scout combine wednesday morning . think leonard good spot caldwell say . know lot make end season stuff seem like good phase . work . know take nutrition workout seriously . think good spot see come april rest veteran ota . fournette 's second nfl season go nearly well first rush 1,040 yard nine touchdown help jaguars win afc south reach afc championship game . instead 2018 miss seven game injury suspend another several field issue let get shape . jaguars need quarterback also could use tight end line help another receiver two draft april . could big change new england offseason . jet meanwhile money spend everyone still chase pats . season end sour note executive vp football operation tom coughlin publicly criticize fournette inactive foot injury t.j. yeld

In [30]:
list_of_tokenized_texts = []
def tokenize(cleaned_text):
    tokenized_text = []
    doc = nlp(cleaned_text)
    for token in doc:
        tokenized_text.append(token)
    return tokenized_text

def add_tokenized(df):
    df['tokenized'] = tokenize(df['text_cleaned'])
    return df

In [31]:
data = data.apply(add_tokenized, axis = 1)

In [32]:
data

Unnamed: 0,author,class,data-id,sport,teamname,timestamp,url,summary,text,headline,text_cleaned,tokenized
0,Michael DiRocco,story-link,26094253,nfl,buffalo-bills,4h,http://www.espn.com/nfl/story/_/id/26094253/ja...,"JACKSONVILLE, Fla. -- The Jacksonville Jaguars...","JACKSONVILLE, Fla. -- The Jacksonville Jaguars...",Jags GM: Fournette 'in a good spot' after meeting,jacksonville florida -- jacksonville jaguar pl...,"[jacksonville, florida, --, jacksonville, jagu..."
1,Mike Rodak,story-link,buffalo-bills-32911,nfl,buffalo-bills,2d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The cash-flush Buffalo Bills could be among th...,The cash-flush Buffalo Bills could be among th...,Bills' focus on homegrown talent could temper ...,cash flush buffalo bills could among active te...,"[cash, flush, buffalo, bills, could, among, ac..."
2,Mike Rodak,story-link,buffalo-bills-32889,nfl,buffalo-bills,6d,http://espn.com/blog/buffalo-bills/post/_/id/3...,The Buffalo Bills will be under pressure over ...,The Buffalo Bills will be under pressure over ...,Why Bills should consider trading sacks leader...,buffalo bills pressure upcoming two month impr...,"[buffalo, bills, pressure, upcoming, two, mont..."
3,"Michael C. Wright, Greg Wyshynski, Mike Rodak ...",story-link,25760086,nfl,buffalo-bills,6d,http://www.espn.com/espn/story/_/id/25760086/b...,93-year-old Pete Anton has spent decades worki...,93-year-old Pete Anton has spent decades worki...,Behind-the-scenes game-day jobs you never knew...,93-year old pete anton spend decade work spurs...,"[93-year, old, pete, anton, spend, decade, wor..."
4,ESPN.com,story-link,25998951,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/25998951/ho...,The five quarterbacks drafted in the first rou...,The five quarterbacks drafted in the first rou...,How the NFL's worst quarterbacks can improve i...,five quarterback draft first round last april ...,"[five, quarterback, draft, first, round, last,..."
5,Mike Rodak,story-link,26026509,nfl,buffalo-bills,9d,http://www.espn.com/nfl/story/_/id/26026509/bi...,Mina Kimes and Pablo S. Torre react to Bills Q...,Mina Kimes and Pablo S. Torre react to Bills Q...,'Trash' talk: Bills QB's autograph jabs at Ramsey,mina kimes pablo s. torre react bill qb josh a...,"[mina, kimes, pablo, s., torre, react, bill, q..."
6,Mike Rodak,story-link,26005150,nfl,buffalo-bills,12d,http://www.espn.com/nfl/story/_/id/26005150/bi...,The Buffalo Bills have released tight end Char...,The Buffalo Bills have released tight end Char...,Bills release TE Clay after career-worst season,buffalo bills release tight end charles clay t...,"[buffalo, bills, release, tight, end, charles,..."
7,Jeremy Willis,story-link,25995553,nfl,buffalo-bills,13d,http://www.espn.com/nfl/story/_/id/25995553/nf...,Can't you feel the love around the NFL?,Can't you feel the love around the NFL? It's V...,The best cheesy Valentines from around the NFL,feel love around nfl valentine 's day course n...,"[feel, love, around, nfl, valentine, 's, day, ..."
8,Mike Rodak,story-link,25980647,nfl,buffalo-bills,15d,http://www.espn.com/nfl/story/_/id/25980647/bi...,The Buffalo Bills on Tuesday signed free-agent...,The Buffalo Bills on Tuesday signed free-agent...,Bills sign ex-Jet Long to bolster offensive line,buffalo bills tuesday sign free agent offensiv...,"[buffalo, bills, tuesday, sign, free, agent, o..."
9,Mike Rodak,story-link,buffalo-bills-32874,nfl,buffalo-bills,21d,http://espn.com/blog/buffalo-bills/post/_/id/3...,"Five weeks after the Bills' 2018 season ended,...","Five weeks after the Bills' 2018 season ended,...",Why Bills' Lorenzo Alexander embraces the Buff...,five week bills 2018 season end buffalo bring ...,"[five, week, bills, 2018, season, end, buffalo..."
