## Following this tutorial: [Automated Keyword Extraction from Articles using NLP](https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34)

#### next steps:
* manual keywords
* determining what keywords are
look at nouns, get rid of stop words, see what pops out!

In [1]:
# set up libraries we'll need
import pandas as pd
import regex as re
import string

In [2]:
# import and preview dataset
# dataset from https://www.ire.org/events-and-training/conferences/nicar-2019
dataset = pd.read_csv('data/car19guide.csv')
dataset.head()

Unnamed: 0,event_id,name,clean_description,location_room,start_date_clean,start_time,end_time,pre_reg_flag,paid_flag,laptop_flag,speakers_cleaned,session_type,keywords,skill_level,session_title
0,4178,(Generally) painless collaboration with the gr...,Traditional reporters and editors often view t...,Salon A&B,2019-03-09,2019-03-09 15:30:00,2019-03-09 16:30:00,False,False,False,"Ryann Grochowski Jones, ProPublica (moderator)...",Panel,,General interest,(Generally) painless collaboration with the gr...
1,4162,25th CAR: What a ride it's been!,Buckle up for a fast-paced ride through 25 yea...,Salon C&D,2019-03-08,2019-03-08 15:30:00,2019-03-08 16:30:00,False,False,False,"Doug Haddix, IRE/NICAR; Shawn McIntosh, Atlant...",Panel,,General interest,25th CAR: What a ride it's been!
2,4189,50 databases to request right now,Get your FOI templates ready to roll. In this ...,Salon D,2019-03-09,2019-03-09 14:15:00,2019-03-09 15:15:00,False,False,False,"Mark Walker, The New York Times; Kate Martin, ...",Panel,,General interest,50 databases to request right now
3,4198,A conversation with James B. Steele: Insights ...,This special session features the wit and wisd...,Salon A&B,2019-03-09,2019-03-09 10:15:00,2019-03-09 11:15:00,False,False,False,"Sarah Cohen, ASU Cronkite School of Journalism...",Panel,,,A conversation with James B. Steele: Insights ...
4,4301,Adding a text editor to your CAR toolkit,A good text editor is an essential tool for da...,Salon A&B,2019-03-10,2019-03-10 10:15:00,2019-03-10 11:15:00,False,False,False,"Agustin Armendariz, The New York Times",Demo,,Intermediate,Adding a text editor to your CAR toolkit


In [3]:
# create new dataset with only the fields we want
subset = dataset.loc[:, ['name','clean_description']] # 'session_title' seems to be same as name
# add a column with the year
subset['conference_year'] = '2019'
subset.head()

Unnamed: 0,name,clean_description,conference_year
0,(Generally) painless collaboration with the gr...,Traditional reporters and editors often view t...,2019
1,25th CAR: What a ride it's been!,Buckle up for a fast-paced ride through 25 yea...,2019
2,50 databases to request right now,Get your FOI templates ready to roll. In this ...,2019
3,A conversation with James B. Steele: Insights ...,This special session features the wit and wisd...,2019
4,Adding a text editor to your CAR toolkit,A good text editor is an essential tool for da...,2019


In [4]:
# through manual analysis of the schedule, I found some that are not classes or are duplicate sessions
# these rows have names containing strings we can filter out using pipe
# thanks to https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe

filter_out = ['registration', 'sales', 'repeat']
filtered_subset = subset[~subset['name'].str.contains('|'.join(filter_out))]
# reset indices
filtered_subset = filtered_subset.reset_index(drop=True)

In [5]:
# preliminary text exploration
# fetch word count for each description
pd.options.mode.chained_assignment = None # get rid of warning...

filtered_subset['word_count'] = filtered_subset['clean_description'].apply(lambda x: len(str(x).split(" ")))
filtered_subset[['clean_description','word_count']].head()

Unnamed: 0,clean_description,word_count
0,Traditional reporters and editors often view t...,68
1,Buckle up for a fast-paced ride through 25 yea...,99
2,Get your FOI templates ready to roll. In this ...,56
3,This special session features the wit and wisd...,194
4,A good text editor is an essential tool for da...,49


In [6]:
# descriptive statistics of word counts
filtered_subset.word_count.describe()

count    225.000000
mean      81.973333
std       37.831383
min        3.000000
25%       58.000000
50%       75.000000
75%       98.000000
max      210.000000
Name: word_count, dtype: float64

In [7]:
# copy descriptions to new column for pre-processing
filtered_subset['preproc_desc'] = filtered_subset['clean_description']

# # make every word in descriptions lowercase
# filtered_subset['preproc_desc'] = filtered_subset['preproc_desc'].apply(lambda x: x.lower())
# # remove punctuation before looking for common/uncommon words because adjacent punctuation changes words
# filtered_subset['preproc_desc'] = filtered_subset['preproc_desc'].apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
# filtered_subset['preproc_desc'].head()

In [8]:
# identify common words
# could be used for custom stop word list
freq = pd.Series(' '.join(filtered_subset['preproc_desc']).split()).value_counts()[:50]
freq
# only domain-specific words that we might want to keep out of stoplist are 'data', 'learn', 'stories'

and        756
to         653
the        512
a          385
of         335
data       258
you        246
for        245
is         241
in         232
with       213
will       207
how        197
your       180
This       153
session    140
that       136
this       123
on         118
be         117
can        108
or         105
are         95
good        91
who         79
have        73
as          70
from        67
an          65
some        64
for:        64
use         61
using       60
into        58
about       57
it          56
at          56
We'll       51
but         48
In          46
learn       46
more        45
—           44
what        43
also        43
their       42
tools       41
get         41
stories     39
data.       39
dtype: int64

In [9]:
# identify uncommon words
# inform cleaning needed?
unfreq =  pd.Series(' '.join(filtered_subset 
         ['preproc_desc']).split()).value_counts()[-20:]
unfreq

re-structure              1
rebuilding                1
request?                  1
Pivot                     1
(bit.ly/DarkMoneyData)    1
world.                    1
nation's                  1
PyCAR                     1
super-handy               1
Customize                 1
onto                      1
animations                1
hoaxes                    1
All                       1
happen,                   1
Star                      1
agency                    1
conventions,              1
watching                  1
strength?                 1
dtype: int64

In [10]:
# libraries for text-preprocessing

# download these the first time you run this
#nltk.download('stopwords')
#nltk.download('wordnet') 

from nltk.corpus import stopwords
# stemming normalizes text by removing suffixes
from nltk.stem.porter import PorterStemmer
# lemmatisation works based on the root of the word.
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

In [11]:
# creating a list of stop words (plus adding custom stopwords if we want)
stop_words = set(stopwords.words("english"))# creating a list of custom stopwords
new_words = []
stop_words = stop_words.union(new_words)

In [12]:
# with the stopwords, clean and normalize the corpus
corpus = []
for i in range(0, filtered_subset['preproc_desc'].count()): # don't hard code number of rows!
    #Remove punctuations
    text = re.sub('[^a-zA-Z]', ' ', filtered_subset['preproc_desc'][i])
    
    #Convert to lowercase
    text = text.lower()
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ##Convert to list from string
    text = text.split()
    
    ##Stemming
    ps=PorterStemmer()    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

In [13]:
# view an example corpus item
print(corpus[1])
print()
print(corpus[5])

buckle fast paced ride year data journalism told people drove car mainstream investigative reporting hear pivotal moment bizarre twist befuddled bureaucrat know hit featuring special guest expected guest speaker include crina boros center investigative journalism sarah cohen asu walter cronkite school journalism steve doig asu walter cronkite school journalism jaimi dowdell reuters mark horvit university missouri brant houston university illinois clarence jones independent journalist jennifer lafleur investigative reporting workshop james b steele independent journalist

much openrefine clustering faceting feature session deep dive grel openrefine expression language equivalent excel formula thorough introduction grel syntax review common function explore clean dataset function covered session include replace split concatenate string comparison cell cross join multiple project together foreach session good people familiar openrefine least excel experience introduction openrefine check 

In [42]:
# # testing if "r" shows up
# sub = ' r '
# print('\n'.join(s for s in corpus if sub in s))




In [24]:
# get word counts for every single word in the corpus
# https://www.geeksforgeeks.org/python-count-occurrences-of-each-word-in-given-text-file-using-dictionary/

# create an empty dictionary
counts = dict()

for desc in corpus: 
    # remove leading spaces
    desc = desc.strip() 
  
    # split the line into words 
    words = desc.split(" ") 
    
    for word in words:
        # Check if the word is already in dictionary 
        if word in counts: 
            # Increment count of word by 1 
            counts[word] = counts[word] + 1
        else: 
            # Add the word to dictionary with count 1 
            counts[word] = 1

In [35]:
# sort dictionary by word count values
import collections

sorted_counts_list = sorted(counts.items(), key=lambda x: x[1])
sorted_counts = collections.OrderedDict(sorted_counts_list)

In [36]:
# print dictionary of word counts
# https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value

print(There are len(sorted_counts.keys()) unique words in this corpus.)
for key in list(sorted_counts.keys()): 
    print(key, ":", sorted_counts[key]) 

desk : 1
unapproachable : 1
shoot : 1
stack : 1
magically : 1
bouncing : 1
head : 1
revamp : 1
reputation : 1
perception : 1
buckle : 1
told : 1
drove : 1
mainstream : 1
pivotal : 1
bizarre : 1
twist : 1
befuddled : 1
bureaucrat : 1
hit : 1
featuring : 1
crina : 1
boros : 1
steve : 1
doig : 1
jaimi : 1
dowdell : 1
reuters : 1
mark : 1
horvit : 1
missouri : 1
brant : 1
houston : 1
illinois : 1
clarence : 1
jennifer : 1
lafleur : 1
geek : 1
overlooked : 1
appreciated : 1
wit : 1
accomplished : 1
inquirer : 1
largest : 1
magazine : 1
slew : 1
inc : 1
vanity : 1
fair : 1
arizona : 1
ran : 1
shared : 1
medal : 1
allowing : 1
inspect : 1
tidy : 1
seamlessly : 1
cm : 1
clustering : 1
equivalent : 1
thorough : 1
covered : 1
concatenate : 1
foreach : 1
least : 1
questioning : 1
sub : 1
neat : 1
unnatural : 1
occur : 1
incident : 1
lengthy : 1
frustrating : 1
intimidate : 1
appeal : 1
freedom : 1
terrorist : 1
jason : 1
leopold : 1
attorney : 1
katie : 1
townsend : 1
successfully : 1
cleaned : 1

display : 1
geographically : 1
beautiful : 1
sf : 1
leaflet : 1
utilize : 1
angelica : 1
mckinley : 1
lena : 1
groeger : 1
usability : 1
alignment : 1
responsiveness : 1
thumb : 1
typography : 1
accessibility : 1
inclusiveness : 1
ambiguity : 1
poor : 1
unity : 1
clarity : 1
convinced : 1
higher : 1
ups : 1
instructor : 1
excelled : 1
handling : 1
carving : 1
guidance : 1
effectively : 1
ken : 1
armstrong : 1
conducting : 1
capitalizing : 1
tough : 1
silence : 1
sincerity : 1
briefly : 1
mysterious : 1
emotion : 1
mind : 1
heart : 1
surrounding : 1
pure : 1
hooey : 1
outstanding : 1
interviewer : 1
writer : 1
depth : 1
half : 1
supervising : 1
atlanta : 1
journal : 1
constitution : 1
deputy : 1
shawn : 1
mcintosh : 1
kimbriell : 1
kelly : 1
diary : 1
checking : 1
untangling : 1
avoiding : 1
turnoff : 1
plenty : 1
requires : 1
broadcast : 1
challenging : 1
cut : 1
buy : 1
marketing : 1
promotion : 1
accuracy : 1
measuring : 1
success : 1
please : 1
gathered : 1
talked : 1
christian : 1


script : 2
shapefiles : 2
exploratory : 2
thinking : 2
designing : 2
core : 2
leader : 2
hurt : 2
venture : 2
feeling : 2
error : 2
speak : 2
dealing : 2
coverage : 2
pulling : 2
store : 2
automate : 2
lawyer : 2
require : 2
almost : 2
app : 2
cloud : 2
readily : 2
mapbox : 2
multilingual : 2
sourcing : 2
openelections : 2
general : 2
freely : 2
introduced : 2
tackling : 2
tricky : 2
unfamiliar : 2
embedded : 2
newport : 2
beach : 2
mass : 2
reception : 2
postgis : 2
seems : 2
likely : 2
manipulate : 2
behavior : 2
spread : 2
retains : 2
selection : 2
json : 2
geared : 2
certain : 2
dice : 2
grouping : 2
jump : 2
matplotlib : 2
exporting : 2
confidence : 2
observation : 2
measure : 2
logistic : 2
regex : 2
survive : 2
trauma : 2
sharing : 2
ground : 2
reliable : 2
actually : 2
total : 2
ny : 2
running : 2
foreign : 2
member : 2
brainstorm : 2
live : 2
major : 2
starting : 2
instead : 2
socrata : 2
tiplines : 2
improvement : 2
legal : 2
growth : 2
tipsheet : 2
intended : 2
datawrapper :

medium : 15
interactive : 15
website : 15
driven : 15
challenge : 15
manager : 15
technique : 15
resource : 15
including : 15
knowledge : 15
write : 15
idea : 16
panel : 16
ready : 16
explore : 16
clean : 16
query : 16
go : 16
look : 16
library : 16
command : 16
line : 17
year : 17
text : 17
common : 17
writing : 17
tableau : 17
file : 17
start : 17
google : 17
attendee : 17
investigation : 17
training : 17
open : 18
simple : 18
provided : 18
life : 18
nicar : 18
prerequisite : 18
intermediate : 18
ire : 19
question : 19
information : 19
chart : 19
software : 19
datasets : 19
anyone : 19
javascript : 19
learning : 19
public : 20
example : 20
function : 20
join : 20
familiar : 20
programming : 20
editor : 21
discus : 21
set : 21
build : 21
investigative : 22
know : 22
record : 22
analyze : 22
free : 22
powerful : 22
online : 22
share : 22
preregistration : 22
seating : 22
web : 23
best : 23
limited : 23
team : 24
comfortable : 24
map : 24
news : 24
create : 26
reporter : 27
project : 27