## Following this tutorial: [Automated Keyword Extraction from Articles using NLP](https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34)

#### next steps:
* manual keywords
* determining what keywords are
look at nouns, get rid of stop words, see what pops out!

In [65]:
# set up libraries we'll need
import pandas as pd
import regex as re
import string

In [4]:
# import and preview dataset
# dataset from https://www.ire.org/events-and-training/conferences/nicar-2019
dataset = pd.read_csv('data/car19guide.csv')
dataset.head()

Unnamed: 0,event_id,name,clean_description,location_room,start_date_clean,start_time,end_time,pre_reg_flag,paid_flag,laptop_flag,speakers_cleaned,session_type,keywords,skill_level,session_title
0,4178,(Generally) painless collaboration with the gr...,Traditional reporters and editors often view t...,Salon A&B,2019-03-09,2019-03-09 15:30:00,2019-03-09 16:30:00,False,False,False,"Ryann Grochowski Jones, ProPublica (moderator)...",Panel,,General interest,(Generally) painless collaboration with the gr...
1,4162,25th CAR: What a ride it's been!,Buckle up for a fast-paced ride through 25 yea...,Salon C&D,2019-03-08,2019-03-08 15:30:00,2019-03-08 16:30:00,False,False,False,"Doug Haddix, IRE/NICAR; Shawn McIntosh, Atlant...",Panel,,General interest,25th CAR: What a ride it's been!
2,4189,50 databases to request right now,Get your FOI templates ready to roll. In this ...,Salon D,2019-03-09,2019-03-09 14:15:00,2019-03-09 15:15:00,False,False,False,"Mark Walker, The New York Times; Kate Martin, ...",Panel,,General interest,50 databases to request right now
3,4198,A conversation with James B. Steele: Insights ...,This special session features the wit and wisd...,Salon A&B,2019-03-09,2019-03-09 10:15:00,2019-03-09 11:15:00,False,False,False,"Sarah Cohen, ASU Cronkite School of Journalism...",Panel,,,A conversation with James B. Steele: Insights ...
4,4301,Adding a text editor to your CAR toolkit,A good text editor is an essential tool for da...,Salon A&B,2019-03-10,2019-03-10 10:15:00,2019-03-10 11:15:00,False,False,False,"Agustin Armendariz, The New York Times",Demo,,Intermediate,Adding a text editor to your CAR toolkit


In [16]:
# create new dataset with only the fields we want
subset = dataset.loc[:, ['name','clean_description']] # 'session_title' seems to be same as name
# add a column with the year
subset['conference_year'] = '2019'
subset.head()

Unnamed: 0,name,clean_description,conference_year
0,(Generally) painless collaboration with the gr...,Traditional reporters and editors often view t...,2019
1,25th CAR: What a ride it's been!,Buckle up for a fast-paced ride through 25 yea...,2019
2,50 databases to request right now,Get your FOI templates ready to roll. In this ...,2019
3,A conversation with James B. Steele: Insights ...,This special session features the wit and wisd...,2019
4,Adding a text editor to your CAR toolkit,A good text editor is an essential tool for da...,2019


In [122]:
# through manual analysis of the schedule, I found some that are not classes or are duplicate sessions
# these rows have names containing strings we can filter out using pipe
# thanks to https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe

filter_out = ['registration', 'sales', 'repeat']
filtered_subset = subset[~subset['name'].str.contains('|'.join(filter_out))]
# reset indices
filtered_subset = filtered_subset.reset_index(drop=True)

In [125]:
# preliminary text exploration
# fetch word count for each description
pd.options.mode.chained_assignment = None # get rid of warning...

filtered_subset['word_count'] = filtered_subset['clean_description'].apply(lambda x: len(str(x).split(" ")))
filtered_subset[['clean_description','word_count']].head()

Unnamed: 0,clean_description,word_count
0,Traditional reporters and editors often view t...,68
1,Buckle up for a fast-paced ride through 25 yea...,99
2,Get your FOI templates ready to roll. In this ...,56
3,This special session features the wit and wisd...,194
4,A good text editor is an essential tool for da...,49


In [53]:
# descriptive statistics of word counts
filtered_subset.word_count.describe()

count    225.000000
mean      81.973333
std       37.831383
min        3.000000
25%       58.000000
50%       75.000000
75%       98.000000
max      210.000000
Name: word_count, dtype: float64

In [126]:
# copy descriptions to new column for pre-processing
filtered_subset['preproc_desc'] = filtered_subset['clean_description']

# # make every word in descriptions lowercase
# filtered_subset['preproc_desc'] = filtered_subset['preproc_desc'].apply(lambda x: x.lower())
# # remove punctuation before looking for common/uncommon words because adjacent punctuation changes words
# filtered_subset['preproc_desc'] = filtered_subset['preproc_desc'].apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
# filtered_subset['preproc_desc'].head()

In [77]:
# identify common words
# could be used for custom stop word list
freq = pd.Series(' '.join(filtered_subset['preproc_desc']).split()).value_counts()[:50]
freq
# only domain-specific words that we might want to keep out of stoplist are 'data', 'learn', 'stories'

and        774
to         662
the        548
a          397
of         337
data       336
for        313
you        298
this       279
in         279
is         243
how        224
with       218
will       207
your       184
session    174
that       140
on         120
be         117
or         112
can        110
are        100
good        91
well        86
who         81
it          77
from        75
as          74
have        74
some        74
what        73
learn       73
use         67
an          67
using       66
stories     61
we          60
but         60
into        59
about       57
at          57
class       55
people      55
we’ll       52
tools       49
more        49
if          47
get         46
their       44
—           44
dtype: int64

In [79]:
# identify uncommon words
# inform cleaning needed?
unfreq =  pd.Series(' '.join(filtered_subset 
         ['preproc_desc']).split()).value_counts()[-20:]
unfreq

internship             1
closer                 1
marketingpromotions    1
digitalfirst           1
lesserknown            1
mckinley               1
nittygritty            1
designers              1
platform               1
markets                1
defraud                1
publishable            1
crosstabulations       1
secondary              1
recruiting             1
character              1
analytics              1
parental               1
stops                  1
hours                  1
dtype: int64

In [80]:
# libraries for text-preprocessing

# download these the first time you run this
#nltk.download('stopwords')
#nltk.download('wordnet') 

from nltk.corpus import stopwords
# stemming normalizes text by removing suffixes
from nltk.stem.porter import PorterStemmer
# lemmatisation works based on the root of the word.
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

In [85]:
# creating a list of stop words (plus adding custom stopwords if we want)
stop_words = set(stopwords.words("english"))# creating a list of custom stopwords
new_words = []
stop_words = stop_words.union(new_words)

In [129]:
# with the stopwords, clean and normalize the corpus
corpus = []
for i in range(0, filtered_subset['preproc_desc'].count()): # don't hard code number of rows!
    #Remove punctuations
    text = re.sub('[^a-zA-Z]', ' ', filtered_subset['preproc_desc'][i])
    
    #Convert to lowercase
    text = text.lower()
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ##Convert to list from string
    text = text.split()
    
    ##Stemming
    ps=PorterStemmer()    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

In [136]:
# view an example corpus item
print(corpus[1])
print()
print(corpus[5])

buckle fast paced ride year data journalism told people drove car mainstream investigative reporting hear pivotal moment bizarre twist befuddled bureaucrat know hit featuring special guest expected guest speaker include crina boros center investigative journalism sarah cohen asu walter cronkite school journalism steve doig asu walter cronkite school journalism jaimi dowdell reuters mark horvit university missouri brant houston university illinois clarence jones independent journalist jennifer lafleur investigative reporting workshop james b steele independent journalist

much openrefine clustering faceting feature session deep dive grel openrefine expression language equivalent excel formula thorough introduction grel syntax review common function explore clean dataset function covered session include replace split concatenate string comparison cell cross join multiple project together foreach session good people familiar openrefine least excel experience introduction openrefine check 

In [141]:
sub = ' r '
print('\n'.join(s for s in corpus if sub in s))

skill level intermediate learn use tidyverse collection r package help make data journalism efficient stronger fun learn import clean analyze plot data story used package like dplyr tidyr readr ggplot tibble purr would like learn work together class preregistration required seating limited laptop provided training workshop prerequisite comfortable working r rstudio also familiar basic data analysis
researcher stanford university collected examined record million local police stop city using programming language r learn analyze local policing data find pattern story session good people worked data r want learn analyze police data
journalist pretty much coding experience love excel working spreadsheet attended ire nicar year feel ready make leap might little intimidated code trouble finding time learn new skill let talk challenge learning programming language whether python r sql overcome two self taught coder share journey lesson learned along way
skill level intermediate charles minshe

In [None]:
# get word counts for every single word
every_count = pd.Series(' '.join(filtered_subset['preproc_desc']).split()).value_counts()[:50]