# Clean Text & Add Column of POS Tags

In order to obtain features from our articles dataset, we will need to vectorize the text; however, before we can do that we should first clean the text to remove punctuation, convert the text to lowercase, expand contractions, and lemmatize words. We will also use POS tag counts as features in our predictive modeling, so we will add a column that includes text where the words have been replaced with their POS tags. This column can then be run through sklearn's CountVectorizer to obtain POS tag counts efficiently.

## Import modules

In [1]:
import pandas as pd
import re
from nltk.help import upenn_tagset
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
# import other modules...

## Load data

In [2]:
articles = pd.read_csv('./articles_df.csv', index_col = 0)
articles.head()

Unnamed: 0,news_source,pub_date,title,text
0,The Gateway Pundit,2018-07-14,REPORT House Conservatives Prepare to Impeach ...,House GOP lawmakers are preparing to push to i...
1,oann,2018-03-24,French policeman who took place of hostage die...,PARIS (Reuters) A gendarme who was shot three...
2,New York Daily News,2018-03-24,Attorney for Roy Moore accuser was offered 10G...,"An attorney for Leigh Corfman, a woman who acc..."
3,Sputnik,2018-03-23,Martin Vizcarra is New Peruvian President Afte...,Martin Vizcarra is sworn in as Peruvian presid...
4,oann,2018-04-02,Oil falls 2 percent on Russia output rise pote...,NEW YORK (Reuters) Oil fell by more than 2 pe...


In [3]:
len(articles)

498

## Remove rows with no text

In [4]:
# get rid of any blank articles
if len(articles) == len(articles[articles['text'].notnull()]):
    print('There are no blank articles in our dataset.')
else:
    print(f'There are {k} blank articles in our dataset.')
    articles = articles[articles['text'].notnull()].reset_index(drop = True)
    print('The blank articles have now been removed.')

There are no blank articles in our dataset.


In [5]:
# check for any strings with no word characters
no_words = []
for i in range(len(articles)):
    if re.match('^[\W_]+$', articles['text'][i]):
        no_words.append(i)

if len(no_words) == 0:
    print('There are no rows with empty strings in our dataset.')
else:
    print(f'There are {len(no_words)} rows with empty strings in our dataset.')

There are no rows with empty strings in our dataset.


## Clean text

In [6]:
# view example article
articles['text'][0]

'House GOP lawmakers are preparing to push to impeach Deputy Attorney General Rod Rosenstein as soon as Monday, according to three conservative Capitol Hill sources.\n\nFreedom Caucus Chairman Mark Meadows previously drafted the impeachment documents, however nothing has been filed yet.\n\nGOP Congressmen are fed up with Rosensteins continued stonewalling of their probe of the FBIs and DOJs corruption, Spygate and Russiagate during the 2016 election.\n\nHouse conservatives are preparing a new push to oust Deputy Attorney General Rod Rosenstein, according to three conservative Capitol Hill sources  putting the finishing touches on an impeachment filing even as Rosenstein announced the indictment of 12 Russian intelligence officers for interfering in the 2016 election. House Freedom Caucus Chairman Mark Meadows, in fact, had the impeachment document on the floor of the House at the very moment that Rosenstein spoke to reporters and TV cameras Friday. Conservative sources say they could f

In [7]:
# Things we might want to remove from the text:
# - political rep. tags, e.g. (D-NY) for Democrat, New York
# - email addresses asking for tips or comments
# - internet domains
# - do we want to keep punctuation? (for instance, maybe "!" is used more often in fake news articles)
#
# Other considerations:
# - remove stopwords
# - lemmatize words

# to convert contractions into full words (before tokenization)
contractions = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he had",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it had",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she had",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so is",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we had",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you had",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have",
    "'s": " is" # for instances such as "Caitlin's going to run this code."
}

# to map nltk POS tags to wordnet POS tags
tag_dict = {
    'J': wordnet.ADJ,
    'N': wordnet.NOUN,
    'V': wordnet.VERB,
    'R': wordnet.ADV
}

# function to convert nltk_pos tags to wordnet-compatible POS tags
def convert_pos_wordnet(tag):
    tag_abbr = tag[0].upper()
                
    if tag_abbr in tag_dict:
        return tag_dict[tag_abbr]

# Set list of "valid" tags such that when normalizing text, all words tagged with PoS = coordinating conjunction,
# cardinal digit, determiner, existential there, preposition/subordinating conjunction, list marker, predeterminer,
# possessive ending, personal pronoun, possessive pronoun, to, or interjection are dropped.
valid_tags_abbr = 'FJMNRVW'

In [8]:
# define function to clean text
def clean_str(text, lemmatize = True):
    # to drop any internet domains, email addresses, or political rep. tags
    text = re.sub(r'(https?://)?\w+@?\w+(\.\w+)+|\([DRI]-[A-Z]{2}\)', '', text)

    # iterate over entries in contractions dictionary
    for el in list(contractions.keys()):
        if el in text.lower(): # check if the ith contraction is in the string
            text = re.sub(el, contractions[el], text, flags = re.IGNORECASE) # expand contraction

    words = word_tokenize(text)
    clean_words = []

    for word in words:
        PoS_tag = pos_tag([word])[0][1]
        word = re.sub(r'[_-]', '', word)

        # drop words with fewer than 2 characters; drop any punctuation "words"; drop words not in
        # approved set of PoS tags (defined above)
        if (len(word) > 1) and (re.match(r'^\w+$', word)) and (PoS_tag[0].upper() in valid_tags_abbr):

            if lemmatize:
                lemmatizer = WordNetLemmatizer()

                if PoS_tag[0].upper() in 'JNVR':
                    word = lemmatizer.lemmatize(word, convert_pos_wordnet(PoS_tag))
                else:
                    word = lemmatizer.lemmatize(word)

            clean_words.append(word)
    clean_text = ' '.join(clean_words)
    
    return clean_text

In [9]:
# apply clean_str() function to all articles
articles['clean_txt'] = articles['text'].map(clean_str)

In [10]:
# check to see if there are any rows where the clean text column is null
if len(articles[articles['clean_txt'].notnull()]) == len(articles):
    print('There are no rows where the clean text column is null.')
else:
    k = 0
    for i in range(len(articles['clean_txt'])):
        if type(articles['clean_txt'][i]) != str:
            k += 1
    print(f'There are {k} rows where the clean text column is null.')
    
    # drop articles with null clean text
    articles = articles[articles['clean_txt'].notnull()].reset_index(drop = True)
    print('These rows have now been removed from the dataframe.')

There are no rows where the clean text column is null.


## POS tag text

In [11]:
# all PoS tags
upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [12]:
# define function to replace text with POS tags
def str_to_pos(text):
    # expand contractions
    for el in list(contractions.keys()):
        if el.lower() in text.lower():
            text = re.sub(el, contractions[el], text, flags = re.IGNORECASE)

    words = word_tokenize(text)
    tags = []
    for word in words:
        tag = pos_tag([word])[0][1]
        tags.append(tag)
    pos_text = ' '.join(tags)
    
    return pos_text

In [13]:
# test function
testText = 'NLP is my favorite class! What\'s yours?'
str_to_pos(testText)

'NN VBZ PRP$ NN NN . WP VBZ NNS .'

In [14]:
cv_test = CountVectorizer(lowercase = False, token_pattern='\w\w+\$?')
cv_test.fit_transform([str_to_pos(testText)]).toarray()

array([[3, 1, 1, 2, 1]])

In [15]:
cv_test.vocabulary_
# NOTE: The punctuation POS tag has been dropped, but this is okay because we will count punctuation use
# separately.

{'NN': 0, 'VBZ': 3, 'PRP$': 2, 'WP': 4, 'NNS': 1}

In [16]:
# create new column in dataframe with each cell = text replaced with POS tags
articles['POS_tags'] = articles['text'].map(str_to_pos)

## Save Updated Dataframe

In [17]:
# save over existing articles csv file
articles.to_csv('articles_df.csv')