# Preparing text for Natural Language Processing
Text wrangling is 80% of your battle when trying to topic model, phrase extract, or otherwise throw computer code at words. Today we'll be trying some standard techniques on sci-kit learn's 20newsgroups data to see how it turns out.

In [1]:
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.datasets import fetch_20newsgroups

stop_words = stopwords.words("english") #load the stop words (words to ignore list) for english
df = pd.DataFrame(pd.Series(fetch_20newsgroups(subset='train').data))

### Example entry from 20newsgroups

In [2]:
'''
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----
'''
# Print how python sees the string, it may help determine where you could split the data
s = df[0][0]
s

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

### How do seperate signal from noise?

The first thing you should do is decide what is important to your task, and what isn't. After that, you can see if there are underlying structures you can use to eliminate swaths of your text at once. 
<br>
In our case, we only care about the comments.

In [3]:
def extractComments(x):
    ''' INPUT: a string
        OUTPUT: the right side of the string after splitting it
            on the first double line break
    '''
    l = x.split('\n\n',1)
    return l[1]

df['comments'] = df[0].apply(lambda x: extractComments(x)).astype(str)

Now that we've broken out the comments from everything else, we can strip out non-words, links, and email addresses using regex

In [4]:
def scrubString(x):
    ''' INPUT: a string
        OUTPUT: a string that has had links removed, then non-letters, then english stopwords
            This will produce a blank string if it only consisted of links, numbers, etc
    '''
    lemmatizer = WordNetLemmatizer()
    
    x = re.sub("\S*@\S*\s?","",x) #Remove email addresses
    x = re.sub("#\S+|&\S+|@\S+|https?:\S+|RT|[^A-Za-z0-9]+",' ', x) #Remove hyperlinks
    x = re.sub("&\S*|@\S+|https?:\S+",' ', x) #Remove more hyperlinks
    x = re.sub("[^A-Za-z']+",' ',x) #keep only letters

    if len(x)==0:
        return ''
    
    tokens = word_tokenize(x) # Convert the string into tokens
    
    # Lemmatize the words, and only keep non-stop words
    tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]
    
    if len(tokens)==0:
        return ''
    
    return ' '.join(map(str,tokens))

df['cleaned'] = df['comments'].apply(lambda x: scrubString(x))

The text contained in the "cleaned" column is now ready for preliminary analysis

In [6]:
# If you want to save the cleaned text for use in another project, uncomment & run the following code
# df['cleaned'].to_csv('cleanted_20newsgroups.csv',index=False)