# Students Do: Crude Stopwords
For this activity, create a function that takes in an article and outputs a list of words that is free of stopwords and any non-letter characters. After looking at the results, define your own list of stopwords to add to the NLTK default set. 

In [1]:
# Imports
from nltk.corpus import reuters, stopwords
from nltk.tokenize import word_tokenize
import re

# Code to download corpora
import nltk
nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package reuters to /Users/lharris/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /Users/lharris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lharris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Store an article to work with
crude_article = reuters.raw(fileids=reuters.fileids(categories='crude')[2])

In [3]:
# Print the stored article
print(crude_article)

TURKEY CALLS FOR DIALOGUE TO SOLVE DISPUTE
  Turkey said today its disputes with
  Greece, including rights on the continental shelf in the Aegean
  Sea, should be solved through negotiations.
      A Foreign Ministry statement said the latest crisis between
  the two NATO members stemmed from the continental shelf dispute
  and an agreement on this issue would effect the security,
  economy and other rights of both countries.
      "As the issue is basicly political, a solution can only be
  found by bilateral negotiations," the statement said. Greece has
  repeatedly said the issue was legal and could be solved at the
  International Court of Justice.
      The two countries approached armed confrontation last month
  after Greece announced it planned oil exploration work in the
  Aegean and Turkey said it would also search for oil.
      A face-off was averted when Turkey confined its research to
  territorrial waters. "The latest crises created an historic
  opportunity to solve th

In [4]:
# Complete the `clean_text` function
def clean_text(article):
    
    # Define a set of stopwords using `stopwords.words()`
    sw = set(stopwords.words('english'))

    # Define the regex parameters
    regex = re.compile("[^a-zA-Z ]")

    # Apply regex parameters to article
    re_clean = regex.sub('', article)

    # Apply `word_tokenize` to the regex scrubbed text
    words = word_tokenize(re_clean)

    # Create list of lower-case words that are not in the stopword set
    output = [word.lower() for word in words if word.lower() not in sw]
    
    # Return the final list
    return output

In [5]:
# Pass article into `clean_text` and store the result
result = clean_text(crude_article)

In [6]:
# Print out unique words
print(set(result))

{'work', 'would', 'crisis', 'political', 'statement', 'effect', 'papandreou', 'averted', 'issue', 'ambassador', 'prime', 'created', 'court', 'dispute', 'basicly', 'two', 'bilateral', 'countries', 'month', 'repeatedly', 'waters', 'reply', 'foreign', 'could', 'meet', 'week', 'oil', 'disputes', 'armed', 'legal', 'negotiations', 'stemmed', 'also', 'ozal', 'sea', 'shelf', 'justice', 'research', 'nazmi', 'announced', 'opportunity', 'disclosed', 'agreement', 'akiman', 'athens', 'solution', 'message', 'confrontation', 'exploration', 'turkey', 'continental', 'dialogue', 'search', 'confined', 'greece', 'economy', 'approached', 'said', 'members', 'rights', 'found', 'solve', 'andreas', 'sent', 'international', 'including', 'security', 'turkeys', 'greek', 'faceoff', 'aegean', 'planned', 'due', 'today', 'minister', 'nato', 'calls', 'turkish', 'contents', 'last', 'latest', 'historic', 'crises', 'turgut', 'ministry', 'territorrial', 'solved'}


In [7]:
# Second iteration, with custom stopwords
def clean_text(article):
    
    # Define a set of stopwords using `stopwords.words()`
    sw = set(stopwords.words('english'))
    
    # Create custom stopwords
    sw_addons = {'said', 'sent', 'found', 'including', 'today', 'announced', 'week', 'basically', 'also'}

    # Define the regex parameters
    regex = re.compile("[^a-zA-Z ]")

    # Apply regex parameters to article
    re_clean = regex.sub('', article)

    # Apply `word_tokenize` to the regex scrubbed text
    words = word_tokenize(re_clean)

    # Create list of lower-case words that are not in the stopword set
    output = [word.lower() for word in words if word.lower() not in sw.union(sw_addons)]
    
    # Return the final list
    return output

In [8]:
# Pass article into `clean_text` and examine new results
result2 = clean_text(crude_article)
print(set(result2))

{'work', 'would', 'crisis', 'political', 'statement', 'effect', 'papandreou', 'averted', 'issue', 'ambassador', 'prime', 'created', 'court', 'dispute', 'basicly', 'two', 'bilateral', 'countries', 'month', 'repeatedly', 'waters', 'reply', 'foreign', 'could', 'meet', 'oil', 'disputes', 'armed', 'legal', 'negotiations', 'stemmed', 'ozal', 'sea', 'shelf', 'justice', 'research', 'nazmi', 'opportunity', 'disclosed', 'agreement', 'akiman', 'athens', 'solution', 'message', 'confrontation', 'exploration', 'turkey', 'continental', 'dialogue', 'search', 'confined', 'greece', 'economy', 'approached', 'members', 'rights', 'solve', 'andreas', 'international', 'security', 'turkeys', 'greek', 'faceoff', 'aegean', 'planned', 'due', 'minister', 'nato', 'calls', 'turkish', 'contents', 'last', 'latest', 'historic', 'crises', 'turgut', 'ministry', 'territorrial', 'solved'}
