In [2]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

# Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.



1. Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.


In [3]:
def basic_clean(string):
    '''
    This function takes in a string and performs the following operations:
    Lowercase everything
    Normalize unicode characters
    Replace anything that is not a letter, number, whitespace or a single quote
    '''
    article = string.lower()
    article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8')
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    return article

In [4]:
original = "Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"
original

"Paul Erdős and George Pólya are influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

In [5]:
clean_test = basic_clean(original)
clean_test

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(string):
    '''
    This function takes in a string and tokenizes all the words in the string.
    '''
    # Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()

    # Use the tokenizer
    article = tokenizer.tokenize(string, return_str = True)

    return article

In [7]:
tokenize_test = tokenize(clean_test)
tokenize_test

"paul erdos and george polya are influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we don't want to remove.



6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.