## Prepare Exercise

### Imports

In [1]:
import re
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import os

import unicodedata
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import acquire

### Lesson Practice

In [2]:
original = acquire.get_article_text()
print(original)

The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with real

### Lowercase

In [3]:
# Convert all letters to lowercase
article = original.lower()
print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoor’s #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. we’ve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real

### Remove Accented Characters

In [4]:
# Remove accented characters
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(article)

the rumors are true! the time has arrived. codeup has officially opened applications to our new data science career accelerator, with only 25 seats available! this immersive program is one of a kind in san antonio, and will help you land a job in glassdoors #1 best job in america.
data science is a method of providing actionable intelligence from data. the data revolution has hit san antonio, resulting in an explosion in data scientist positions across companies like usaa, accenture, booz allen hamilton, and heb. weve even seen utsa invest $70 m for a cybersecurity center and school of data science. we built a program to specifically meet the growing demands of this industry.
our program will be 18 weeks long, full-time, hands-on, and project-based. our curriculum development and instruction is led by senior data scientist, maggie giust, who has worked at heb, capital group, and rackspace, along with input from dozens of practitioners and hiring partners. students will work with real d

### Remove Special Characters

In [5]:
# Remove special characters
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

the rumors are true the time has arrived codeup has officially opened applications to our new data science career accelerator with only 25 seats available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america
data science is a method of providing actionable intelligence from data the data revolution has hit san antonio resulting in an explosion in data scientist positions across companies like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demands of this industry
our program will be 18 weeks long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who has worked at heb capital group and rackspace along with input from dozens of practitioners and hiring partners students will work with real data sets realistic problems a

### Tokenization

In [6]:
# Tokenization
# Is the process of breaking down break words and any punctuation into discrete units

tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(original, return_str=True))

The rumors are true ! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator , with only 25 seats available ! This immersive program is one of a kind in San Antonio , and will help you land a job in Glassdoor ’ s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio , resulting in an explosion in Data Scientist positions across companies like USAA , Accenture , Booz Allen Hamilton , and HEB. We ’ ve even seen UTSA invest $ 70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long , full-time , hands-on , and project-based. Our curriculum development and instruction is led by Senior Data Scientist , Maggie Giust , who has worked at HEB , Capital Group , and Rackspace , along with input from dozens of practitioners and hiring partners. Student

### Stemming

In [7]:
# Stemming 
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

In [8]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
print(article_stemmed)

the rumor are true the time ha arriv codeup ha offici open applic to our new data scienc career acceler with onli 25 seat avail thi immers program is one of a kind in san antonio and will help you land a job in glassdoor 1 best job in america data scienc is a method of provid action intellig from data the data revolut ha hit san antonio result in an explos in data scientist posit across compani like usaa accentur booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecur center and school of data scienc we built a program to specif meet the grow demand of thi industri our program will be 18 week long fulltim handson and projectbas our curriculum develop and instruct is led by senior data scientist maggi giust who ha work at heb capit group and rackspac along with input from dozen of practition and hire partner student will work with real data set realist problem and the entir data scienc pipelin from collect to deploy they will receiv profession develop train in resum

In [9]:
pd.Series(stems).value_counts().head(10)

data      13
and       13
to         9
a          8
in         8
scienc     7
our        7
learn      6
the        6
of         6
dtype: int64

### Lemmatization

In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/dbojado/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
# Lemmatization
wnl = nltk.stem.WordNetLemmatizer()

for word in 'study studies'.split():
    print('stem:', ps.stem(word), '-- lemma:', wnl.lemmatize(word))

stem: studi -- lemma: study
stem: studi -- lemma: study


In [12]:
lemmas = [wnl.lemmatize(word) for word in article.split()]
article_lemmatized = ' '.join(lemmas)

print(article_lemmatized)

the rumor are true the time ha arrived codeup ha officially opened application to our new data science career accelerator with only 25 seat available this immersive program is one of a kind in san antonio and will help you land a job in glassdoors 1 best job in america data science is a method of providing actionable intelligence from data the data revolution ha hit san antonio resulting in an explosion in data scientist position across company like usaa accenture booz allen hamilton and heb weve even seen utsa invest 70 m for a cybersecurity center and school of data science we built a program to specifically meet the growing demand of this industry our program will be 18 week long fulltime handson and projectbased our curriculum development and instruction is led by senior data scientist maggie giust who ha worked at heb capital group and rackspace along with input from dozen of practitioner and hiring partner student will work with real data set realistic problem and the entire data

In [13]:
pd.Series(lemmas).value_counts()[:10]

and        13
data       13
to          9
in          8
a           8
our         7
science     7
will        6
with        6
of          6
dtype: int64

### Removing Stopwords

In [14]:
# Removing stopwords
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [15]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('---')

article_without_stopwords = ' '.join(filtered_words)

print(article_without_stopwords)

Removed 122 stopwords
---
rumors true time arrived codeup officially opened applications new data science career accelerator 25 seats available immersive program one kind san antonio help land job glassdoors 1 best job america data science method providing actionable intelligence data data revolution hit san antonio resulting explosion data scientist positions across companies like usaa accenture booz allen hamilton heb weve even seen utsa invest 70 cybersecurity center school data science built program specifically meet growing demands industry program 18 weeks long fulltime handson projectbased curriculum development instruction led senior data scientist maggie giust worked heb capital group rackspace along input dozens practitioners hiring partners students work real data sets realistic problems entire data science pipeline collection deployment receive professional development training resume writing interviewing continuing education prepare smooth transition workforce focus applie

#### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [16]:
def basic_clean(string): 
    '''
    This function lowercases everything,normalizes unicode characters, and replace anything that is not a letter, number, whitespace or a single quote.
    '''
    string = unicodedata.normalize('NFKC', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    string = re.sub(r'[^\w\s]', '', string).lower()
    return string

#### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [17]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

#### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [18]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

#### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [19]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

#### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

#### This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [20]:
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove additional exclude_words.
    stopword_list.extend(exclude_words)
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Add additional extra_words.
    filtered_words.extend(extra_words)
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

#### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

#### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

#### 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

#### 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?    

- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?    

- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?  