# Prepare Exercises

Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [194]:
import pandas as pd
import numpy as np
import unicodedata
import re
from bs4 import BeautifulSoup
import acquire as a
import requests

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords



# 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [18]:
def basic_clean(input_string):
    '''
    basice_clean function takes in a string and performs the following cleaning: lowercase, normalize characters
    and replaces anything that ia a letter , number, whitespace or sigle quote
    returns clean_string
    '''
    # takes original string and lowercase the string
    clean_string = input_string.lower()
    
    # normalized the string
    clean_string = unicodedata.normalize('NFKD', clean_string).encode('ascii','ignore').decode('utf-8')
    
    # remove anything that is not a through z, a number, a single quote, or whites
    clean_string = re.sub(r"[^a-z0-9'\s]", '', clean_string)
    
    return clean_string

In [19]:
sample_string = '1-034-@#$.32ksk|llkm fsadpfo-3-ljf &*^...hi mom...?/\|'

In [23]:
sample_text = "Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp"

In [20]:
basic_clean(sample_string)

'103432kskllkm fsadpfo3ljf hi mom'

In [24]:
basic_clean(sample_text)

'hey amazon  my package never arrived httpswwwamazoncomgpcssorderhistoryrefnavordersfirst please fix asap amazonhelp'

# 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [135]:
def tokenize(input_string,return_str = True):
    '''
    tokenize takes in a string and passes throug basic_clean function then tokenize all the words in the string
    returns token_string
    '''

    # create tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
     
    # apply token to string    
    token_string  = tokenizer.tokenize(input_string, return_str=return_str)

    
    return token_string

In [136]:
tokenize(sample_string,False)

['1-034-@#$',
 '.32ksk',
 '&#124;',
 'llkm',
 'fsadpfo-3-ljf',
 '&*^',
 '...',
 'hi',
 'mom',
 '...',
 '?/\\',
 '&#124;']

In [137]:
tokenize(sample_text)

'Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP ! @AmazonHelp'

# 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [131]:
def stem(input_string):
    '''
    stem takes in a string 
    returns stem_string a stem version of string
    '''
    # create stemming object
    ps = nltk.porter.PorterStemmer()
    # stemming string
    stem_string = [ps.stem(word) for word in input_string.split()]
    # join stemmed string
    stem_string = ' '.join(stem_string)
    
    return stem_string

In [132]:
stem(sample_string)

'1-034-@#$.32ksk|llkm fsadpfo-3-ljf &*^...hi mom...?/\\|'

In [105]:
stem(sample_text)

'hey amazon my packag never arriv httpswwwamazoncomgpcssorderhistoryrefnavordersfirst pleas fix asap amazonhelp'

# 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [133]:
def lemmatize(input_string):
    '''
    lemmatize takes in a string 
    returns lemmas_string a lemmatize version of the string.
    '''
    
    # create object
    wnl = nltk.stem.WordNetLemmatizer()
    
    # apply lemmatizer to string
    lemmas_string = [wnl.lemmatize(word) for word in input_string.split()]
    lemmas_string = " ".join(lemmas_string)
    
    return lemmas_string
    
    

In [134]:
lemmatize(sample_string)

'1-034-@#$.32ksk|llkm fsadpfo-3-ljf &*^...hi mom...?/\\|'

In [46]:
lemmatize(sample_text)

'hey amazon my package never arrived httpswwwamazoncomgpcssorderhistoryrefnavordersfirst please fix asap amazonhelp'

# 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [156]:
def remove_stopwords(input_string, extra_words = [],exclude_words = []):
    '''
    remove_stopwords takes in a string, optional extra_words as a list and exclude_words as a list
    parameters with default  empty lists and returna string.
    '''
    
    # ceate stopwords list
    stopwords_list = stopwords.words('english')
    
    # take out some words
    stopwords_list = set(stopwords_list)-set(exclude_words)
    
    # words to be added
    stopwords_list = stopwords_list.union(set(exclude_words))
    
    # split our document by spaces
    words = input_string.split()
    
    # this is the stopwords applied(taken out of) the original text
    new_string = [word for word in input_string.split() if word not in stopwords_list]
    # join together
    new_string = ' '.join(new_string)

    return new_string

In [157]:
remove_stopwords(sample_string)

'1-034-@#$.32ksk|llkm fsadpfo-3-ljf &*^...hi mom...?/\\|'

In [158]:
remove_stopwords(sample_text)

'Hey Amazon - package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp'

# 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [161]:
news_df = pd.DataFrame(a.inshort_info())

In [162]:
news_df

Unnamed: 0,title,content,category
0,"Moscow-Goa flight gets bomb threat, makes emer...",A Goa-bound flight from Russia's Moscow made a...,national
1,"Joshimath divided into 3 zones, govt says most...","Uttarakhand's Joshimath, where a majority of b...",national
2,Which states have reported COVID-19 variant XB...,"One new case of COVID-19 variant XBB.1.5, whic...",national
3,I decided to wear t-shirt till I shiver after ...,Congress leader Rahul Gandhi on Monday told re...,national
4,"2 children charred to death, 4 other siblings ...",At least two children were charred to death an...,national
...,...,...,...
292,Weakening rupee could force us to raise domest...,Mercedes-Benz India Managing Director Santosh ...,automobile
293,Volkswagen's India sales grow by 85% to 1.01 l...,Volkswagen's (VW) sales in India grew by 85.48...,automobile
294,India becomes world's 3rd largest auto market ...,India surpassed Japan to become the third-larg...,automobile
295,Tesla reports record deliveries of 1.3 million...,Tesla on Monday reported that it delivered rec...,automobile


# 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [199]:
# start building our function:
# first step: grab the article links:
def get_blog_urls(base_url, header={'User-Agent': 'hamsandwich'}):
    soup = BeautifulSoup(requests.get(url, headers=header).content)
    return [link['href'] for link in soup.select('a.more-link')]

In [200]:
def get_blog_content(base_url):
    blog_links = get_blog_urls(base_url)
    all_blogs = []
    for blog in blog_links:
        blog_soup = soupify(
            get(blog,
                headers=header).content)
        blog_content = {'title': blog_soup.select_one(
            'h1.entry-title').text,
        'content': blog_soup.select_one(
            'div.entry-content').text.strip()}
        all_blogs.append(blog_content)
    return all_blogs

In [201]:
url = 'https://codeup.com/blog/'

In [202]:
get_blog_urls(url)

['https://codeup.com/data-science/become-a-data-scientist/',
 'https://codeup.com/employers/hiring-tech-talent/',
 'https://codeup.com/cloud-administration/cap-funding-options/',
 'https://codeup.com/dallas-info/it-professionals-dallas/',
 'https://codeup.com/codeup-news/codeup-voted-1-technical-school-in-dfw/',
 'https://codeup.com/tips-for-prospective-students/financing/codeups-scholarships/']

In [203]:
codeup_df = pd.DataFrame(codeup_df)

In [204]:
codeup_df

Unnamed: 0,title,content
0,Become a Data Scientist in 6 Months!,Are you feeling unfulfilled in your work but w...


# 8. For each dataframe, produce the following columns:

* title to hold the title
* original to hold the original article/post content
* clean to hold the normalized and tokenized original with the stopwords removed.
* stemmed to hold the stemmed version of the cleaned data.
* lemmatized to hold the lemmatized version of the cleaned data.

In [168]:
news_df.rename(columns={'content':'original'}, inplace= True)

In [171]:
df =news_df

In [176]:
df['clean'] = df['original'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)

In [177]:
df['stemmed'] = df['clean'].apply(stem)

In [179]:
df['lemmatized'] = df['clean'].apply(lemmatize)

In [180]:
df

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,"Moscow-Goa flight gets bomb threat, makes emer...",A Goa-bound flight from Russia's Moscow made a...,national,goabound flight russia ' moscow made emergency...,goabound flight russia ' moscow made emerg lan...,goabound flight russia ' moscow made emergency...
1,"Joshimath divided into 3 zones, govt says most...","Uttarakhand's Joshimath, where a majority of b...",national,uttarakhand ' joshimath majority buildings dev...,uttarakhand ' joshimath major build develop cr...,uttarakhand ' joshimath majority building deve...
2,Which states have reported COVID-19 variant XB...,"One new case of COVID-19 variant XBB.1.5, whic...",national,one new case covid19 variant xbb15 responsible...,one new case covid19 variant xbb15 respons ris...,one new case covid19 variant xbb15 responsible...
3,I decided to wear t-shirt till I shiver after ...,Congress leader Rahul Gandhi on Monday told re...,national,congress leader rahul gandhi monday told repor...,congress leader rahul gandhi monday told repor...,congress leader rahul gandhi monday told repor...
4,"2 children charred to death, 4 other siblings ...",At least two children were charred to death an...,national,least two children charred death four others s...,least two children char death four other susta...,least two child charred death four others sust...
...,...,...,...,...,...,...
292,Weakening rupee could force us to raise domest...,Mercedes-Benz India Managing Director Santosh ...,automobile,mercedesbenz india managing director santosh i...,mercedesbenz india manag director santosh iyer...,mercedesbenz india managing director santosh i...
293,Volkswagen's India sales grow by 85% to 1.01 l...,Volkswagen's (VW) sales in India grew by 85.48...,automobile,volkswagen ' vw sales india grew 8548 101270 u...,volkswagen ' vw sale india grew 8548 101270 un...,volkswagen ' vw sale india grew 8548 101270 un...
294,India becomes world's 3rd largest auto market ...,India surpassed Japan to become the third-larg...,automobile,india surpassed japan become thirdlargest auto...,india surpass japan becom thirdlargest auto ma...,india surpassed japan become thirdlargest auto...
295,Tesla reports record deliveries of 1.3 million...,Tesla on Monday reported that it delivered rec...,automobile,tesla monday reported delivered record 13 mill...,tesla monday report deliv record 13 million ve...,tesla monday reported delivered record 13 mill...


# 9 Ask yourself:

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?