# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

#### 1)Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

>Lowercase everything

>Normalize unicode characters

>Replace anything that is not a letter, number, whitespace or a single quote.

#### *(2)*Define a function named tokenize. It should take in a string and tokenize all the words in the string.

#### 3)Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

#### 4)Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

#### 5)Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

>This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

#### 6)Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

#### 7)Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

#### 8)For each dataframe, produce the following columns:

>title to hold the title

>original to hold the original article/post content

>clean to hold the normalized and tokenized original with the stopwords removed.

>stemmed to hold the stemmed version of the cleaned data.

>lemmatized to hold the lemmatized version of the cleaned data.

#### 9)Ask yourself:

>If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?

>If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?

>If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would 
you prefer to use stemmed or lemmatized text?

In [50]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire

In [51]:
# nltk.download('all')

In [52]:
df=acquire.codeup_blogs()
df.head()

Unnamed: 0,title,date_published,content
0,Is a Career in Tech Recession-Proof?,"Aug 12, 2022",\n\n\n\n\n\nGiven the current economic climate...
1,Codeup Honored as SABJ Diversity and Inclusion...,"Oct 7, 2022",\nCodeup has been named the 2022 Diversity and...
2,How Can I Finance My Career Transition?,"Sep 29, 2022",\nDeciding to transition into a tech career is...
3,Tips for Women Beginning a Career in Tech,"Sep 23, 2022","\nCodeup strongly values diversity, and inclus..."
4,2022 SABJ C-Suite Award Winner: Stephen Noteboom,"Sep 9, 2022","\nCodeup’s Chief Operating Officer, Stephen No..."


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           6 non-null      object
 1   date_published  6 non-null      object
 2   content         6 non-null      object
dtypes: object(3)
memory usage: 272.0+ bytes


In [54]:
business_df=acquire.cut_from_one('business')
business_df.head()

Unnamed: 0,title,content,category
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business


In [55]:
business_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     25 non-null     object
 1   content   25 non-null     object
 2   category  25 non-null     object
dtypes: object(3)
memory usage: 728.0+ bytes


In [56]:
business_df_2=business_df.copy()

# 1)Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

>Lowercase everything

>Normalize unicode characters

>Replace anything that is not a letter, number, whitespace or a single quote.

In [57]:
#Define a function to clean text data
def basic_clean(string):
    
    string = string.lower()
    
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    string = re.sub(r"[^a-z0-9\s']", '', string)
    
    return string

In [58]:
business_df['basic_clean']=business_df['content'].apply(basic_clean)
business_df.head()

Unnamed: 0,title,content,category,basic_clean
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys on thursday reported a 13 qoq drop in ...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke on the moonligh...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,the world's richest person elon musk earned a ...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani on ...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...


# *(2)*Define a function named tokenize. It should take in a string and tokenize all the words in the string.



In [59]:
def tokenize(string):
    
    tokenizer = ToktokTokenizer()
    
    return tokenizer.tokenize(string, return_str=True)

In [60]:
business_df['tokenize']=business_df['content'].apply(tokenize)
business_df.head()

Unnamed: 0,title,content,category,basic_clean,tokenize
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys on thursday reported a 13 qoq drop in ...,Infosys on Thursday reported a 1.3 % QoQ drop ...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke on the moonligh...,Infosys CEO Salil Parekh spoke on the moonligh...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,the world's richest person elon musk earned a ...,The world ' s richest person Elon Musk earned ...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani on ...,Reliance Industries Chairman Mukesh Ambani on ...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...,Swedish ready-to-assemble furniture retailer I...


# 3)Define a function named stem. It should accept some text and return the text after applying stemming to all the words.



In [61]:
def stem(string):
    
    ps = nltk.porter.PorterStemmer()
    
    stems = [ps.stem(word) for word in string.split()]
    
    stemmed_string = ' '.join(stems)
    
    return stemmed_string

In [62]:
business_df['stem']=business_df['content'].apply(stem)
business_df.head()

Unnamed: 0,title,content,category,basic_clean,tokenize,stem
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys on thursday reported a 13 qoq drop in ...,Infosys on Thursday reported a 1.3 % QoQ drop ...,infosi on thursday report a 1.3% qoq drop in i...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke on the moonligh...,Infosys CEO Salil Parekh spoke on the moonligh...,infosi ceo salil parekh spoke on the moonlight...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,the world's richest person elon musk earned a ...,The world ' s richest person Elon Musk earned ...,the world' richest person elon musk earn a mil...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani on ...,Reliance Industries Chairman Mukesh Ambani on ...,relianc industri chairman mukesh ambani on thu...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...,Swedish ready-to-assemble furniture retailer I...,swedish ready-to-assembl furnitur retail ikea ...


# 4)Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.



In [63]:
def lemmatize(string):
    
    wnl = nltk.stem.WordNetLemmatizer()
    
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    lemmatized_string = ' '.join(lemmas)
    
    return lemmatized_string

In [64]:
business_df['lemmatize']=business_df['content'].apply(lemmatize)
business_df.head()

Unnamed: 0,title,content,category,basic_clean,tokenize,stem,lemmatize
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys on thursday reported a 13 qoq drop in ...,Infosys on Thursday reported a 1.3 % QoQ drop ...,infosi on thursday report a 1.3% qoq drop in i...,Infosys on Thursday reported a 1.3% QoQ drop i...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke on the moonligh...,Infosys CEO Salil Parekh spoke on the moonligh...,infosi ceo salil parekh spoke on the moonlight...,Infosys CEO Salil Parekh spoke on the moonligh...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,the world's richest person elon musk earned a ...,The world ' s richest person Elon Musk earned ...,the world' richest person elon musk earn a mil...,The world's richest person Elon Musk earned a ...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani on ...,Reliance Industries Chairman Mukesh Ambani on ...,relianc industri chairman mukesh ambani on thu...,Reliance Industries Chairman Mukesh Ambani on ...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...,Swedish ready-to-assemble furniture retailer I...,swedish ready-to-assembl furnitur retail ikea ...,Swedish ready-to-assemble furniture retailer I...


# 5)Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

>This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.



In [65]:
def remove_stopwords(string, extra_words=None, exclude_words=None):
    
    stopword_list = stopwords.words('english')
    
    if exclude_words:
        
        stopword_list = stopword_list + exclude_words
        
    if extra_words:
        
        for word in extra_words:
            
            stopword_list.remove(word)
            
    words = string.split()
    
    filtered_words = [word for word in words if word not in stopword_list]
    
    filtered_string = ' '.join(filtered_words)
    
    return filtered_string

In [66]:
business_df['remove_stopwords'] = business_df['content'].apply(remove_stopwords)

business_df.head()

Unnamed: 0,title,content,category,basic_clean,tokenize,stem,lemmatize,remove_stopwords
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys on thursday reported a 13 qoq drop in ...,Infosys on Thursday reported a 1.3 % QoQ drop ...,infosi on thursday report a 1.3% qoq drop in i...,Infosys on Thursday reported a 1.3% QoQ drop i...,Infosys Thursday reported 1.3% QoQ drop volunt...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke on the moonligh...,Infosys CEO Salil Parekh spoke on the moonligh...,infosi ceo salil parekh spoke on the moonlight...,Infosys CEO Salil Parekh spoke on the moonligh...,Infosys CEO Salil Parekh spoke moonlighting de...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,the world's richest person elon musk earned a ...,The world ' s richest person Elon Musk earned ...,the world' richest person elon musk earn a mil...,The world's richest person Elon Musk earned a ...,The world's richest person Elon Musk earned mi...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani on ...,Reliance Industries Chairman Mukesh Ambani on ...,relianc industri chairman mukesh ambani on thu...,Reliance Industries Chairman Mukesh Ambani on ...,Reliance Industries Chairman Mukesh Ambani Thu...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...,Swedish ready-to-assemble furniture retailer I...,swedish ready-to-assembl furnitur retail ikea ...,Swedish ready-to-assemble furniture retailer I...,Swedish ready-to-assemble furniture retailer I...


# 6)Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.




In [67]:
news_df= business_df_2
news_df.head()

Unnamed: 0,title,content,category
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business


# 7)Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.



In [68]:
codeup_df=df
codeup_df.head()

# 8)For each dataframe, produce the following columns:

>title to hold the title

>original to hold the original article/post content

>clean to hold the normalized and tokenized original with the stopwords removed.

>stemmed to hold the stemmed version of the cleaned data.

>lemmatized to hold the lemmatized version of the cleaned data.

In [72]:
codeup_df['clean'] = codeup_df['content'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)
codeup_df['stemmed'] = codeup_df['content'].apply(stem)
codeup_df['lemmatize'] = codeup_df['content'].apply(lemmatize)
codeup_df.head()

Unnamed: 0,title,date_published,content,clean,stemmed,lemmatized
0,Is a Career in Tech Recession-Proof?,"Aug 12, 2022",\n\n\n\n\n\nGiven the current economic climate...,given current economic climate many economists...,"given the current econom climate, mani economi...","Given the current economic climate, many econo..."
1,Codeup Honored as SABJ Diversity and Inclusion...,"Oct 7, 2022",\nCodeup has been named the 2022 Diversity and...,codeup named 2022 diversity inclusion award wi...,codeup ha been name the 2022 divers and inclus...,Codeup ha been named the 2022 Diversity and In...
2,How Can I Finance My Career Transition?,"Sep 29, 2022",\nDeciding to transition into a tech career is...,deciding transition tech career big step signi...,decid to transit into a tech career is a big s...,Deciding to transition into a tech career is a...
3,Tips for Women Beginning a Career in Tech,"Sep 23, 2022","\nCodeup strongly values diversity, and inclus...",codeup strongly values diversity inclusion hon...,"codeup strongli valu diversity, and inclusion....","Codeup strongly value diversity, and inclusion..."
4,2022 SABJ C-Suite Award Winner: Stephen Noteboom,"Sep 9, 2022","\nCodeup’s Chief Operating Officer, Stephen No...",codeups chief operating officer stephen notebo...,"codeup’ chief oper officer, stephen noteboom h...","Codeup’s Chief Operating Officer, Stephen Note..."


In [73]:
news_df['clean'] = news_df['content'].apply(basic_clean).apply(tokenize).apply(remove_stopwords)
news_df['stemmed'] = news_df['content'].apply(stem)
news_df['lemmatize'] = news_df['content'].apply(lemmatize)
news_df.head()

Unnamed: 0,title,content,category,clean,stemmed,lemmatize
0,"Infosys' attrition drops to 27.1%, net employe...",Infosys on Thursday reported a 1.3% QoQ drop i...,business,infosys thursday reported 13 qoq drop voluntar...,infosi on thursday report a 1.3% qoq drop in i...,Infosys on Thursday reported a 1.3% QoQ drop i...
1,We do not support dual employment: Infosys on ...,Infosys CEO Salil Parekh spoke on the moonligh...,business,infosys ceo salil parekh spoke moonlighting de...,infosi ceo salil parekh spoke on the moonlight...,Infosys CEO Salil Parekh spoke on the moonligh...
2,Musk sells $1 million worth of 'Burnt Hair' pe...,The world's richest person Elon Musk earned a ...,business,world ' richest person elon musk earned millio...,the world' richest person elon musk earn a mil...,The world's richest person Elon Musk earned a ...
3,Mukesh Ambani visits Kedarnath & Badrinath shr...,Reliance Industries Chairman Mukesh Ambani on ...,business,reliance industries chairman mukesh ambani thu...,relianc industri chairman mukesh ambani on thu...,Reliance Industries Chairman Mukesh Ambani on ...
4,"IKEA lays off 10,000 employees after halting R...",Swedish ready-to-assemble furniture retailer I...,business,swedish readytoassemble furniture retailer ike...,swedish ready-to-assembl furnitur retail ikea ...,Swedish ready-to-assemble furniture retailer I...
