# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.


In [3]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire
import prepare

In [4]:
codeup_df = prepare.create_prepared_blog_df()


In [5]:
codeup_df.head()


Unnamed: 0,url,title,date_published,original,clean,stemmed,lemmatized
0,https://codeup.com/featured/what-jobs-can-you-...,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...,considering career cloud administration idea j...,consid career cloud administr idea job titl po...,considering career cloud administration idea j...
1,https://codeup.com/data-science/jobs-after-a-c...,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 7, 2022",If you are interested in embarking on a career...,interested embarking career tech youre probabl...,interest embark career tech probabl wonder new...,interested embarking career tech youre probabl...
2,https://codeup.com/workshops/san-antonio/in-pe...,In-Person Workshop: Learn to Code – JavaScript...,"Jul 6, 2022",Join us for our live in-person JavaScript cras...,join us live inperson javascript crash course ...,join us live inperson javascript crash cours d...,join u live inperson javascript crash course d...
3,https://codeup.com/workshops/in-person-worksho...,In-Person Workshop: Learn to Code – Python on ...,"Jun 20, 2022","According to LinkedIn, the “#1 Most Promising ...",according linkedin 1 promising job data scienc...,accord linkedin 1 promis job data scienc one m...,according linkedin 1 promising job data scienc...
4,https://codeup.com/workshops/dallas/free-javas...,Free JavaScript Workshop at Codeup Dallas on 6...,"Jun 19, 2022",Event Info: \nLocation – Codeup Dallas\nTime –...,event info location codeup dallas time 6 pm co...,event info locat codeup dalla time 6 pm come l...,event info location codeup dallas time 6 pm co...


In [6]:
news_df = prepare.create_prepared_news_df()


Importing from csv


In [7]:
news_df.lemmatized[3]


'businessman anand mahindra took twitter praise pv sindhu singapore open sharing tweet doordarshan sport old image sindhu mahindra wrote thats facial expression expression soul fighter corenever getting demoralised slump teaching u rise'

In [8]:
df = acquire.get_blog_articles(True)


In [9]:
df = pd.DataFrame(df)


In [10]:
original = df.original[0]


In [11]:
article = original.lower()


### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.



In [27]:
def basic_clean(article:str):
    """ Performs basic cleaning of text string, article, by switching all letters to lowecase, normalizing unicode characters, 
    and replacing everything that is not a letter, number, whitespace, or single quote."""
    # Convert text to lowercase
    article = article.lower()
    
    # Remove accented characteries. Normalize removes inconsistencies in unicode character encoding.
    # Encode converts string to ASCII and decode returns the bytes into string.
    article = unicodedata.normalize('NFKD', article)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

    # remove anything that is not a through z, a number, a single quote, or whitespace
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article

In [28]:
print(prepare.basic_clean(article))


have you been considering a career in cloud administration but have no idea what your job title or potential salary could be continue reading below to find out
in this miniseries we will take each of our programs here at codeup data science web development and cloud administration and outline respectively potential job titles as well as entrylevel salaries lets discuss cloud administration
program overview
at codeup we offer a 15week cloud administration program which was derived from our previous two programs systems engineering and cyber cloud we combined the best of both and blended handson practical knowledge with skilled instructors to create the cloud administration program
upon completing this program youll have the opportunity to take on two exams for certifications amazon web services aws cloud practitioner and aws solutions architect associate 
potential jobs
according to a cloud guru with an aws certification youll be equipped with the knowledge and experience to secure a jo

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.



In [29]:
def tokenize(article:str):
    """ Takes in a string, article, and tokenizes all words """
    
    tokenizer = nltk.tokenize.ToktokTokenizer()

    return tokenizer.tokenize(article, return_str=True)

In [30]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(original, return_str=True))

Have you been considering a career in Cloud Administration , but have no idea what your job title or potential salary could be ? Continue reading below to find out ! 
In this mini-series , we will take each of our programs here at Codeup : Data Science , Web Development , and Cloud Administration , and outline respectively potential job titles , as well as entry-level salaries.* Let ’ s discuss Cloud Administration.
Program Overview
At Codeup , we offer a 15-week Cloud Administration program , which was derived from our previous two programs : Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructors to create the Cloud Administration program.
Upon completing this program , you ’ ll have the opportunity to take on two exams for certifications : Amazon Web Services ( AWS ) Cloud Practitioner and AWS Solutions Architect Associate. 
Potential Jobs
According to A Cloud Guru , with an AWS Certification you ’ ll be equ

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [31]:
def stem(article: str):
    """ Takes in a string, article, and returns text after applying stemming using Porter method """
    
    ps = nltk.porter.PorterStemmer()

    stems = [ps.stem(word) for word in article.split()]
    article_stemmed = ' '.join(stems)
    
    return article_stemmed

In [14]:
stems = prepare.stem(original)


In [15]:
stems

'have you been consid a career in cloud administration, but have no idea what your job titl or potenti salari could be? continu read below to find out! in thi mini-series, we will take each of our program here at codeup: data science, web development, and cloud administration, and outlin respect potenti job titles, as well as entry-level salaries.* let’ discuss cloud administration. program overview at codeup, we offer a 15-week cloud administr program, which wa deriv from our previou two programs: system engin and cyber cloud. we combin the best of both and blend hands-on practic knowledg with skill instructor to creat the cloud administr program. upon complet thi program, you’ll have the opportun to take on two exam for certifications: amazon web servic (aws) cloud practition and aw solut architect associate. potenti job accord to a cloud guru, with an aw certif you’ll be equip with the knowledg and experi to secur a job as the following: 1. cloud architect as a cloud architect, you 

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.



In [32]:
def lemmatize(article: str):
    """ Accepts string as argument, article, and returns text after applying lemmatization to each word """
    
    wnl = nltk.stem.WordNetLemmatizer()
        
    lemmas = [wnl.lemmatize(word) for word in article.split()]
    article_lemmatized = ' '.join(lemmas)

    return article_lemmatized

In [16]:
prepare.lemmatize(original)


'Have you been considering a career in Cloud Administration, but have no idea what your job title or potential salary could be? Continue reading below to find out! In this mini-series, we will take each of our program here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, a well a entry-level salaries.* Let’s discus Cloud Administration. Program Overview At Codeup, we offer a 15-week Cloud Administration program, which wa derived from our previous two programs: Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructor to create the Cloud Administration program. Upon completing this program, you’ll have the opportunity to take on two exam for certifications: Amazon Web Services (AWS) Cloud Practitioner and AWS Solutions Architect Associate. Potential Jobs According to A Cloud Guru, with an AWS Certification you’ll be equipped with the knowledge and ex

In [17]:
original


'Have you been considering a career in Cloud Administration, but have no idea what your job title or potential salary could be? Continue reading below to find out!\nIn this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.*\xa0Let’s discuss Cloud Administration.\nProgram Overview\nAt Codeup, we offer a 15-week Cloud Administration program, which was derived from our previous two programs: Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructors to create the Cloud Administration program.\nUpon completing this program, you’ll have the opportunity to take on two exams for certifications: Amazon Web Services (AWS) Cloud Practitioner and AWS Solutions Architect Associate.\xa0\nPotential Jobs\nAccording to A Cloud Guru, with an AWS Certification you’ll be equipped with 

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
    
    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.



In [33]:
def remove_stopwords(article: str, extra_words: list, exclude_words: list):
    """ Accepts string (article) as argument and returns text after removing all the stopwords.
    extra_words: any additional stop words to include (these words will be removed from the article)
    exclude_words: any words we do not want to remove. These words are removed from the stopwords list and will remain in article """
    
    stopword_list = stopwords.words('english')

    [stopword_list.append(word_to_add) for word_to_add in extra_words if word_to_add not in stopword_list]
    [stopword_list.remove(to_remove) for to_remove in exclude_words if to_remove in stopword_list]

    words = article.split()
    filtered_words = [w for w in words if w not in stopword_list]

    # print('Removed {} stopwords'.format(len(words) - len(filtered_words)))

    article_without_stopwords = ' '.join(filtered_words)
    
    return article_without_stopwords

In [18]:
prepare.remove_stopwords(original, extra_words = ['Taryn', 'Month','chat'], exclude_words= ['for','We'])


'Have considering career Cloud Administration, idea job title potential salary could be? Continue reading find out! In mini-series, take programs Codeup: Data Science, Web Development, Cloud Administration, outline respectively potential job titles, well entry-level salaries.* Let’s discuss Cloud Administration. Program Overview At Codeup, offer 15-week Cloud Administration program, derived previous two programs: Systems Engineering Cyber Cloud. We combined best blended hands-on practical knowledge skilled instructors create Cloud Administration program. Upon completing program, you’ll opportunity take two exams for certifications: Amazon Web Services (AWS) Cloud Practitioner AWS Solutions Architect Associate. Potential Jobs According A Cloud Guru, AWS Certification you’ll equipped knowledge experience secure job following: 1. Cloud Architect As Cloud Architect, double IT specialist responsible for organization’s cloud infrastructure. This includes system monitoring, computing strategy


### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.


In [20]:
news_df = acquire.get_news_articles()


Importing from csv


In [21]:
news_df = pd.DataFrame(news_df)


In [22]:
news_df


Unnamed: 0,title,author,datetime,category,original
0,Rupee closes at an all-time low of 79.98 again...,Ridham Gambhir,2022-07-18T11:00:17.000Z,business,The rupee on Monday hit a fresh record low as ...
1,Rupee hits record low of 79.97 against US dollar,Ridham Gambhir,2022-07-18T10:00:15.000Z,business,The rupee hit a record low of 79.97 against th...
2,"BCCI had ₹40 cr in bank when I joined & ₹47,68...",Ridham Gambhir,2022-07-17T06:35:36.000Z,business,"In an Instagram post, Lalit Modi asserted that..."
3,A fighter to the core: Mahindra praises PV Sin...,Ridham Gambhir,2022-07-17T08:17:31.000Z,business,Businessman Anand Mahindra took to Twitter to ...
4,RBI is of the view that cryptocurrencies shoul...,Hiral Goyal,2022-07-18T07:55:13.000Z,business,The Reserve Bank of India (RBI) has recommende...
...,...,...,...,...,...
95,Grace & style of Dhanush is something to behol...,Amartya Sharma,2022-07-18T10:27:56.000Z,entertainment,"Regé-Jean Page, speaking about Dhanush in 'The..."
96,"Want to show people I'm more than simple, inno...",Amartya Sharma,2022-07-18T11:08:29.000Z,entertainment,Actress Janhvi Kapoor has said 'Good Luck Jerr...
97,"Love you SRK for rehearsing with me, not throw...",Ria Kapoor,2022-07-18T11:20:55.000Z,entertainment,Actress Kashmera Shah took to Instagram to sha...
98,Didn't realise we were making memories: Juhi o...,Kriti Kambiri,2022-07-18T14:25:06.000Z,entertainment,Actress Juhi Chawla shared a video montage of ...



### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.


### 8. For each dataframe, produce the following columns:
- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.



In [49]:
def prepare_df(df, original, extra_words = [], exclude_words = []):
    """Adds columns for cleaned, stemmed, and lemmatized data in dataframe """
    # Create cleaned data column of content
    df['clean'] = df.original.apply(basic_clean).apply(tokenize).apply(remove_stopwords,
                                                       extra_words = extra_words,
                                                       exclude_words = exclude_words)
    
    # Create stemmed column with stemmed version of cleaned data
    df['stemmed'] = df.clean.apply(tokenize).apply(stem).apply(remove_stopwords,
                                                       extra_words = extra_words,
                                                       exclude_words = exclude_words)

    # Create lemmatized column with lemmatized version of cleaned data
    df['lemmatized'] = df.clean.apply(tokenize).apply(lemmatize).apply(remove_stopwords,
                                                       extra_words = extra_words,
                                                       exclude_words = exclude_words)
    
    return df[['title', original, 'clean', 'stemmed', 'lemmatized']]

In [50]:
prepare_df(df, 'original', extra_words = ['ha'], exclude_words = ['no'])

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...,considering career cloud administration no ide...,consid career cloud administr no idea job titl...,considering career cloud administration no ide...
1,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...,interested embarking career tech youre probabl...,interest embark career tech probabl wonder new...,interested embarking career tech youre probabl...
2,In-Person Workshop: Learn to Code – JavaScript...,Join us for our live in-person JavaScript cras...,join us live inperson javascript crash course ...,join us live inperson javascript crash cours d...,join u live inperson javascript crash course d...
3,In-Person Workshop: Learn to Code – Python on ...,"According to LinkedIn, the “#1 Most Promising ...",according linkedin 1 promising job data scienc...,accord linkedin 1 promis job data scienc one m...,according linkedin 1 promising job data scienc...
4,Free JavaScript Workshop at Codeup Dallas on 6...,Event Info: \nLocation – Codeup Dallas\nTime –...,event info location codeup dallas time 6 pm co...,event info locat codeup dalla time 6 pm come l...,event info location codeup dallas time 6 pm co...
5,Is Our Cloud Administration Program Right for ...,Changing careers can be scary. The first thing...,changing careers scary first thing may asking ...,chang career scari first thing may ask begin l...,changing career scary first thing may asking b...
6,"PRIDE in Tech Panel\nJun 5, 2022 | Dallas, San...","In celebration of PRIDE month, join our Codeup...",celebration pride month join codeup alumni lgb...,celebr pride month join codeup alumni lgbtqia ...,celebration pride month join codeup alumnus lg...
7,Inclusion at Codeup During Pride Month (and Al...,Happy Pride Month! Pride Month is a dedicated ...,happy pride month pride month dedicated time c...,happi pride month pride month dedic time celeb...,happy pride month pride month dedicated time c...
8,"Mental Health First Aid Training\nMay 31, 2022...","As a student of Codeup, going through a massiv...",student codeup going massive career transition...,student codeup go massiv career transit mental...,student codeup going massive career transition...
9,Codeup Dallas: How to Succeed at a Coding Boot...,This event is the perfect opportunity for peop...,event perfect opportunity people wondering exp...,event perfect opportun peopl wonder expect cod...,event perfect opportunity people wondering exp...


### 9. Ask yourself:
- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?