# NLP Prepare Exercises

In [1]:
# general imports
import numpy as np
import pandas as pd

# text and file handling
import unicodedata
import re
import json
import os

# NLP
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

# local modules
from acquire import (get_blog_articles_data, 
                     get_news_articles_data)
import prepare as p

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /Users/donq/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/donq/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### **Goal:** 
- The end result of this exercise should be a file named prepare.py that defines the requested functions. 
- In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### **Acquisition**

Going to go ahead and import both datasets:

In [3]:
codeup_df = get_blog_articles_data()
news_df = get_news_articles_data()

### **Preparation**

##### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
   - Lowercase everything
   - Normalize unicode characters
   - Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
text = "I need texting to Test some thing's out jump jumped jumping! have has had"

In [5]:
text = p.basic_clean(text)
text

'i need texting to test some things out jump jumped jumping have has had'

##### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.



In [6]:
tokenized = p.tokenize(text)

In [7]:
tokenized


'i need texting to test some things out jump jumped jumping have has had'

##### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.



In [8]:
stemmed = p.stem(tokenized)
stemmed

'i need text to test some thing out jump jump jump have ha had'

##### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.



In [9]:
lemmatized = p.lemmatize(tokenized)
lemmatized

'i need texting to test some thing out jump jump jump have have have'

##### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.



In [10]:
p.remove_stopwords(lemmatized)

'need texting test thing jump jump jump'

##### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.



In [11]:
# see top of notebook
news_df.head()

Unnamed: 0,title,content,category
0,"WhatsApp responds to int'l calls scam, announc...",WhatsApp has ramped up its AI and machine lear...,national
1,Beyoncé wears colour-changing dress during co...,Singer Beyoncé wore a colour-changing dress d...,national
2,"Gauahar Khan, Zaid Darbar blessed with a baby boy",Actress Gauahar Khan and her husband Zaid Darb...,national
3,"Complaint filed against Prabhas, Kriti Sanon's...",A complaint has been filed against Prabhas and...,national
4,"Yuzvendra Chahal creates history, takes most w...",RR leg-spinner Yuzvendra Chahal has created hi...,national


##### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.



In [12]:
# see top of notebook
codeup_df.head()

Unnamed: 0,title,content
0,Women in tech: Panelist Spotlight – Magdalena ...,\nCodeup is hosting a Women in Tech Panel in h...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nCodeup is hosting a Women in Tech Panel in h...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nCodeup is hosting a Women in Tech Panel in ...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nCodeup is hosting a Women in Tech Panel in h...
4,Black Excellence in Tech: Panelist Spotlight –...,\n\nCodeup is hosting a Black Excellence in Te...


In [13]:
print(codeup_df.loc[0,'content'])


Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry!

Meet Magdalena!
Magdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.
We asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming and data science methods, and it’s been an encouragement to have such supportive instr

In [14]:
# even though there is a newline character visible in the df,
# when I print it it does not show up.

##### 8. For each dataframe, produce the following columns:

  - title to hold the title
  - original to hold the original article/post content
  - clean to hold the normalized and tokenized original with the stopwords removed.
  - stemmed to hold the stemmed version of the cleaned data.
  - lemmatized to hold the lemmatized version of the cleaned data.

In [15]:
codeup_df = p.make_comparative_df(codeup_df)


In [16]:
news_df = p.make_comparative_df(news_df)

##### 9. Ask yourself:

  - If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
  - If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
  - If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?


For the first two, probably lemmatized. For the 200TB, probably stemmed unless you have the budget and the extra benefit seems worth the cost.