---
# NLP Prepare Exercises
---

The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

---

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import acquire as a

---
## 1.

Define a function named `basic_clean`. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
def basic_clean(some_string):
    some_string = some_string.lower()
    some_string = unicodedata.normalize('NFKD', some_string).encode('ascii', 'ignore').decode('utf-8')
    some_string = re.sub(r"[^a-z0-9'\s]", '', some_string)
    return some_string

---
## 2.

Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

In [3]:
some_string = 'This is an example sentence'

In [4]:
def tokenize(some_string):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    some_string = tokenizer.tokenize(some_string, return_str = True)
    return some_string

---
## 3.

Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

In [5]:
def stem(some_string):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in some_string.split()]
    some_string_stemmed = ' '.join(stems)
    return some_string_stemmed

---
## 4.

Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

In [6]:
def lemmatize(some_string):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in some_string.split()]
    some_string_lemmatized = ' '.join(lemmas)
    return some_string_lemmatized

---
## 5.

Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.
- This function should define two optional parameters, `extra_words` and `exclude_words`. These parameters should define any additional stop words to include, and any words that we *don't* want to remove.

In [7]:
def remove_stopwords(some_string, extra_words = [], exclude_words = []):
    stopword_list = stopwords.words('english')
    [stopword_list.append(word) for word in extra_words]
    [stopword_list.remove(word) for word in extra_words]
    words = some_string.split()
    filtered_words = [word for word in words if word not in stopword_list]
    some_string_without_stopwords = ' '.join(filtered_words)
    return some_string_without_stopwords

---
## 6.

Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

In [8]:
topics = [
    'business',
    'sports',
    'technology',
    'entertainment'
]

In [9]:
news_df = a.get_articles(topics, 'https://inshorts.com/en/read/', 'Codeup Data Science Germain Cohort')

In [10]:
news_df.head()

Unnamed: 0,title,content,topic
0,Refer friends & get a chance to win Bitcoin wo...,CoinSwitch Kuber has launched 'CSK Referral Le...,business
1,China's new COVID-19 outbreak wipes $4 billion...,China's top hot pot chain has lost $4 billion ...,business
2,"Wow, 13 years ago: Musk on old video from when...",Tesla CEO and the world's richest person Elon ...,business
3,Shiba Inu jumps 40% to record high after anony...,Meme-based cryptocurrency Shiba Inu (SHIB) jum...,business
4,"Clear Air India dues, purchase tickets in cash...","The Department of Expenditure, which comes und...",business


---
## 7.

Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

In [11]:
urls = [
    'https://codeup.com/data-science/why-you-should-become-a-data-scientist/',
    'https://codeup.com/data-science/math-in-data-science/',
    'https://codeup.com/data-science/transition-into-data-science/',
    'https://codeup.com/data-science/data-science-career/',
    'https://codeup.com/data-science/what-is-python/'
]

In [12]:
codeup_df = a.get_blog_articles(urls, 'Codeup Data Science Germain Cohort')

In [13]:
codeup_df.head()

Unnamed: 0,title,content
0,Why You Should Become a Data Scientist,"What do you look for in a career? Chances are,..."
1,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
2,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
3,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
4,What is Python?,If you’ve been digging around our website or r...


---
## 8.

For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

In [14]:
def prep_nlp_data(df):
    df = df.rename(columns={'content' : 'original'})
    df['clean'] = df.original.apply(basic_clean)
    df['clean'] = df.clean.apply(tokenize)
    df['clean'] = df.clean.apply(remove_stopwords)
    df['stemmed'] = df.clean.apply(stem)
    df['lemmatized'] = df.clean.apply(lemmatize)
    return df

In [15]:
prep_nlp_data(news_df).head()

Unnamed: 0,title,original,topic,clean,stemmed,lemmatized
0,Refer friends & get a chance to win Bitcoin wo...,CoinSwitch Kuber has launched 'CSK Referral Le...,business,coinswitch kuber launched ' csk referral leagu...,coinswitch kuber launch ' csk referr leagu ' c...,coinswitch kuber launched ' csk referral leagu...
1,China's new COVID-19 outbreak wipes $4 billion...,China's top hot pot chain has lost $4 billion ...,business,china ' top hot pot chain lost 4 billion marke...,china ' top hot pot chain lost 4 billion marke...,china ' top hot pot chain lost 4 billion marke...
2,"Wow, 13 years ago: Musk on old video from when...",Tesla CEO and the world's richest person Elon ...,business,tesla ceo world ' richest person elon musk twe...,tesla ceo world ' richest person elon musk twe...,tesla ceo world ' richest person elon musk twe...
3,Shiba Inu jumps 40% to record high after anony...,Meme-based cryptocurrency Shiba Inu (SHIB) jum...,business,memebased cryptocurrency shiba inu shib jumped...,memebas cryptocurr shiba inu shib jump 40 hit ...,memebased cryptocurrency shiba inu shib jumped...
4,"Clear Air India dues, purchase tickets in cash...","The Department of Expenditure, which comes und...",business,department expenditure comes finance ministry ...,depart expenditur come financ ministri wednesd...,department expenditure come finance ministry w...


In [16]:
prep_nlp_data(codeup_df).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Why You Should Become a Data Scientist,"What do you look for in a career? Chances are,...",look career chances youre looking way make use...,look career chanc your look way make use parti...,look career chance youre looking way make use ...
1,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will...",coming data science program need know math sta...,come data scienc program need know math stat h...,coming data science program need know math sta...
2,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...,alumni katy salts brandi reger joined us publi...,alumni kati salt brandi reger join us public p...,alumnus katy salt brandi reger joined u public...
3,What Data Science Career is For You?,If you’re struggling to see yourself as a data...,youre struggling see data science professional...,your struggl see data scienc profession may fi...,youre struggling see data science professional...
4,What is Python?,If you’ve been digging around our website or r...,youve digging around website researching tech ...,youv dig around websit research tech tool may ...,youve digging around website researching tech ...


---
## 9.

Ask yourself:
- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
    - lemmatized
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - stemmed
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - stemmed