# <Span style = 'color: #008B8B'>Data Preparation

In [1]:
#panda 
import pandas as pd
import os
import acquire as a
from time import strftime


# unicode, regex, json for text digestion
import unicodedata
import re
import json

# nltk: natural language toolkit -> tokenization, stopwords
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords




import warnings
warnings.filterwarnings('ignore')

## <Span style = 'color: #008B8B'>Exercises
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

#### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

Lowercase everything
Normalize unicode characters
Replace anything that is not a letter, number, whitespace or a single quote.

In [2]:
## can turn on and off ascii and utf 

# we will define a basic_clean function for a single document (one string)
def basic_clean(string):
    '''
    This function takes in a string and
    returns the string normalized.
    '''
    # we will normalize our data into standard NFKD unicode, feed it into an ascii encoding
    # decode it back into UTF-8
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    # utilize our regex substitution to remove our undesirable characters, then lowercase
    string = re.sub(r"[^\w0-9'\s]", '', string).lower()
    return string
    

#### 2.Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [3]:
def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # make our tokenizer, taken from nltk's ToktokTokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # apply our tokenizer's tokenization to the string being input, ensure it returns a string
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

#### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [4]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # create our stemming object
    ps = nltk.porter.PorterStemmer()
    # use a list comprehension => stem each word for each word inside of the entire document,
    # split by the default, which are single spaces
    stems = [ps.stem(word) for word in string.split()]
    # glue it back together with spaces, as it was before
    string = ' '.join(stems)
    
    return string

#### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [5]:

def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # create our lemmatizer object
    wnl = nltk.stem.WordNetLemmatizer()
    # use a list comprehension to lemmatize each word
    # string.split() => output a list of every token inside of the document
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    # glue the lemmas back together by the strings we split on
    string = ' '.join(lemmas)
    #return the altered document
    return string

#### 4. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

In [6]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # assign our stopwords from nltk into stopword_list
    stopword_list = stopwords.words('english')
    # utilizing set casting, i will remove any excluded stopwords
    stopword_set = set(stopword_list) - set(exclude_words)
    # add in any extra words to my stopwords set using a union
    stopword_set = stopword_set.union(set(extra_words))
    # split our document by spaces
    words = string.split()
    # every word in our document, as long as that word is not in our stopwords
    filtered_words = [word for word in words if word not in stopword_set]
    # glue it back together with spaces, as it was so it shall be
    string_without_stopwords = ' '.join(filtered_words)
    # return the document back
    return string_without_stopwords

#### 5. This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

#### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [7]:
news_df = a.get_news_articles_data()
news_df

Unnamed: 0,title,content,category
0,Antfin transfers 10.3% stake to Paytm chief Vi...,Antfin (Netherlands) Holding BV has transferre...,business
1,"Nepal asks India for rice, sugar to avert poss...",Nepal government has requested India to facili...,business
2,GQG Partners buys 8.1% stake in Adani Power fo...,Investment firm GQG Partners bought an 8.1% st...,business
3,"USDA cuts rice trade forecast for 2023, 2024 p...",US Department of Agriculture (USDA) lowered th...,business
4,Hyundai to buy General Motors' Talegaon plant ...,Hyundai Motor India signed an asset purchase a...,business
5,Nifty50 firms chiefs' average remuneration dro...,Combined remuneration for the heads of the Nif...,business
6,H&M probes labour abuses at Myanmar garment fa...,H&M is investigating 20 alleged instances of l...,business
7,Irish bank glitch lets people with no money wi...,A glitch in Bank of Ireland's app allowed cust...,business
8,"Union Cabinet approves ₹32,500 crore for seven...","Union Cabinet approved a ₹32,500-crore budget ...",business
9,Netherlands slips into recession,The Dutch economy has entered a recession as i...,business


#### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [8]:
codeup_df = a.blog_soup()
codeup_df

Unnamed: 0,title,content
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...


#### 8. For each dataframe, produce the following columns:

title to hold the title
original to hold the original article/post content
clean to hold the normalized and tokenized original with the stopwords removed.
stemmed to hold the stemmed version of the cleaned data.
lemmatized to hold the lemmatized version of the cleaned data.

In [9]:
news_df.rename(columns={'content': 'original'}, inplace=True)
codeup_df.rename(columns={'content': 'original'}, inplace=True)

In [10]:
news_df

Unnamed: 0,title,original,category
0,Antfin transfers 10.3% stake to Paytm chief Vi...,Antfin (Netherlands) Holding BV has transferre...,business
1,"Nepal asks India for rice, sugar to avert poss...",Nepal government has requested India to facili...,business
2,GQG Partners buys 8.1% stake in Adani Power fo...,Investment firm GQG Partners bought an 8.1% st...,business
3,"USDA cuts rice trade forecast for 2023, 2024 p...",US Department of Agriculture (USDA) lowered th...,business
4,Hyundai to buy General Motors' Talegaon plant ...,Hyundai Motor India signed an asset purchase a...,business
5,Nifty50 firms chiefs' average remuneration dro...,Combined remuneration for the heads of the Nif...,business
6,H&M probes labour abuses at Myanmar garment fa...,H&M is investigating 20 alleged instances of l...,business
7,Irish bank glitch lets people with no money wi...,A glitch in Bank of Ireland's app allowed cust...,business
8,"Union Cabinet approves ₹32,500 crore for seven...","Union Cabinet approved a ₹32,500-crore budget ...",business
9,Netherlands slips into recession,The Dutch economy has entered a recession as i...,business


In [11]:
codeup_df

Unnamed: 0,title,original
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...


In [12]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords,
                                  extra_words=extra_words,
                                  exclude_words=exclude_words)
    
    df['stemmed'] = df['clean'].apply(stem)
    
    df['lemmatized'] = df['clean'].apply(lemmatize)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [13]:
prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no'])

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Antfin transfers 10.3% stake to Paytm chief Vi...,Antfin (Netherlands) Holding BV has transferre...,antfin netherlands holding bv transferred 103 ...,antfin netherland hold bv transfer 103 stake o...,antfin netherlands holding bv transferred 103 ...
1,"Nepal asks India for rice, sugar to avert poss...",Nepal government has requested India to facili...,nepal government requested india facilitate su...,nepal govern request india facilit suppli padd...,nepal government requested india facilitate su...
2,GQG Partners buys 8.1% stake in Adani Power fo...,Investment firm GQG Partners bought an 8.1% st...,investment firm gqg partners bought 81 stake a...,invest firm gqg partner bought 81 stake adani ...,investment firm gqg partner bought 81 stake ad...
3,"USDA cuts rice trade forecast for 2023, 2024 p...",US Department of Agriculture (USDA) lowered th...,us department agriculture usda lowered global ...,us depart agricultur usda lower global rice tr...,u department agriculture usda lowered global r...
4,Hyundai to buy General Motors' Talegaon plant ...,Hyundai Motor India signed an asset purchase a...,hyundai motor india signed asset purchase agre...,hyundai motor india sign asset purchas agreeme...,hyundai motor india signed asset purchase agre...
5,Nifty50 firms chiefs' average remuneration dro...,Combined remuneration for the heads of the Nif...,combined remuneration heads nifty50 companies ...,combin remuner head nifty50 compani fell 25 10...,combined remuneration head nifty50 company fel...
6,H&M probes labour abuses at Myanmar garment fa...,H&M is investigating 20 alleged instances of l...,hm investigating 20 alleged instances labour a...,hm investig 20 alleg instanc labour abus myanm...,hm investigating 20 alleged instance labour ab...
7,Irish bank glitch lets people with no money wi...,A glitch in Bank of Ireland's app allowed cust...,glitch bank ireland ' app allowed customers lo...,glitch bank ireland ' app allow custom low bal...,glitch bank ireland ' app allowed customer low...
8,"Union Cabinet approves ₹32,500 crore for seven...","Union Cabinet approved a ₹32,500-crore budget ...",union cabinet approved 32500crore budget seven...,union cabinet approv 32500crore budget seven r...,union cabinet approved 32500crore budget seven...
9,Netherlands slips into recession,The Dutch economy has entered a recession as i...,dutch economy entered recession shrank 03 quar...,dutch economi enter recess shrank 03 quarterli...,dutch economy entered recession shrank 03 quar...


In [14]:
prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...,may traditionally known asian american pacific...,may tradit known asian american pacif island a...,may traditionally known asian american pacific...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...,women tech panelist spotlight magdalena rahn c...,women tech panelist spotlight magdalena rahn c...,woman tech panelist spotlight magdalena rahn c...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...,women tech panelist spotlight rachel robbinsma...,women tech panelist spotlight rachel robbinsma...,woman tech panelist spotlight rachel robbinsma...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...,women tech panelist spotlight sarah mellor cod...,women tech panelist spotlight sarah mellor cod...,woman tech panelist spotlight sarah mellor cod...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...,women tech panelist spotlight madeleine capper...,women tech panelist spotlight madelein capper ...,woman tech panelist spotlight madeleine capper...


#### 9. Ask yourself:

##### If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- It would not matter stemmed or lemmatized text because it is very small corpus 

##### If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
For a 25MB corpus, the choice between stemmed and lemmatized text depends on analysis needs. Stemming is faster but might result in non-words. Lemmatization provides accurate words but is slower. Corpus size affects computation. The email analogy highlights 25MB limits for sending. So, the decision should focus on task requirements and trade-offs between speed and linguistic accuracy.

##### If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
When dealing with a massive 200TB corpus and being charged by megabyte for computational resources, the choice between stemmed and lemmatized text depends on trade-offs. The choice trade-off of being  charged based on computer use, the choice is between quick but sometimes inaccurate cleaning (stemming) or slower but accurate cleaning (lemmatization)Stemming is faster and efficient for resource usage but might sacrifice linguistic accuracy. Lemmatization provides accurate results but is slower and resource-intensive. Consider your analysis needs, accuracy requirements, computational resources, and costs when making the decision. A hybrid approach could also be considered.