## Foundations of Text Data Preparation for NLP

### Project Overview
This project delves into the fundamental aspects of preparing text data for natural language processing (NLP) models. It encompasses various techniques including word and sentence tokenization, lexical diversity, and text manipulation within DataFrames. The notebook provides a comprehensive guide on opening and processing .txt files, handling parts of speech, named entity recognition, and text preprocessing techniques such as stemming and lemmatizing. By exploring these foundational methods, the project equips users with essential skills for effectively managing and analyzing text data in NLP applications.

### Part 1: Tokenization

This notebook focuses on tokenizing text.  It will use `nltk` to tokenize words and sentences of a given document.  In general, tokenizing a text refers to the operation of splitting the text apart into chunks.  Here, our chunks can be individual "words" and "sentences".  These are not necessarily meant to refer to proper grammatical structure or meaning, however splitting entities based on white space, periods, or other punctuation.  

#### Outline

- Word Tokenizing a String
- Sentence Tokenization of a string
- Unique Words with 'set'
- Counts of words
- Lexical Diversity
- Text in a DataFrame

#### The Data

The notebook will use both a single piece of text in the form of a lead paragraph from Isaac Newton's *Principia* and a dataset including text data from [kaggle](https://www.kaggle.com/datasets/sankha1998/emotion?select=Emotion%28sad%29.csv) related to classifying WhatsApp status. 

In [37]:
principia = '''
Since the ancients (as we are told by Pappus), made great account of the science of mechanics in the investigation of natural things; and the moderns, laying aside substantial forms and occult qualities, have endeavoured to subject the phænomena of nature to the laws of mathematics, I have in this treatise cultivated mathematics so far as it regards philosophy. The ancients considered mechanics in a twofold respect; as rational, which proceeds accurately by demonstration: and practical. To practical mechanics all the manual arts belong, from which mechanics took its name. But as artificers do not work with perfect accuracy, it comes to pass that mechanics is so distinguished from geometry, that what is perfectly accurate is called geometrical, what is less so, is called mechanical. But the errors are not in the art, but in the artificers. He that works with less accuracy is an imperfect mechanic; and if any could work with perfect accuracy, he would be the most perfect mechanic of all; for the description if right lines and circles, upon which geometry is founded, belongs to mechanics. Geometry does not teach us to draw these lines, but requires them to be drawn; for it requires that the learner should first be taught to describe these accurately, before he enters upon geometry; then it shows how by these operations problems may be solved. To describe right lines and circles are problems, but not geometrical problems. The solution of these problems is required from mechanics; and by geometry the use of them, when so solved, is shown; and it is the glory of geometry that from those few principles, brought from without, it is able to produce so many things. Therefore geometry is founded in mechanical practice, and is nothing but that part of universal mechanics which accurately proposes and demonstrates the art of measuring. But since the manual arts are chiefly conversant in the moving of bodies, it comes to pass that geometry is commonly referred to their magnitudes, and mechanics to their motion. In this sense rational mechanics will be the science of motions resulting from any forces whatsoever, and of the forces required to produce any motions, accurately proposed and demonstrated. This part of mechanics was cultivated by the ancients in the five powers which relate to manual arts, who considered gravity (it not being a manual power), no otherwise than as it moved weights by those powers. Our design not respecting arts, but philosophy, and our subject not manual but natural powers, we consider chiefly those things which relate to gravity, levity, elastic force, the resistance of fluids, and the like forces, whether attractive or impulsive; and therefore we offer this work as the mathematical principles if philosophy; for all the difficulty of philosophy seems to consist in this—from the phænomena of motions to investigate the forces of nature, and then from these forces to demonstrate the other phænomena; and to this end the general propositions in the first and second book are directed. In the third book we give an example of this in the explication of the System of the World; for by the propositions mathematically demonstrated in the former books, we in the third derive from the celestial phenomena the forces of gravity with which bodies tend to the sun and the several planets. Then from these forces, by other propositions which are also mathematical, we deduce the motions of the planets, the comets, the moon, and the sea. I wish we could derive the rest of the phænomena of nature by the same kind of reasoning from mechanical principles; for I am induced by many reasons to suspect that they may all depend upon certain forces by which the particles of bodies, by some causes hitherto unknown, are either mutually impelled towards each other, and cohere in regular figures, or are repelled and recede from each other; which forces being unknown, philosophers have hitherto attempted the search of nature in vain; but I hope the principles here laid down will afford some light either to this or some truer method of philosophy.'''

In [38]:
print(principia)


Since the ancients (as we are told by Pappus), made great account of the science of mechanics in the investigation of natural things; and the moderns, laying aside substantial forms and occult qualities, have endeavoured to subject the phænomena of nature to the laws of mathematics, I have in this treatise cultivated mathematics so far as it regards philosophy. The ancients considered mechanics in a twofold respect; as rational, which proceeds accurately by demonstration: and practical. To practical mechanics all the manual arts belong, from which mechanics took its name. But as artificers do not work with perfect accuracy, it comes to pass that mechanics is so distinguished from geometry, that what is perfectly accurate is called geometrical, what is less so, is called mechanical. But the errors are not in the art, but in the artificers. He that works with less accuracy is an imperfect mechanic; and if any could work with perfect accuracy, he would be the most perfect mechanic of all

In [39]:
import nltk
import pandas as pd

In [40]:
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Word Tokenizing a String

Task: to use the `word_tokenize` to split the string `principia` into individual elements of the text. 

In [41]:
    
# Split the string
task1 = word_tokenize(principia)

# Check the result
print(type(task1))
print(task1[:5])

<class 'list'>
['Since', 'the', 'ancients', '(', 'as']



### Sentence Tokenization of a string

Task: rather than breaking a text apart into individual tokens or "words", split based on sentences using `sent_tokenize`. 

In [42]:
# Split based on sentences
task2 = sent_tokenize(principia)


# Check the result
print(type(task2))
print(task2[:2])

<class 'list'>
['\nSince the ancients (as we are told by Pappus), made great account of the science of mechanics in the investigation of natural things; and the moderns, laying aside substantial forms and occult qualities, have endeavoured to subject the phænomena of nature to the laws of mathematics, I have in this treatise cultivated mathematics so far as it regards philosophy.', 'The ancients considered mechanics in a twofold respect; as rational, which proceeds accurately by demonstration: and practical.']



### Unique Words with `set`

Task: the tokenization does not yield unique words.  To create a collection of unique words, use the `set` function along with `word_tokenize` to create a mathematical set object of the words from the principia.

In [43]:

# Create a mathematical set object of the words
task3 = set(word_tokenize(principia))

# Check the result
print(type(task3))
print(task3)

<class 'set'>
{'required', 'mathematical', 'it', 'depend', 'weights', 'their', 'laid', 'hope', 'we', 'manual', 'seems', 'practical', 'unknown', 'demonstrate', 'design', 'few', 'what', 'who', 'here', 'able', 'In', 'our', 'investigation', 'example', 'twofold', 'Our', 'perfectly', 'circles', 'otherwise', 'when', 'than', 'tend', 'could', 'elastic', 'attempted', 'commonly', 'endeavoured', 'truer', 'mutually', 'mechanical', 'I', 'demonstrated', 'He', 'shown', 'Since', 'them', 'levity', 'recede', 'this', 'proposes', 'took', 'great', 'describe', 'forces', 'teach', 'offer', 'accuracy', 'moderns', 'laws', 'System', 'reasons', 'World', 'gravity', 'phenomena', 'particles', 'us', 'ancients', 'demonstrates', 'consist', 'cohere', 'solved', 'five', 'regards', 'does', 'any', 'upon', 'brought', 'towards', ';', 'do', 'respecting', 'so', 'books', 'investigate', 'kind', 'causes', 'an', 'learner', 'bodies', 'should', 'was', 'propositions', 'search', 'no', 'that', 'give', 'by', 'without', 'many', 'Therefore'


### Counts of words

Task: determine the number of words in the principia text using `word_tokenize` and the `len` function.  

In [44]:
# Determine the number of words
task4 = len(word_tokenize(principia))

# Check the result
print(type(task4))
print(task4)

<class 'int'>
766



### Lexical Diversity

Task:the lexical diversity of a text is the ratio of unique words to the total words.  Compute the lexical diversity of the principia text.Use the `length` function to find the numerial amount of unique and non-unique words.


In [45]:
# Use the `length` function to find the numerial amount of unique and non-unique words
task5 = len(set(word_tokenize(principia)))/len(word_tokenize(principia))

# Check the result
print(type(task5))
print(task5)
print(task1)
print(task2)

<class 'float'>
0.370757180156658
['Since', 'the', 'ancients', '(', 'as', 'we', 'are', 'told', 'by', 'Pappus', ')', ',', 'made', 'great', 'account', 'of', 'the', 'science', 'of', 'mechanics', 'in', 'the', 'investigation', 'of', 'natural', 'things', ';', 'and', 'the', 'moderns', ',', 'laying', 'aside', 'substantial', 'forms', 'and', 'occult', 'qualities', ',', 'have', 'endeavoured', 'to', 'subject', 'the', 'phænomena', 'of', 'nature', 'to', 'the', 'laws', 'of', 'mathematics', ',', 'I', 'have', 'in', 'this', 'treatise', 'cultivated', 'mathematics', 'so', 'far', 'as', 'it', 'regards', 'philosophy', '.', 'The', 'ancients', 'considered', 'mechanics', 'in', 'a', 'twofold', 'respect', ';', 'as', 'rational', ',', 'which', 'proceeds', 'accurately', 'by', 'demonstration', ':', 'and', 'practical', '.', 'To', 'practical', 'mechanics', 'all', 'the', 'manual', 'arts', 'belong', ',', 'from', 'which', 'mechanics', 'took', 'its', 'name', '.', 'But', 'as', 'artificers', 'do', 'not', 'work', 'with', 'per


### Text in a DataFrame

To this point, we have been dealing with a block of text. How do you work with multiple lines of text in a DataFrame? You can use `set` to determine the number of unique words (as above), but this will only provide a result PER ITEM, not for the entire DataFrame. 

Task: to determine the total amount of words in a DataFrame, first use the `word_tokenize` function with the `.apply` method, and sum the resulting column to get a non-unique list of words.  Use your work above to determine the number of non-unique words (using `len`) from `happy_df` in the `content` feature given below.

In [47]:
# Import dataset
happy_df = pd.read_csv(r'C:\Users\agnek\OneDrive\Documents\Data\Emotion(happy).csv')
happy_df.head()

Unnamed: 0,content,sentiment
0,Wants to know how the hell I can remember word...,happy
1,Love is a long sweet dream & marriage is an al...,happy
2,The world could be amazing when you are slight...,happy
3,My secret talent is getting tired without doin...,happy
4,"Khatarnaak Whatsapp Status Ever… Can\’t talk, ...",happy


In [48]:
# Determine the total amount of words in a DataFrame
task6 = len(set(happy_df['content'].apply(word_tokenize).sum()))

print(type(task6))
print(task6)

<class 'int'>
2097


### Part 2: Named Entities

This part of the notebook focuses on extracting named entities from text.  The named entities will be extracted using the `nltk` library.  It will read in the full text of Newton's *Principia* and identify the entities labeled as places.  

#### Outline

- Opening a `.txt` file
- Tokenizing the text
- Part of Speech Tags 
- Named Entities
- Removing People
- Removing stopwords

In [49]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Opening a `.txt` file.

Task: use the `open` function to open the text file with the Principia by Isaac Newton using the filepath given below.  Assign the text using the `readlines()` function to assign the text as a list of lines to the variable `principia` below. 

In [50]:
filepath = r'C:\Users\agnek\OneDrive\Documents\Data\Philosophiae_Naturalis_Principia_Mathematica.txt'

In [51]:
# Use the open function to open the text file 
with open(filepath) as f:
    principia = f.readlines()

# Check the result
print(type(principia))

<class 'list'>



### Tokenizing the text. 

Task: using the `principia` variable from problem 1, combine the `' '.join()` function with `word_tokenize` to create a list of tokens named `tokens`.

In [52]:
# Create a list of tokens named `tokens`
tokens = word_tokenize(' '.join(principia))


# Check the result
print(type(tokens))
print(tokens[:5])

<class 'list'>
['Philosophiae', 'Naturalis', 'Principia', 'Mathematica', 'Isaacus']



### Part of Speech Tags 

Task: use the `pos_tag` function to create the part of speech tagged corpus of the principia text.  Assign the tagged text to the variable `words_pos`.

In [53]:
# Create the part of speech tagged corpus of the principia text
words_pos = nltk.pos_tag(tokens)

# Check the result
print(type(words_pos))
print(words_pos[:5])

<class 'list'>
[('Philosophiae', 'NNP'), ('Naturalis', 'NNP'), ('Principia', 'NNP'), ('Mathematica', 'NNP'), ('Isaacus', 'NNP')]



### Named Entities

Task: use the tagged words in `words_pos` to create a list of tuples in the form (word, entity type) if the word has a named entity label.  Assign these tuples to the list `named_entities`.

In [54]:
# Create a list of tuples in the form (word, entity type) 
named_entities = []
for word in nltk.ne_chunk(words_pos):
    if hasattr(word, 'label'):
        named_entities.append((' '.join(c[0] for c in word.leaves()), word.label()))

# Check the result
print(type(named_entities))
print(named_entities[:5])

<class 'list'>
[('Philosophiae', 'GSP'), ('Naturalis Principia Mathematica Isaacus Newtonus', 'PERSON'), ('Wikisource', 'GPE'), ('INDEX Tituli', 'ORGANIZATION'), ('Auctoris', 'GPE')]



### Removing People

Task: use the `named_entities` list to include only entities labeled `GPE` and create a list of these words lowercased as `places`.

In [55]:
# Create a list of these words lowercased as places
places = [i[0].lower() for i in named_entities if i[1] == 'GPE']

# Check the result
print(type(places))
print(places[:5])

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus']



### Removing stopwords

Task: use the list `places` to remove all stopwords.  Assign these words as a list to `no_stops`.

In [56]:
from nltk.corpus import stopwords

In [57]:
# Remove all stopwords
no_stops = [i for i in places if not i in stopwords.words('english')]

# Check the result
print(type(no_stops))
print(no_stops)

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus', 'superficiebus', 'mediis', 'fluida']


### Part 3: Stemming and Lemmatization

In this part, notebook will stem and lemmatize a text to normalize a given text. It will use the lemmatizer and stemmer on a basic list and then turn to data in a DataFrame, writing a function to apply the lemmatization and stemming operations to a column of text data.  The data is the WhatsApp status dataset from kaggle, and notebook will focus on the `content` feature.

- Stemming a list of words
- Lemmatizing a list of words
- A function for stemming
- Using the stemmer on a DataFrame

In [58]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### The Data

The text data again comes from [kaggle](https://www.kaggle.com/datasets/sankha1998/emotion?select=Emotion%28sad%29.csv) and is related to classifying WhatsApp status. We load in only the "angry" sentiment below.


In [60]:
angry = pd.read_csv(r'C:\Users\agnek\OneDrive\Documents\Data\Emotion(angry).csv')

In [61]:
angry.head()

Unnamed: 0,content,sentiment
0,"Sometimes I’m not angry, I’m hurt and there’s ...",angry
1,Not available for busy people☺,angry
2,I do not exist to impress the world. I exist t...,angry
3,Everything is getting expensive except some pe...,angry
4,My phone screen is brighter than my future 🙁,angry



### Stemming a list of words

Task: use `PorterStemmer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `stemmed_words`. 

In [62]:
C = ['computer', 'computing', 'computed', 'computes', 'computation', 'compute']

In [63]:
# Stem the different variations on the word "compute" 
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(w) for w in C]

# Check the result
print(type(stemmed_words))
print(stemmed_words)

<class 'list'>
['comput', 'comput', 'comput', 'comput', 'comput', 'comput']



### Lemmatizing a list of words

Task: use `WordNetLemmatizer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `lemmatized_words`. 

In [64]:
# Use WordNetLemmatizer to stem the different variations on the word "compute"
lemma = WordNetLemmatizer()
lemmatized_words = [lemma.lemmatize(w) for w in C]

# Check the result
print(type(lemmatized_words))
print(lemmatized_words)

<class 'list'>
['computer', 'computing', 'computed', 'computes', 'computation', 'compute']


### A function for stemming

Task: use `PorterStemmer` to create the function called `stemmer`. This function should take in a string of text and return a string of stemmed text. Note that you will need to tokenize the text before stemming and should return a single string.  


In [65]:
# Define the function
def stemmer(text):
    '''
    This function takes in a string of text and returns
    a string of stemmed text.
    
    Arguments
    ---------
    text: str
        string of text to be stemmed
        
    Returns
    -------
    str
       string of stemmed words from the text input
    '''
    
    
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])


# Check the result
text = 'The computer did not compute the answers correctly.'
print(text)
print(stemmer(text))

The computer did not compute the answers correctly.
the comput did not comput the answer correctli .


### Using the stemmer on a DataFrame

Task: use function `stemmer` to apply to the `content` feature of the DataFrame `angry`. Use the `.apply` method. Assign the resulting series to `stemmed_content`.


In [36]:
# Use function `stemmer` to apply to the `content` feature of the DataFrame `angry`
stemmed_content = angry['content'].apply(stemmer)

# Check the result
print(type(stemmed_content))
print(stemmed_content.head())

<class 'pandas.core.series.Series'>
0    sometim i ’ m not angri , i ’ m hurt and there...
1                           not avail for busi people☺
2    i do not exist to impress the world . i exist ...
3    everyth is get expens except some peopl , they...
4          my phone screen is brighter than my futur 🙁
Name: content, dtype: object
