# 1 **Introduction**

Throughout this notebook,
* Our goal is to explore the fundamentals of Natural Language Processing (NLP) and equip you with essential techniques for working with text data.
* We will dive into key concepts such as **tokenization, stemming, and stop-word removal and more,** which are crucial for preprocessing textual information.
* Our ultimate aim is to showcase the practical application of these techniques by guiding you through a simple sentiment analysis task.
* By the end of this notebook, you will have a solid understanding of NLP basics and be able to apply them in real-world scenarios."
* you will find examples of code that have been transformed into functions using the **Google code style format**. 
 * This formatting style follows a set of guidelines to enhance code readability, maintainability, and consistency.

and this is example of our final task of **Sentiment Analysis** with **Natural Language Processing (NLP).**

![image.png](attachment:3a0c9530-b4c9-48f1-a7be-749ee3646daf.png)

**I heard someone now say you should explain NLP first and this is his reaction 😡
ok don't be angry let's start from scratch :)**

**Natural Language Processing:**
> is a branch of **Artificial Intelligence** which deal with bridging the machines understanding humans in their Natural Language.

> NLP aims to enable computers to **understand and generate human language** in a way that is both meaningful and useful.

> NLP has a wide range of applications, including chatbots, virtual assistants, information retrieval, sentiment analysis, language translation, content generation, and many more.

![image.png](attachment:1d5cfb2d-d626-4daa-b832-73b5341746eb.png)

**Now OK? ok**

So let's start with a our simple task to make everything as clear as possible when we explain several subfields and techniques that work together to process and analyze natural language.**

# 2 **Preprocessing Text Data**

![bad-data-cleansing-actvities.jpg](attachment:b75f729b-5160-48e8-a15d-654e8f8fe9e3.jpg)

Preprocessing text refers to the **series of steps and techniques applied to raw text data** in order to **clean, normalize, and transform it** into a format that is more suitable for analysis, modeling, or further processing. Preprocessing is an essential step in text mining, natural language processing (NLP), and various other text-related tasks.

____________________________________________________________________________________________________________________________________________________________________________________________________


Skip this cell now so as not to confuse yourself

In [None]:
import nltk

nltk.download('punkt')  # Download the required resource for tokenizing
nltk.download('stopwords')  # Download the required resource for removal stop words
nltk.download('averaged_perceptron_tagger')  # Download the required resource for POS
nltk.download('maxent_ne_chunker')  # Download the required resource for NER
nltk.download('vader_lexicon')  # Download the required resource for sentiment
nltk.download('words')  

____________________________________________________________________________________________________________________________________________________________________________________________________


## 2.1 **Tokenization**

Tokenization is the process of **breaking down text into smaller units called tokens.** In NLP, tokens are usually words, but they can also be phrases or other meaningful units of text. Tokenization is an important step in text preprocessing because it helps to reduce the complexity of the text data and makes it easier to process.

![image.png](attachment:5e7ed2c9-66ec-4a14-92b8-d2f53bc6089e.png)

> Here is an example of how we can tokenize a sample text using the **nltk library**:

In [None]:
import nltk


def tokenize_text(text):
    """
    Tokenizes the given text into a list of words.

    Args:
        text (str): The input text to be tokenized.

    Returns:
        list: A list of tokens (words) extracted from the text.
    """
    

    from nltk.tokenize import word_tokenize  # Import inside the function to avoid potential errors

    tokens = word_tokenize(text)  # Tokenize the text into words
    return tokens


if __name__ == '__main__':
    text = "Hello, how are you? did you know that programmers program with programming languages?"
    tokens = tokenize_text(text)
    print(tokens)


**Do you see how the text breaking down into smaller units (words,Punctuation marks,....)?**

OK remeber the last output and let's continue... :)

____________________________________________________________________________________________________________________________________________________________________________________________________


## 2.2 **Stemming/Lematization**

Stemming is the process of **reducing words to their root or stem form**. In NLP, stemming is used **to reduce the number of unique words in the text data and to group together words with similar meanings**. There are several algorithms for stemming, such as the **Porter stemming algorithm and the Snowball stemming algorithm**.

Lemmatization usually refers to **doing things properly with the use of a vocabulary and morphological analysis of words**, Unlike stemming, which applies simple rules to remove suffixes from words, lemmatization considers the word's context and **applies morphological analysis** to produce the root form. 

![image.png](attachment:b57ac298-c546-47f7-9b1e-22021c256f03.png)

>  Here is an example of how we can perform stemming on the sample text using the **Porter stemming algorithm:**

In [None]:
from nltk.stem import PorterStemmer


def stem_tokens(tokens):
    """
    Stems the given tokens using PorterStemmer.

    Args:
        tokens (list): A list of tokens (words).

    Returns:
        list: A list of stemmed tokens.
    """
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens


if __name__ == '__main__':
    text = "Hello, how are you? did you know that programmers program with programming languages?"
    tokens = tokenize_text(text)
    stemmed_tokens = stem_tokens(tokens)
    print(stemmed_tokens)

**Do you See how words reduce to their root or stem form?**
> e.g. programming --> program

OK, Say Nice work and let's continue.. :)

____________________________________________________________________________________________________________________________________________________________________________________________________


## 2.3 **Stop-word Removal**

Stop-word removal is the process of **removing common words that do not carry much meaning**, such as **"the", "and", and "a"**. In NLP, stop-word removal is used to **reduce the noise in the text data and to focus on the important words**. 

>  Here is an example of how we can remove stop words from the sample text:

In [None]:
from nltk.corpus import stopwords


def remove_stopwords(tokens):
    """
    Removes the stop words from the given tokens.

    Args:
        tokens (list): A list of tokens (words).

    Returns:
        list: A list of tokens with stop words removed.
    """

    stop_words = stopwords.words('english')  # Get the stop words for English

    filtered_tokens = [token for token in tokens if token not in stop_words]
    return filtered_tokens


if __name__ == '__main__':
    text = "Hello, how are you? did you know that programmers program with programming languages?"
    tokens = tokenize_text(text)
    stemmed_tokens = stem_tokens(tokens)
    filtered_tokens = remove_stopwords(stemmed_tokens)
    print(filtered_tokens)

**As you can see meaningless words are removed.**

____________________________________________________________________________________________________________________________________________________________________________________________________

 ## 2.4 **Part Of Speech (POS)**

POS Tagging is the process of **assigning grammatical tags to words in a sentence**, such as **nouns, verbs, adjectives**, and more. It helps to analyze the syntactic structure and understand the grammatical relationships between words in a text.

![0_V635bzjWK2n1jBsd.png](attachment:e2188885-69e9-43d0-b61f-bc04af00fb83.png) ![image.png](attachment:2d2ffd00-d56c-46d7-aa09-66a2f88f8183.png)

> Here is an example of how we can perform POS tagging using the nltk library:

In [None]:
import nltk

def pos_tagging(tokens):
    """
    Performs POS tagging on the given tokens.

    Args:
        tokens (list): A list of tokens (words).

    Returns:
        list: A list of tuples containing the word and its corresponding POS tag.
    """
    pos_tags = nltk.pos_tag(tokens)  # Perform POS tagging
    return pos_tags

if __name__ == '__main__':
    text = "Hello, how are you? did you know that programmers program with programming languages?"
    tokens = tokenize_text(text)
    pos_tags = pos_tagging(tokens)
    print(pos_tags)


Don't forget to say **nice work** before continue 😡

____________________________________________________________________________________________________________________________________________________________________________________________________


## 2.5 **Named Entity Recognition (NER)**

NER is the process of **identifying and classifying named entities in text**, such as** person names, organizations, locations**, and more. It plays a crucial role in extracting meaningful information from unstructured text data.

![image.png](attachment:f3e8b7ca-d0c9-4879-be96-e05171a1abcf.png)

> Here is an example of how we can perform Named Entity Recognition using the nltk library:

In [None]:
import spacy


def named_entity_recognition(text):
    """Performs named entity recognition on the given text using spaCy.

    Args:
        text (str): The input text to process.

    Returns:
        list: A list of tuples representing recognized named entities and their labels.
    """
    # Load the spaCy English language model
    nlp = spacy.load('en_core_web_sm')

    # Process the text
    doc = nlp(text)

    # Extract named entities and their labels
    entities = [(entity.text, entity.label_) for entity in doc.ents]
    return entities


def main():
    text = "Apple Inc. was founded by Steve Jobs in California."

    # Perform named entity recognition on the text
    entities = named_entity_recognition(text)

    # Print the recognized entities and their labels
    for entity, label in entities:
        print("Entity:", entity)
        print("Label:", label)
        print()


if __name__ == '__main__':
    main()


* PER: Person
* ORG: Organization
* LOC: Location
* GPE: Geo-Political Entity.
* MISC: Miscellaneous (for entities that don't fall into other categories)
* O: Not an entity (used for tokens that are not part of a named entity)

____________________________________________________________________________________________________________________________________________________________________________________________________


# 3 **Sentiment Analysis**

Sentiment analysis is a subfield of Natural Language Processing (NLP) that focuses on **determining the sentiment or emotional tone expressed in a piece of text.** It involves analyzing and interpreting the subjective information conveyed by words and phrases to classify the overall sentiment as **positive, negative, or neutral.** Sentiment analysis has gained significant popularity due to its wide range of applications, such as understanding customer feedback, monitoring social media sentiment, and analyzing product reviews.

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

def sentiment_analysis(text):
    """
    Performs Sentiment Analysis on the given text.

    Args:
        text (str): The input text.

    Returns:
        dict: A dictionary containing the polarity scores for positive, negative, neutral, and compound sentiments.
    """

    sid = SentimentIntensityAnalyzer()
    sentiment_scores = sid.polarity_scores(text)
    return sentiment_scores


if __name__ == '__main__':
    text = "I loved the movie. It was fantastic!"
    sentiment_scores = sentiment_analysis(text)
    print(sentiment_scores)
    text = "I hate the movie. It was bored!"
    sentiment_scores = sentiment_analysis(text)
    print(sentiment_scores)


> The output contains **four** scores:

* The **"neg**" score represents the sentiment intensity of the text in terms of negativity. In this case, it is 0.604, which indicates that the text has a relatively high level of negativity.
* The **"neu**" score represents the sentiment intensity of the text in terms of neutrality. In this case, it is 0.396, which indicates that the text is fairly neutral.
* The **"pos"** score represents the sentiment intensity of the text in terms of positivity. In this case, it is 0.0, which indicates that the text has no positivity.
* The **"compound"** score is a normalized score that ranges from -1 (most negative) to +1 (most positive). In this case, it is -0.7263, which indicates that the overall sentiment of the text is negative.

____________________________________________________________________________________________________________________________________________________________________________________________________


**Finally :)**

# 4 **Our Simple combination Task**

In the code below demonstrates practical text preprocessing and sentiment analysis using Python's NLTK library. It showcases how to prepare text data for analysis and extract sentiment-related insights. The code includes a **preprocess_text function** that applies lowercase conversion, special character and punctuation removal, tokenization, and stemming. It also features a **perform_sentiment_analysis function** that utilizes the SentimentIntensityAnalyzer class for sentiment analysis. The code focuses on analyzing a set of movie reviews and printing the sentiment scores.

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and punctuation
    text = ''.join(c for c in text if c.isalnum() or c.isspace())
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    stemmed_tokens = ' '.join(s for s in stemmed_tokens)
    return stemmed_tokens

def perform_sentiment_analysis(reviews):
    nltk.download('vader_lexicon')  # Download the required resource

    sid = SentimentIntensityAnalyzer()
    sentiment_scores = []
    
    for review in reviews:
        # Preprocess the review
        stemmed_tokens = preprocess_text(review)
        
        # Perform sentiment analysis
        sentiment_scores.append(sid.polarity_scores(stemmed_tokens))
    
    return sentiment_scores

if __name__ == '__main__':
    # Generate sample movie reviews
    reviews = [
        "The movie was absolutely fantastic! I loved every minute of it.",
        "I found the plot quite confusing and the acting subpar.",
        "The cinematography was stunning, and the soundtrack added to the overall experience.",
        "I couldn't stop laughing throughout the entire movie. It was hilarious!",
        "The movie was a complete disappointment. I regret watching it."
    ]
    
    # Perform sentiment analysis on the movie reviews
    sentiment_scores = perform_sentiment_analysis(reviews)
    
    # Print the sentiment scores
    for i, review in enumerate(reviews):
        print(f"Review {i+1}: {review}")
        print(f"Sentiment Score: {sentiment_scores[i]}")
        print()


# 5 **Conclusion**

In conclusion, this notebook has presented a comprehensive overview of text preprocessing techniques and sentiment analysis using Python and the NLTK library. We have explored key steps such as tokenization, stemming, and stop-word removal, which are crucial for cleaning and normalizing text data. Additionally, we have demonstrated how to perform sentiment analysis, enabling us to classify the sentiment expressed in textual content.

By following the examples and understanding the concepts presented in this notebook, you now have a good foundation for working with text data in natural language processing tasks. Text preprocessing techniques help clean and standardize the data, while sentiment analysis allows for extracting valuable insights regarding sentiment.

**As you continue your journey in NLP, it is essential to experiment with different datasets, explore advanced techniques, and adapt the methods to your specific use cases. Remember, text preprocessing is often a crucial initial step in any NLP project, as it lays the foundation for accurate and meaningful analysis.**

If you found this notebook helpful and informative, I kindly ask you to support it by **VOTING** for it.
It would be greatly appreciated.

Thank you for joining me on this NLP journey.