In [None]:
# NLP, or Natural Language Processing, is a field of computer science focused on enabling computers to understand, interpret,
# and respond to human language in a way that is both meaningful and useful. 
# It combines elements of linguistics (the study of language) and artificial intelligence.

# In simple terms, NLP allows computers to:

# Understand Text and Speech: This includes tasks like reading and listening to what humans say or write.
# Analyze and Interpret: Breaking down the language to understand its structure and meaning, such as identifying the subject, verb, and object in a sentence.
# Generate Responses: Producing human-like text or speech in response to inputs, like chatbots answering questions.
# Translate Languages: Converting text from one language to another, like translating a Spanish sentence into English.

# Examples of NLP in action include virtual assistants like Siri or Alexa, translation services like Google Translate,
# and even grammar checking tools like Grammarly.
# These applications use NLP to interact with human language in ways that feel natural and intuitive.

In [1]:
# Natural Language Processing (or NLP) is applying Machine Learning models to text and language. 
# Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing.
# Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.

# You can also use NLP on a text review to predict if the review is a good one or a bad one.
# You can use NLP on an article to predict some categories of the articles you are trying to segment.
# You can use NLP on a book to predict the genre of the book. And it can go further,
# you can use NLP to build a machine translator or a speech recognition system,
# and in that last example you use classification algorithms to classify language. 
# Speaking of classification algorithms, most of NLP algorithms are classification models,
# and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees,
# Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.

# A very well-known model in NLP is the Bag of Words model. 
# It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

# In this part, you will understand and learn how to:

# Clean texts to prepare them for the Machine Learning models,
# Create a Bag of Words model,
# Apply Machine Learning models onto this Bag of Worlds model.

In [2]:
# Natural Language Processing (NLP) encompasses a variety of tasks and techniques to process and analyze human language. Here are some key types of NLP:

# Tokenization:

# Word Tokenization: Splitting text into individual words.
# Sentence Tokenization: Splitting text into individual sentences.
# Part-of-Speech Tagging (POS Tagging):

# Identifying and labeling the parts of speech (nouns, verbs, adjectives, etc.) in a sentence.
# Named Entity Recognition (NER):

# Identifying and classifying proper nouns in text (e.g., names of people, organizations, locations).
# Sentiment Analysis:

# Determining the emotional tone behind a series of words, used to understand attitudes, opinions, and emotions.
# Machine Translation:

# Translating text from one language to another, such as from English to Spanish.
# Text Classification:

# Assigning categories to text, such as spam detection in emails or topic categorization in news articles.
# Speech Recognition:

# Converting spoken language into written text.
# Language Generation:

# Creating coherent text that is meaningful and contextually relevant, such as chatbots generating responses.
# Coreference Resolution:

# Determining when different words refer to the same entity in a text, like recognizing that "he" and "John" refer to the same person.
# Text Summarization:

# Condensing long pieces of text into shorter summaries while preserving key information.
# Word Sense Disambiguation:

# Determining which meaning of a word is used in a context, such as distinguishing between "bank" (financial institution) and "bank" (side of a river).
# Question Answering:

# Building systems that can answer questions posed in natural language.
# Topic Modeling:

# Identifying topics present in a collection of documents, often used in discovering underlying themes in large datasets.
# Information Retrieval:

# Extracting relevant information from large datasets, such as search engines retrieving relevant web pages based on user queries.
# Each of these types of NLP involves specific techniques and algorithms to process and analyze language,
# contributing to the broader goal of making human-computer interaction more intuitive and effective.

In [3]:
# Bag of words
# The "Bag of Words" (BoW) model is a fundamental concept in natural language processing and text analysis.
# It simplifies text by converting it into a collection of words (or tokens) 
# while disregarding grammar and word order but keeping the multiplicity of words.
# Here's how it works:
# Tokenization: Split the text into individual words or tokens.
# Example: "I love natural language processing" → ["I", "love", "natural", "language", "processing"]
# Vocabulary Creation: Create a list of all unique words (vocabulary) from the entire text or corpus.
# Suppose we have another sentence: "Language processing is fun"
# Combined vocabulary: ["I", "love", "natural", "language", "processing", "is", "fun"]
# Vectorization: Represent each document or sentence as a vector based on the vocabulary. Each element in the vector corresponds to the count (or presence) of a word from the vocabulary in the document.
# Example sentence 1: "I love natural language processing"
# Vector: [1, 1, 1, 1, 1, 0, 0]
# Example sentence 2: "Language processing is fun"
# Vector: [0, 0, 0, 1, 1, 1, 1]
# In the vectors above:
# The first position corresponds to the word "I"
# The second to "love"
# The third to "natural"
# The fourth to "language"
# The fifth to "processing"
# The sixth to "is"
# The seventh to "fun"
# Key Characteristics of Bag of Words
# Order Ignored: The model does not consider the order of words. "I love" and "love I" would be represented the same way.
# Frequency Count: The simplest version uses word counts,
# but variations might use binary presence (0 or 1) or other measures like term frequency-inverse document frequency (TF-IDF).
# Simplicity: It is easy to understand and implement, making it a popular choice for initial text analysis.
# Use Cases
# Text Classification: Categorizing text into predefined categories, like spam detection in emails.
# Information Retrieval: Search engines retrieving relevant documents based on query terms.
# Sentiment Analysis: Determining the sentiment of a piece of text by analyzing word frequencies associated with positive or negative sentiments.
# Limitations
# Loss of Context: Since the order of words is ignored, contextual or syntactic information is lost.
# High Dimensionality: Large vocabularies can result in very high-dimensional vectors, 
# making computation more intensive and possibly leading to sparse data issues.
# Semantic Ambiguity: Words with multiple meanings (polysemy) or different words with similar meanings (synonymy) are not distinguished.
# Despite these limitations, Bag of Words is a foundational technique that provides a baseline for more complex models 
# and is often used as a starting point for text processing tasks.

In [1]:
# what mean order ignored
# When we say "order ignored" in the context of the Bag of Words (BoW) model, 
# it means that the sequence or arrangement of words in the original text is not taken into account.
# The model treats the text as a "bag" of words where the presence or frequency of words is considered,
# but the specific order in which the words appear is not.

# Here's a simple example to illustrate this:

# Original Sentences
# "I love natural language processing."
# "Language processing is fun."
# Tokenization and Vocabulary Creation
# Let's tokenize these sentences and create a combined vocabulary:

# Sentence 1: ["I", "love", "natural", "language", "processing"]
# Sentence 2: ["Language", "processing", "is", "fun"]
# Combined vocabulary (unique words): ["I", "love", "natural", "language", "processing", "is", "fun"]

# Vectorization
# Each sentence is converted into a vector based on the presence or frequency of words in the vocabulary.

# For Sentence 1: "I love natural language processing"

# Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fun"]
# Vector: [1, 1, 1, 1, 1, 0, 0]
# For Sentence 2: "Language processing is fun"

# Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fun"]
# Vector: [0, 0, 0, 1, 1, 1, 1]
# Explanation of "Order Ignored"
# In the BoW model, the vectors only indicate the presence or count of each word from the vocabulary in the sentences.
# The original order of words in the sentences is completely disregarded. For example:

# The model does not distinguish between "I love natural language processing" and "natural language processing I love";
# both would be represented by the same vector: [1, 1, 1, 1, 1, 0, 0].
# Similarly, "Language processing is fun" and "is fun processing language" both would be represented by the vector: [0, 0, 0, 1, 1, 1, 1].
# This "order ignored" approach means the BoW model does not capture any information about the syntax,
# grammar, or meaning derived from the sequence of words.
# It only captures the frequency or presence of individual words within the text.
# This simplicity makes the BoW model easy to implement but limits its ability to understand the deeper context or nuanced meanings in the text.

In [2]:
# on which base 0,1 selected and what represnt

# In the Bag of Words (BoW) model, the elements of the vector (0s and 1s, or counts) represent 
# whether a particular word from the vocabulary is present in the text or how many times it appears. Here’s a detailed explanation:

# Binary Representation (0 and 1)
# In the binary version of the BoW model:

# 1 indicates the presence of a word in the document.
# 0 indicates the absence of a word in the document.
# Example
# Consider the sentences:

# "I love natural language processing."
# "Language processing is fun."
# Step-by-Step Process:
# Tokenization:

# Sentence 1: ["I", "love", "natural", "language", "processing"]
# Sentence 2: ["Language", "processing", "is", "fun"]
# Vocabulary Creation:
# Combined vocabulary (unique words): ["I", "love", "natural", "language", "processing", "is", "fun"]

# Binary Vectorization:

# For Sentence 1: "I love natural language processing"
# Vector: [1, 1, 1, 1, 1, 0, 0]
# "I" is present: 1
# "love" is present: 1
# "natural" is present: 1
# "language" is present: 1
# "processing" is present: 1
# "is" is absent: 0
# "fun" is absent: 0
# For Sentence 2: "Language processing is fun"
# Vector: [0, 0, 0, 1, 1, 1, 1]
# "I" is absent: 0
# "love" is absent: 0
# "natural" is absent: 0
# "language" is present: 1
# "processing" is present: 1
# "is" is present: 1
# "fun" is present: 1
# Count-Based Representation
# In the count-based version of the BoW model:

# The vector elements represent the frequency of each word in the document.
# Example
# Using the same sentences:

# Count-Based Vectorization:
# For Sentence 1: "I love natural language processing"
# Vector: [1, 1, 1, 1, 1, 0, 0]
# "I" appears once: 1
# "love" appears once: 1
# "natural" appears once: 1
# "language" appears once: 1
# "processing" appears once: 1
# "is" appears zero times: 0
# "fun" appears zero times: 0
# For Sentence 2: "Language processing is fun"
# Vector: [0, 0, 0, 1, 1, 1, 1]
# "I" appears zero times: 0
# "love" appears zero times: 0
# "natural" appears zero times: 0
# "language" appears once: 1
# "processing" appears once: 1
# "is" appears once: 1
# "fun" appears once: 1
# Summary
# Binary Representation (0 and 1): Each vector element is either 0 or 1, indicating the absence or presence of a word in the text.
# Count-Based Representation: Each vector element indicates the frequency of the corresponding word from the vocabulary in the text.
# Both methods ignore the order of words and focus solely on the presence or frequency of words to represent the text.


In [3]:
# what means unique words if a word in text is repeated what about it
# In the context of the Bag of Words (BoW) model, "unique words" refer to the distinct words that appear in the entire set of documents or corpus. 
# Each unique word becomes a feature in the vocabulary, which is then used to create the vectors representing each document.

# Handling Repeated Words
# If a word is repeated within a text, how it is handled depends on whether you are using a binary representation or a count-based representation:

# Binary Representation:

# Each word in the vocabulary is represented as either 0 or 1, indicating the absence or presence of the word in the document.
# Repeated words do not change the value from 1 to a higher number; they are simply noted as present (1).
# Example:

# Sentence: "I love love natural language processing."
# Vocabulary: ["I", "love", "natural", "language", "processing"]
# Vector: [1, 1, 1, 1, 1]
# Even though "love" is repeated, the vector just indicates that "love" is present (1).

# Count-Based Representation:

# Each word in the vocabulary is represented by the number of times it appears in the document.
# Repeated words increase the count for that word in the vector.
# Example:

# Sentence: "I love love natural language processing."
# Vocabulary: ["I", "love", "natural", "language", "processing"]
# Vector: [1, 2, 1, 1, 1]
# In this case, "love" appears twice, so its count in the vector is 2.

# Detailed Steps
# Tokenization:

# Sentence: "I love love natural language processing."
# Tokens: ["I", "love", "love", "natural", "language", "processing"]
# Vocabulary Creation:

# Identify unique words from all documents (or the entire corpus).
# Suppose another sentence is "Language processing is fun."
# Combined Vocabulary: ["I", "love", "natural", "language", "processing", "is", "fun"]
# Vectorization:

# For the sentence "I love love natural language processing":

# Binary Vector: [1, 1, 1, 1, 1, 0, 0]
# Count-Based Vector: [1, 2, 1, 1, 1, 0, 0]
# For the sentence "Language processing is fun":

# Binary Vector: [0, 0, 0, 1, 1, 1, 1]
# Count-Based Vector: [0, 0, 0, 1, 1, 1, 1]
# Summary
# Unique Words: The distinct words from all documents create the vocabulary.
# Repeated Words:
# In binary representation, repetition does not affect the vector beyond indicating presence (1).
# In count-based representation, repetition increases the count for that word in the vector.
# Both representations help in converting text into a structured form that can be easily processed by machine learning algorithms.

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [6]:
# delimiter = '\t': Specifies that the file uses tab characters to separate fields.
# quoting = 3: Tells pandas not to interpret any quotes as special, treating them as regular characters.
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [10]:
dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [None]:
# re: The re module provides support for regular expressions, which allow you to search for and manipulate strings.
# import re
# nltk: The nltk (Natural Language Toolkit) is a suite of libraries and programs for symbolic and statistical natural language processing.
# import nltk


In [None]:
# This line downloads the list of stopwords.
# Stopwords are common words (like "the", "is", "in") that are often removed from text data because they don't carry significant meaning.
# nltk.download('stopwords')

In [None]:
# Import Stopwords and PorterStemmer
# from nltk.corpus import stopwords
# from nltk.stem.porter import PorterStemmer
# stopwords: This module provides the list of stopwords.
# PorterStemmer: A stemming algorithm that reduces words to their root form. For example, "running" becomes "run".


In [None]:
# Initialize an Empty List for the Corpus:
# corpus = []
# corpus: This will store the cleaned reviews.

In [None]:
# Process Each Review:
# for i in range(0, 1000):
# This loop iterates over the first 1000 reviews in the dataset.

In [None]:
# Text Cleaning:
# review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
# re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]):This uses a regular expression to replace all characters that are not letters (a-z or A-Z) with a space.
# This removes numbers, punctuation, and special characters.

In [None]:
# Convert to Lowercase:
# review = review.lower()
# This converts all characters in the review to lowercase.

In [None]:
# Tokenization:
# review = review.split()
# This splits the review into a list of words (tokens).

In [None]:
# Remove Stopwords and Apply Stemming:
# ps = PorterStemmer()
# all_stopwords = stopwords.words('english')
# all_stopwords.remove('not')
# review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
# ps = PorterStemmer(): Initializes the PorterStemmer.
# all_stopwords = stopwords.words('english'): Retrieves the list of English stopwords.
# all_stopwords.remove('not'): Removes "not" from the list of stopwords to keep negation words.
# review = [ps.stem(word) for word in review if not word in set(all_stopwords)]: This list comprehension iterates over each word in the tokenized review,
# stems it,and includes it in the review only if it is not a stopword.

In [None]:
# Join the Words Back into a String:
# review = ' '.join(review)
# This joins the list of words back into a single string, with words separated by spaces.

In [None]:
# Add the Cleaned Review to the Corpus:
# corpus.append(review)
# This appends the cleaned and processed review to the corpus list.

In [11]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

[nltk_data] Downloading package stopwords to C:\Users\Umair
[nltk_data]     Jadoon\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [12]:
print(corpus)

['wow love place', 'crust not good', 'not tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch', 'servic prompt', 'would not go back', 'cashier care ever say still end wayyy overpr', 'tri cape cod ravoli chicken cranberri mmmm', 'disgust pretti sure human hair', 'shock sign indic cash', 'highli recommend', 'waitress littl slow servic', 'place not worth time let alon vega', 'not like', 'burritto blah', 'food amaz', 'servic also cute', 'could care less interior beauti', 'perform', 'right red velvet cake ohhh stuff good', 'never brought salad ask', 'hole wall great mexican street taco friendli staff', 'took hour get food tabl restaur food luke warm sever run around like total overwhelm', 'worst salmon sashimi', 'also combo like burger fri beer decent deal', 'like final blow', 'found place accid could not

In [None]:
# Importing CountVectorizer:
# from sklearn.feature_extraction.text import CountVectorizer
# This imports the CountVectorizer class from the sklearn.feature_extraction.text module,
# which is used to convert a collection of text documents into a matrix of token counts.

In [None]:
# Initializing CountVectorizer with max_features:
# cv = CountVectorizer(max_features = 1500)
# max_features = 1500: This parameter limits the number of features (tokens) to 1500. It keeps only the 1500 most frequent words in the corpus,
# which helps in reducing the dimensionality and computational complexity of the model.

In [None]:
# Fitting and Transforming the Corpus:
# X = cv.fit_transform(corpus).toarray()
# cv.fit_transform(corpus): This fits the CountVectorizer to the corpus and transforms the text data into a numerical feature matrix.
# .toarray(): Converts the sparse matrix returned by fit_transform into a dense numpy array.
# Result:

# X is a 2D numpy array where each row represents a document (review) and each column represents a feature (word).
# The values are the counts of each word in each document.

In [None]:
# Extracting the Target Variable:
# y = dataset.iloc[:, -1].values
# dataset.iloc[:, -1]: This selects the last column of the dataset,
# assuming it is the target variable (e.g., sentiment labels, such as positive or negative).
# .values: Converts the selected column into a numpy array.
# Result:

# y is a 1D numpy array containing the target labels for the reviews.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [16]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [17]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]

In [18]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[55 42]
 [12 91]]


0.73