**Task: Develop a Natural Language Processing (NLP) Model for Sentiment Analysis**

**Description:**
You are tasked with building a machine learning model for sentiment analysis. The goal is to create a model that can classify text data (such as product reviews or social media posts) into different sentiment categories, such as positive, negative, or neutral. This model will be used to automatically assess the sentiment of customer reviews for a product.

**Steps and Responsibilities:**

1. **Data Collection:** Gather a dataset of text samples with labeled sentiment. This dataset should include a variety of text sources and cover different domains relevant to the product.

2. **Data Preprocessing:** Clean and preprocess the text data. This may involve tasks such as tokenization, removing stop words, stemming or lemmatization, and handling any missing data.

3. **Feature Engineering:** Convert the text data into numerical features that can be used as input for the machine learning model. Common techniques include TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe.

4. **Model Selection:** Choose an appropriate machine learning or deep learning model for sentiment analysis. Common choices include logistic regression, support vector machines, recurrent neural networks (RNNs), or transformer models like BERT.

5. **Model Training:** Train the selected model using the preprocessed data. This involves splitting the data into training and validation sets, setting hyperparameters, and training the model to minimize the loss function.

6. **Evaluation:** Assess the model's performance using appropriate metrics like accuracy, precision, recall, F1-score, or ROC-AUC. Fine-tune the model and hyperparameters as needed.

7. **Testing and Deployment:** After achieving a satisfactory level of performance, test the model on an independent test dataset to ensure generalization. Once validated, deploy the model to a production environment, which may involve creating APIs or integrating it into a larger system.

8. **Monitoring and Maintenance:** Continuously monitor the model's performance in the production environment, as well as the data it encounters. Implement regular model retraining to account for concept drift and changing data distributions.

9. **Documentation:** Maintain documentation of the entire process, including data sources, preprocessing steps, model architecture, hyperparameters, and evaluation results.

10. **Scaling and Optimization:** As the application grows, consider optimization techniques like distributed computing, parallel processing, and model compression to improve efficiency and scalability.

11. **Feedback Loop:** Gather user feedback on the model's predictions and iteratively improve it based on real-world performance.

This task showcases a common responsibility of an AI and ML engineer, which is to develop, deploy, and maintain machine learning models for specific applications. The specific tasks and tools used may vary depending on the project and its requirements.




In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [None]:
dataset["train"]["text"][1000]

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import nltk

In [None]:
# Assuming your dataset has columns 'label' and 'text'
df_training = pd.DataFrame({'text': dataset['train']['text'], 'label': dataset['train']['label']})
df_test = pd.DataFrame({'text': dataset['test']['text'], 'label': dataset['test']['label']})
print(df_training.count())
print(df_test.count())

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
# Text preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define the preprocess_text function
def preprocess_text_with_progress(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha()]
    words = [lemmatizer.lemmatize(word) for word in words]
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)


tqdm.pandas()

In [None]:
df_training['text'] = df_training['text'].progress_apply(preprocess_text_with_progress)

In [None]:
df_test['text'] = df_test['text'].progress_apply(preprocess_text_with_progress)

In [None]:
# Feature extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()

In [None]:
# Fit and transform on training data
X_train = tfidf_vectorizer.fit_transform(df_training['text'])
y_train = df_training['label']

# Transform the test data using the same vectorizer
X_test = tfidf_vectorizer.transform(df_test['text'])
y_test = df_test['label']

In [None]:
# Build and train the model (Naive Bayes)
model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
# Make predictions
y_pred = model.predict(X_test)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(report)

In [None]:
# Sample text to classify
# neutral
#input_text = "Overall, the product meets my expectations. It does what it's supposed to do, but I wouldn't say it's outstanding. It's an average product with no major complaints. It gets the job done."
# positive
#input_text = "I'm absolutely thrilled with this product! It exceeded my expectations in every way. The quality is top-notch, and it's incredibly easy to use. I've been using it for a while now, and it has made my life so much better. I highly recommend it to anyone looking for a reliable and efficient solution."
# negative
input_text = "I'm really disappointed with this product. It didn't work as advertised, and I encountered numerous issues from the moment I started using it. The quality is subpar, and it's a waste of money. I regret purchasing it and wouldn't recommend it to anyone."

# Preprocess the input text
input_text = preprocess_text_with_progress(input_text)

# Vectorize the input text using the same TF-IDF vectorizer
input_vector = tfidf_vectorizer.transform([input_text])

# Make predictions
predicted_label = model.predict(input_vector)[0]

# Print the result
if (predicted_label > 2) :
    print("Positive Sentiment")
elif (predicted_label == 2):
    print("Neutral Sentiment")
else:
    print("Negative Sentiment")

print(predicted_label)

In [None]:
!mkdir sentiment_classification

In [None]:
import pickle

# Save your NLTK model
with open('/content/sentiment_scoring/sentiment-scoring-nltk-sklearn-naivebayes.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

In [None]:
!pip install huggingface_hub

In [None]:
!huggingface-cli login

In [None]:
!huggingface-cli upload faizalnf1800/sentiment-scoring-nltk-sklearn-naivebayes /content/sentiment_scoring/sentiment-scoring-nltk-sklearn-naivebayes.pkl sentiment-scoring-nltk-sklearn-naivebayes.pkl

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
nltk_sentiment_classification_model.pkl: 100% 14.4M/14.4M [00:00<00:00, 23.1MB/s]
https://huggingface.co/faizalnf1800/nltk_sentiment_classification/blob/main/nltk_sentiment_classification_model.pkl


In [None]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="faizalnf1800/sentiment-scoring-nltk-sklearn-naivebayes", filename="sentiment-scoring-nltk-sklearn-naivebayes.pkl", local_dir="/content")

'/content/nltk_sentiment_classification_model.pkl'

In [None]:
import pickle
pickled_model = pickle.load(open('nltk_sentiment_classification_model.pkl', 'rb'))

# Sample text to classify
# neutral
#input_text = "Overall, the product meets my expectations. It does what it's supposed to do, but I wouldn't say it's outstanding. It's an average product with no major complaints. It gets the job done."
# positive
#input_text = "I'm absolutely thrilled with this product! It exceeded my expectations in every way. The quality is top-notch, and it's incredibly easy to use. I've been using it for a while now, and it has made my life so much better. I highly recommend it to anyone looking for a reliable and efficient solution."
# negative
input_text = "I'm really disappointed with this product. It didn't work as advertised, and I encountered numerous issues from the moment I started using it. The quality is subpar, and it's a waste of money. I regret purchasing it and wouldn't recommend it to anyone."


# Preprocess the input text
input_text = preprocess_text_with_progress(input_text)

# Vectorize the input text using the same TF-IDF vectorizer
input_vector = tfidf_vectorizer.transform([input_text])
# Make predictions
predicted_label = pickled_model.predict(input_vector)[0]

# Print the result
if (predicted_label > 2) :
    print("Positive Sentiment")
elif (predicted_label == 2):
    print("Neutral Sentiment")
else:
    print("Negative Sentiment")

print(predicted_label)

Negative Sentiment
0
