### Survey Responses to Model building

In this complete example, we generate a small bunch of synthetic survey responses for a new car called CarX and assign sentiment labels to them. We performed sentiment analysis using the VADER sentiment analyzer and extract sentiment scores. We then create a DataFrame to store the responses, sentiment labels, and sentiment scores.

Next, we apply CounteVectorizer(BOW) and TFIDF vectorizer to numerically encode the features and checke the result in each case. We split the data into training and testing sets, oversample the train data to remove any imbalance and then fit a random Forest Classifier model on the training set. The trained models are used to predict sentiment labels for the test set in each case.

Finally, we evaluate the model's performance by calculating the accuracy.

Feel free to modify the code according to your specific needs or further explore other machine learning algorithms and techniques for sentiment analysis.

In [1]:
# Import required modules

import numpy as np
import pandas as pd

import nltk

from imblearn.over_sampling import SMOTE # For oversampling

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier

from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

In [2]:
# Set stopwords function to English

stop_words = set(stopwords.words('english'))

In [3]:
# Initialize the sentiment analyzer
# nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()

In [4]:
# Market Research Survey responses about the car

synthetic_responses = [
    "I love the sleek design of the car.",
    "The car has excellent fuel efficiency.",
    "The safety features in the car are top-notch.",
    "The car's performance on the road is outstanding.",
    "The interior of the car is spacious and comfortable.",
    "I'm impressed with the advanced technology features in the car.",
    "The car offers great value for the price.",
    "The handling and maneuverability of the car are superb.",
    "The car's infotainment system is user-friendly and intuitive.",
    "I appreciate the ample storage space in the car.",
    "The car's acceleration is impressive.",
    "The sound system in the car provides excellent audio quality.",
    "The car's exterior design is eye-catching.",
    "I find the car to be reliable and dependable.",
    "The car offers a smooth and comfortable ride.",
    "The car's braking system is efficient and responsive.",
    "The car's suspension provides a comfortable driving experience.",
    "I like the variety of color options available for the car.",
    "The car's maintenance costs are reasonable.",
    "The car's warranty coverage is comprehensive.",
    "The car's headlights offer excellent visibility at night.",
    "The car's seats are ergonomic and supportive.",
    "The car's handling in different weather conditions is impressive.",
    "The car's fuel economy exceeds my expectations.",
    "The car's safety ratings are reassuring.",
    "The car's technology integration with smartphones is seamless.",
    "I appreciate the car's spacious trunk capacity.",
    "The car's design reflects a modern and stylish look.",
    "The car's navigation system is accurate and reliable.",
    "The car's interior materials are of high quality.",
    "The car's climate control system provides optimal comfort.",
    "The car's engine power is impressive.",
    "I enjoy the panoramic sunroof in the car.",
    "The car's audio system offers immersive sound quality.",
    "The car's exterior color options are appealing.",
    "The car's transmission provides smooth gear shifts.",
    "The car's fuel efficiency allows for long drives without frequent refueling.",
    "The car's resale value is competitive.",
    "The car's seat comfort makes long trips enjoyable.",
    "The car's entertainment options cater to all passengers.",
    "The car's build quality feels sturdy and durable.",
    "The car's technology features enhance the driving experience.",
    "The car's suspension absorbs road bumps effectively.",
    "The car's interior lighting creates a pleasant ambiance.",
    "The car's handling in tight spaces is effortless.",
    "I appreciate the car's safety assist features, such as blind-spot monitoring and lane-keeping assist.",
    "The car's fuel tank capacity allows for extended driving range.",
    "The car's acceleration from 0 to 60 mph is impressive.",
    "The car's seating configuration offers flexibility for passengers and cargo.",
    "The car's exterior design stands out from other vehicles on the road.",
    "The car's engine noise is minimal during acceleration.",
    "The car's parking assist system makes parking hassle-free.",
    "The car's dashboard layout is intuitive and easy to navigate.",
    "The car's high-quality materials give it a luxurious feel.",
    "The car's safety features provide peace of mind.",
    "The car's suspension system offers a smooth and comfortable ride.",
    "I appreciate the car's fuel-saving start-stop feature.",
    "The car's responsive steering enhances the driving experience.",
    "The car's smartphone integration allows for seamless connectivity.",
    "The car's seating configuration offers no flexibility for passengers and cargo.",
    "The car's exterior design is old fashioned and boring.",
    "The car's engine noise is horrible during acceleration.",
    "The car's parking assist system is complicated",
    "The car's dashboard layout is not intuitive and difficult to navigate.",
    "The car's high-quality materials make it very expensive.",
    "The car's safety features creates many doubts.",
    "The car's suspension system offers is not smooth.",
    "I appreciate the car's fuel-saving start-stop feature but it is irritating.",
    "The car's steering is not responsive.",
    "The car's smartphone integration works erratically."
]


In [5]:
# Perform sentiment analysis on the responses
sentiment_scores = []
for response in synthetic_responses:
    sentiment_score = sid.polarity_scores(response)
    sentiment_scores.append(sentiment_score)
    
len(sentiment_scores)

70

In [6]:
# Create a DataFrame to store the responses and sentiment scores
data = pd.DataFrame({'Response': synthetic_responses, 
                     'Positive': [score['pos'] for score in sentiment_scores],
                     'Negative': [score['neg'] for score in sentiment_scores],
                     'Neutral': [score['neu'] for score in sentiment_scores],
                      'Compound': [score['compound'] for score in sentiment_scores]})

# Preprocess the response to reduce to lower case and remove stop words
data['Response'] = data['Response'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop_words))

print(data.tail())

                                             Response  Positive  Negative  \
65     the car's safety features creates many doubts.     0.441     0.198   
66         the car's suspension system offers smooth.     0.000     0.000   
67  i appreciate car's fuel-saving start-stop feat...     0.134     0.289   
68                     the car's steering responsive.     0.000     0.297   
69  the car's smartphone integration works erratic...     0.000     0.000   

    Neutral  Compound  
65    0.360    0.4019  
66    1.000    0.0000  
67    0.578   -0.4854  
68    0.703   -0.2755  
69    1.000    0.0000  


In [7]:
# Create the Sentiment label column by specifying threshold values for positve, neutral and negative

df = data.copy() # make a copy of the original dataframe

df.loc[df.Compound > 0.5, 'Sentiment'] = "positive"
df.loc[df.Compound <= 0.5, 'Sentiment'] = "neutral"
df.loc[df.Compound < 0.25, 'Sentiment'] = "negative"
print(df.head(10))

                                            Response  Positive  Negative  \
0                           i love sleek design car.     0.412       0.0   
1                 the car excellent fuel efficiency.     0.608       0.0   
2                 the safety features car top-notch.     0.286       0.0   
3            the car's performance road outstanding.     0.364       0.0   
4             the interior car spacious comfortable.     0.292       0.0   
5    i'm impressed advanced technology features car.     0.389       0.0   
6                  the car offers great value price.     0.520       0.0   
7           the handling maneuverability car superb.     0.339       0.0   
8  the car's infotainment system user-friendly in...     0.000       0.0   
9              i appreciate ample storage space car.     0.278       0.0   

   Neutral  Compound Sentiment  
0    0.588    0.6369  positive  
1    0.392    0.7351  positive  
2    0.714    0.4215   neutral  
3    0.636    0.6124  positive 

In [8]:
# Check if the dataset is balanced

print(f"Counts of each class:\n{df['Sentiment'].value_counts()}")

Counts of each class:
negative    30
positive    23
neutral     17
Name: Sentiment, dtype: int64


In [9]:
# Separate y

y = df['Sentiment']

#### Count Vectorizer 

In [10]:
# Feature Extraction using CountVectorizer also known as Bag-of-Words method

vectorizer = CountVectorizer()
vectorizer.fit(df['Response'])
X = vectorizer.transform(df['Response'])

# Convert to matrix form
X = X.toarray()


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.20, random_state=42)

# Oversample using smote
print('Original dataset shape %s' % Counter(y_train))

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print('Resampled (SMOTE) dataset shape %s' % Counter(y_res))

# Train a Logistic Regression model
gnb1 = GaussianNB()
gnb1.fit(X_res, y_res)

# Check training accuracy
print(f"\nTraining Accuracy: {accuracy_score(y_res,gnb1.predict(X_res))*100:.2f}%\n\n")

# Predict sentiment labels for the test set
y_pred = gnb1.predict(X_test).reshape(-1,1)

# Print the model's performance
print(f"Test Accuracy: {accuracy_score(y_pred,y_test)*100:.2f}%\n")
print(f"Confusion Matrix:\n{confusion_matrix(y_pred,y_test)}\n")

Original dataset shape Counter({'negative': 25, 'neutral': 16, 'positive': 15})
Resampled (SMOTE) dataset shape Counter({'negative': 25, 'positive': 25, 'neutral': 25})

Training Accuracy: 88.00%


Test Accuracy: 50.00%

Confusion Matrix:
[[2 0 3]
 [1 1 1]
 [2 0 4]]



#### TFIDF Vectorizer

In [11]:
# Feature Extraction using TFIDFVectorizer 

tfidf = TfidfVectorizer()

y1 = y.copy() # Create a copy of y

# Instantiate vectorizer
tfidf.fit(df['Response'])

# Transform the response column
X1 = tfidf.transform(df['Response'])

# Convert to matrix form
X1 = X1.toarray()

# Split the data into training and testing sets
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=.20, random_state=42)

# Oversample using smote
print('Original dataset shape %s' % Counter(y_train))

sm = SMOTE(random_state=42)
X1_res, y1_res = sm.fit_resample(X1_train, y1_train)

print('Resampled (SMOTE) dataset shape %s' % Counter(y_res))

# Instantiate new RFC and build model
gnb2 = GaussianNB()

# Train model
gnb2.fit(X1_res,y1_res)

# Check training accuracy
print(f"\nTraining Accuracy: {accuracy_score(y1_res,gnb2.predict(X1_res))*100:.2f}%\n\n")

# Predict on test set and check accuracy
pred = gnb2.predict(X1_test)

# Print the model's performance
print(f"Test Accuracy: {accuracy_score(pred,y1_test)*100:.2f}%\n")
print(f"Confusion Matrix:\n{confusion_matrix(pred,y1_test)}\n")

Original dataset shape Counter({'negative': 25, 'neutral': 16, 'positive': 15})
Resampled (SMOTE) dataset shape Counter({'negative': 25, 'positive': 25, 'neutral': 25})

Training Accuracy: 96.00%


Test Accuracy: 42.86%

Confusion Matrix:
[[2 0 4]
 [1 1 1]
 [2 0 3]]



# Build a pipeline from response to sentiment prediction

In [12]:
# Function to predict on a response using both vectorizers and the corresponding models

def predict_sentiment(response):
    
    count_vec = vectorizer.transform(response)
    count_vec = count_vec.toarray()
    
    tfidf_response = tfidf.transform(response)
    tfidf_response = tfidf_response.toarray()
    
    prediction_countvec = gnb1.predict(count_vec)
    prediction_tfidf = gnb2.predict(tfidf_response)
    
    return prediction_countvec,prediction_tfidf
                             

In [13]:
# New data to predict upon
resp = ["The car is horrible",'The car smartphone integration works poorly',"The car is excellent","the car is ordinary"]

# Convert to dataframe
new_input = pd.DataFrame({"response":resp})

# Do they same text preprocessing as on training data
new_input['response'] = new_input['response'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop_words))

new_input

Unnamed: 0,response
0,the car horrible
1,the car smartphone integration works poorly
2,the car excellent
3,car ordinary


In [14]:
# Predict on the new inputs

results = predict_sentiment(new_input['response'])
print(f"Countvectorizer Based: {results[0]}\n\nTFIDF based: {results[1]}\n\nCorrect answer should be: {['negative','negative','positive','neutral']}")

Countvectorizer Based: ['negative' 'negative' 'positive' 'neutral']

TFIDF based: ['negative' 'negative' 'positive' 'positive']

Correct answer should be: ['negative', 'negative', 'positive', 'neutral']
