In [1]:
# Sentiment Classification using Traditional NLP (TF-IDF + Logistic Regression)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# 1. Example dataset (10 sentences)
sentences = [
    "I love this product, it works great!",
    "This is the worst experience I’ve ever had.",
    "Absolutely fantastic! Highly recommend it.",
    "I hate how slow and buggy this is.",
    "The movie was amazing and full of surprises.",
    "Terrible service, I will never come back.",
    "The food was delicious and well presented.",
    "I’m disappointed, the quality was poor.",
    "Excellent performance, I’m really impressed.",
    "Not worth the money, very bad quality."
]

# Corresponding labels (1 = positive, 0 = negative)
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

# 2. Create TF-IDF + Logistic Regression pipeline
model = make_pipeline(TfidfVectorizer(), LogisticRegression())

# 3. Train the model
model.fit(sentences, labels)

# 4. Test on new examples
test_sentences = [
    "I really enjoyed this!",
    "That was awful and boring."
]

predictions = model.predict(test_sentences)

# 5. Show results
for text, pred in zip(test_sentences, predictions):
    sentiment = "Positive 😊" if pred == 1 else "Negative 😠"
    print(f"'{text}' → {sentiment}")


'I really enjoyed this!' → Positive 😊
'That was awful and boring.' → Positive 😊


Here's an explanation of the roles of `TfidfVectorizer` and `LogisticRegression` in the provided code:

**`TfidfVectorizer`:**

*   **Job:** The `TfidfVectorizer` is responsible for converting the text data (the sentences) into a numerical representation that the machine learning model can understand. It does this by calculating the TF-IDF score for each word in each document (sentence).
*   **TF-IDF:** TF-IDF stands for Term Frequency-Inverse Document Frequency.
    *   **Term Frequency (TF):** Measures how often a word appears in a document.
    *   **Inverse Document Frequency (IDF):** Measures how rare a word is across all documents. Words that appear in many documents (like "the" or "a") have a lower IDF, while words that are specific to a few documents have a higher IDF.
*   **Output:** The output of the `TfidfVectorizer` is a sparse matrix where each row represents a sentence and each column represents a unique word from the entire vocabulary. The values in the matrix are the TF-IDF scores for each word in each sentence. This matrix is essentially a numerical representation of the text data, capturing the importance of words in each sentence relative to the entire dataset.

**`LogisticRegression`:**

*   **Job:** `LogisticRegression` is a linear model used for binary classification (in this case, classifying sentiment as either positive or negative). It takes the numerical representation of the text (the TF-IDF matrix) as input and learns to predict the probability of a sentence belonging to a particular class (positive or negative).
*   **How it uses the `TfidfVectorizer` output:** The TF-IDF matrix generated by `TfidfVectorizer` serves as the input features for the `LogisticRegression` model. Each column in the matrix (representing a word's TF-IDF score) becomes a feature that the logistic regression model uses to learn the relationship between the words and the sentiment labels. The model learns weights for each word feature, determining how much that word contributes to a positive or negative sentiment prediction.

**In summary:**

The `TfidfVectorizer` transforms the raw text into a numerical format that highlights the importance of words in each sentence. This numerical representation is then fed into the `LogisticRegression` model, which uses these features to learn a decision boundary that can classify new sentences as either positive or negative based on the patterns it learned from the training data. The `make_pipeline` function chains these two steps together, creating a single model that takes raw text as input and outputs sentiment predictions.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Example data
texts = [
    "I love this phone", "I hate this product",
    "Absolutely amazing", "Very bad experience",
    "Fantastic quality", "Terrible item"
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Create pipeline
model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(texts, labels)

# Test
test_sentences = ["This is awesome!", "Worst ever!"]

for sentence in test_sentences:
    prediction = model.predict([sentence])[0]
    sentiment = "Positive" if prediction == 1 else "Negative"
    print(f"'{sentence}' → {sentiment}")

'This is awesome!' → Positive
'Worst ever!' → Negative


In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import numpy as np # Import numpy

# Example data
texts = [
    "I love this movie", "This is awful",
    "Amazing story", "Very bad acting",
    "Fantastic direction", "Terrible sound"
]
labels = [1, 0, 1, 0, 1, 0]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
X = tokenizer.texts_to_sequences(texts)
X = pad_sequences(X, maxlen=5)
y = np.array(labels) # Convert labels to a numpy array

# Model
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=8, input_length=5),
    LSTM(16),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=20, verbose=0)

# Test
test = tokenizer.texts_to_sequences(["I really enjoyed this film"])
test = pad_sequences(test, maxlen=5)
prediction_probability = model.predict(test)[0][0] # Get the single probability value
sentiment = "Positive" if prediction_probability > 0.5 else "Negative"
print(f"'{'I really enjoyed this film'}' → {sentiment} (Probability: {prediction_probability:.4f})")



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 321ms/step
'I really enjoyed this film' → Positive (Probability: 0.5030)
