## Text Classification

## Part A- Practice Lab Exercises

### Program 1. Basic Text Classification using scikit-learn

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

texts = [
    "I love this movie",
    "This film was excellent",
    "I hate this movie",
    "This film was terrible",
    "Amazing acting and story",
    "Worst movie ever",
]

labels = [
    "positive",
    "positive",
    "negative",
    "negative",
    "positive",
    "negative"
]

model = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("classifier", MultinomialNB())
])

model.fit(texts, labels)

text_sentences = [
    "The movie was amazing",
    "I hated the film"
]

predictions = model.predict(text_sentences)

for sentence, label in zip(text_sentences, predictions):
    print(f"Text: {sentence} => Predicted class: {label}")

Text: The movie was amazing => Predicted class: positive
Text: I hated the film => Predicted class: negative


### Program 2: Probabilistic Text classification using scikit-learn

In [8]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

categories = [
    "rec.sport.baseball",
    "sci.space",
    "talk.politics.misc"
]

train_data = fetch_20newsgroups(subset="train", categories=categories)

test_data = fetch_20newsgroups(subset="test", categories=categories)

In [None]:
model = Pipeline([
    ("vectorizer", TfidfVectorizer(stop_words="english")),
    ("classifier", MultinomialNB())
])

model.fit(train_data.data, train_data.target)

predictions = model.predict(test_data.data)

accuracy = accuracy_score(test_data.target, predictions)

print("Accuracy:", accuracy)
print("Classification Report:\n")
print(classification_report(test_data.target, predictions, target_names=train_data.target_names))

sample_texts = [
    "The spacecraft was launched into orbit",
    "The baseball team won the championship"
]

sample_predictions = model.predict(sample_texts)

print("Custom predictions:")
for text, label in zip(sample_texts, sample_predictions):
    print(f"Text: {text}")
    print(f"Predicted category: {train_data.target_names[label]}\n")

Accuracy: 0.9564032697547684
Classification Report:

                    precision    recall  f1-score   support

rec.sport.baseball       0.97      0.99      0.98       397
         sci.space       0.93      0.98      0.96       394
talk.politics.misc       0.98      0.88      0.93       310

          accuracy                           0.96      1101
         macro avg       0.96      0.95      0.95      1101
      weighted avg       0.96      0.96      0.96      1101

Custom predictions:
Text: The spacecraft was launched into orbit
Predicted category: sci.space

Text: The baseball team won the championship
Predicted category: rec.sport.baseball



## Part B- Try it Yourself exercises

1. Design and implement a text classification system that automatically classifies movie reviews into positive and negative categories using the IMDb movie review benchmark dataset

Command to download IMDb dataset (using cURL):

```
#!/bin/bash
curl -L -o ./imdb-dataset-of-50k-movie-reviews.zip\
  https://www.kaggle.com/api/v1/datasets/download/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
```

In [17]:
import pandas as pd

file_path = "./imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
df = pd.read_csv(file_path)
print(df.head())

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

model = Pipeline([
    ("vectorizer", TfidfVectorizer(stop_words="english")),
    ("classifier", MultinomialNB())
])

model.fit(train_data["review"], train_data["sentiment"])

predictions = model.predict(test_data["review"])

accuracy = accuracy_score(test_data["sentiment"], predictions)


print("Accuracy:", accuracy)
print("Classification Report:\n")
print(classification_report(test_data["sentiment"], predictions, target_names=train_data["sentiment"].unique()))

sample_texts = [
    "This movie is an utter waste of time. Absolutely terrible plot and poor acting.",
    "It's quite rare to see movies of this calibre being made. Once in a generation kind of acting and plot."
]

sample_predictions = model.predict(sample_texts)

print("Custom predictions:")
for text, label in zip(sample_texts, sample_predictions):
    print(f"Text: {text}")
    print(f"Predicted category: {label}\n")

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Accuracy: 0.8652
Classification Report:

              precision    recall  f1-score   support

    negative       0.86      0.88      0.87      4961
    positive       0.88      0.85      0.86      5039

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Custom predictions:
Text: This movie is an utter waste of time. Absolutely terrible plot and poor acting.
Predicted category: negative

Text: It's quite rare to see movies of this calibre being made. Once in a generation kind of acting and plot.
Predicte

2. Design and implement a text classification system that automatically classifies text messages into spam or not spam, using the SMS spam collection dataset.

Command to download SMS Spam Collection dataset (using cURL):
```
#!/bin/bash
curl -L -o ./sms-spam-collection-dataset.zip\
  https://www.kaggle.com/api/v1/datasets/download/uciml/sms-spam-collection-dataset
```

In [28]:
import pandas as pd

file_path = "./sms-spam-collection-dataset/spam.csv"
df = pd.read_csv(file_path, encoding="latin-1") # note special encoding required for this dataset
print(df.head())

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

model = Pipeline([
    ("vectorizer", TfidfVectorizer(stop_words="english")),
    ("classifier", MultinomialNB())
])

model.fit(train_data["v2"], train_data["v1"])

predictions = model.predict(test_data["v2"])

accuracy = accuracy_score(test_data["v1"], predictions)


print("Accuracy:", accuracy)
print("Classification Report:\n")
print(classification_report(test_data["v1"], predictions, target_names=train_data["v1"].unique()))

sample_texts = [
    "Hey John, are we still on for the meeting tomorrow?",
    "As an esteemed customer, I am pleased to announce that following recent review of your Mob No. you are awarded with a £10 Prize, call 09063231539"
]

sample_predictions = model.predict(sample_texts)

print("Custom predictions:")
for text, label in zip(sample_texts, sample_predictions):
    print(f"Text: {text}")
    print(f"Predicted category: {label}\n")

     v1  ... Unnamed: 4
0   ham  ...        NaN
1   ham  ...        NaN
2  spam  ...        NaN
3   ham  ...        NaN
4   ham  ...        NaN

[5 rows x 5 columns]
Accuracy: 0.9668161434977578
Classification Report:

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115

Custom predictions:
Text: Hey John, are we still on for the meeting tomorrow?
Predicted category: ham

Text: As an esteemed customer, I am pleased to announce that following recent review of your Mob No. you are awarded with a £10 Prize, call 09063231539
Predicted category: spam

