## **Quiz-2**

Perform following tasks on the provided Reviews Dataset.
* Drop words if not alphabets.
* Tokenize the sentence.
* Perform lemitization.
* Vectorize using bigram and trigram techniques.
* Apply Random Forest algorithm with 150 trees.
* Evaluate overall accuracy of the model and class-wise precision .

**Load Dataset and Basic Text Cleaning**

In [8]:
import pandas as pd
import re

data = pd.read_csv('reviews_dataset.csv')

print(data.head())

def clean_text(text):
    return ' '.join(re.sub(r'[^a-zA-Z\s]', '', str(text)).split())
data['cleaned_text'] = data['news'].apply(clean_text)

print(data[['news', 'cleaned_text']].head())


                                                news      type
0  China had role in Yukos split-up\n \n China le...  business
1  Oil rebounds from weather effect\n \n Oil pric...  business
2  Indonesia 'declines debt freeze'\n \n Indonesi...  business
3  $1m payoff for former Shell boss\n \n Shell is...  business
4  US bank in $515m SEC settlement\n \n Five Bank...  business
                                                news  \
0  China had role in Yukos split-up\n \n China le...   
1  Oil rebounds from weather effect\n \n Oil pric...   
2  Indonesia 'declines debt freeze'\n \n Indonesi...   
3  $1m payoff for former Shell boss\n \n Shell is...   
4  US bank in $515m SEC settlement\n \n Five Bank...   

                                        cleaned_text  
0  China had role in Yukos splitup China lent Rus...  
1  Oil rebounds from weather effect Oil prices re...  
2  Indonesia declines debt freeze Indonesia no lo...  
3  m payoff for former Shell boss Shell is to pay...  
4  US bank

**Preprocessing**

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


nltk.download('punkt')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('all')



lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    clean_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha()]
    return ' '.join(clean_tokens)


data['processed_text'] = data['cleaned_text'].apply(preprocess_text)


print(data[['news', 'processed_text']].head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package aver

                                                news  \
0  China had role in Yukos split-up\n \n China le...   
1  Oil rebounds from weather effect\n \n Oil pric...   
2  Indonesia 'declines debt freeze'\n \n Indonesi...   
3  $1m payoff for former Shell boss\n \n Shell is...   
4  US bank in $515m SEC settlement\n \n Five Bank...   

                                      processed_text  
0  china had role in yukos splitup china lent rus...  
1  oil rebound from weather effect oil price reco...  
2  indonesia decline debt freeze indonesia no lon...  
3  m payoff for former shell bos shell is to pay ...  
4  u bank in m sec settlement five bank of americ...  


**Vectorization using Bigrams and Trigrams**

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 3))  # Bigram and trigram
X = vectorizer.fit_transform(data['processed_text'])

print(f"Feature matrix shape: {X.shape}")


Feature matrix shape: (2225, 897973)


**Splitting the Data and Setting Up Labels**

In [15]:
from sklearn.model_selection import train_test_split

y = data['processed_text']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")


Training set: (1780, 897973), Test set: (445, 897973)


**Train Random Forest Classifier**

We'll train the model with 150 trees and evaluate its performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


rf_model = RandomForestClassifier(n_estimators=150, random_state=42)


rf_model.fit(X_train, y_train)


y_pred = rf_model.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy}")


print("Classification Report:\n")
print(classification_report(y_test, y_pred))
