## Installing and Importing Packages

In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.




In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [50]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)
import numpy as np
import re
import string 

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score , classification_report

In [4]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Reading the Data

In [5]:
imdb_prelim = pd.read_excel("IMDB_dataset.xlsx")
imdb_prelim.head()

Unnamed: 0,review,sentiment
0,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...",positive
1,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a nobl...",positive
2,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have tod...,positive
3,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 o...",negative
4,Encouraged by the positive comments about this film on here I was looking forward to watching th...,negative


In [6]:
imdb_prelim['review'].loc[23]

"It had all the clichÃ©s of movies of this type and no substance. The plot went nowhere and at the end of the movie I felt like a sucker for watching it. The production was good; however, the script and acting were B-movie quality. The casting was poor because there were good actors mixed in with crumby actors. The good actors didn't hold their own nor did they lift up the others. <br /><br />This movie is not worthy of more words, but I will say more to meet the minimum requirement of ten lines. James Wood and Cuba Gooding, Jr. play caricatures of themselves in other movies. <br /><br />If you are looking for mindless entertainment, I still wouldn't recommend this movie."

In [7]:
imdb_prelim.describe()

Unnamed: 0,review,sentiment
count,25000,25000
unique,24898,2
top,"When i got this movie free from my job, along with three other similar movies.. I watched then w...",positive
freq,3,12500


In [8]:
imdb_prelim.sentiment.value_counts()

positive    12500
negative    12500
Name: sentiment, dtype: int64

#### We can see that the dataset is balanced.

## Data Preprocessing

### Converting 'reviews' to Lowercase

In [9]:
imdb_prelim['review'] = imdb_prelim['review'].str.lower()
imdb_prelim.head()

Unnamed: 0,review,sentiment
0,"i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...",positive
1,"probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a nobl...",positive
2,i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...,positive
3,"this show was an amazing, fresh & innovative idea in the 70's when it first aired. the first 7 o...",negative
4,encouraged by the positive comments about this film on here i was looking forward to watching th...,negative


### Removing HTML tags

In [10]:
def remove_html_tags (text):
    rmv = re.compile('<.*?>')
    return rmv.sub(r'', text)

In [11]:
imdb_prelim['review'] = imdb_prelim['review'].apply(remove_html_tags)
imdb_prelim['review'].loc[:24]

0     i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...
1     probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a nobl...
2     i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...
3     this show was an amazing, fresh & innovative idea in the 70's when it first aired. the first 7 o...
4     encouraged by the positive comments about this film on here i was looking forward to watching th...
5     phil the alien is one of those quirky films where the humour is based around the oddness of ever...
6     i saw this movie when i was about 12 when it came out. i recall the scariest scene was the big b...
7     so im not a big fan of boll's work but then again not many are. i enjoyed his movie postal (mayb...
8     this a fantastic movie of three prisoners who become famous. one of the actors is george clooney...
9     this movie made it into one of my top 10

### Removing URLs

In [12]:
def remove_url(text):
    re_url = re.compile('https?://\S+|www\.\S+')
    return re_url.sub('', text)

In [13]:
imdb_prelim['review'] = imdb_prelim['review'].apply(remove_url)
imdb_prelim['review'].loc[2336]

"a message movie, but a rather good one. outstanding cast, top to bottom. interesting in that bette davis's plot line is essentially back story! the extremely negative reviews (name throwing at the screenplay/playwright, associating this somehow with extremely negative comments about 'angles in america', etc. etc.) object to the movie being too preachy about germany in wwii. gosh, that is just a bit too sophisticated an understanding of morality for me.theatrical and movie-making, and acting styles vary over time and of course 70 years later this particular movie would not be made in this way. yes casablanca is a better movie (i guess), but although made in the same year and both having nazis in them, casablanca is primarily a love story. the love story in this movie takes second seat to the spy plot--more of a thriller. both have a rather large number of somewhat cheesy accents and wonderful character actors. the children are a bit tedious and could have been edited"

## Removing Punctuations

In [14]:
def remove_punct(text):
    re_punct = "".join([char for char in text if char not in string.punctuation])
    return re_punct

In [15]:
imdb_prelim['review'] = imdb_prelim['review'].apply(remove_punct)
imdb_prelim.head()

Unnamed: 0,review,sentiment
0,i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air ...,positive
1,probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble c...,positive
2,i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...,positive
3,this show was an amazing fresh innovative idea in the 70s when it first aired the first 7 or 8 ...,negative
4,encouraged by the positive comments about this film on here i was looking forward to watching th...,negative


## Tokenization 

In [16]:
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

In [17]:
imdb_prelim['review_token'] = imdb_prelim['review'].apply(tokenize)
imdb_prelim.head()

Unnamed: 0,review,sentiment,review_token
0,i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air ...,positive,"[i, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, si..."
1,probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble c...,positive,"[probably, my, alltime, favorite, movie, a, story, of, selflessness, sacrifice, and, dedication,..."
2,i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...,positive,"[i, sure, would, like, to, see, a, resurrection, of, a, up, dated, seahunt, series, with, the, t..."
3,this show was an amazing fresh innovative idea in the 70s when it first aired the first 7 or 8 ...,negative,"[this, show, was, an, amazing, fresh, innovative, idea, in, the, 70s, when, it, first, aired, th..."
4,encouraged by the positive comments about this film on here i was looking forward to watching th...,negative,"[encouraged, by, the, positive, comments, about, this, film, on, here, i, was, looking, forward,..."


## Removing Stopwords

In [18]:
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')

In [19]:
def remove_stopwords(text):
    re_stp = [word for word in text if word not in stopwords_english]
    return re_stp

In [20]:
imdb_prelim['review_stopwords'] = imdb_prelim['review_token'].apply(remove_stopwords)
imdb_prelim.head()

Unnamed: 0,review,sentiment,review_token,review_stopwords
0,i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air ...,positive,"[i, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, si...","[thought, wonderful, way, spend, time, hot, summer, weekend, sitting, air, conditioned, theater,..."
1,probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble c...,positive,"[probably, my, alltime, favorite, movie, a, story, of, selflessness, sacrifice, and, dedication,...","[probably, alltime, favorite, movie, story, selflessness, sacrifice, dedication, noble, cause, p..."
2,i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...,positive,"[i, sure, would, like, to, see, a, resurrection, of, a, up, dated, seahunt, series, with, the, t...","[sure, would, like, see, resurrection, dated, seahunt, series, tech, today, would, bring, back, ..."
3,this show was an amazing fresh innovative idea in the 70s when it first aired the first 7 or 8 ...,negative,"[this, show, was, an, amazing, fresh, innovative, idea, in, the, 70s, when, it, first, aired, th...","[show, amazing, fresh, innovative, idea, 70s, first, aired, first, 7, 8, years, brilliant, thing..."
4,encouraged by the positive comments about this film on here i was looking forward to watching th...,negative,"[encouraged, by, the, positive, comments, about, this, film, on, here, i, was, looking, forward,...","[encouraged, positive, comments, film, looking, forward, watching, film, bad, mistake, ive, seen..."


## Stemming 

In [21]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def perform_stemming(text):
    new_text = [ps.stem(word) for word in text]
    return ' '.join(new_text)

In [22]:
imdb_prelim['review_stemmed'] = imdb_prelim['review_stopwords'].apply(perform_stemming)
imdb_prelim.head()

Unnamed: 0,review,sentiment,review_token,review_stopwords,review_stemmed
0,i thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air ...,positive,"[i, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, si...","[thought, wonderful, way, spend, time, hot, summer, weekend, sitting, air, conditioned, theater,...",thought wonder way spend time hot summer weekend sit air condit theater watch lightheart comedi ...
1,probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble c...,positive,"[probably, my, alltime, favorite, movie, a, story, of, selflessness, sacrifice, and, dedication,...","[probably, alltime, favorite, movie, story, selflessness, sacrifice, dedication, noble, cause, p...",probabl alltim favorit movi stori selfless sacrific dedic nobl caus preachi bore never get old d...
2,i sure would like to see a resurrection of a up dated seahunt series with the tech they have tod...,positive,"[i, sure, would, like, to, see, a, resurrection, of, a, up, dated, seahunt, series, with, the, t...","[sure, would, like, see, resurrection, dated, seahunt, series, tech, today, would, bring, back, ...",sure would like see resurrect date seahunt seri tech today would bring back kid excit mei grew b...
3,this show was an amazing fresh innovative idea in the 70s when it first aired the first 7 or 8 ...,negative,"[this, show, was, an, amazing, fresh, innovative, idea, in, the, 70s, when, it, first, aired, th...","[show, amazing, fresh, innovative, idea, 70s, first, aired, first, 7, 8, years, brilliant, thing...",show amaz fresh innov idea 70 first air first 7 8 year brilliant thing drop 1990 show realli fun...
4,encouraged by the positive comments about this film on here i was looking forward to watching th...,negative,"[encouraged, by, the, positive, comments, about, this, film, on, here, i, was, looking, forward,...","[encouraged, positive, comments, film, looking, forward, watching, film, bad, mistake, ive, seen...",encourag posit comment film look forward watch film bad mistak ive seen 950 film truli one worst...


## TF-IDF Vectorization

### Read and Clean the dataset

In [23]:
imdb = pd.read_excel("IMDB_dataset.xlsx")
imdb.columns = ['review', 'sentiment']
imdb.head()

Unnamed: 0,review,sentiment
0,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...",positive
1,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a nobl...",positive
2,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have tod...,positive
3,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 o...",negative
4,Encouraged by the positive comments about this film on here I was looking forward to watching th...,negative


### Create feature for text message length and % of text that is punctuation

In [24]:
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)

imdb['body_len'] = imdb['review'].apply(lambda x: len(x) - x.count(" "))
imdb['punct%'] = imdb['review'].apply(lambda x: count_punct(x))

imdb.head()

Unnamed: 0,review,sentiment,body_len,punct%
0,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...",positive,761,0.053
1,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a nobl...",positive,538,0.052
2,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have tod...,positive,577,0.021
3,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 o...",negative,761,0.043
4,Encouraged by the positive comments about this film on here I was looking forward to watching th...,negative,552,0.056


### Cleaning Punctuations and performing Tokenization 

In [25]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Apply TfidfVectorizer

In [26]:
X = imdb['review']
y = imdb['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, stratify = y, random_state = 0)

In [27]:
import nltk

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

# Create the TfidfVectorizer with the same vocabulary
tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text, max_features=5000)
X_tfidf_sample = tfidf_vect_sample.fit_transform(X_train)

# Use the same vocabulary to transform the training and test sets
X_train_tfidf = tfidf_vect_sample.transform(X_train)
X_test_tfidf = tfidf_vect_sample.transform(X_test)

print(X_tfidf_sample.shape)
print(tfidf_vect_sample.get_feature_names_out())


(12500, 5000)
['' '0' '1' ... 'zone' 'â' 'ã']


In [28]:
X_features = pd.concat([imdb['body_len'], imdb['punct%'], pd.DataFrame(X_tfidf_sample.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,761,0.053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,538,0.052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,577,0.021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,761,0.043,0.114026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,552,0.056,0.080035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.073237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Vectorizers output sparse matrices

In [29]:
X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray())
X_tfidf_df.columns = tfidf_vect_sample.get_feature_names_out()
X_tfidf_df

Unnamed: 0,Unnamed: 1,0,1,10,100,1000,1010,10br,11,110,...,your,youth,youv,z,zani,zero,zombi,zone,â,ã
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.114026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.080035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.073237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12496,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12497,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12498,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Using GridSearchCV on Random Forest

In [31]:
rf = RandomForestClassifier()
param = {'n_estimators': [10, 25, 50, 75],
        'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_train_tfidf, y_train)
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
11,44.361397,1.30284,0.2156,0.022383,90.0,75,"{'max_depth': 90, 'n_estimators': 75}",0.84,0.838,0.8288,0.8364,0.8396,0.83656,0.004084,1
15,33.314024,2.11726,0.124393,0.02632,,75,"{'max_depth': None, 'n_estimators': 75}",0.8412,0.8308,0.8372,0.8432,0.83,0.83648,0.005333,2
7,39.858996,1.379299,0.2074,0.011942,60.0,75,"{'max_depth': 60, 'n_estimators': 75}",0.8324,0.8288,0.8352,0.8388,0.8272,0.83248,0.004214,3
3,25.121795,0.409968,0.202002,0.024637,30.0,75,"{'max_depth': 30, 'n_estimators': 75}",0.8368,0.8316,0.8288,0.8408,0.8192,0.83144,0.007391,4
14,28.502211,1.318316,0.114993,0.008842,,50,"{'max_depth': None, 'n_estimators': 50}",0.8312,0.828,0.8344,0.8312,0.8212,0.8292,0.004483,5


## Using GridSearchCV on Gradient Boosting Classifier

In [33]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [5, 10, 20, 30],
    'max_depth': [5,10,15],
    'learning_rate': [0.1]
}

gs2 = GridSearchCV(gb, param, cv=5, n_jobs=-1)
gs2_fit = gs2.fit(X_train_tfidf, y_train)
pd.DataFrame(gs2_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,224.562999,1.870217,0.023601,0.001745,0.1,10,30,"{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 30}",0.7924,0.7884,0.7976,0.798,0.7952,0.79432,0.003572,1
11,302.054423,2.518665,0.017989,0.003787,0.1,15,30,"{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 30}",0.7932,0.7892,0.7952,0.794,0.7944,0.7932,0.002101,2
6,150.498599,1.635302,0.021,0.003406,0.1,10,20,"{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 20}",0.7744,0.7728,0.774,0.7792,0.7708,0.77424,0.002778,3
10,238.498,5.127184,0.018398,0.002727,0.1,15,20,"{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 20}",0.7748,0.778,0.7704,0.7772,0.7704,0.77416,0.003246,4
3,93.235799,0.695044,0.02,0.002608,0.1,5,30,"{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 30}",0.7696,0.7748,0.764,0.7764,0.7752,0.772,0.00463,5


In [36]:
# Random Forest with GridSearchCV
best_rf_params = gs_fit.best_params_
best_rf_model = RandomForestClassifier(**best_rf_params)
best_rf_model.fit(X_train_tfidf, y_train)
y_pred_rf = best_rf_model.predict(X_test_tfidf)

# Evaluation metrics for Random Forest
print("Random Forest Evaluation Metrics:")
print(classification_report(y_test, y_pred_rf))
print(f"Best Parameters: {best_rf_params}")

# Gradient Boosting with GridSearchCV
best_gb_params = gs2_fit.best_params_
best_gb_model = GradientBoostingClassifier(**best_gb_params)
best_gb_model.fit(X_train_tfidf, y_train)
y_pred_gb = best_gb_model.predict(X_test_tfidf)

# Evaluation metrics for Gradient Boosting
print("\nGradient Boosting Evaluation Metrics:")
print(classification_report(y_test, y_pred_gb))
print(f"Best Parameters: {best_gb_params}")


Random Forest Evaluation Metrics:
              precision    recall  f1-score   support

    negative       0.83      0.85      0.84      6250
    positive       0.85      0.83      0.84      6250

    accuracy                           0.84     12500
   macro avg       0.84      0.84      0.84     12500
weighted avg       0.84      0.84      0.84     12500

Best Parameters: {'max_depth': 90, 'n_estimators': 75}

Gradient Boosting Evaluation Metrics:
              precision    recall  f1-score   support

    negative       0.82      0.77      0.80      6250
    positive       0.78      0.83      0.81      6250

    accuracy                           0.80     12500
   macro avg       0.80      0.80      0.80     12500
weighted avg       0.80      0.80      0.80     12500

Best Parameters: {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 30}


In [46]:
# Best parameters for Random Forest
best_rf_params = gs_fit.best_params_

# Create and fit the Random Forest model with the best parameters
best_rf_model = RandomForestClassifier(**best_rf_params)
best_rf_model.fit(X_train_tfidf, y_train)  

# Evaluate the Random Forest model on the test set
rf_predictions = best_rf_model.predict(X_test_tfidf)  
rf_accuracy = accuracy_score(y_test, rf_predictions) * 100
rf_precision = precision_score(y_test, rf_predictions, pos_label='positive') * 100
rf_recall = recall_score(y_test, rf_predictions, pos_label='positive') * 100
rf_f1 = f1_score(y_test, rf_predictions, pos_label='positive') * 100

# Print Random Forest Evaluation Metrics
print("Random Forest Evaluation Metrics:")
print(f"Best Parameters: {best_rf_params}")
print(f"Accuracy: {rf_accuracy:.2f}%")
print(f"Precision: {rf_precision:.2f}%")
print(f"Recall: {rf_recall:.2f}%")
print(f"F1 Score: {rf_f1:.2f}%")


Random Forest Evaluation Metrics:
Best Parameters: {'max_depth': 90, 'n_estimators': 75}
Accuracy: 83.92%
Precision: 84.32%
Recall: 83.34%
F1 Score: 83.83%


In [47]:
# Best parameters for Gradient Boosting
best_gb_params = gs2_fit.best_params_

# Create and fit the Gradient Boosting model with the best parameters
best_gb_model = GradientBoostingClassifier(**best_gb_params)
best_gb_model.fit(X_train_tfidf, y_train) 

# Evaluate the Gradient Boosting model on the test set
gb_predictions = best_gb_model.predict(X_test_tfidf)  
gb_accuracy = accuracy_score(y_test, gb_predictions) * 100  
gb_precision = precision_score(y_test, gb_predictions, pos_label='positive') * 100
gb_recall = recall_score(y_test, gb_predictions, pos_label='positive') * 100
gb_f1 = f1_score(y_test, gb_predictions, pos_label='positive') * 100

# Print Gradient Boosting Evaluation Metrics
print("\nGradient Boosting Evaluation Metrics:")
print(f"Best Parameters: {best_gb_params}")
print(f"Accuracy: {gb_accuracy:.2f}%")
print(f"Precision: {gb_precision:.2f}%")
print(f"Recall: {gb_recall:.2f}%")
print(f"F1 Score: {gb_f1:.2f}%")



Gradient Boosting Evaluation Metrics:
Best Parameters: {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 30}
Accuracy: 80.31%
Precision: 78.48%
Recall: 83.54%
F1 Score: 80.93%


# Analysis
### Accuracy:

- Random Forest: 83.92%
- Gradient Boosting: 80.31%

**Analysis:** The Random Forest model has a higher accuracy, indicating that it makes more correct predictions overall compared to the Gradient Boosting model.

### Precision:

- Random Forest: 84.32%
- Gradient Boosting: 78.48%

**Analysis:** The Random Forest model also exhibits higher precision, implying that when it predicts a positive sentiment, it is more likely to be correct compared to the Gradient Boosting model.

### Recall:

- Random Forest: 83.34%
- Gradient Boosting: 83.54%

**Analysis:** Both models have similar recall, indicating their ability to capture positive sentiments, but the Random Forest model performs slightly lower in this aspect.

### F1 Score:

- Random Forest: 83.83%
- Gradient Boosting: 80.93%

**Analysis:** The Random Forest model has a higher F1 score, signifying a good balance between precision and recall.


# Conclusion 

### Considering the overall performance across various metrics, the Random Forest model appears to be the better-performing model in this scenario. It demonstrates higher accuracy, precision, and F1 score compared to the Gradient Boosting model. Therefore, based on the provided evaluation metrics, the Random Forest model is recommended.