<a href="https://colab.research.google.com/github/YurajK00/Naive-Bayes-Algorithm/blob/main/COMPSCI_762_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Task 1: Text Classification using Naive Bayes

The task is to classify text data into different categories using a Naive Bayes classifier. Naive Bayes is a popular choice for text classification tasks due to its simplicity, efficiency, and effectiveness, especially when dealing with high-dimensional data like text.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import string
import pandas as pd

nltk.download('punkt')
nltk.download('stopwords')
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Data Loading: The training and test data are loaded from CSV files.


In [2]:
train_data = pd.read_csv("train.csv")


In [24]:
test_data = pd.read_csv("test.csv")

print(train_data.shape)
print(test_data.shape)

(2838, 7)
(728, 6)


In [3]:
train_df = pd.DataFrame(train_data)
text_data = train_df.drop('category', axis=1)
text_data.dtypes

text = pd.DataFrame(train_data[['name' , 'review']])
pd.DataFrame(text)

text

Unnamed: 0,name,review
0,Splendid Pig,Experienced the Splendid Pig pop-up on a trip ...
1,Subway,"Ok, not a great location. But - The girl behin..."
2,Firehouse Subs,Soooo excited for this location and it did not...
3,Port Orleans Brewing,Just opened but feels like a hit. Big open spa...
4,New Orleans Cake Café & Bakery,In love with this hole in the wall. Def must g...
...,...,...
2833,Bar Frances,Picky friend and I went the other night and lo...
2834,Chef D'Z Cafe,Food is real good you get alot of food for the...
2835,Eyes on Canal,Doctor Jackson is great. As her practice grows...
2836,Slidershak Food Truck,"Tucked beneath the highway, slider shack is us..."


In [4]:
pd.DataFrame(test_data)

Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review
0,478,McAlister's Deli,29.949670,-90.064408,16.0,I at there twice in the last week. The Deli i...
1,3490,The Soup Garden,29.922519,-90.091088,20.0,The soup is good however it's not $8.50 good. ...
2,2705,The Station Coffee Shop & Bakery,29.978549,-90.102926,17.0,"Friendly, personable service. Kept a quick flo..."
3,1973,Key West Hat Company,29.955036,-90.064357,17.0,I love hats. I probably have 20 of them. But I...
4,369,Lengua Madre,29.938117,-90.070331,0.0,This is exactly what I want in a dining experi...
...,...,...,...,...,...,...
723,1643,Soul's Seafood Market,29.937463,-90.090468,19.5,"It's near my home, so I should frequent it, ri..."
724,2326,Vincent's Italian Cuisine,29.941856,-90.132306,2.0,My husband and I just had dinner at Vincent's ...
725,3215,Igor's Lounge & Game Room,29.934535,-90.080520,7.0,"This is my local watering hole, so I may be a ..."
726,159,McDonald's,30.004244,-90.036407,14.0,NASTY NASTY NASTY!!!!!! Steer clear. I've bee...


Text Preprocessing: The train_data undergoes several preprocessing steps:
Tokenization: Splitting the train_data into individual tokens (words).
Lowercasing: Converting all tokens to lowercase to ensure consistency.
Removing Punctuation: Eliminating punctuation marks from the tokens.
Removing Stopwords: Filtering out common stopwords like "the" and "is".
Stemming (Optional): Reducing words to their root form using stemming.
Vectorization: The preprocessed text data is converted into a Bag-of-Words (BoW) representation using CountVectorizer.

In [5]:
#Step 1: Tokenization
for column in text:

  train_data[column] = train_data[column].apply(lambda x:  word_tokenize(str(x)))


# # Step 2: Lowercasing
for column in text:
  train_data[column] = [[token.lower() for token in sentence] for sentence in train_data[column]]

# Step 3: Removing Punctuation

table = str.maketrans('', '', string.punctuation)
for column in text:
  train_data[column] = train_data[column].apply(lambda tokens: [token.translate(table) for token in tokens if token.translate(table) != '' ])




#Step 4: Removing Stopwords
stop_words = set(stopwords.words('english'))
for column in text:
  train_data[column]= [[token for token in sentence if token not in stop_words] for sentence in train_data[column]]

# # Step 5: Stemming (Optional)
stemmer = PorterStemmer()
for column in text:
  train_data[column] = [[stemmer.stem(token) for token in sentence] for sentence in train_data[column]]






# # Step 6: Vectorization (BoW)


# # Print preprocessed tokens and BoW representation
# # print("Preprocessed Tokens:", tokens_stemmed)
# # print("Bag of Words Representation:")
# # print(X.toarray())
# # print("Vocabulary:", vectorizer.get_feature_names_out()
train_data



Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review,category
0,1457,"[splendid, pig]",29.937778,-90.081429,17.0,"[experienc, splendid, pig, popup, trip, nola, ...",Restaurants
1,3526,[subway],29.932236,-90.004271,18.0,"[ok, great, locat, girl, behind, counter, actu...",Restaurants
2,1891,"[firehous, sub]",29.920835,-90.012189,18.0,"[soooo, excit, locat, disappoint, fire, lol, s...",Restaurants
3,3384,"[port, orlean, brew]",29.917005,-90.098272,19.0,"[open, feel, like, hit, big, open, space, list...",Nightlife
4,1297,"[new, orlean, cake, café, bakeri]",29.963752,-90.052964,16.0,"[love, hole, wall, def, must, get, biscuit, huge]",Restaurants
...,...,...,...,...,...,...,...
2833,1681,"[bar, franc]",29.935199,-90.104948,17.0,"[picki, friend, went, night, love, food, great...",Restaurants
2834,1855,"[chef, z, cafe]",29.963803,-90.071958,18.0,"[food, real, good, get, alot, food, price, bad...",Restaurants
2835,3281,"[eye, canal]",29.975829,-90.102318,19.0,"[doctor, jackson, great, practic, grow, make, ...",Shopping
2836,1181,"[slidershak, food, truck]",29.921968,-90.107394,13.0,"[tuck, beneath, highway, slider, shack, usual,...",Restaurants


In [6]:
train_data['name'] = train_data['name'].apply(lambda x:  ''.join(x))
train_data['review'] = train_data['review'].apply(lambda x:  ''.join(x))

vectorizer1 = CountVectorizer()
vectorizer2 = CountVectorizer()
name_vectorized = vectorizer1.fit_transform(train_data['name'])
review_vectorized = vectorizer2.fit_transform(train_data['review'])


name_df = pd.DataFrame(name_vectorized.toarray(), columns=vectorizer1.get_feature_names_out())
review_df = pd.DataFrame(review_vectorized.toarray(), columns=vectorizer2.get_feature_names_out())

df_vectorized = pd.concat([train_data.drop(columns = ['name','review']), name_df, review_df], axis=1)
df_vectorized[df_vectorized['1000fig']==2]

Unnamed: 0,ID,latitude,longitude,mean_checkin_time,category,1000fig,12barfultonstreet,14parishjamaicanrestaur,14parishoak,21stamendlalouisian,...,yettripizzameatbalsubterrifmeatbaltenderfreshbreadnicecrispigotlunchspecialincludsubbagchipsoftdrinkcame878floorspend67bucknowadaycrappifastfoodjointrealfoodless9buckdefinitgobackplacetrimenuitem,yukundercookfrioverlisaltifrioystermove,yumlovesatsumaneverlocatareathoughtgivetrifriendliaccommodstaffcleanspaceyummifoodbreakfastcantgowrongbaconeggcheesbiscuitonequalmfeelgenerhiplocatnicecontemporarintneworleancharmcouldrestaurtrendicitintfeeloverlispecialsensdefinitworthtriareasureback,yummintanythmenuwouldntorderexpectwaittimechickencookfresh,yummipizzarealliperfectwent5pmthursdayfolktimeordercustompizzapepperonisundritomatobasilgreatservicfriendliaccomodlikeambiencfunwatchmagazinstreetshopperstroll,yummitapacoolcrowdamazdjwishcouldgoeveriweekendtransportsfnolawellwishcouldvarioureasonnexttimegomimimakesuredjsoulsistaspindancassnightlongheartmimi,yummitridishexceptkatsutastilovesaucpairflavorfavoritbeefdishwelluniqutakepoboyalsokoreansoftdrinkhighlirecommendgrapeemployenicefriendliregardlesslinepatientanswerquestiononestarportionsmallknowkoreanfoodneworleanpricifoodidestincontinutradit,yuppifinicelyadornbarhomehothollerfrattypladidotedrinkoverprshameneighborhoodawesomgontparkperlilotnturinoutsidcaughttape,zanderownerfineestablishgangslingbestpizzatownhiddengemshookyelpcommunknowlovecometrislicepreorderfavoritlaterprotipgetsitereservpreorderpicktimeclassiccheespepperonimushroomjalapeñobambinospecialfavoritpicturstuffface,zerointentstopplacegentlemanreadpaperfirehydranttstopdiscussadordoghappenregularlimustadmitbartendstartlureduncantreatvodkastartluredrinkdognapfloorgotleast5treatdealbartenddresspurpllsushirtredbandannaaroundneckredchuckredsallijessraphaelstyleglassentertaineveryonsarcasmevenaskanythshopbagwouldlikeputfrigdrinksweetstaywatchnbagamediscussbasketbalcelticfansitnexteveryonfriendliserioustalkalmosteveripersonplaceriothonestliwouldstayhourdidntwalkwayhomewaybartendaskreturncrawfishboilsaturdayjazzfesttoldcouldntmakefirstsaturdaywouldtricomefridaylatercutheytakedamnnoteseenexttimeahahahahahreturnblastpleasnotecashalthoughdoorwindowopenbitsmokeysideluckilismokeriseduncanlayfloorotherwisdontthinkwouldvfeltcomfortbreathair


In [7]:
import numpy as np
X = df_vectorized.drop(columns = 'category', axis=1)
X = np.abs(X)

In [8]:
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
label_encoder = LabelEncoder()
# vectorizer_y = CountVectorizer()
# Encode the target variable
Y = label_encoder.fit_transform(train_data['category'])

type(Y)

numpy.ndarray

Solution Implementation:
Model Selection: Grid search is used to find the best hyperparameters for the Multinomial Naive Bayes model.
Model Training: The Multinomial Naive Bayes model is trained on the preprocessed text data.
Model Evaluation: The model's accuracy is evaluated on a holdout validation set using accuracy_score from scikit-learn.
Train-Test Split: The training data is split into training and validation sets (e.g., 70-30 split).
Grid Search Cross-Validation: Grid search with cross-validation (e.g., 10-fold cross-validation) is performed to find the best hyperparameters.
Model Training and Validation: The best model is trained on the training set and evaluated on the validation set.

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

X_train



Unnamed: 0,ID,latitude,longitude,mean_checkin_time,1000fig,12barfultonstreet,14parishjamaicanrestaur,14parishoak,21stamendlalouisian,2ndlinetourexperi,...,yettripizzameatbalsubterrifmeatbaltenderfreshbreadnicecrispigotlunchspecialincludsubbagchipsoftdrinkcame878floorspend67bucknowadaycrappifastfoodjointrealfoodless9buckdefinitgobackplacetrimenuitem,yukundercookfrioverlisaltifrioystermove,yumlovesatsumaneverlocatareathoughtgivetrifriendliaccommodstaffcleanspaceyummifoodbreakfastcantgowrongbaconeggcheesbiscuitonequalmfeelgenerhiplocatnicecontemporarintneworleancharmcouldrestaurtrendicitintfeeloverlispecialsensdefinitworthtriareasureback,yummintanythmenuwouldntorderexpectwaittimechickencookfresh,yummipizzarealliperfectwent5pmthursdayfolktimeordercustompizzapepperonisundritomatobasilgreatservicfriendliaccomodlikeambiencfunwatchmagazinstreetshopperstroll,yummitapacoolcrowdamazdjwishcouldgoeveriweekendtransportsfnolawellwishcouldvarioureasonnexttimegomimimakesuredjsoulsistaspindancassnightlongheartmimi,yummitridishexceptkatsutastilovesaucpairflavorfavoritbeefdishwelluniqutakepoboyalsokoreansoftdrinkhighlirecommendgrapeemployenicefriendliregardlesslinepatientanswerquestiononestarportionsmallknowkoreanfoodneworleanpricifoodidestincontinutradit,yuppifinicelyadornbarhomehothollerfrattypladidotedrinkoverprshameneighborhoodawesomgontparkperlilotnturinoutsidcaughttape,zanderownerfineestablishgangslingbestpizzatownhiddengemshookyelpcommunknowlovecometrislicepreorderfavoritlaterprotipgetsitereservpreorderpicktimeclassiccheespepperonimushroomjalapeñobambinospecialfavoritpicturstuffface,zerointentstopplacegentlemanreadpaperfirehydranttstopdiscussadordoghappenregularlimustadmitbartendstartlureduncantreatvodkastartluredrinkdognapfloorgotleast5treatdealbartenddresspurpllsushirtredbandannaaroundneckredchuckredsallijessraphaelstyleglassentertaineveryonsarcasmevenaskanythshopbagwouldlikeputfrigdrinksweetstaywatchnbagamediscussbasketbalcelticfansitnexteveryonfriendliserioustalkalmosteveripersonplaceriothonestliwouldstayhourdidntwalkwayhomewaybartendaskreturncrawfishboilsaturdayjazzfesttoldcouldntmakefirstsaturdaywouldtricomefridaylatercutheytakedamnnoteseenexttimeahahahahahreturnblastpleasnotecashalthoughdoorwindowopenbitsmokeysideluckilismokeriseduncanlayfloorotherwisdontthinkwouldvfeltcomfortbreathair
2092,1273,29.945841,90.071842,18.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1567,279,29.976821,90.099142,16.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1550,1415,29.957382,90.067295,3.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2785,504,29.977635,90.067838,18.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1936,2493,29.956175,90.064937,18.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1638,2731,29.935051,90.108954,19.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1095,2215,29.955075,90.068514,5.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1130,330,29.963616,90.057872,17.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1294,2939,29.946974,90.052519,17.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters grid for MultinomialNB
param_grid = {'alpha': [0.55]}

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=10)
grid_search.fit(X_train, Y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'alpha': 0.55}


Train-Test Split: The training data is split into training and validation sets (e.g., 70-30 split).
Grid Search Cross-Validation: Grid search with cross-validation (e.g., 10-fold cross-validation) is performed to find the best hyperparameters.
Model Training and Validation: The best model is trained on the training set and evaluated on the validation set.

In [11]:
best_mnb_model = MultinomialNB(alpha= 0.55)


# Train the model with the best hyperparameters
best_mnb_model.fit(X_train, Y_train)

# Make predictions on the testing data
y_pred_best = best_mnb_model.predict(X_test)


In [13]:
from sklearn import metrics
metrics.accuracy_score(Y_test,y_pred_best)


0.528169014084507

In [37]:
# Preprocess test data using the same preprocessing steps as training data
for column in text:

  test_data[column] = test_data[column].apply(lambda x:  word_tokenize(str(x)))


# # Step 2: Lowercasing
for column in text:
  test_data[column] = [[token.lower() for token in sentence] for sentence in test_data[column]]

# Step 3: Removing Punctuation

table = str.maketrans('', '', string.punctuation)
for column in text:
  test_data[column] = test_data[column].apply(lambda tokens: [token.translate(table) for token in tokens if token.translate(table) != '' ])




#Step 4: Removing Stopwords
stop_words = set(stopwords.words('english'))
for column in text:
  test_data[column]= [[token for token in sentence if token not in stop_words] for sentence in test_data[column]]

# # Step 5: Stemming (Optional)
stemmer = PorterStemmer()
for column in text:
  test_data[column] = [[stemmer.stem(token) for token in sentence] for sentence in test_data[column]]

test_data


Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review
0,478,"[mcalist, deli]",29.949670,-90.064408,16.0,"[twice, last, week, deli, locat, harrah, casin..."
1,3490,"[soup, garden]",29.922519,-90.091088,20.0,"[soup, good, howev, 850, good, would, understa..."
2,2705,"[station, coff, shop, bakeri]",29.978549,-90.102926,17.0,"[friendli, person, servic, kept, quick, flow, ..."
3,1973,"[key, west, hat, compani]",29.955036,-90.064357,17.0,"[love, hat, probabl, 20, still, love, buy, col..."
4,369,"[lengua, madr]",29.938117,-90.070331,0.0,"[exactli, want, dine, experi, great, food, jou..."
...,...,...,...,...,...,...
723,1643,"[soul, seafood, market]",29.937463,-90.090468,19.5,"[near, home, frequent, right, well, may, chang..."
724,2326,"[vincent, italian, cuisin]",29.941856,-90.132306,2.0,"[husband, dinner, vincent, tonight, great, exp..."
725,3215,"[igor, loung, game, room]",29.934535,-90.080520,7.0,"[local, water, hole, may, bit, bia, come, sinc..."
726,159,[mcdonald],30.004244,-90.036407,14.0,"[nasti, nasti, nasti, steer, clear, told, trou..."


In [39]:
test_data['name'] = test_data['name'].apply(lambda x:  ''.join(x))
test_data['review'] = test_data['review'].apply(lambda x:  ''.join(x))

vectorizer1 = CountVectorizer()
vectorizer2 = CountVectorizer()
name1_vectorized = vectorizer1.fit_transform(test_data['name'])
review1_vectorized = vectorizer2.fit_transform(test_data['review'])


name_df_1 = pd.DataFrame(name1_vectorized.toarray(), columns=vectorizer1.get_feature_names_out())
review_df_1 = pd.DataFrame(review1_vectorized.toarray(), columns=vectorizer2.get_feature_names_out())

test_vectorized = pd.concat([test_data.drop(columns = ['name','review']), name_df_1, review_df_1], axis=1)

test_vectorized

Unnamed: 0,ID,latitude,longitude,mean_checkin_time,1135decatur,13monaghan,3090nola,3southerngirl,8blockkitchenbar,acorn,...,wowsaidseegetparkhahangaroundparklotlikeeveryonelinsidstorefoooooorgetidkgorenovstoreclosursmhboxuponboxemptishelvhalffillshelvlitermessnexttimepasslol,wowseriouwhelmservpersonmenulaterdishdropdetailevensimplenjoywaitercouldgivegoodrecommendregardentrchoicstartduckbaseyelpreviewbestdishnightfiletcookwelluninvblandasksteaksaucredsnapperdrithreeglasswinelatertwodesertsignservermaitrfinalstopaskmealfinalchancinquirflavorsorbetdisappointgem,wowvisitnolalastweektooktruliawesominformsteamboatnatcheztripmarch30canttellinterestpleasantplupeoplwellorgannicejojoreservuswindowseatcouldbelievkindexcelservicnicerecommend,writereviewduegoodservichusbandreceivgoldmanstaffthifirsttimevisitneworleanlabrowstorenoticlotgreatpriceimpressworkthankmrgoldmanloritafrost,yearcomewhenevwantoriginpiecjewelrifriendjealoualwaygofriendcokenengagringstillloveringmuchdayfiancpropochosequalitisizediamondlightroomgetlotcomplimentringcantthankkenteamenough,yefeelamazfavoritplacefavoritfolkpickultimfoodfestexperisecondwouldnolajazzheritagfestmakefestspectacularthoughtneveraskfamilifestfoodifestheldmiddlnolafreefilllocalmusicfloodlocalfooddrinkletfacearoundwhatchawannakindagoodtimefoodmusttreyyencrawfishlobstersauchaydelkingcakebreadpudwalkercochondelaitpoboylorettapralinbeignetactirmathomabrassholicrouzancouldntforgetwolfmanshortgetamaztime,yelpwantgivestarplacentworthcrap00000starpizzahutorderonlindeliverieasirightnopemesswholeorderforgotwingcallmangerbarelistenofferhelpsendcorrectordertakebacknopeneverhappensentdriverbackgetwrongorderstillwingrefurefundneverbroughtrightorderntusehardearnmoneyreallineedsomeoncarejobneverorderplacecomplainprobablspitfoodntwasttimemoneygopizzahutjudgperezbestmanagkarmabitchwaitfuckbitchniceday,yeyeyeloveloveloveohyescorpionbowlpiñacoladaneverdrunkfrishrimptacojelloshotsgummitikicocktailouncdistilspiritcocktailconvertgummibearpineapplrumoneorderdeliciservfriendlibarmixerinteriorfuninterprettikibartwogirlfriendfantastevensayyeyeyeloveloveloveperfectgroupfriendromantdateeverythrightntmissliveneworleanvisitfrenchquartertikibarsmallenoughkeepintimambiencbigenoughpartifriendbeautifulnoncommercitikitoltecawouldbecomoneregularhangoutneworleanenjoy,yummihappiseegoodlunchspotexchangcenterbuildvisittwiceordergetthumbquichfreshmadetastisalmonbltservryeaccordmenuaskmultigrainabsolutdelicioptionfreshhealthitasti,yummiloveindianfoodloveneworleanfoodheavenspiciflavorsamosalentilsoupfavoritlookforwardtripoboywinetastthursdaynicetouchback
0,478,29.949670,-90.064408,16.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3490,29.922519,-90.091088,20.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2705,29.978549,-90.102926,17.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1973,29.955036,-90.064357,17.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,369,29.938117,-90.070331,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723,1643,29.937463,-90.090468,19.5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
724,2326,29.941856,-90.132306,2.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
725,3215,29.934535,-90.080520,7.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
726,159,30.004244,-90.036407,14.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
# Make predictions on the test data
y_pred_test = best_mnb_model.predict(test_vectorized)

# Decode the predicted labels (if needed)
# predicted_labels_test = label_encoder.inverse_transform(y_pred_test)


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 1135decatur
- 13monaghan
- 1sttimeorderparticulardominontrealiz2windowsidecamelooklikeonewell2blackyoungladintevenacknowledglookrightcarntsmilentraihandanythelwait5minutfinalyoungblackmalecamewindowacknowledgretrievpizzafavoritfire2ladihiremen
- 2011boughthugepaintdonefencsalvagkatrinashiphomefloridayearlaterpaintstartpeelfirstcontactowneramenartistrepairplanworkwherebiownerrequestartiststayhomeartistrepairpaintreluctantliagrwouldmeantripmainresidfloridaaccommodordergetpaintrepairtimepaintstartdisappeareitherflakecompletabsorbfencshortlivalternplanwherebiownerwouldpickpaintwaysarasotamiamitakebackneworleanshipbackuslongtimenegotinotifiwouldshippaintideawouldwouldgowouldartistrepairwouldpayartistreapplipaintdisappearwantbuypaintdisappearpaintcostenormamountfrustratinconvenimoneystayaway
- 255karaokspotreasoncomesnagsmallroomrockfriendratedecentconsidbasicneedcover20hrroomratedrinkmunchprettieasifoodmehmeanbadgreathitmissdependserversometimchillsometimreallipushirecentturnoffmovesystemnewcomputsystemliterclckleerbleerelecngtotalbuzzkilllikewearmitteneverythmuchharderfindsongwasttimerealliwishgobackcontrolinstalkeyboardsometh
- ...
Feature names seen at fit time, yet now missing:
- 1000fig
- 100satisfiservicreceivdialonebroussardofficstaffpleasantworkpolitworktimeappointtechniciansuperfriendliworkdiligfixproblemwouldrecommenddialonebroussardeveryon
- 10sandwichmayb2tablespooncoleslawthickwhitebreadwheatbreadchoicbread1smallplasticcontainbarbqsaucmayb1teaspoonnearlienoughsweetgoodwrapsandwichtookhomegetdecentbreadbarbqsauchotsauceitherwayoverprgoodalsopickslawpicklcouldntgetrealli10gotkid
- 12barfultonstreet
- 14parishjamaicanrestaur
- ...


In [None]:
import pandas as pd

# Assuming 'IDs' is a list containing the IDs corresponding to the test data
# Assuming 'predicted_labels_test' is a list containing the predicted labels for the test data

# Create a DataFrame with 'ID' and 'Predicted Label' columns
df_result = pd.DataFrame({'ID': test_data['ID'], 'category': predicted_labels_test})

# Save the DataFrame to a CSV file
df_result.to_csv('predicted_labels1.csv', index=False)

print("Predicted labels saved to predicted_labels.csv")


# ------------------------------------------------------------------------------------------------------------

Task 2: Text and Numerical Data Integration and Classification
Motivation:
In this task, text data is combined with numerical data (latitude, longitude, mean check-in time) for classification. The motivation is to leverage both textual and numerical features to improve classification performance

Data Representation and Preprocessing:
Text Preprocessing: Similar to Task 1, the text data undergoes tokenization, lowercasing, punctuation removal, stopwords removal, and stemming.
Numerical Data Scaling: Numerical features are scaled to be non-negative using MinMaxScaler to ensure compatibility with text features.
Data Integration: Text and numerical features are combined into a single feature matrix using hstack from scipy.sparse.

In [None]:

from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.isalpha()]  # Lowercase and remove punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

# Apply preprocessing to the review column
data['processed_reviews'] = train_data['review'].apply(preprocess_text)

# Vectorize the text data
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(data['processed_reviews'])
data

Model Selection: Similar to Task 1, grid search is used to find the best hyperparameters for the Multinomial Naive Bayes model.
Model Training: The Multinomial Naive Bayes model is trained on the integrated text and numerical feature matrix.
Model Evaluation: The model's accuracy is evaluated on a holdout validation set using accuracy_score.

In [None]:
numerical_data = train_data[['latitude', 'longitude', 'mean_checkin_time']]

# Scaling numerical data to be non-negative using MinMaxScaler
scaler = MinMaxScaler()
X_numerical = scaler.fit_transform(numerical_data)

# Convert scaled data to DataFrame for easier manipulation
X_numerical_df = pd.DataFrame(X_numerical, columns=numerical_data.columns)

In [None]:
from scipy.sparse import hstack

# Combine the text features and the scaled numerical features
X_1 = hstack([X_text, X_numerical_df])

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
Y_1 = label_encoder.fit_transform(data['category'])


Train-Test Split: Similar to Task 1, the data is split into training and validation sets.
Grid Search Cross-Validation: Grid search with cross-validation is performed to find the best hyperparameters.
Model Training and Validation: The best model is trained on the training set and evaluated on the validation set.

In [None]:
X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(X_1, Y_1, test_size=0.3, random_state=42)

Evaluation Procedure:
Train-Test Split: Similar to Task 1, the data is split into training and validation sets.
Grid Search Cross-Validation: Grid search with cross-validation is performed to find the best hyperparameters.
Model Training and Validation: The best model is trained on the training set and evaluated on the validation set.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters grid for MultinomialNB
param_grid = {'alpha': [0.1, 0.5, 1.0, 10.0]}

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
grid_search.fit(X_train_1, Y_train_1)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)




Training/Validation Results:
The best hyperparameters for the Multinomial Naive Bayes model are determined using grid search cross-validation.
The model achieves a certain accuracy score on the validation set, indicating its performance.

In [None]:
best_mnb_model_1 = MultinomialNB(alpha = 0.9)

# Train the model with the best hyperparameters
best_mnb_model_1.fit(X_train_1, Y_train_1)

# # Make predictions on the testing data
y_pred_best_1 = best_mnb_model_1.predict(X_test_1)
# print(y_pred_best_1.shape)
# print(Y_test_1.shape)

In [None]:
metrics.accuracy_score(Y_test_1,y_pred_best_1)


In [None]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.isalpha()]  # Lowercase and remove punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

# Apply preprocessing to the review column
test_data['processed_reviews'] = test_data['review'].apply(preprocess_text)

# Vectorize the text data
vectorizer = CountVectorizer()
X_test_text = vectorizer.fit_transform(test_data['processed_reviews'])


In [None]:
numerical_test_data = test_data[['latitude', 'longitude', 'mean_checkin_time']]

# Scaling numerical data to be non-negative using MinMaxScaler
scaler = MinMaxScaler()
X_numerical_test = scaler.fit_transform(numerical_test_data)

# Convert scaled data to DataFrame for easier manipulation
X_numerical_df_test = pd.DataFrame(X_numerical_test, columns=numerical_data.columns)

In [None]:
X_test_combined = hstack([X_test_text, X_numerical_df_test])


In [None]:
y_pred_test_2 = best_mnb_model_1.predict(X_test_combined)



predicted_labels_test = label_encoder.inverse_transform(y_pred_test_2)