## Classification of museum-related tweets in New York City

### This notebook describes two classification experiments with differently balanced datasets and different featuresets

### Experiment 1 contains a dataset with a "naturally" occuring class distribution based on data downloaded from Twitter API (8%-92%) with a random baseline performance of 85%
### Experiment 2 contains a dataset with a 50%-50% balanced data with a random baseline performance of 50%

### Both experiments also compare performance based on two different featuresets: a list of unigram word cooccurrences (synonym/semantic sets) and a dictionary list of museum names, abreviations and tags 

In [14]:
# The classification methodology is largely based on NTLK Chapter 6: http://www.nltk.org/book/ch06.html
# Other reference links are quoted throughout the notebook

In [15]:
# Importing necessary packages

import re
import nltk
import random
from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
from nltk.corpus import stopwords

## Experiment 1 
### Datasets: text_1 and label_1

### Total = 7,033 tweets, museum-labeled = 583 tweets (8%)
### Random baseline performance: (0.92 x 0.92) + (0.08 x 0.08) = 0.84 + 0.0064 = 0.85

##### Reference link: https://machinelearningmastery.com/dont-use-random-guessing-as-your-baseline-classifier/

## Data import and processing

### Input text file: text_1.txt

In [16]:
# Importing content file with .readlines()

file = input("Please input your file: ")
text = open(file,'r',encoding = "ISO-8859-1").readlines()

Please input your file: text_1.txt


In [17]:
# Removing URL links from the content

text_list = []
for i in text:
    y = re.sub(r"http\S+", "", i)
    text_list.append(y)

print(text_list[1])

# Reference link: https://stackoverflow.com/questions/3094659/editing-elements-in-a-list-in-python
# Reference link: https://stackoverflow.com/questions/24399820/expression-to-remove-url-links-from-twitter-tweet/24399874

I'm at American Media in New York, NY 



In [18]:
# Checking the raw string format using repr()

# Reference link: https://bytes.com/topic/python/answers/670554-how-print-raw-string-data-variable

for i in text_list:
    print(repr(i))

'Just posted a photo @ Statue of Liberty &amp; Ellis Island Events \n'
"I'm at American Media in New York, NY \n"
'West Coast Memorial in Battery Park. Dedicated to the 4000+ servicemen who? \n'
'May is here. That?s no bull #1010wins \n'
'"#Repost colorofnyc with get_repost\n'
'Thank you colorofnyc for the feature and? \n'
'#MayDay March &amp; rally in #BatteryPark to #WallStreet live @PIX11News with @CynthiaNixon marching with. \n'
'About to begin #mayday rally and march to Wall Street #1010wins \n'
'#Simple#FoodForThought#Balance????# @ Stone Street \n'
'"Post-Avenger\n'
"Me and my girls ??? @ Murphy's Tavern NYC \n"
'To the village now. (@ MTA Subway - South Ferry (1) - @nyctsubwayscoop in New York, NY) \n'
"crown jewels uh'da island #TourDeStatenIsland \n"
'#nyc #nodaysoff #nopainnogain #clangingandbanging #nyc #nycfit #z3r0fitness? \n'
'#BroadwayMall #Art #broadwaymallart kathyruttenberg #InDreamsAwake #FirstOne #CloudyDay? \n'
'"#Repost via streetartmuraltours \n'
'?\n'
'One of t

"I'm at Citicorp - @citibank in Long Island City, NY \n"
"I'm at @LAGourmetNy in Long Island City, NY \n"
"I'm at MTA Subway - Court Square (E/G/M/7) - @nyctsubwayscoop in Long Island City, NY \n"
'"Best #skylight in town\n'
'#jamesturrell #ps1 #moma \n'
'@momaps1 #timelapse #timelapsevideo #museum @? \n'
'See our latest #NewYork, NY #job and click to apply: Senior Programming Specialist - New Local Entertainment Networ? \n'
'3 years ago I was here. I"ll be back soon. #travel #photography #memorie? \n'
'"Professional #professional\n'
'.\n'
'#riseagainsthunger #hungersucks #stophungernow? \n'
'Vue sympa \n'
'"Professional #professional\n'
'.\n'
'#riseagainsthunger #hungersucks #stophungernow? \n'
'Corporate vibes ? #WorkFlow \n'
'NY ?? en Top Of The Rock NYC \n'
'Accident cleared in #Queens on The B.Q.E. E Leg NB between Exit 44 and Grand Central Pky, stop and go traffic back to Exit 43 #traffic\n'
'Closed due to accident in #Queens on The B.Q.E. E Leg NB between Exit 44 and Grand Centr

"Dining out with a dietician doesn't have to be a drag ? @bnutritious and I? \n"
'Only 5 Wind-Up Variations remain. Jam with us next on Friday... #seeMoreplays? \n'
'Part 2 @mtfmusicals @joespub #songsbyus #womenwriters #transwriters? \n'
'Part 1 @mtfmusicals @joespub #songsbyus #womenwriters #transwriters? \n'
'Talents include brunching and gazing at the menu as if I didn?t already study? \n'
"It's gonna be a kiki tonight! The rain is gone, so come make it rain on? \n"
'? @ Jake?s Steakhouse - Bronx \n'
'Construction on Broadway has started! #makebroadwaysafe @kevindaloia @TransitErwin @AmeryAmril @TransAlt \n'
'#WeAreCityHawks     #721MFamily \n'
'Voluptuous. @ Wave Hill \n'
'Jessica at the spa @ Wave Hill \n'
'my babies ? \n'
'wedding sunset ? #kelseyandkaushal #theagrawalees @ Wave Hill \n'
'Watercolor of Wave Hill aquatic garden @ Wave Hill \n'
'Sunbathing ??????at Wave Hills on #earthday - #skyexuberance ? @ Wave Hill \n'
'Spring is near \n'
'grateful for friendship! sharing the 

'@Biblops OH MY GOD IS IT THE ENCYCRAWPEDIA\n'
'@2AvSagas ?What actually causes congestion in Midtown?? Too many cars. Done.\n'
'@ReeseTrece !!!!!!!!!!!!!!!\n'
'@Slim_Luvva lmao fuck they know ?? how old is he? who are his parents. i need all of the story lol\n'
'?Tanice? ???????????. ALW OF NEW YORK IS MINE! \n'
'@MADly_INsane ?\n'
'Ramon Allones by AJ Fernÿndez #AJFCigars @ Club Macanudo \n'
"I'm at Morton Williams in New York, NY \n"
'@Shook_Jones Not what, why?\n'
'@thecierranicole me and nyc ?\n'
'@WomenforWomen and the story of one woman that started a program within the organization to empower herself and her? \n'
'Pulitzer Prize winning photojournalist @lynseyaddario speaking about her experiences around the world bringing atte? \n'
'Excited to be at the @WomenforWomen Luncheon, honoring incredible and courageous women around the world!!? \n'
'Spicy High Quality Belgian Dark Chocolate? Yes please. 4 Heat? \n'
'Amazing NY corner... so close to Central Park a beautiful park. @ Tr

'I work so hard! I really needed this time off! #chubiiline #bullychasers? \n'
'Enjoy Miles of Smiles on #WorldLaughterDay &amp; every Sunday at Noon with @MagicAtConey @coneyislandusa! #Brooklyn? \n'
'Just your average punk rock mom ?Sy? #missconeyisland #filmmaking #actorslife? \n'
'Big shoutout to the Family and Crew of denoswonderwheelpark ! You guys had? \n'
'Well this act was pretty intense! #RBM denoswonderwheelpark #WonderWheel? \n'
'The Amazing Flesh of Chaim Soutine en The Jewish? \n'
'#fuckputin? @ Consulate-General of Russia in New York City \n'
'Loved seeing the work of Marc Camille Chaimovicz @thejewishmuseum . (thanks? \n'
'Just posted a photo @ Guggenheim Museum \n'
'Cherry blossoms at the Reservoir #timeoutnewyork #nycprime_ladies? \n'
'Laure, subjugu?e. \n'
'?\x98Take my breath away\x98? \n'
'Inspiring exhibition by Vietnamese Danish artist Danh Vo (b. 1975) at? \n'
'Ovnis. em Guggenheim Museum \n'
'May the 4th be with you....always ??? #MayThe4thBeWithYou #StarWars? 

"We're #hiring! Click to apply: Java Developer -  #NettempsJobs #OpenSource #LongIslandCity, NY #Job #Jobs #CareerArc\n"
"We're #hiring! Read about our latest #job opening here: Property Management Summer Internship -? \n"
'Join the Sizewise team! See our latest #job opening here:  #MedicalDevices #NewYork, NY #Hiring #CareerArc\n'
"Want to work at Amtrak? We're #hiring in #LongIslandCity, NY! Click for details:  \n"
"I'm at Long Island City, NY in Long Island City, NY \n"
"We're #hiring! Click to apply: Billing Clerk -  #Clerical #NewYork, NY #Job #Jobs #CareerArc\n"
'This #job might be a great fit for you: Bookkeeper / Accountant -  #Accounting #NewYork, NY #Hiring #CareerArc\n'
'Can you recommend anyone for this #job? Route Delivery Driver NON CDL -  #SupplyChain #NewYork, NY #Hiring #CareerArc\n'
'#NightWatch rockcenternyc @ Top Of The Rock NYC \n'
'Best view of New York at midnight??????#llausÿstarriba #3rdhoneymoon? \n'
'?Top of the Rock? #new#york#city#rocefellercenter ?? @ Top 

'#Repost @amyhau_nyc\n'
'Beautiful and ethereal #clouds by #miyaando on view @noguchimuseum? \n'
'#Repost @doreenremen\n'
'Following the #MiyaAndo #cloud and celebrating the elemental at the? \n'
'@noguchimuseum #cloud #sculpture installed #newyork @ The Noguchi Museum \n'
'Forecast: #CLOUDS. Two enigmatic works by @miyaando are now abiding? \n'
'Just posted a photo @ Milk &amp; Cream Cereal Bar \n'
'JOIN US! #sitdownshutupandeat #lamelanyc #pullthestring #italianfoodporn #littleitalynyc #italianfood #mulberryst? \n'
'La?Mela?s amazing mixed seafood over linguini. Fresh clams, mussels, shrimp, and calamari in our homemade red sauce? \n'
'Double the fun with two wrappers Maduro &amp; Connecticut ! #cigar #cigars #mulberrystreet? \n'
'Downtown Day - Part Three! #sniffaspringfling2018 la_mela_nyc davidscottecker @talfoto? \n'
'Go out there and crack open today?s cheese wheel folks\n'
'#HappyMonday #TheKingOfCheese? \n'
'I love little italy ???? \n'
'Hoy inici\x9b algo genial en New York l

'Proud to be trendsetters- did this as part of #awardwinning #employeeengagement campaign: vacay photo sharing? \n'
'?Engage with employees on the weekend by keeping content light and more social.? -@LeanoraMinai, @DukeU ??? \n'
'.@LeanoraMinai takes a journalistic approach to internal comms through the @WorkingatDuke platform, connecting and? \n'
'Packed house to hear @itsaallman talk about the USA Today Network brand. ?Add something to the story that nobody el? \n'
'Kicking off #RaganContent Summit for Corporate Communicators via @RaganComms, w/@gina__rossi #content #marketing? \n'
'@realDonaldTrump  \n'
"I'm at Sony Public Plaza in New York, NY \n"
'?? which way to the weekend? ???\n'
'let @billyconahan show you de wae ??\n'
'#trexphoto? \n'
"Tart berry, light crisp apple, dry and delightful. - Drinking a Perronelle's Blush by @Aspall @ Hard Cider Revival? \n"
'Ramp restrictions in #NewYork on The FDR Dr SB at The Brooklyn BR, stop and go traffic back to The Manhattan Br, delay of 5

'Beach &amp; park #coneyisland #nyc lunapark em Coney Island \n'
'SUMMER 18? #photoshoot #familia #actor #newwork #coneyisland @ Coney? \n'
'?So I packed my napsack, got on a train looked at a map and decided I wanted? \n'
"Go with the flow but not on the water cause it's cold? \n"
'#beach #body #art #blacklivesmatter #purpleheart ?Beach Body Bingo? A? \n'
'Just posted a photo @ Coney Island USA \n'
'#funmoment #coneyisland #sandyscordo #nyc #beach #friendship #mylife #ilny @? \n'
'Just posted a photo @ Coney Island USA \n'
'Happy #cincodemayo ?? ? at La Casa de Trump, Trump Tower Grill. Having a Taco? \n'
'Miami Florida May 18th come get ya funny on. Guaranteed good time. My hommie? \n'
'Good morning NYC! #tiffanyandco #nyc #theeabpov #theeabproject #bitchisanomad? \n'
'#tiffany #newnew #collection #paperflowers #tiffanyblue #breakfastattiffanys? \n'
'Make America chingasatuputamadre @realdonaldtrump great again!!!! en Trump? \n'
'Awesome launch @tiffanyandco last night that my talent

'#LYFEINCORPORATED ?? @IAMANITABAKER \n'
'#LYFEINCORPORATED ? @IAMANITABAKER \n'
'@BuckSexton 10/4\n'
'@sarasidnerCNN HUH? \n'
'Huh? \n'
'@50cent ? \n'
'@GOYOCQT ae it was worth the reach!  Not enough passion for the music 8 \n'
'Huh? \n'
'#LYFEINCORPORATED ?? \n'
'Cafe Lalo?s scaffold is beau! #iPhonePortrait \n'
'Pumpkin &amp; red lentil soup. Bomb!! (at @PeacefoodCafe in New York, NY)  \n'
'Mommy Crush Monday?s #MCM\n'
'This past Saturday, Simba and I went to? \n'
'?A mother is she a person who can take the place of all others but? \n'
'You got Instagram? (at @CafeLalo in New York, NY w/ @mario_abarcac)  \n'
"I'm at Cafe Lalo in New York, NY \n"
'Drinking a Krombacher Pils by @krombacherbeer at @fredsnyc ? \n'
'We are awesome #hookem #amwrica #texas #beerisgood @ George Keeley NYC \n'
'#madeit @ George Keeley NYC \n'
'Pre Thacker?s Day celebration #Pushin4Tre (@ Flor de Mayo in New York, NY) \n'
'Best breakfast we?ve had here so far! @jasonytuarte \n'
"I'm at Hale &amp; Hearty - @ha

'Ma and her #MothersDay margarita! (@ Tacuba Mexican Cantina - @tacubanyc in Astoria, NY)  \n'
'Technicolor Camera on display at the Museum of Moving Image. \n'
'Look who I got to meet today! #Muppets #momi @ Museum of the Moving? \n'
'On @supremecourt88 30th Birthday, we play with puppets!? \n'
'Key takeaways after visiting the Museum of the Moving Image \n'
'Drinking a Baby Elephant by @RushingDuck at @sunswickastoria ? \n'
'The power table. \n'
'Thank you @nywift for featuring ?The Kung Fu Master? and showcasing? \n'
'Had an awesome bday visiting the Jim Henson exhibit in NYC ???? @? \n'
'my church. \n'
'Danh Vo - ?Take my breath away? #guggenheim #nyc #usa? \n'
'?Off Centered. #photography #geometric #patterns #guggenheim? \n'
"I've been fascinated by this place since I was a kid. It was about time to visit it again \n"
'You spin round and round ?? #guggenheim #museum #nyc #holiday @? \n'
"I'm at Solomon R @Guggenheim Museum in New York, NY \n"
'Just posted a photo @ Solomon R. Gug

'@FoxNews @TomiLahren Omg #TomiLahren you?re so right. Why didn?t they think of that and stay in their towns with vi? \n'
'I may have purchased a few items to do my bit @realdonaldtrump? \n'
'???????????????????? \n'
'Powdered brows before &amp; after by Master Tech #EyeDesignNadia ?? Our? \n'
'A building almost as great as me, #MissShitholeCountry? \n'
'I have never been more embarrassed of my heritage.  \n'
'The douche has been revealed @ASchlossbergLaw  make sure he knows that nyc is a small replica of the USA.  \n'
'@EWErickson If ignorance is bliss, @EWErickson is the happiest man alive. ObamaCare isn?t Constitutional because of? \n'
'At the start of 2018 Facebook claimed to have 2bn user accounts. In Q1 it shuttered 583m accounts. Basically 30% of? \n'
'?Thank you @CPTalentMgmt  and @HouseCasting ! Always so much fun at 450 W 15th ! ? \n'
'iPhone 8+ \n'
'8 PM \n'
"I'm at Aladdin @ New Amsterdam Theatre in New York, NY  \n"
'Sat down with the radiant @idinamenzel this morning to t

'#mothersday w/ #mom\n'
'Happy Mother?s Day to you? \n'
'MOOD because this weekend was amazing got to do all the things I? \n'
'Just posted a photo @ Upper West Side \n'
'Balenciaga and a fresco from my county: Lleida. Seeing one of my favorite dresses of all time framed by art that sp? \n'
'Clean up time.\n'
'#EverySportSkill\n'
'#Muscleup\n'
'#strictmuscleup? \n'
'Went to Coney island today, it was only shut \n'
'@rihanna @rihanna consider this our first colab.\n'
'@\n'
'?They said there were two fathers. One above, one below. They lied. There was only ever the devil.And when you look? \n'
'Gor Getting there \n'
'June 3rd @VINAI joins the @MagicCarousel family! Tickets on Sale Now!  #brooklyn #nyc #SundayFunday #newyork #mcs \n'
'#fab #done #goodvibes #fabnews #have a #fabweek @madonna #nyc #my? \n'
'If only I could live in the moment with as much skill as you.? \n'
'Special Crepes Dark Chocolate?&amp; ?\n'
'? ????? #cheesy #eater? \n'
'Morning in nyc.\n'
'#mondaymood #goodmorning #p

'Freelancing may seem luxurious to some, but in reality, it?s no? \n'
'Love when friends stop by.. @rollesgracie from @kasaigrappling? \n'
'? ??? ??? ???#?? #supermoonbakery #nyc \n'
'Liz Luisada ?Find Your Way? opens this Friday May 18. Opening? \n'
'Rick !!! #magnumpi @ Russ &amp; Daughters Cafe \n'
'#AnArtistADay at #TheClemente:\n'
'Leading up to #os18, check here to view a different participating artist each day!? \n'
'Multi camo columns for May ???? @ Lower East Side \n'
'New week means new gear! Short sleeved #boxing #nyc hoodies back in? \n'
'#bwoodknows @warriors\n'
'#212 ?? #510\n'
'a #bwood collaboration with? \n'
'???? \n'
'#YesterdaysShoot #SexyFilmCrew Lol, hire us for your next? \n'
'Catch me, Tommy, and Our Special Guest Next Weekend @ The Delancey !? \n'
'#WDYM? Beauty &amp; Brains\n'
'Is One a Week Away!\n'
'Come learn, shop, drink? \n'
'#Repost cuddyseason\n'
'$ingle 4 $ingle?? 8 Singles 7 Battles 1? \n'
"I'm at MTA Subway - Delancey St/Essex St (F/J/M/Z) - @nyctsubw

'Simple: to achieve it... you already have to BELIEVE it, your daily? \n'
'#Anothermiracle, Thanks to Gary Millier I was able to serve as a? \n'
"Drake's Virginia Black Whiskey will be featured at the 4th Annual? \n"
'#unpluggedmondays #rwp #djmasai #livemusic #livedj #everymonday @? \n'
'#handsdown #onedollar #newyork #pizza #diet #perfection ? @ Central? \n'
"You know you've built a great family when @beth__egan and? \n"
'city nights. noites da cidade. #harlem #nyc em Central Harlem \n'
"Dinner (@ Lou's in New York, NY)  \n"
'Cryptocurrency, Blockchain and Building the Future event (@ Ludlow House in New York, NY)  \n'
'#catsandnails #rainbowcamo #giovannidipesce ??? @ Lower East Side \n'
'It?s summer for the next couple hours. @ Lower East Side \n'
'!?!\n'
'¨\xad¨ @ Lower East Side \n'
'One of the best candles (and signature scents) that I?ve ever? \n'
'Just two girls talking about body hair and wellness addiction while? \n'
'pretty in pastels ?#wheninrome #travels #sustainable #sel

In [19]:
# Removing new line symbol

text_ready = [i.strip() for i in text_list]
print(repr(text_ready[1]))

"I'm at American Media in New York, NY"


In [20]:
# Creating a list of tokens

text_ready_list = []
for i in text_ready:
    y = i.split()
    text_ready_list.append(y)
    
print(text_ready_list[1])

["I'm", 'at', 'American', 'Media', 'in', 'New', 'York,', 'NY']


### Input labels file: label_1.txt

In [21]:
# Importing file with associated labels

file = input("Please input your file: ")
labels = open(file,'r',encoding = "ISO-8859-1").readlines()

Please input your file: label_1.txt


In [22]:
# Removing new line symbol

labels_ready = [i.strip() for i in labels]
print(labels_ready[1])

non-museum


In [23]:
# Combining two lists with content and labels to get a data structure similar to a Reuters Corpus in NLTK

list_text_labels = []
for i, y in zip(text_ready_list, labels_ready):
    list_text_labels.append((i,y))

# Reference link: https://stackoverflow.com/questions/1919044/is-there-a-better-way-to-iterate-over-two-lists-getting-one-element-from-each-l
# Reference link: https://stackoverflow.com/questions/6304808/how-to-pass-tuple-as-argument-in-python
# Reference link: https://stackoverflow.com/questions/19560044/how-to-concatenate-element-wise-two-lists-in-python

In [24]:
# Checking the resulting data structure of one item (tokenized content and label)

print(list_text_labels[1])

(["I'm", 'at', 'American', 'Media', 'in', 'New', 'York,', 'NY'], 'non-museum')


## 1.1 Featureset: Synonym/semantic word cooccurrences
### Creating a unigram (synonym/semantic) feature set based on cooccurrences of words "museum" and "gallery" in a Wortschatz corpora portal: http://corpora.uni-leipzig.de/en/res?word=museum&corpusId=eng-za_web_2014

In [25]:
word_features = ['museum', 'art museum', 'art', 'visit', 'collection', 'exhibition', 'display', 'cultural', 'museums', 'historical', 'curator', 'gallery', 'photo', 'artist','artists','studio']

In [26]:
# Creating featuresets (based on NLTK Chapter 6 Document Classification: http://www.nltk.org/book/ch06.html)

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]

# Shuffling featuresets
random.shuffle(featuresets)

# Creating training and testing sets
train_set, test_set = featuresets[100:], featuresets[:100]

# Training a Naive Bayes Classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [27]:
# Checking for accuracy
print(nltk.classify.accuracy(classifier, test_set)) 

0.92


In [28]:
# Checking for most informative features
classifier.show_most_informative_features(5)

Most Informative Features
    contains(exhibition) = True           museum : non-mu =      4.9 : 1.0
    contains(collection) = True           museum : non-mu =      3.7 : 1.0
         contains(visit) = True           museum : non-mu =      1.8 : 1.0
         contains(photo) = True           museum : non-mu =      1.7 : 1.0
           contains(art) = True           museum : non-mu =      1.3 : 1.0


### Naive Bayes classifier is compared with Decision Tree, Maximum Entropy and SVC

In [29]:
# Comparing with the accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set))

0.92


In [30]:
# Comparing with accuracy results of Maximum Entropy
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.917
         Final          -0.16497        0.917
0.92


In [31]:
# Checking for most informative features based on Maximum Entropy
maximum_entropy_classifier.show_most_informative_features(5)

  -0.177 contains(artist)==True and label is 'museum'
  -0.163 contains(exhibition)==False and label is 'museum'
  -0.163 contains(photo)==False and label is 'museum'
  -0.162 contains(cultural)==False and label is 'museum'
  -0.162 contains(museums)==False and label is 'museum'


In [32]:
# Comparing with accuracy results of SVC classifier
svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.92


## 1.2 Featureset - Dictionary list of museum names, abbreviations and tags

### 1.2.1 Unigram model: creating a list where museum names are tokenized (e.g. ['American', 'Academy', 'of', 'Arts', 'and', 'Letters'])

In [33]:
# Importing a dictionary.txt file with museum names, abbreviations and tags with .read()

file = input("Please input your file: ")
dictionary = open(file,'r',encoding = "ISO-8859-1").read()

Please input your file: dictionary.txt


In [34]:
# Tokenized the document

dictionary_processed = dictionary.split()
print(dictionary_processed)

['Alexander', 'Hamilton', 'U.S.', 'Custom', 'House', 'Alice', 'Austen', 'House', 'Museum', 'Alice', 'Austen', 'House', '@iAliceAusten', 'American', 'Academy', 'of', 'Arts', 'and', 'Letters', 'American', 'Academy', 'of', 'Arts', '&', 'Sciences', '@americanacad', 'American', 'Folk', 'Art', 'Museum', 'Folk', 'Art', 'Museum', '#folkartmuseum', '@FolkArtMuseum', 'American', 'Immigration', 'History', 'Center', 'American', 'Family', 'Immigration', 'History', 'Center', '@EllisIsland', 'American', 'Museum', 'of', 'Natural', 'History', 'American', 'Museum', 'of', 'Natural', 'History', '@AMNH', 'American', 'Numismatic', 'Society', 'American', 'Numismatic', '@ANSCoins', 'Americas', 'Society', 'Americas', 'Society', 'Art', '@Visual_ArtsAS', 'Anne', 'Frank', 'Center', 'USA', 'Anne', 'Frank', 'Center', '@AnneFrankCenter', 'Asia', 'Society', 'Asia', 'Society', '@AsiaSociety', 'Audubon', 'Terrace', 'NYC', 'Audubon', '@NYCAudubon', 'Bartow-Pell', 'Mansion', 'The', 'Bartow', 'Pell', 'Mansion', 'Museum', 

In [35]:
# Removing stopwords 

stopwords = stopwords.words('english')
dictionary_processed_ready = []
for i in dictionary_processed:
    if i not in stopwords:
        dictionary_processed_ready.append(i)

print(dictionary_processed_ready)

['Alexander', 'Hamilton', 'U.S.', 'Custom', 'House', 'Alice', 'Austen', 'House', 'Museum', 'Alice', 'Austen', 'House', '@iAliceAusten', 'American', 'Academy', 'Arts', 'Letters', 'American', 'Academy', 'Arts', '&', 'Sciences', '@americanacad', 'American', 'Folk', 'Art', 'Museum', 'Folk', 'Art', 'Museum', '#folkartmuseum', '@FolkArtMuseum', 'American', 'Immigration', 'History', 'Center', 'American', 'Family', 'Immigration', 'History', 'Center', '@EllisIsland', 'American', 'Museum', 'Natural', 'History', 'American', 'Museum', 'Natural', 'History', '@AMNH', 'American', 'Numismatic', 'Society', 'American', 'Numismatic', '@ANSCoins', 'Americas', 'Society', 'Americas', 'Society', 'Art', '@Visual_ArtsAS', 'Anne', 'Frank', 'Center', 'USA', 'Anne', 'Frank', 'Center', '@AnneFrankCenter', 'Asia', 'Society', 'Asia', 'Society', '@AsiaSociety', 'Audubon', 'Terrace', 'NYC', 'Audubon', '@NYCAudubon', 'Bartow-Pell', 'Mansion', 'The', 'Bartow', 'Pell', 'Mansion', 'Museum', '(BPMM)', '@Bartow_Pell', 'Bown

In [36]:
# Removing '/' symbol and numbers
symbols = ['/','1','2','3','4','5','6','7','8','9','&']

dictionary_unigrams = []
for i in dictionary_processed_ready:
    if i not in symbols:
        dictionary_unigrams.append(i)

print(dictionary_unigrams)

['Alexander', 'Hamilton', 'U.S.', 'Custom', 'House', 'Alice', 'Austen', 'House', 'Museum', 'Alice', 'Austen', 'House', '@iAliceAusten', 'American', 'Academy', 'Arts', 'Letters', 'American', 'Academy', 'Arts', 'Sciences', '@americanacad', 'American', 'Folk', 'Art', 'Museum', 'Folk', 'Art', 'Museum', '#folkartmuseum', '@FolkArtMuseum', 'American', 'Immigration', 'History', 'Center', 'American', 'Family', 'Immigration', 'History', 'Center', '@EllisIsland', 'American', 'Museum', 'Natural', 'History', 'American', 'Museum', 'Natural', 'History', '@AMNH', 'American', 'Numismatic', 'Society', 'American', 'Numismatic', '@ANSCoins', 'Americas', 'Society', 'Americas', 'Society', 'Art', '@Visual_ArtsAS', 'Anne', 'Frank', 'Center', 'USA', 'Anne', 'Frank', 'Center', '@AnneFrankCenter', 'Asia', 'Society', 'Asia', 'Society', '@AsiaSociety', 'Audubon', 'Terrace', 'NYC', 'Audubon', '@NYCAudubon', 'Bartow-Pell', 'Mansion', 'The', 'Bartow', 'Pell', 'Mansion', 'Museum', '(BPMM)', '@Bartow_Pell', 'Bowne', '

In [37]:
# Creating a new featureset and training a Naive Bayes classifier

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in dictionary_unigrams:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]
random.shuffle(featuresets)
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

# Reference link: http://www.nltk.org/book/ch06.html

0.89


In [38]:
# Checking for most informative features
classifier.show_most_informative_features(5)

Most Informative Features
     contains(Schomburg) = True           museum : non-mu =     18.4 : 1.0
      contains(National) = True           museum : non-mu =     18.4 : 1.0
    contains(Technology) = True           museum : non-mu =     17.3 : 1.0
     contains(Institute) = True           museum : non-mu =     14.2 : 1.0
 contains(@studiomuseum) = True           museum : non-mu =     11.0 : 1.0


In [39]:
# Comparing with accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set))

0.91


In [40]:
# Comparing with accuracy results of Maximum Entropy classifier
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.917
         Final          -0.68133        0.918
0.93


In [41]:
# Checking for most informative features of Maximum Entropy classifier
maximum_entropy_classifier.show_most_informative_features(5)

   0.246 contains(+)==True and label is 'non-museum'
   0.246 contains(The)==True and label is 'non-museum'
   0.246 contains(Chelsea)==True and label is 'non-museum'
   0.246 contains(Street)==True and label is 'non-museum'
   0.246 contains(For)==True and label is 'non-museum'


In [42]:
# Comparing with accuracy results of SVC classifier
svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.93


### 1.2.2 Entity model: creating a list where museum names are preserved as one token (e.g. ['American Academy of Arts and Letters'])

In [43]:
# Importing a dictionary.txt file with names, abbrevations and tags with .readlines()

file = input("Please input your file: ")
dictionary = open(file,'r',encoding = "ISO-8859-1").readlines()

Please input your file: dictionary.txt


In [44]:
# Removing new line symbol

dictionary_ready = [i.strip() for i in dictionary]
print(dictionary_ready)

['Alexander Hamilton U.S. Custom House', 'Alice Austen House Museum\tAlice Austen House\t\t@iAliceAusten', 'American Academy of Arts and Letters\tAmerican Academy of Arts & Sciences\t\t@americanacad', 'American Folk Art Museum\tFolk Art Museum\t#folkartmuseum\t@FolkArtMuseum', 'American Immigration History Center\tAmerican Family Immigration History Center\t\t@EllisIsland', 'American Museum of Natural History\tAmerican Museum of Natural History\t\t@AMNH', 'American Numismatic Society\tAmerican Numismatic\t\t@ANSCoins', 'Americas Society\tAmericas Society Art \t\t@Visual_ArtsAS', 'Anne Frank Center USA\tAnne Frank Center\t\t@AnneFrankCenter', 'Asia Society\tAsia Society\t\t@AsiaSociety', 'Audubon Terrace\tNYC Audubon\t\t@NYCAudubon', 'Bartow-Pell Mansion\tThe Bartow Pell Mansion Museum (BPMM)\t\t@Bartow_Pell', 'Bowne House\tBowne House\t\t@BowneHouse1', 'Bronx Historical Society & Musem\tBronx Historical So\t\t@BronxHistory', 'Bronx Museum of the Arts (BXMA)\tBronx Museum\t\t@BronxMuseu

In [45]:
# Removing tab symbol

dictionary_ready_list = []
for i in dictionary_ready:
    y = i.split('\t')
    dictionary_ready_list.append(y)
    
print(dictionary_ready_list)

[['Alexander Hamilton U.S. Custom House'], ['Alice Austen House Museum', 'Alice Austen House', '', '@iAliceAusten'], ['American Academy of Arts and Letters', 'American Academy of Arts & Sciences', '', '@americanacad'], ['American Folk Art Museum', 'Folk Art Museum', '#folkartmuseum', '@FolkArtMuseum'], ['American Immigration History Center', 'American Family Immigration History Center', '', '@EllisIsland'], ['American Museum of Natural History', 'American Museum of Natural History', '', '@AMNH'], ['American Numismatic Society', 'American Numismatic', '', '@ANSCoins'], ['Americas Society', 'Americas Society Art ', '', '@Visual_ArtsAS'], ['Anne Frank Center USA', 'Anne Frank Center', '', '@AnneFrankCenter'], ['Asia Society', 'Asia Society', '', '@AsiaSociety'], ['Audubon Terrace', 'NYC Audubon', '', '@NYCAudubon'], ['Bartow-Pell Mansion', 'The Bartow Pell Mansion Museum (BPMM)', '', '@Bartow_Pell'], ['Bowne House', 'Bowne House', '', '@BowneHouse1'], ['Bronx Historical Society & Musem', 

In [46]:
# Combining list of lists into one

# Reference link: https://stackoverflow.com/questions/716477/join-list-of-lists-in-python

dictionary_entity = [i for y in dictionary_ready_list for i in y]
print(dictionary_entity)

['Alexander Hamilton U.S. Custom House', 'Alice Austen House Museum', 'Alice Austen House', '', '@iAliceAusten', 'American Academy of Arts and Letters', 'American Academy of Arts & Sciences', '', '@americanacad', 'American Folk Art Museum', 'Folk Art Museum', '#folkartmuseum', '@FolkArtMuseum', 'American Immigration History Center', 'American Family Immigration History Center', '', '@EllisIsland', 'American Museum of Natural History', 'American Museum of Natural History', '', '@AMNH', 'American Numismatic Society', 'American Numismatic', '', '@ANSCoins', 'Americas Society', 'Americas Society Art ', '', '@Visual_ArtsAS', 'Anne Frank Center USA', 'Anne Frank Center', '', '@AnneFrankCenter', 'Asia Society', 'Asia Society', '', '@AsiaSociety', 'Audubon Terrace', 'NYC Audubon', '', '@NYCAudubon', 'Bartow-Pell Mansion', 'The Bartow Pell Mansion Museum (BPMM)', '', '@Bartow_Pell', 'Bowne House', 'Bowne House', '', '@BowneHouse1', 'Bronx Historical Society & Musem', 'Bronx Historical So', '', 

In [47]:
# Creating a new featureset and training a Naive Bayes classifier

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in dictionary_entity:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]
random.shuffle(featuresets)
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

# Reference link: http://www.nltk.org/book/ch06.html

0.89


In [48]:
# Showing most informative features

classifier.show_most_informative_features(5)

Most Informative Features
 contains(@studiomuseum) = True           museum : non-mu =     11.1 : 1.0
   contains(@Guggenheim) = True           museum : non-mu =      6.7 : 1.0
     contains(Cloisters) = True           museum : non-mu =      5.8 : 1.0
contains(@brooklynmuseum) = True           museum : non-mu =      3.0 : 1.0
contains(@MorganLibrary) = True           museum : non-mu =      2.2 : 1.0


In [49]:
# Comparing with accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set))

0.89


In [50]:
# Comparing with accuracy results of Maximum Entropy classifier
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.917
         Final          -0.68519        0.917
0.89


In [51]:
# Checking most informative features of Maximum Entropy classifier

maximum_entropy_classifier.show_most_informative_features(5)

   0.020 contains(@AsiaSociety)==True and label is 'non-museum'
   0.020 contains(@EllisIsland)==True and label is 'non-museum'
   0.020 contains(@MJHnews)==True and label is 'non-museum'
   0.020 contains(@BronxMuseum)==True and label is 'non-museum'
   0.020 contains(@SchomburgCenter)==True and label is 'non-museum'


In [52]:
# Comparing with accuracy results of SVC classifier

svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.89


## Experiment 2
### Datasets: text_2 and label_2

### Total = 1166 tweets, museum-labeled = 583 (50%)
### Random baseline performance: 50%

## Data import and processing

### Input text file: text_2.txt

In [53]:
# Importing text file with .readlines()

file = input("Please input your file: ")
text = open(file,'r',encoding = "ISO-8859-1").readlines()

Please input your file: text_2.txt


In [54]:
# Removing URL links

text_list = []
for i in text:
    y = re.sub(r"http\S+", "", i)
    text_list.append(y)

print(text_list[1])

# Reference link: https://stackoverflow.com/questions/3094659/editing-elements-in-a-list-in-python
# Reference link: https://stackoverflow.com/questions/24399820/expression-to-remove-url-links-from-twitter-tweet/24399874

Calvin Klein aka @FamaLamTam and his lovely lady aka @_missshaniqua_  on #TheKleinSyndicate -? 



In [55]:
# Checking the raw string with repr()

# Reference link: https://bytes.com/topic/python/answers/670554-how-print-raw-string-data-variable
for i in text_list:
    print(repr(i))

'Intrepid ?? #weekendvibes #intrepidseaairandspacemuseum #what_i_saw_in_nyc #intrepidmuseum? \n'
'Calvin Klein aka @FamaLamTam and his lovely lady aka @_missshaniqua_  on #TheKleinSyndicate -? \n'
'To all the beach trips i missed cause of school ??? @ Cute Kids \n'
'? @ New York Hall of Science \n'
'#keepgoing | ?? #planetfitness planetfitness @ Planet Fitness \n'
'@Misfit_VAR 8:30 plis (sino apago el internet) ???\n'
'Oded Halahmy "Family (Study)" 1978. @ Bronx Museum of the Arts \n'
'After Being told to go on the wrong train by a local, I?ve finally? \n'
'Aggies everywhere -- NY \n'
"I'm at The @MorganLibrary &amp; Museum in New York, NY \n"
'\n'
"I'm at Central Park - Conservatory Garden Center Fountain in New York, NY  \n"
'sO gOOd #adrianpiper @ MoMA The Museum of Modern Art \n'
'Happy Mother?s Day at Wave Hill? \n'
'...so today\'s #ITweetMuseums choice: The Morgan Library &amp; Museum for both "#PeterHujar: Speed of Life"? \n'
'Subliminal messaging. @ Republic Records \n'
"I'm at

'VEN Y ALMUERZA CON CALIDAD Y ALTURA A PRECIO DE LUNCH,!!TODO DESDE $6.95 A $11.95 EL MEJOR LUNCH? \n'
'MoMA, New York City, USA | 2015\n'
'??MONDAYS HAPPY HOUR ALL NIGHT!! ALL HOUSE DRINKS?? 2x1@ the hottest? \n'
'Lash it girlfriend!! #becomeashewinker #2129441850 #lashextensions? \n'
'"Angles\n'
'#nyc #nycphotographer #highcontrast #architecture @ Upper West Side \n'
'#curiousshapes #springishere @ Conservatory Garden \n'
'Freedom Copper @ Statue Of Liberty And Ellis Island Immigration Museum \n'
'Throwback to #Dayhab day at the #newyorktransitmuseum: a rare? \n'
'@MsCharlotteWWE ?\n'
'Uptown, Monday night...? @ Inwood Bar and Grill \n'
'Thank you @brooklynmuseum for hosting @DavidBowieReal exhibit \n'
'Patti! (@ New-York Historical Society Museum &amp; Library in New York, NY) \n'
'Coffee &amp; Cream Soft Serve on a scoop of Chocotorta tastes even better than it? \n'
'History Refused to Die will open metmuseum on May 22nd and will? \n'
'Journalism is a profession that cannot see its

In [56]:
# Removing new line symbol 

text_ready = [i.strip() for i in text_list]
print(repr(text_ready[1]))

'Calvin Klein aka @FamaLamTam and his lovely lady aka @_missshaniqua_  on #TheKleinSyndicate -?'


In [57]:
# Tokenizing

text_ready_list = []
for i in text_ready:
    y = i.split()
    text_ready_list.append(y)
    
print(text_ready_list[1])

['Calvin', 'Klein', 'aka', '@FamaLamTam', 'and', 'his', 'lovely', 'lady', 'aka', '@_missshaniqua_', 'on', '#TheKleinSyndicate', '-?']


### Input labels file: label_2.txt

In [58]:
# Importing a file with associated labels with .readlines()

file = input("Please input your file: ")
labels = open(file,'r',encoding = "ISO-8859-1").readlines()

Please input your file: label_2.txt


In [59]:
# Removing newline symbol

labels_ready = [i.strip() for i in labels]
print(labels_ready[1])

non-museum


In [60]:
# Combining content and labels lists into one to create a data structure similar to a Reuters Corpus in NLTK

list_text_labels = []
for i, y in zip(text_ready_list, labels_ready):
    list_text_labels.append((i,y))

# Reference link: https://stackoverflow.com/questions/1919044/is-there-a-better-way-to-iterate-over-two-lists-getting-one-element-from-each-l
# Reference link: https://stackoverflow.com/questions/6304808/how-to-pass-tuple-as-argument-in-python
# Reference link: https://stackoverflow.com/questions/19560044/how-to-concatenate-element-wise-two-lists-in-python

In [61]:
print(list_text_labels[1])

(['Calvin', 'Klein', 'aka', '@FamaLamTam', 'and', 'his', 'lovely', 'lady', 'aka', '@_missshaniqua_', 'on', '#TheKleinSyndicate', '-?'], 'non-museum')


## 2.1 Featureset: Synonym/semantic word cooccurrences

In [62]:
# Creating a featureset based on a variable word_features defined above in Experiment 1 and classifiying a Naive Bayes classifier

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]
random.shuffle(featuresets)
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

# Reference link: http://www.nltk.org/book/ch06.html

0.52


In [63]:
# Checking most informative features
classifier.show_most_informative_features(5)

Most Informative Features
        contains(artist) = True           non-mu : museum =      3.0 : 1.0
        contains(museum) = True           museum : non-mu =      2.6 : 1.0
         contains(visit) = True           non-mu : museum =      1.7 : 1.0
    contains(exhibition) = True           museum : non-mu =      1.6 : 1.0
         contains(photo) = True           museum : non-mu =      1.5 : 1.0


In [64]:
# Comparing with accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set))

0.52


In [65]:
# Comparing with accuracy results of Maximum Entropy classifier
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.500
         Final          -0.69454        0.512
0.52


In [66]:
# Checking most informative features of Maximum Entropy classifier
maximum_entropy_classifier.show_most_informative_features(5)

  -0.532 contains(art museum)==False and label is ''
  -0.532 contains(collection)==False and label is ''
  -0.532 contains(cultural)==False and label is ''
  -0.532 contains(historical)==False and label is ''
  -0.532 contains(artists)==False and label is ''


In [67]:
# Comparing with accuracy results of SVC classifier
svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.52


## 2.2 Featureset: Dictionary list of names, abbreviations and tags

### 2.2.1 Unigram model: creating a list where museum names are tokenized (e.g. ['American', 'Academy', 'of', 'Arts', 'and', 'Letters'])

In [68]:
# Importing dictionary.txt file with names, abbreviations and tags with .read()

file = input("Please input your file: ")
dictionary = open(file,'r',encoding = "ISO-8859-1").read()

Please input your file: dictionary.txt


In [69]:
# Using variable dictionary_unigrams defined above and training a Naive Bayes classifier

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in dictionary_unigrams:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]
random.shuffle(featuresets)
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

# Reference link: http://www.nltk.org/book/ch06.html

0.45


In [70]:
# Checking most informative features
classifier.show_most_informative_features(5)

Most Informative Features
       contains(Library) = True           museum : non-mu =      4.4 : 1.0
        contains(Garden) = True           non-mu : museum =      3.6 : 1.0
        contains(Morgan) = True           museum : non-mu =      3.0 : 1.0
        contains(Jewish) = True           museum : non-mu =      3.0 : 1.0
        contains(Center) = True           non-mu : museum =      3.0 : 1.0


In [71]:
# Comparing with accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set))

0.45


In [72]:
# Comparing with accuracy results of Maximum Entropy classifier
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.501
         Final          -0.98268        0.527
0.47


In [73]:
# Checking most informative features of Maximum Entropy classifer
maximum_entropy_classifier.show_most_informative_features(5)

   0.246 contains(The)==True and label is 'non-museum'
   0.246 contains(Intrepid)==True and label is 'museum'
   0.246 contains(Space)==True and label is 'museum'
   0.246 contains(Jewish)==True and label is 'museum'
   0.246 contains(Heritage)==True and label is 'museum'


In [74]:
# Comparing with accuracy results of SVC classifier
svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.48


### 2.2.2 Entity model: creating a list where museum names are preserved as one token (e.g. ['American Academy of Arts and Letters']) 

In [75]:
# Using variable dictionary_entity defined above and training a Naive Bayes classifier

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in dictionary_entity:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in list_text_labels]
random.shuffle(featuresets)
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) 

# Reference link: http://www.nltk.org/book/ch06.html

0.42


In [76]:
# Checking most informative features

classifier.show_most_informative_features(5)

Most Informative Features
contains(@brooklynmuseum) = True           non-mu : museum =      1.6 : 1.0
    contains(@metmuseum) = True           museum : non-mu =      1.5 : 1.0
contains(@IntrepidMuseum) = False          non-mu :        =      1.3 : 1.0
 contains(@studiomuseum) = False          non-mu :        =      1.3 : 1.0
     contains(@OSHBklyn) = False          non-mu :        =      1.3 : 1.0


In [77]:
# Comparing with accuracy results of Decision Tree classifier
decision_tree_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(decision_tree_classifier, test_set)) 

0.42


In [78]:
# Comparing with accuracy results of Maximum Entropy classifier
maximum_entropy_classifier = nltk.MaxentClassifier.train(train_set,max_iter=1)
print(nltk.classify.accuracy(maximum_entropy_classifier, test_set))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.507
         Final          -1.08686        0.513
0.42


In [79]:
# Checking most informative features of Maximum Entropy classifier
maximum_entropy_classifier.show_most_informative_features(5)

   0.020 contains(@cooperhewitt)==True and label is 'non-museum'
   0.020 contains(@AsiaSociety)==True and label is 'museum'
   0.020 contains(@tenementmuseum)==True and label is 'museum'
   0.020 contains(@EllisIsland)==True and label is 'non-museum'
   0.020 contains(@nysci)==True and label is 'non-museum'


In [80]:
# Comparing with accuracy results of SVC classifier
svm_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)
print(nltk.classify.accuracy(svm_classifier, test_set))

0.42
