# Movie Theme Extraction and Labeling

The goal of this project is to automate the extraction of labeled themes from movie overviews. These themes can replace the overview as a faster and more consistent way of determining what the movie is about. The defined themes would also allow for faster filtering that would allow the user to find a more exact match based on viewing mood.

In [1]:
import pandas as pd


Data used for this project can be found here https://www.kaggle.com/rounakbanik/the-movies-dataset

We will use the overviews and titles from the movies_metadata.csv

In [None]:
df = pd.read_csv('the-movies-dataset\movies_metadata.csv')

In [11]:
overviews = df['overview'].apply(str).tolist()
overviews = " ".join(overviews)    #prepare for word_tokenize

# Data Preparation

Names like Alan and Jack are common and will appear influence topic model distributions. Given that names are interchangeable when discussing themes, we will identify and remove them as a part of the data preparation process.

In [10]:
import nltk

In [14]:
tokens = nltk.word_tokenize(overviews)
tokens_pos = nltk.pos_tag(tokens)

### Named Entity Recognition:

Use named entity recognition to detect names in the corpus. The ne_chunk function is used to extract chunks from the tagged pos tokens and will return person, places and organizations. 

In [15]:
ne_tree = nltk.ne_chunk(tokens_pos)

In [17]:
#build list of people to remove
people = []                                 
for subtree in ne_tree.subtrees():
    if subtree.label() == 'PERSON':
        person = []
        for key, value in subtree.leaves():
            person.append(key)
        people.append(person)

In [92]:
#counter gives count of each name 
import collections, operator
people_unique = collections.Counter(map(tuple, people))

Named entity recognition identifies some chunks that are not names. Using the frequency list of people_unique and manual check of the high frequency terms is done to create a list of misidentified entites that should remain in the corpus.

In [101]:
exclude_list = [('Hollywood'),('Christmas','Eve'),('San', 'Francisco'), ('Las', 'Vegas'), ('Santa', 'Claus'),
                ('Christmas',), ('Hollywood',), ('Superman',), ('Academy',), ('Godzilla',), ('Academy', 'Award'),
                ('Buenos', 'Aires'), ('Jesus', 'Christ'),('Pearl', 'Harbor'), ('Louisiana',), ('Sequel',), ('Father',), 
                ('Wealthy',), ('Disney',), ('Count', 'Dracula'),('Los', 'Angeles'), ('Monster', 'High'), ('Brazil',), 
                ('Shaolin',), ('Halloween',), ('Navy','SEALS')
               ]

In [102]:
#return clean list of names minus the misidentified names
people_clean = []
for key,val in sorted(people_unique.items(), key=operator.itemgetter(1), reverse = True):
    if key not in exclude_list:
        people_clean.append(' '.join(key))

## Prep data:

* remove people's names
* remove stop words
* remove punctuation
* lemmatization

In [104]:
import string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
stops = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

other_stops = ["'s","--","..."]
stops.extend(other_stops)


texts = []
for sent in df['overview'].apply(str):
    text = nltk.word_tokenize(sent)
    text = ['PERSON' if word in people_clean else word for word in text  ]  #replace name with PERSON
    text = [word.lower() for word in text] 
    text = [word for word in text if word not in stops]
    text = [lemmatizer.lemmatize(word) for word in text]
    text = [word for word in text if word not in string.punctuation ] 
    texts.append(text)

In [142]:
#additional cleaning 
more_stops = ["``","'ll","''"]
texts_clean = [[word for word in text if word != 'person'] for text in texts]     #remove identified names
texts_clean = [[word for word in text if word not in more_stops] for text in texts_clean]
texts_clean = [[word for word in text if len(word) > 2] for text in texts_clean]
print(texts_clean[:2])

[['led', 'toy', 'live', 'happily', 'room', 'birthday', 'brings', 'lightyear', 'onto', 'scene', 'losing', 'place', 'heart', 'plot', 'circumstance', 'separate', 'owner', 'duo', 'eventually', 'learns', 'put', 'aside', 'difference'], ['sibling', 'discover', 'enchanted', 'board', 'game', 'open', 'door', 'magical', 'world', 'unwittingly', 'invite', 'adult', 'trapped', 'inside', 'game', 'year', 'living', 'room', 'hope', 'freedom', 'finish', 'game', 'prof', 'risky', 'three', 'find', 'running', 'giant', 'rhinoceros', 'evil', 'monkey', 'terrifying', 'creature']]


## LDA topic modeling

Latent Dirichlet Allocation will be used for topic modeling. 

In [107]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models

In [None]:
#build bag of words representation of all documents 
dictionary = corpora.Dictionary(texts_clean)
corpus = [dictionary.doc2bow(text) for text in texts_clean ]

## 300 topics chosen after multiple iterations and coherence score testing

Choosing the right number of topics is an iterative process. After multiple tests, the number of topics that gave the best results was 300. The coherence score and visualization of the topic models for human readability were part of the evaluation process.

In [None]:
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=300, update_every=1, chunksize=5000, passes=3)

In [146]:
#get coherence score 
from gensim.models.coherencemodel import CoherenceModel
cm = CoherenceModel(model=lda, texts=texts_clean, corpus=corpus, coherence='c_v')
coherence = cm.get_coherence()  
print(coherence)

2018-12-15 11:19:36,196 : INFO : using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
2018-12-15 11:23:17,776 : INFO : 3 accumulators retrieved from output queue
2018-12-15 11:23:57,353 : INFO : accumulated word occurrence stats for 44438 virtual documents


0.3910022859034032


The pyLDAvis library provides a convenient way to visualize the topics and the distribution of terms in each topic.

In [None]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis)

# Seed Labels

Two types of chunk patterns will form the basis of the seed phrases. We are looking to form two word phrases that represent things and terms as described by the corpus. The idea is that themes would be explicitly described during some of the overview creations and that similarity mesures can be used to aplly the common themes as labels.

In [113]:
thing_chunker = "CHUNK: {<NN><NN>}"
term_chunker = "CHUNK: {<JJ.*><NN>}"
thing_parser = nltk.RegexpParser(thing_chunker)
term_parser = nltk.RegexpParser(term_chunker)

In [114]:
thing_tree = thing_parser.parse(tokens_pos)
term_tree = term_parser.parse(tokens_pos)

In [115]:
labels = []
for subtree in thing_tree.subtrees():
    if subtree.label() == 'CHUNK':
        label = []
        for key, value in subtree.leaves():
            label.append(key)
        #print(label)
        labels.append(label)
for subtree in term_tree.subtrees():
    if subtree.label() == 'CHUNK':
        label = []
        for key, value in subtree.leaves():
            label.append(key)
        #print(label)
        labels.append(label)

In [116]:
unique_labels = [list(label) for label in set(tuple(label) for label in labels)]

We can examine the seed list to find potential matches that contain a specific word.

In [369]:
potentials = [label for label in unique_labels if 'club' in label]
potentials

[['chess', 'club'],
 ['ground', 'club'],
 ['bondage', 'club'],
 ['grime', 'club'],
 ['sex', 'club'],
 ['senior', 'club'],
 ['club', 'member'],
 ['club', 'proprietor'],
 ['club', 'scene'],
 ['auto', 'club'],
 ['roll', 'club'],
 ['club', 'kid'],
 ['motorcycle', 'club'],
 ['trendy', 'club'],
 ['volleyball', 'club'],
 ['happy', 'club'],
 ['dance', 'club'],
 ['monster', 'club'],
 ['club', 'life'],
 ['soccer', 'club'],
 ['club', 'championship'],
 ['private', 'club'],
 ['underground', 'club'],
 ['gay', 'club'],
 ['club', 'manager'],
 ['secret', 'club'],
 ['football', 'club'],
 ['bowling', 'club'],
 ['trout', 'club'],
 ['golf', 'club'],
 ['correspondence', 'club'],
 ['Spanish', 'club'],
 ['exclusive', 'club'],
 ['supper', 'club'],
 ['manipulative', 'club'],
 ['gourmet', 'club'],
 ['jazz', 'club'],
 ['club', 'house'],
 ['female', 'club'],
 ['gambling', 'club'],
 ['night', 'club'],
 ['fist', 'club'],
 ['hunting', 'club'],
 ['sumo', 'club'],
 ['club', 'dancer'],
 ['poetry', 'club'],
 ['go-go', 'c

### High potential phrases

Here we will collect the top high potential phrases for each topic. 
1. For each word in the topic model find potentials that have one matching word
2. Convert each matching phrase to a bow model using the same dictionary used for lda training
3. Get topic model for the phrase
4.  Save if topic matches current topic and probability is high 

In [None]:
top_matches = {}
for topic_id, words in lda.print_topics(300):
    top_match = []
    top = 0
    for word, rat in lda.show_topic(topic_id):
        potentials = [label for label in unique_labels if word in label]
        for potential in potentials:
            unseen = dictionary.doc2bow(potential)
            for topic, prob in lda[unseen]:
                if topic ==topic_id and prob > 0.6:
                    if topic in top_matches:
                        top_matches[topic].append(list(potential))
                    else:
                        top_matches.update({topic : [list(potential)]})
                    if prob > top:
                        top = prob
                        top_match = list(potential)

# Replace overview with a list of themes

In [217]:
#create Movie class
class Movie:
    def __init__(self, doc_id, title, overview, topics, labels, label_score):
        self.doc_id = doc_id
        self.title = title
        self.overview = overview
        self.topics = topics
        self.labels = dict(labels)
        self.label_score = label_score

In [None]:
#tfidf model to be used to improve similarity scores
tfidf = models.TfidfModel(corpus)

## Get a single label for each topic in an overview

1. Get all the topics  above a minimum threshold for each document
2. Get all the high potential phrases for the topic
3. Covert the phrase to a tfidf bow model
4. Compare phrase and document cosine similarity 
5. Keep highest score above a minimum threshold

If no label meets the minimum threshold then we ignore that topic for overview replacement.

In [221]:
from gensim import matutils


def get_movie_labels(doc_id):  
    top_pot_labels = {}
    for topic_id,prob in lda.get_document_topics(corpus[doc_id], minimum_probability=0.03): 
        top_pot_score = 0
        top_pot_label = []
        for pot in top_matches[topic_id]:
            pot_corpus = tfidf[dictionary.doc2bow(pot)]
            #label score
            pot_score = matutils.cossim(tfidf[corpus[doc_id]], pot_corpus)
            if pot_score > top_pot_score and pot_score > 0.05:
                top_pot_label = list(pot)
                top_pot_score = pot_score
        if len(top_pot_label) > 0 : top_pot_labels.update({topic_id : list(top_pot_label)})

    
    top_pot_sent = []
    for key in top_pot_labels:
        top_pot_sent.append(list(top_pot_labels[key]))
    top_pot_sent = [word for label in top_pot_sent for word in label]

    label_score = matutils.cossim(tfidf[corpus[doc_id]], tfidf[dictionary.doc2bow(top_pot_sent)])
    title = df['title'].loc[doc_id]
    overview = df['overview'].loc[doc_id ]
    topics = {}
    for key, value in lda[corpus[doc_id]]:
        topics.update({key : value})
    
    new_movie = Movie(doc_id, title, overview, topics, top_pot_labels, label_score )
    return new_movie

## Build dictionary of movie objects with theme lists and write to csv file

In [223]:
movies = {}
for doc_id in range(len(corpus)):
    movies.update( {doc_id: get_movie_labels(doc_id)})

In [249]:
import csv

with open('movie_labels.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['doc_id', 'title','overview', 'topics', 'labels','label_score' ]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    for movie in range(len(movies)):
        try:
            writer.writerow({'doc_id': movies[movie].doc_id, 'title':movies[movie].title, 'overview':movies[movie].overview ,
                         'topics': movies[movie].topics, 'labels': movies[movie].labels , 'label_score': movies[movie].label_score})
        except:
            print(movie, 'failed')

## Build dictionary of topics and their associated documents

Gather the list of documents for each topic. The labels from the movie object is used instead of the full topic distribution. This will increase the accuracy of the filtered list of movies that match multiple topics. 

In [264]:
from collections import defaultdict
topics_docs = defaultdict(list)

for i, doc in enumerate(movies):
    for key in movies[i].labels:
        topics_docs[key].append(i)

## Get global label based on frequency of words in associated documents



1. Get documents associated with topic 
2. Build word frequency list for top x words
3. Covert the phrase to a tfidf bow model
4. Get cosine similarity for high potential phrases and top frequency list
5. Keep the highest score as the top label



In [361]:
def get_top_label(topic_id):
    num_words  = 10       #number of top frequency words to return
    prob_thres = 0.1      #minimum threshold for probability of topic in document
    topic_docs = []

    for doc in topics_docs[topic_id]:
        doc_topics = lda.get_document_topics(corpus[doc], minimum_probability=prob_thres)
        for key, value in doc_topics:
            if  key == topic_id:
                topic_docs.append(corpus[doc])

    freq_dict = {}
    for doc in topic_docs:
        for word, count in doc:
            if word in freq_dict:
                freq_dict[word] += count
            else:
                freq_dict.update({word: count})

    sorted_dict =  sorted(freq_dict.items(), key=operator.itemgetter(1), reverse=True)


    freq_words = []
    for key, value in sorted_dict[:num_words]:
        freq_words.append(dictionary.id2token[key])
    print('Word Frequency:',freq_words)
    freq_corpus = tfidf[dictionary.doc2bow(freq_words)]

    top_score = 0
    top_score_match = []

    for pot in top_matches[topic_id]:
        pot_corpus = tfidf[dictionary.doc2bow(pot)]
        #label score
        pot_score = matutils.cossim(freq_corpus , pot_corpus)
        if pot_score > top_score:
            top_score_match = list(pot)
            top_score = pot_score
            
    return top_score_match

In [340]:
print(get_top_label(56))
print(lda.show_topic(56))
print(top_matches[56])

['life', 'family', 'woman', 'always', 'friend', 'vision', 'young', 'man', 'get', 'one']
['cheerful', 'woman']
[('always', 0.120550446), ('cross', 0.05623309), ('vision', 0.043705042), ('life', 0.039183065), ('loving', 0.026004331), ('charm', 0.02297734), ('wrote', 0.02174493), ('architect', 0.021143453), ('friend', 0.020754363), ('woman', 0.019915445)]
[['cross', 'country'], ['new', 'vision'], ['vision', 'move'], ['end', 'life'], ['stripper', 'life'], ['loving', 'relationship'], ['loving', 'wife'], ['loving', 'look'], ['loving', 'family'], ['enough', 'charm'], ['country', 'charm'], ['open-hearted', 'charm'], ['famous', 'architect'], ['young', 'architect'], ['dumb', 'woman'], ['cheerful', 'woman']]





============================


# Demo Section

============================

In [305]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

def print_movie_details(movie_id): 
    temp_movie = get_movie_labels(movie_id ) 
    print("Title:", temp_movie.title)
    print("Overview:" ,temp_movie.overview)
    pp.pprint(temp_movie.labels)

## Sample document 

use: <br>
print_movie_details(movie_id)<br>
movie_id range from 0 - 45465


In [367]:
print_movie_details(1256)

Title: Field of Dreams
Overview: Ray Kinsella is an Iowa farmer who hears a mysterious voice telling him to turn his cornfield into a baseball diamond. He does, but the voice's directions don't stop -- even after the spirits of deceased ballplayers turn up to play.
{   119: ['mysterious', 'circus'],
    129: ['great', 'spirit'],
    157: ['mysterious', 'voice'],
    161: ['baseball', 'team'],
    176: ['new', 'direction'],
    252: ['mysterious', 'mansion'],
    291: ['local', 'farmer']}


## Browse topics

topic_id ranges from 0 - 299 <br>
get_top_label(topic_id)

In [364]:
browse_topic_id =  45
print('LDA distribution:', lda.show_topic(browse_topic_id))
print('Theme:',get_top_label(browse_topic_id))

LDA distribution: [('club', 0.10398337), ('choose', 0.05524126), ('behavior', 0.050058357), ('owner', 0.048141845), ('halloween', 0.029622752), ('singing', 0.029364608), ('catastrophe', 0.028086297), ('potential', 0.02472276), ('drummer', 0.023873845), ('stunning', 0.023380758)]
Word Frequency: ['club', 'owner', 'love', 'life', 'new', 'night', 'house', 'young', 'meet', 'must']
Theme: ['night', 'club']


## Find intersection of document lists for multiple topics to filter down to a small list movies

topic_id ranges from 0 - 299 <br> 
chain set(topics_docs[topic_id]) together to find the intersection of documents for topics chained

In [368]:
for doc_id in list(set(topics_docs[45]) & set(topics_docs[189]) ):
    print_movie_details(doc_id)
    print("=======================================")

Title: Girls! Girls! Girls!
Overview: Elvis plays Ross Carpenter, a fishing guide/sailor who loves his life out on the sea. When he finds out his boss is retiring to Arizona, he has to find a way to buy the Westwind, a boat that he and his father built. He is also caught between two women: insensitive club singer Robin and sweet Laurel.
{   31: ['woman', 'caught'],
    45: ['club', 'life'],
    57: ['father', 'try'],
    115: ['life', 'boat'],
    125: ['sweet', 'love'],
    132: ['sea', 'travel'],
    189: ['disturbed', 'woman'],
    226: ['tortured', 'woman'],
    283: ['lead', 'singer']}
Title: Homicidal
Overview: The story centers around a murderous scheme to collect a rich inheritance. The object of murder is Miriam Webster, who is to share in the inheritance with her half brother Warren, who lives with his childhood guardian Helga in the mansion where Warren and Mariam grew up. Confined to a wheelchair after recently suffering a stroke, Helga is cared for by her nurse Emily, a st

Title: This Time Around
Overview: We all wish we could change the past. For Mel that day has come. In junior high, Mel and Gabby (Sara Rue) were the biggest geeks. Of course, Mel had a crush on the most popular guy, Drew Hesler (Brian A. Green). When word spreads of Mel’s crush, he plays a cruel joke on her pretending to be interested, only to turn her down in front of the entire school. Eleven years later, our ugly ducklings are now swans and Mel is a PR executive. When Mel is assigned to handle the publicity for a new restaurant, she is shocked to discover that the owner is a handsome young entrepreneur named Drew Hesler.
{   45: ['young', 'owner'],
    78: ['school', 'day'],
    104: ['first', 'crush'],
    116: ['wish', 'come'],
    163: ['new', 'guy'],
    181: ['immigrant', 'day'],
    189: ['young', 'guard'],
    190: ['new', 'year'],
    205: ['new', 'location']}
Title: Poor Pierrot
Overview: One night, Arlequin come to see his lover Colombine. But then Pierrot knocks at the do

Title: Oasis of Fear
Overview: Two young sexually free hippies, Dick (Ray Lovelock) and Ingrid (Ornella Muti) finance their travels by selling naked snaps of Ingrid until their plan is brought to an abrupt end by the Police. Forced on the run the two seek refuge at a seemingly empty isolated large villa. As it turns out the house is inhabited by the middle-aged Barbara (Irene Papas) who invites them in for some potential three-way hanky-panky that soon locks them into something far more twisted and chilling!
{   16: ['isolated', 'house'],
    25: ['naked', 'man'],
    45: ['potential', 'love'],
    70: ['young', 'lawyer'],
    83: ['young', 'lead'],
    84: ['middle-aged', 'man'],
    101: ['young', 'teenage'],
    169: ['young', 'cousin'],
    189: ['twisted', 'life'],
    216: ['large', 'house'],
    230: ['young', 'police'],
    261: ['find', 'refuge'],
    290: ['cold', 'plan']}
Title: La Cage aux folles
Overview: Two gay men living in St. Tropez have their lives turned upside down