# Problem Statement

Provided the entire script of the movie, can you classify it into the right genre.
Labeling text data can be hard. To use the available information to auto-create or predict the labels can be an interesting machine learning task. Using the power of Natural Language Processing (NLP) the unstructured text data can be leveraged to auto-generate the right classes for the test data in the future.

In order to accomplish this, we have scraped close to 2000 movie scripts and the respective genres.

As some of the scripts are huge, it would be interesting to figure out new ways of feature extraction and different NLP techniques.

In this hackathon participants are challenged to use the movie script to design a Natural language processing system that can help the customer classify it into the right genre in the coming future.

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import csv
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline
pd.set_option('display.max_colwidth', 300)

In [2]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')

In [3]:
train.head()

Unnamed: 0,File_Name,Labels
0,file_2180.txt,8
1,file_693.txt,4
2,file_2469.txt,6
3,file_2542.txt,6
4,file_378.txt,16


In [4]:
train.shape, test.shape

((1978, 2), (849, 1))

In [5]:
import os
data_folder = "C:\\Users\\hungu\\Documents\\MovieGenre\\MovieScriptsParticipantsData\\Scripts"
train['Script'] = [open(data_folder + os.sep + file, "r").read() for file in train['File_Name']]
test['Script'] = [open(data_folder + os.sep + file, "r").read() for file in test['File_Name']]

In [6]:
train.head()

Unnamed: 0,File_Name,Labels,Script
0,file_2180.txt,8,"\t\t\tCrouching Tiger, Hidden Dragon\n\n\t\t\t\tby\n\n\tWang Hui Ling, James Schamus, Tsai Kuo Jung\n\n\t\t\t\tbased on the novel by\n\n\t\t\t\tWang Du Lu\n\nEXT. YUAN COMPOUND - DAY\n\nSecurity men and porters are loading wagons for a convoy.\n\nAs they work, we see across the lake a lone horse..."
1,file_693.txt,4,"""MUMFORD""\n\n Screenplay by\n\n Lawrence Kasdan\n\n SHOOTING DRAFT\n\n EXT. MAIN STREET, SMALL TOWN - DAY\n\n A freigh..."
2,file_2469.txt,6,MAX PAYNE\n\n Written by\n\n Beau Michael Thorne\n\n 8/24/2007\n\n OVER BLACK:\n\n MAX'S VOICE\n\n ...
3,file_2542.txt,6,"SLUMDOG MILLIONAIRE\n\n Written by\n\n Simon Beaufoy\n\n November 4th, 2007\n\n SLUMDOG FILMS LIMITED\n\n39 LONG ACRE\n\nLONDON WC2E 9LG\n\n1 INT. JAVED'S SAFE-HOUSE. BATHROOM. NI..."
4,file_378.txt,16,<b><!--\n\n</b>if (window!= top)\n\ntop.location.href=location.href\n\n<b>// -->\n\n</b>\n\nThe Abyss - by James Cameron \n\n THE ABYSS\n\n AN ORIGINAL SCREENPLAY\n\n BY\n\n ...


In [7]:
print(train['Script'][3][:3000])

                      SLUMDOG MILLIONAIRE

                          Written by

                         Simon Beaufoy

                                             November 4th, 2007

 SLUMDOG FILMS LIMITED

39 LONG ACRE

LONDON WC2E 9LG

1   INT. JAVED'S SAFE-HOUSE. BATHROOM. NIGHT.                 1

    An expensive bathroom suite. Excess of marble and gold

    taps. Into the bath, a hand is scattering rupee notes.

    Hundreds and hundreds of notes, worth hundreds of

    thousands of rupees. The sound of a fist thumping on

    the bathroom door, furious shouting from the other

    side.

                               JAVED O/S

              Salim! Salim!

2   INT. STUDIO. BACKSTAGE. DAY.                              2

    Darkness. Then, glimpses of faces. In the half-light,

    shadowy figures move with purpose. An implacable voice

    announces.

                            TALKBACK V/O

              Ten to white-out, nine, eight,

              seven...

           

In [8]:
def clean_summary(text):
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]"," ",text)
    text = ' '.join(text.split())
    text = text.lower()
    text = ' '.join([w for w in text.split() if len(w)>3])
    return text

In [9]:
train['clean_script'] = train['Script'].apply(lambda x: clean_summary(x))
train.head(2)

Unnamed: 0,File_Name,Labels,Script,clean_script
0,file_2180.txt,8,"\t\t\tCrouching Tiger, Hidden Dragon\n\n\t\t\t\tby\n\n\tWang Hui Ling, James Schamus, Tsai Kuo Jung\n\n\t\t\t\tbased on the novel by\n\n\t\t\t\tWang Du Lu\n\nEXT. YUAN COMPOUND - DAY\n\nSecurity men and porters are loading wagons for a convoy.\n\nAs they work, we see across the lake a lone horse...",crouching tiger hidden dragon wang ling james schamus tsai jung based novel wang yuan compound security porters loading wagons convoy they work across lake lone horseman entering village recognizes worker master here angle thirties powerful handsome background aunt sight drops parcels runs excit...
1,file_693.txt,4,"""MUMFORD""\n\n Screenplay by\n\n Lawrence Kasdan\n\n SHOOTING DRAFT\n\n EXT. MAIN STREET, SMALL TOWN - DAY\n\n A freigh...",mumford screenplay lawrence kasdan shooting draft main street small town freight truck late vintage pulls side road small rural town handsome well built gets passenger side thanks driver newcomer carries coat over shoulder beat suitcase modified pompadour shirtsleeves rolled past biceps wipes br...


In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)
train['clean_script'] = train['clean_script'].apply(lambda x: remove_stopwords(x))
train.head(2)

Unnamed: 0,File_Name,Labels,Script,clean_script
0,file_2180.txt,8,"\t\t\tCrouching Tiger, Hidden Dragon\n\n\t\t\t\tby\n\n\tWang Hui Ling, James Schamus, Tsai Kuo Jung\n\n\t\t\t\tbased on the novel by\n\n\t\t\t\tWang Du Lu\n\nEXT. YUAN COMPOUND - DAY\n\nSecurity men and porters are loading wagons for a convoy.\n\nAs they work, we see across the lake a lone horse...",crouching tiger hidden dragon wang ling james schamus tsai jung based novel wang yuan compound security porters loading wagons convoy work across lake lone horseman entering village recognizes worker master angle thirties powerful handsome background aunt sight drops parcels runs excitedly build...
1,file_693.txt,4,"""MUMFORD""\n\n Screenplay by\n\n Lawrence Kasdan\n\n SHOOTING DRAFT\n\n EXT. MAIN STREET, SMALL TOWN - DAY\n\n A freigh...",mumford screenplay lawrence kasdan shooting draft main street small town freight truck late vintage pulls side road small rural town handsome well built gets passenger side thanks driver newcomer carries coat shoulder beat suitcase modified pompadour shirtsleeves rolled past biceps wipes brow sw...


In [11]:
train['Labels'].value_counts()

6     405
19    261
4     243
0     203
5     141
15    134
1     116
16    109
11    104
8      79
14     75
7      27
2      25
20     18
13     15
21      9
12      4
9       3
3       2
17      2
10      2
18      1
Name: Labels, dtype: int64

In [12]:
train.head()

Unnamed: 0,File_Name,Labels,Script,clean_script
0,file_2180.txt,8,"\t\t\tCrouching Tiger, Hidden Dragon\n\n\t\t\t\tby\n\n\tWang Hui Ling, James Schamus, Tsai Kuo Jung\n\n\t\t\t\tbased on the novel by\n\n\t\t\t\tWang Du Lu\n\nEXT. YUAN COMPOUND - DAY\n\nSecurity men and porters are loading wagons for a convoy.\n\nAs they work, we see across the lake a lone horse...",crouching tiger hidden dragon wang ling james schamus tsai jung based novel wang yuan compound security porters loading wagons convoy work across lake lone horseman entering village recognizes worker master angle thirties powerful handsome background aunt sight drops parcels runs excitedly build...
1,file_693.txt,4,"""MUMFORD""\n\n Screenplay by\n\n Lawrence Kasdan\n\n SHOOTING DRAFT\n\n EXT. MAIN STREET, SMALL TOWN - DAY\n\n A freigh...",mumford screenplay lawrence kasdan shooting draft main street small town freight truck late vintage pulls side road small rural town handsome well built gets passenger side thanks driver newcomer carries coat shoulder beat suitcase modified pompadour shirtsleeves rolled past biceps wipes brow sw...
2,file_2469.txt,6,MAX PAYNE\n\n Written by\n\n Beau Michael Thorne\n\n 8/24/2007\n\n OVER BLACK:\n\n MAX'S VOICE\n\n ...,payne written beau michael thorne black maxs voice dont believe heaven idea something heard song fade white pristine empty frame clean peaceful maxs voice heaven place nothing ever happens theres gentle motion blank frame like swirling grain rumble starts build growing louder grain moves faster ...
3,file_2542.txt,6,"SLUMDOG MILLIONAIRE\n\n Written by\n\n Simon Beaufoy\n\n November 4th, 2007\n\n SLUMDOG FILMS LIMITED\n\n39 LONG ACRE\n\nLONDON WC2E 9LG\n\n1 INT. JAVED'S SAFE-HOUSE. BATHROOM. NI...",slumdog millionaire written simon beaufoy november slumdog films limited long acre london javeds safe house bathroom night expensive bathroom suite excess marble gold taps bath hand scattering rupee notes hundreds hundreds notes worth hundreds thousands rupees sound fist thumping bathroom door f...
4,file_378.txt,16,<b><!--\n\n</b>if (window!= top)\n\ntop.location.href=location.href\n\n<b>// -->\n\n</b>\n\nThe Abyss - by James Cameron \n\n THE ABYSS\n\n AN ORIGINAL SCREENPLAY\n\n BY\n\n ...,window location href location href abyss james cameron abyss original screenplay james cameron august directors revision abyss omitted omitted title abyss black dissolving cobalt blue ocean underwater blue deep featureless twilight five hundred feet propeller sound materializing blue limbo enorm...


In [13]:
test['clean_script'] = test['Script'].apply(lambda x: clean_summary(x))
test['clean_script'] = test['clean_script'].apply(lambda x: remove_stopwords(x))

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [15]:
y = train['Labels']

In [16]:
x_train, x_val, ytrain, yval = train_test_split(train.clean_script.values, y, 
                                                random_state=2020, 
                                                test_size=0.1, shuffle=True)

In [17]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

In [18]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, max_features=4000)
tfidf_vectorizer.fit(list(x_train) + list(x_val))
xtrain = tfidf_vectorizer.transform(x_train)
xval = tfidf_vectorizer.transform(x_val)

In [19]:
xtest = tfidf_vectorizer.transform(test.clean_script.values)

In [20]:
xtrain.shape, xval.shape

((1780, 4000), (198, 4000))

In [21]:
model = LogisticRegression()
model.fit(xtrain, ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [22]:
pred_prob = model.predict_proba(xval)
print(multiclass_logloss(yval, pred_prob))

2.4684748812229613


In [23]:
pred_test = model.predict_proba(xtest)

In [24]:
submission = pd.DataFrame(pred_test)
submission.insert(0, 'File_Name', test.File_Name)
submission.head()

Unnamed: 0,File_Name,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,file_2300.txt,0.073854,0.06536,0.020427,0.004135,0.106531,0.049564,0.200447,0.017628,0.038172,...,0.004564,0.010047,0.054706,0.075976,0.049098,0.00414,0.003683,0.155705,0.011875,0.007012
1,file_809.txt,0.063606,0.049373,0.015267,0.004082,0.166357,0.056085,0.186819,0.027055,0.031936,...,0.004513,0.009824,0.035505,0.100017,0.048324,0.004098,0.003661,0.119776,0.012132,0.006866
2,file_1383.txt,0.100963,0.162501,0.012709,0.003921,0.071993,0.04579,0.089394,0.01386,0.029686,...,0.004432,0.009189,0.028226,0.085389,0.204576,0.003916,0.003456,0.066253,0.011023,0.006603
3,file_983.txt,0.079201,0.056495,0.01265,0.003927,0.11754,0.060903,0.191975,0.012995,0.030074,...,0.004362,0.009347,0.03278,0.055562,0.039627,0.00392,0.003449,0.216454,0.011127,0.006647
4,file_1713.txt,0.153005,0.116929,0.015789,0.004008,0.096903,0.053013,0.191548,0.017948,0.039022,...,0.00441,0.011402,0.029777,0.057831,0.048662,0.004091,0.003563,0.081176,0.011415,0.006692


In [25]:
submission.to_excel('Submission.xlsx', index=False)