# The New Bechdel test!
Analysis on the new Bechdel test using the Cornell Movie-Dialog Corpus

## Sentiment analysis on the movie dialogues corpus
Sentiment Analysis happens at various levels:

1. Document-level Sentiment Analysis evaluate sentiment of a single entity (i.e. a product) from a review document.
2. Sentence-level Sentiment Analysis evaluate sentiment from a single sentence.
3. Aspect-level Sentiment Analysis performs finer-grain analysis. For example, the sentence “the iPhone’s call quality is good, but its battery life is short.” evaluates two aspects: call quality and battery life, of iPhone (entity). The sentiment on iPhone’s call quality is positive, but the sentiment on its battery life is negative. (Liu 2012)


In [1]:
import pandas as pd
import re
import operator
import plotly.plotly as py
import plotly.graph_objs as go

In [2]:
ml_df = pd.read_csv('ml_csv.csv', encoding = 'latin-1')

In [3]:
ml_df.describe()

Unnamed: 0,Line_Id,Character_Id,Movie_Number,Character_Name,Dialogue
count,304713,304713,304713,304670,304446
unique,304713,9035,617,5355,265275
top,L10232,u4525,m289,JACK,What?
freq,1,537,1530,3032,1684


In [4]:
# Lets, import the dataframe
mc_df = pd.read_csv("mc_csv.csv", encoding = 'latin-1')
mcm_df = pd.read_csv("mcm_csv.csv", encoding = 'latin-1')

In [5]:
movie_number = 'm2'

In [6]:
# So, the columns here in this dataset mention that characterId1 
# is the charcter speking to characterId2 in each movies
mc1_df = mc_df.loc[mc_df['Movie_Number'] == movie_number]
ml1_df = ml_df.loc[ml_df['Movie_Number'] == movie_number]

In [7]:
mc1_df.head()

Unnamed: 0,Character_Id1,Character_Id2,Movie_Number,List_of_Utterance
294,u26,u30,m2,"['L3380', 'L3381', 'L3382']"
295,u26,u30,m2,"['L3383', 'L3384', 'L3385']"
296,u26,u30,m2,"['L3392', 'L3393', 'L3394', 'L3395', 'L3396']"
297,u26,u30,m2,"['L3459', 'L3460', 'L3461']"
298,u26,u30,m2,"['L3464', 'L3465', 'L3466', 'L3467', 'L3468', ..."


In [8]:
ml1_df.head()

Unnamed: 0,Line_Id,Character_Id,Movie_Number,Character_Name,Dialogue
942,L3546,u26,m2,CUTLER,"Officers, there's your killer, do your duty, a..."
943,L3545,u30,m2,EMIL,...so we kill someone famous and if we are cau...
944,L3497,u26,m2,CUTLER,"I don't think it's abuse, I think it's torture."
945,L3496,u30,m2,EMIL,I'm abused. Don't you think?
946,L3493,u26,m2,CUTLER,Can I see your back?


In [9]:
for index, rows in mc1_df.iterrows():
    s = rows[3]
    l = s.split('\'')[1::2]
    for _ in l:
        line = ml1_df.loc[ml1_df['Line_Id'] == _, 'Dialogue'].iloc[0]
        line = str(line).lower()
        line = re.sub(r'\.\.+', '', line)
        print(line)

are you my attorney?  i'm emil.  i'm insane.
i'm not your lawyer until i see the money.
here.  i have your money.
oh no!  no!  shit!
emil.  take it easy.  stay with me.  sit down.  what do you need?  what are you looking for?
he has the camera!  he took the movie!
don't say anything.
where are we going?
i'm coming with you.
yes.  yes, come with me!
i'm invoking rights - this man is represented by counsel.  i'm coming with him.
i brought you some letters.  it's really fan mail.  women mostly.  one wants to buy you clothes, another sent a check. another wants a check.
you bring the cigarettes?
oh, sure.
delusions and paranoia.
i was all of these.
well, you didn't appreciate the severity of it until recently.  no question about that.
what about oleg?
disappeared.  they're looking everywhere.  maybe he went back to czechoslovakia.
no, he is here.  shit
don't worry about him.  think about yourself.
what about my movie rights?  book rights?
look, i haven't really focused on that kind of thin

he doesn't speak english.
who is he?
new york's finest.  this is his case.
this all you want?
do you know how much killer gets for movie rights?
in here, says he wants a million.
million?!  the killer gets one million dollars for a television interview?
hey, tabloids paid ted bundy - famous serial killer - half a million for his interview.  and how much you think monica got for writing book about the president coming on to her?  it pays to be a killer or a whore in this country. look, you want magazine or not?
yes.  both.
just do what i do.  say the same thing i say.  don't open your mouth.
okay.
don't fool around.
okay.
did you hear what i said?
i want to document my trip to america.
look.  times square.  just like in the movies!
don't speak russian!
why?  why do i always have to speak to you in czech?
because i don't like your ugly language. i heard enough of it in school!  now speak czech or english.  and don't fool around anymore.  you almost got us thrown out!
look.  new videocame

# Let's use an off-the-shelf pre trained NLTK sentiment analyzer
Sentiment analysis is simply the process of working out (statistically) whether a piece of text is positive, negative or neutral. <br> The majority of sentiment analysis approaches take one of two forms: polarity-based, where pieces of texts are classified as either positive or negative, or valence-based, where the intensity of the sentiment is taken into account. <br>For example, the words ‘good’ and ‘excellent’ would be treated the same in a polarity-based approach, whereas ‘excellent’ would be treated as more positive than ‘good’ in a valence-based approach. <br> 

---
VADER belongs to a type of sentiment analysis that is based on lexicons of sentiment-related words.<br>In this approach, each of the words in the lexicon is rated as to whether it is positive or negative, and in many cases, how positive or negative.<br>

In [12]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [13]:
# A few examples
sentences = ["Krishna is smart, handsome, and funny.", 
"Krishna is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",
"The book was good.",
"The plot was good, but the characters are uncompelling and the dialog is not great.",
"A really bad, horrible book.",
"Honestly is he alright!!!? :(",
"Damn it, I totally fucked up!"]

In [14]:
#nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

In [15]:
for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    print(ss)

Krishna is smart, handsome, and funny.
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
Krishna is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!
{'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
The book was good.
{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}
The plot was good, but the characters are uncompelling and the dialog is not great.
{'neg': 0.327, 'neu': 0.579, 'pos': 0.094, 'compound': -0.7042}
A really bad, horrible book.
{'neg': 0.791, 'neu': 0.209, 'pos': 0.0, 'compound': -0.8211}
Honestly is he alright!!!? :(
{'neg': 0.297, 'neu': 0.307, 'pos': 0.396, 'compound': 0.2444}
Damn it, I totally fucked up!
{'neg': 0.719, 'neu': 0.281, 'pos': 0.0, 'compound': -0.8264}


In [17]:
new_df = pd.DataFrame(columns = ['Character_Id1', 'Character_Id2', 'Gender1', 
                                 'Gender2', 'List_of_Utterance', 'Movie_Number',
                                 'Movie_Title', 'Name1', 'Name2'])
new_df = new_df.append(mc1_df, ignore_index = True)
m1_df = mcm_df.loc[mcm_df['Movie_Number'] == movie_number]

In [18]:
for index, rows in new_df.iterrows():
    rows[2] = m1_df.loc[m1_df.Character_Id == rows[0], 'Gender'].iloc[0]
    rows[3] = m1_df.loc[m1_df.Character_Id == rows[1], 'Gender'].iloc[0]
    rows[6] = m1_df.loc[m1_df.Movie_Number == rows[5], 'Movie_Title'].iloc[0]
    rows[7] = m1_df.loc[m1_df.Character_Id == rows[0], 'Character_Name'].iloc[0]
    rows[8] = m1_df.loc[m1_df.Character_Id == rows[1], 'Character_Name'].iloc[0]

In [19]:
new_df.head()

Unnamed: 0,Character_Id1,Character_Id2,Gender1,Gender2,List_of_Utterance,Movie_Number,Movie_Title,Name1,Name2
0,u26,u30,m,m,"['L3380', 'L3381', 'L3382']",m2,15 minutes,CUTLER,EMIL
1,u26,u30,m,m,"['L3383', 'L3384', 'L3385']",m2,15 minutes,CUTLER,EMIL
2,u26,u30,m,m,"['L3392', 'L3393', 'L3394', 'L3395', 'L3396']",m2,15 minutes,CUTLER,EMIL
3,u26,u30,m,m,"['L3459', 'L3460', 'L3461']",m2,15 minutes,CUTLER,EMIL
4,u26,u30,m,m,"['L3464', 'L3465', 'L3466', 'L3467', 'L3468', ...",m2,15 minutes,CUTLER,EMIL


In [20]:
# Now doing it for the movie
sent_dict = {}

for index, rows in new_df.iterrows():
    s = rows[4]
    l = s.split('\'')[1::2]
    for _ in l:
        line = ml1_df.loc[ml1_df['Line_Id'] == _, 'Dialogue'].iloc[0]
        line = re.sub(r'\.\.+', '', str(line))
        s_score = sid.polarity_scores(line)
        max_sent_type = max(s_score.items(), key=operator.itemgetter(1))[0]
        max_sent_score = max(s_score.items(), key=operator.itemgetter(1))[1]
        sent_dict[index] = {rows[2]: max_sent_type}

In [21]:
sent_dict

{0: {'m': 'neu'},
 1: {'m': 'neu'},
 2: {'m': 'neu'},
 3: {'m': 'pos'},
 4: {'m': 'neu'},
 5: {'m': 'neu'},
 6: {'m': 'neu'},
 7: {'m': 'neu'},
 8: {'m': 'neg'},
 9: {'f': 'neu'},
 10: {'f': 'neu'},
 11: {'f': 'neu'},
 12: {'f': 'neu'},
 13: {'f': 'pos'},
 14: {'f': 'neg'},
 15: {'f': 'neu'},
 16: {'f': 'neu'},
 17: {'f': 'neg'},
 18: {'f': 'neu'},
 19: {'f': 'neu'},
 20: {'f': 'neu'},
 21: {'f': 'neu'},
 22: {'f': 'neu'},
 23: {'f': 'neu'},
 24: {'f': 'neu'},
 25: {'f': 'neu'},
 26: {'f': 'neg'},
 27: {'f': 'neu'},
 28: {'m': 'compound'},
 29: {'m': 'neu'},
 30: {'m': 'neu'},
 31: {'m': 'neu'},
 32: {'m': 'neu'},
 33: {'m': 'neu'},
 34: {'m': 'neu'},
 35: {'m': 'pos'},
 36: {'m': 'neu'},
 37: {'m': 'neu'},
 38: {'m': 'pos'},
 39: {'m': 'neu'},
 40: {'m': 'neu'},
 41: {'m': 'pos'},
 42: {'m': 'neu'},
 43: {'m': 'neu'},
 44: {'m': 'neg'},
 45: {'m': 'compound'},
 46: {'m': 'neu'},
 47: {'m': 'neu'},
 48: {'m': 'neu'},
 49: {'m': 'neu'},
 50: {'m': 'neu'},
 51: {'m': 'neu'},
 52: {'m': '

In [22]:
g_listm = []
genre_listm = []
g_listf = []
genre_listf = []
for key, vals in sent_dict.items():
    gender = str(list(sent_dict.get(key, {}).keys())).replace('[','').replace(']', '').replace('\'', '')
    genre = str(list(sent_dict.get(key, {}).values())).replace('[','').replace(']', '').replace('\'', '')
    if gender == 'm' or gender == 'M':
        g_listm.append(gender)
        genre_listm.append(genre)
    else:
        g_listf.append(gender)
        genre_listf.append(genre)

In [23]:
print(g_listf, genre_listf)

['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'] ['neu', 'neu', 'neu', 'neu', 'pos', 'neg', 'neu', 'neu', 'neg', 'neu', 'neu', 'neu', 'neu', 'neu', 'neu', 'neu', 'neu', 'neg', 'neu']


In [24]:
# Visualization on the year of realese to get the idea of movie years we are working with
trace1 = go.Histogram(
    x=genre_listm,
    name='Male',
    marker=dict(
        color='#867DC6',
    ),
    opacity=0.75
)
trace2 = go.Histogram(
    x=genre_listf,
    name='Female',
    marker=dict(
        color='#3A27C8'
    ),
    opacity=0.75
)
data = [trace1, trace2]
layout = go.Layout(
    title='Sentiments_per_Gender',
    xaxis=dict(
        title='Count'
    ),
    yaxis=dict(
        title='Sentiments'
    ),
    bargap=0.2
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='SpG')


Consider using IPython.display.IFrame instead

