# Violent movie finder model 

So far, we have cleaned the data and performed sentiment analysis as well as a violent word count analysis. Our objective is to categorize the movies on a scale from non-violent to violent.

Our approach will be to build a model that labels the dataset using various features.


## Labeling the Data
Since labeled data is required for analysis, we manually labeled a subset of the dataset. We divided part of the data among team members and labeled each movie plot based on a categorical scale:
<ul>
    <li><b>-1</b> : non-violent movies</li>
    <li><b>0</b> : possibly slightly violent or uncertain</li>
    <li><b>1</b> : definitely violent movies</li>
</ul>
To assess the subjectivity of the labeling process, we had some plots labeled multiple times by external participants.

## Model
For simplicity, we chose to perform a logistic regression using several selected features.

## Features
The selected feature set includes:
<ul>
    <li><b>Word Count Features</b>
        <ul> 
            <li>Count of physically violent words</li> 
            <li>Count of psychologically violent words</li> 
            <li>Density of physically violent words</li> 
            <li>Density of psychologically violent words</li>
        </ul>
    </li>
    <li><b>Sentimental Analysis Features</b>
       <ul>
           <li>Sadness</li>
           <li>Joy</li>
           <li>Love</li>
           <li>Anger</li>
           <li>Surprise</li>
       </ul>
    </li>
</ul>


## Dataset
<ul>
    <li><b>Training and Testing Data</b> <br/> Given the limited number of labeled plots available, we will use most of the labeled items for the training set. We will keep five plots as the testing set to evaluate the model (alternatively, we may use the entire dataset and assess labeling quality across the final labeled set) </li>
    <li><b>Final Dataset</b>  <br/> We will apply the model to label the entire dataset and review the quality of the labels.</li>
</ul>


### Imports

In [20]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math
from sklearn.linear_model import LinearRegression,Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline

In [2]:
# Add the project root directory (not src) to sys.path
sys.path.append(os.path.abspath("../data/"))

# Now import the DataLoader class
from data_loading import DataLoader

raw = '../../data/RAW/'
clean = '../../data/CLEAN'

## Load and prepare the data

In [3]:
#load the data
data_loader = DataLoader(raw,clean)
MovieData,DataTest = data_loader.data_for_violent_model()
MovieData.head()

Unnamed: 0_level_0,name,sadness,joy,love,anger,fear,surprise,word_count_Physical_violence,word_count_Psychological_violence,total_count,density word_count_Physical_violence,density word_count_Psychological_violence
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3217,Army of Darkness,0.011153,0.038138,0.003922,0.548492,0.393865,0.004429,3,3,479,0.006263,0.006263
3333,The Birth of a Nation,0.03144,0.064514,0.068203,0.648962,0.183545,0.003336,1,5,858,0.001166,0.005828
3746,Blade Runner,0.067504,0.08782,0.014147,0.505497,0.320859,0.004172,3,3,669,0.004484,0.004484
3837,Blazing Saddles,0.00516,0.013628,0.001236,0.947847,0.030865,0.001264,7,3,631,0.011094,0.004754
3947,Blue Velvet,0.007605,0.025192,0.002353,0.13514,0.825802,0.003908,7,7,930,0.007527,0.007527


In [4]:
#load the labelled data
ViolentLabel,ViolentData = data_loader.human_labelled_data()
display(ViolentLabel)
display(ViolentData.head())

Unnamed: 0.1,Unnamed: 0,Violence level,Label
0,0,Peaceful,-1
1,1,Mild,0
2,2,Violent,1


Unnamed: 0_level_0,Answer
Wikipedia movie ID,Unnamed: 1_level_1
113454,0
909664,1
1028671,0
1336564,0
1472852,-1


In [5]:
len(ViolentData.index.intersection(MovieData.index))

112

In [6]:
FinalSet = MovieData.loc[MovieData.index.difference(ViolentData.index)]

In [45]:
TestSet = pd.merge(DataTest,ViolentData["Answer"],left_index=True,right_index=True,how = "inner")
TestSet = TestSet.drop(["Unnamed: 0"],axis = 1)
TestSet.head()

Unnamed: 0_level_0,name,sadness,joy,love,anger,fear,surprise,word_count_Physical_violence,word_count_Psychological_violence,total_count,density word_count_Physical_violence,density word_count_Psychological_violence,Answer
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
113454,Prizzi's Honor,0.005138,0.002473,0.000994,0.949046,0.041555,0.000795,3,0,27,0.111111,0.0,0
909664,Little Odessa,0.017475,0.037115,0.006702,0.852454,0.081973,0.004282,1,0,26,0.038462,0.0,1
1028671,Devil's Playground,0.00695,0.963438,0.002151,0.022351,0.004241,0.000869,0,0,36,0.0,0.0,0
1336564,Backlash,0.015322,0.042601,0.001589,0.354581,0.582766,0.003141,0,0,32,0.0,0.0,0
1472852,Down and Out in Beverly Hills,0.89473,0.003303,0.001348,0.014916,0.084946,0.000758,0,0,49,0.0,0.0,-1


In [49]:
fraction = 0.2

# Split the data between train and validation
TrainingSet,ValidationSet = train_test_split(TestSet, test_size=fraction, random_state=21)

print(TrainingSet.shape[0])
print(ValidationSet.shape[0])

116
30


## Regression

In [47]:
#model = Ridge(alpha=6)
#model.fit(TestSet.drop(["Answer","name"], axis=1), TestSet["Answer"])

#model = LinearRegression()  # create the model
#model.fit(TestSet.drop(["Answer","name"], axis=1), TestSet["Answer"])  

model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(TrainingSet.drop(["Answer","name"], axis=1), TrainingSet["Answer"])

In [50]:
print("coefficient",model.coef_)
print("intercept",model.intercept_)
model.score(TrainingSet.drop(["Answer","name"], axis=1), TrainingSet["Answer"])

coefficient [[ 5.64129112e-01  7.23659105e-01  1.21790253e-01 -6.70672598e-01
  -1.02307520e+00  2.91186329e-01 -8.32269438e-01 -4.11081008e-01
   5.21860148e-06 -7.09417625e-03  2.97816941e-02]
 [-8.72993341e-01 -1.31309803e-01 -2.58044809e-02  6.20563186e-01
   5.65982896e-01 -1.61255438e-01  2.41424556e-01  3.64082432e-01
   4.87264284e-03 -1.56438953e-02  7.82816028e-02]
 [ 3.08864229e-01 -5.92349302e-01 -9.59857726e-02  5.01094127e-02
   4.57092301e-01 -1.29930891e-01  5.90844882e-01  4.69985765e-02
  -4.87786144e-03  2.27380716e-02 -1.08063297e-01]]
intercept [ 0.81655712 -0.82673275  0.01017563]


0.5431034482758621

In [59]:
Compare = pd.DataFrame(model.predict(ValidationSet.drop(["Answer","name"], axis=1)),index=ValidationSet.index,columns=["Prediction"])
Compare["Label"] = ValidationSet["Answer"]

In [60]:
accuracy = metrics.accuracy_score(Compare["Label"],Compare["Prediction"])
print("accuracy",accuracy*100)

m1 = abs(Compare["Label"]-Compare["Prediction"]).mean()
print("abs distance",m1)

#penalize more if opposite result 
m2 = np.power(Compare["Label"]-Compare["Prediction"], 2).mean()
print("pow distance",m2)

print("Correct label",(Compare["Label"]==Compare["Prediction"]).sum())
print("incorrect but close",(abs(Compare["Label"]-Compare["Prediction"])==1).sum())
print("opposite",(abs(Compare["Label"]-Compare["Prediction"])==2).sum())

accuracy 33.33333333333333
abs distance 0.8666666666666667
pow distance 1.2666666666666666
Correct label 10
incorrect but close 14
opposite 6
