Assignment 4 - Socioeconomic analysis
===

*Due: January 17 2023*

You will analyze a set of movies dialogues and calculate their sentiment scores (using Vader, text Blob, or Naive Bayes).

Then you will analyze how the positivity (or negativity) of sentiment score expressed in movie dialogues changes through time and is affected by socioeconomic and historical events.

In particular, you will use the provided R scripts to test how sentiment scores in movies are affected by GDP per capita, Life expectancy and the political cycle (Republican or Democrat president).

## Import

In [13]:
import os
import pandas as pd
import regex as re
import contractions
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob
from glob import glob
from nltk.stem import WordNetLemmatizer


lemma = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))
lang = "eng"
data_path = "data/dialogs_preprocessed2/"

## Functions

### Functions for text preprocessing

In [9]:
def clean_url(input):
    output = re.sub(r"http\S+", "", input)
    return output


def fix_contraction(input):
    output = contractions.fix(input)
    return output


def clean_non_alphanumeric(input):
    output = re.sub(r"[^a-zA-Z0-9]", " ", input)
    return output


def clean_tokenization(input):
    output = nltk.word_tokenize(input)
    return output


def clean_stopwords(input):
    output = [item for item in input if item not in stop_words]
    return output


def numbers_to_words(input):
    output = []
    for item in input:
        if item.isnumeric() == True:
            output += [p.number_to_words(item)]
        else:
            output += [item]
    return output


def clean_lowercase(input):
    output = str(input).lower()
    return output


def clean_lemmatization(input):
    output = [lemma.lemmatize(word=w, pos="v") for w in input]
    return output


def clean_length(input):
    output = [word for word in input if len(word) > 2]
    return output


def convert_to_string(input):
    output = " ".join(input)
    return output


def preprocessing(text, remove_stopwords=True):
    """
    Preprocessing pipeline.
    """
    text = clean_url(text)
    text = fix_contraction(text)
    text = clean_non_alphanumeric(text)
    text = clean_lowercase(text)
    text = clean_tokenization(text)
    text = numbers_to_words(text)
    if remove_stopwords:
        text = clean_stopwords(text)
    text = clean_lemmatization(text)
    text = clean_length(text)
    text = convert_to_string(text)
    return text

## Load movie dialogs and preprocess them

In [14]:
df = pd.DataFrame(glob(data_path + "/*.txt"), columns=["movie"])
df["path"] = df["movie"]
df["movie"] = (
    df["movie"]
    .str.replace("^" + data_path, "", regex=True)
    .str.replace("_dialog.txt$", "", regex=True)
)


df["text"] = df["path"].apply(lambda path: open(path, encoding="utf-8").read())
df["text"] = df["text"].apply(lambda raw_txt: preprocessing(raw_txt))

df.head()

Unnamed: 0,movie,path,text
0,donniebrasco,data/dialogs_preprocessed2/donniebrasco_dialog...,paul attanasio base book donnie brasco joseph ...
1,shiningthe,data/dialogs_preprocessed2/shiningthe_dialog.txt,post production script july get appointment ul...
2,idesofmarchthe,data/dialogs_preprocessed2/idesofmarchthe_dial...,write george clooney grant heslov beau willimo...
3,hangoverthe,data/dialogs_preprocessed2/hangoverthe_dialog.txt,write jon lucas scott moore september word dou...
4,bringingoutthedead,data/dialogs_preprocessed2/bringingoutthedead_...,first draft paul schrader novel joseph connell...


## Add sentiment scores

In [17]:
df["textblob"] = df.text.apply(lambda x: TextBlob(x).sentiment.polarity)

df.head()

Unnamed: 0,movie,path,text,textblob
0,donniebrasco,data/dialogs_preprocessed2/donniebrasco_dialog...,paul attanasio base book donnie brasco joseph ...,-0.011833
1,shiningthe,data/dialogs_preprocessed2/shiningthe_dialog.txt,post production script july get appointment ul...,0.158831
2,idesofmarchthe,data/dialogs_preprocessed2/idesofmarchthe_dial...,write george clooney grant heslov beau willimo...,0.097131
3,hangoverthe,data/dialogs_preprocessed2/hangoverthe_dialog.txt,write jon lucas scott moore september word dou...,0.15658
4,bringingoutthedead,data/dialogs_preprocessed2/bringingoutthedead_...,first draft paul schrader novel joseph connell...,0.013822


## Time series

<div class="alert alert-warning">For this we would need the years of the movies</div>

In [355]:
#
#
#