# BART Mad Men Analysis

Work done before this script:
  * Blu Ray copied from disc to file
  * Bitmap subtitles are extracted with ffmpeg
  * On Linux machine, pgsrip (using Tesseract OCR) is used to extract text subtitles into .srt format

This notebook:
  * Iterate over directory of .srt files
  * For each .srt file:
    * Get the name of the file (this has the season and episode number)
    * Use `clean_up_text()` to get a list of lines from the episode
    * Load model to GPU
    * Create hypothesis statements
    * For each line in the episode, get probability for positive and negative
    * Count times positive was greater than negative
    * Return proportion of positive lines
  * Web scrape the episode ratings for all seasons
  * Assemble season/episode name, positive proportion, and rating to df

Get paths for all the subtitle files. Save it to a list: `script_file_list`.

In [1]:
import os

scripts_path = '/home/cheetah/Documents/UWSP/DS 745/Scripts/Scripts - srt/'

script_files = os.listdir(scripts_path)

script_file_list = [f'{scripts_path}{script}' for script in script_files]

Create the functions to extract text from .srt files.

This also has some logic to re-assemble sentences that have been split and are displayed on separate lines when shown as captions on the video.

The extraction and re-assembly of sentences code is closely based off the code here (with some small modifications):

https://www.webucator.com/article/simple-python-script-for-extracting-text-from-an-s/

In [2]:
import re, sys

def is_time_stamp(l):
  if l[:2].isnumeric() and l[2] == ':':
    return True
  return False

def has_letters(line):
  if re.search('[a-zA-Z]', line):
    return True
  return False

def has_no_text(line):
  l = line.strip()
  if not len(l):
    return True
  if l.isnumeric():
    return True
  if is_time_stamp(l):
    return True
  if l[0] == '(' and l[-1] == ')':
    return True
  if not has_letters(line):
    return True
  return False

def is_lowercase_letter_or_comma(letter):
  if letter.isalpha() and letter.lower() == letter:
    return True
  if letter == ',':
    return True
  return False

def clean_up(lines):
  """
  Get rid of all non-text lines and
  try to combine text broken into multiple lines
  """
  new_lines = []
  for line in lines[1:]:
    if has_no_text(line):
      continue
    elif len(new_lines) and is_lowercase_letter_or_comma(line[0]):
      #combine with previous line
      new_lines[-1] = new_lines[-1].strip() + ' ' + line
    else:
      #append line
      new_lines.append(line)
  return new_lines

def clean_up_text(input_file, file_encoding='utf-8'):
  """
    args[1]: file name
    args[2]: encoding. Default: utf-8.
      - If you get a lot of [?]s replacing characters,
      - you probably need to change file_encoding to 'cp1252'
  """
  file_name = input_file
  with open(file_name, encoding=file_encoding, errors='replace') as f:
    lines = f.readlines()
    new_lines = clean_up(lines)
  #with open(new_file_name, 'w') as f:
  #  for line in new_lines:
  #    f.write(line)

  new_lines = [re.sub(r'\n', '', line) for line in new_lines]

  return new_lines

Sets up the bart model.
  * Use GPU 1. This is the second Nvidia RTX 3090 in the system. It's used because GPU 0 drives the monitors
  * Create the positive and negative hypothesis strings. BART outputs the probability that these match the text string inputted.
  * The model is loaded and sent to the GPU

The script then iterates over each script in the script list.
  * Get the episode name in sXXeXX format
  * Use the `clean_up_text` function defined earlier to extract the text from the .srt script
  * Take each sentence of the script
    * Tokenize the text, sending the tokenized text to the GPU
    * Get the probability output
    * Append the probabilities to the list

In [3]:
%%time
import re
from tqdm import tqdm

re_episode = re.compile(r'MMs\d{2}e\d{2}')

episode_names = []
positive_prob = []
negative_prob = []

import torch
device = "cuda:1" #GPU that isn't outputting video

hypothesis_positive = f'Positive'#This sentence is Positive'
hypothesis_negative = f'Negative'#This sentence is Negative'

from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli').to(device)
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

for script in tqdm(script_file_list):
    # get file name
    current_episode = re_episode.search(script).group(0)
    #print(f'\n\n*********Current Episode: {current_episode}***********')
    #episode_names.append(current_episode)
    
    # clean the episode text
    clean_text_list = clean_up_text(script)
    
    for text in clean_text_list:
        # Re-write it so that the larger probability statement is saved to a list
        episode_names.append(current_episode)
        x = tokenizer.encode(text, hypothesis_positive, return_tensors='pt', truncation=True)
        logits = nli_model(x.to(device))[0]

        entail_contradiction_logits = logits[:,[0,2]]
        probs = entail_contradiction_logits.softmax(dim=1)
        prob_label_is_positive = probs[:,1]

        x = tokenizer.encode(text, hypothesis_negative, return_tensors='pt', truncation=True)
        logits = nli_model(x.to(device))[0]

        entail_contradiction_logits = logits[:,[0,2]]
        probs = entail_contradiction_logits.softmax(dim=1)
        prob_label_is_negative = probs[:,1]

        positive_prob.append(prob_label_is_positive.item())
        negative_prob.append(prob_label_is_negative.item())

100%|███████████████████████████████████████████| 89/89 [40:36<00:00, 27.38s/it]

CPU times: user 40min 41s, sys: 2.43 s, total: 40min 43s
Wall time: 40min 43s





Create a pandas DataFrame from the output lists

Create a new column of the DataFrame that has Positive and Negative values, based on which probability is higher

Output a summary of Positive and Negative sentences for each episode.

In [4]:
import pandas as pd
import numpy as np
output = pd.DataFrame({'episode_name': episode_names, 'positive_prob': positive_prob, 'negative_prob': negative_prob})
output['sentiment'] = np.where(output['positive_prob'] > output['negative_prob'], 'Positive', 'Negative')
output.groupby('episode_name')[['episode_name', 'sentiment']].value_counts()

episode_name  sentiment
MMs01e01      Negative     404
              Positive     337
MMs01e02      Negative     412
              Positive     268
MMs01e03      Positive     341
                          ... 
MMs07e12      Positive     357
MMs07e13      Negative     470
              Positive     320
MMs07e14      Negative     475
              Positive     303
Length: 178, dtype: int64

Output the data to .csv for analysis in Tableau or other software

In [5]:
output.to_csv("mad_men_text_analysis_output.csv", index=False)

## Web Scraping


Define a function that will get the IMDB information for every episode given a title_id and season number

In [6]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_imdb_episode_data(title_id, season):    
    base_url = 'https://www.imdb.com/title/{}/episodes?season={}'.format(title_id, season)
    
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    odd_titles = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.odd div.info strong a')
    even_titles = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.even div.info strong a')

    odd_ratings = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.odd div.info div.ipl-rating-widget div.ipl-rating-star.small span.ipl-rating-star__rating')
    even_ratings = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.even div.info div.ipl-rating-widget div.ipl-rating-star.small span.ipl-rating-star__rating')

    odd_airdate = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.odd div.info div.airdate')
    even_airdate = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.even div.info div.airdate')

    odd_descriptions = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.odd div.info div.item_description')
    even_descriptions = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.even div.info div.item_description')

    odd_episode_numbers = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.odd div.image a div.hover-over-image.zero-z-index div')
    even_episode_numbers = soup.select('#styleguide-v2.fixed div#wrapper div#root.redesign div#pagecontent.pagecontent div#content-2-wide.redesign div#main div.article.listo.list div#episodes_content.header div.clear div.list.detail.eplist div.list_item.even div.image a div.hover-over-image.zero-z-index div')
    
    odd_titles = [title.text for title in odd_titles]
    even_titles = [title.text for title in even_titles]

    odd_ratings = [rating.text for rating in odd_ratings]
    even_ratings = [rating.text for rating in even_ratings]

    odd_airdate = [airdate.text for airdate in odd_airdate]
    even_airdate = [airdate.text for airdate in even_airdate]

    odd_descriptions = [description.text for description in odd_descriptions]
    even_descriptions = [description.text for description in even_descriptions]

    odd_episode_numbers = [episode_number.text for episode_number in odd_episode_numbers]
    even_episode_numbers = [episode_number.text for episode_number in even_episode_numbers]
    
    episodes = list(zip(odd_titles, odd_ratings, odd_airdate, odd_descriptions, odd_episode_numbers))
    episodes.extend(list(zip(even_titles, even_ratings, even_airdate, even_descriptions, even_episode_numbers)))
    
    # Make DF: episodes_df
    episodes_df = pd.DataFrame(episodes, columns=['episode_name', 'rating', 'air_date', 'description', 'episode_number'])
    
    # Clean up new line characters
    episodes_df['air_date'] = episodes_df['air_date'].str.replace(r'\n', '', regex=True)
    episodes_df['description'] = episodes_df['description'].str.replace(r'\n', '', regex=True)
    
    # return episodes_df
    return episodes_df

The episode numbers are corrected by hand after the output file is saved.

In [7]:
mad_men_list = []

import pandas as pd

from tqdm import tqdm
import time

for i in tqdm(range(7)):
    mad_men_list.append(get_imdb_episode_data('tt0804503', i))
    time.sleep(5)

mad_men_data = pd.concat(mad_men_list)

mad_men_data.to_csv("IMDB_ratings.csv", index=False)

mad_men_data

100%|█████████████████████████████████████████████| 7/7 [00:43<00:00,  6.22s/it]


Unnamed: 0,episode_name,rating,air_date,description,episode_number
0,Time Zones,7.9,13 Apr. 2014,Don is on the outside looking in on the forced...,"S7, Ep1"
1,Field Trip,8.5,27 Apr. 2014,After a fight with Megan during a surprise tri...,"S7, Ep3"
2,The Runaways,8.5,11 May 2014,While Lou becomes the object of ridicule for t...,"S7, Ep5"
3,Waterloo,9.5,25 May 2014,"Don and Peggy prepare to pitch to Burger Chef,...","S7, Ep7"
4,New Business,7.8,12 Apr. 2015,While Megan comes to New York with her mother ...,"S7, Ep9"
...,...,...,...,...,...
8,To Have and to Hold,8.0,21 Apr. 2013,Don works in secret on a Heinz ketchup campaig...,"S6, Ep4"
9,For Immediate Release,8.8,5 May 2013,"As the firm prepares to go public, Don and Pet...","S6, Ep6"
10,The Crash,8.7,19 May 2013,"The creative department has a wild, drug-influ...","S6, Ep8"
11,A Tale of Two Cities,8.1,2 Jun. 2013,Cutler and Chaough prepare to make radical cha...,"S6, Ep10"
