# **ISM Captioning Coordinator: Script to PowerPoint**


Welcome to the IPython Notebook that turns a script into a PowerPoint! I will go over in details what each part of the code does, so that whoever is reading this will know exactly when to use certain parts of the code and when it is not necessary to run code. I will also try and do my best to make this Notebook as non-programmer friendly as possible. Feel free to use this Notebook as a reference to future script conversions!

There are five main sections in this Notebook:

> **Section 1:** This section basically sets up the rest of the Notebook so that code can run. Make sure you run this first before running any of the other sections!

> **Section 2:** This section will take a PDF of images of the script and convert them into actual images that we can use in Section 3. You do not need to run this section if you are given an actual script with highlightable words. You also do not need to run this section if you have already ran it before.

> **Section 3:** This section takes the images from Section 2 and converts the images into highlightable text in .txt files (one .txt file for each image). Again, you do not need to run this section if you are given an actual script with highlightable words, but if you ran Section 2, you MUST run this section as well. Also, you do not need to run this section if you have already ran it before.

> **Section 4:** This section is more about ensuring that Section 3 works properly. When all of the .txt files created using Section 3 have been corrected and finalized so that no more changes to the script will be made, run this section to combine all of the .txt files into one giant .txt file. This section should only be ran if you ran Section 2 and 3. You may want to run this section many times if there are changes and revisions to the script; otherwise just make the changes manually.

> **Section 5:** This section will take the giant .txt file containing the entire script and create a PowerPoint from it. This section can be run directly if you already have a giant .txt file of the entire script in a parsable format.

To summarize when to go through what sections:

> If you are given a PDF of images of the script, you will need to go through all of the sections.

> If you are given a PDF of highlightable text of the script, then simply copy and paste the text directly into a .txt file, then go through Section 1 and 5. (You could also go through all of the sections if you'd like just to separate each page into a .txt file for organization/revision purposes; however, this is probably not the move and you're probably better off just manually making revisions in the overarching .txt file.) 

> If you are given a .txt file of the entire script in a parsable format, then you only need to go through Section 1 and 5.

If you are confused as to whether or not you should go through a section, just look at the note that I have made at the bottom of each section.

Finally, there may be some things in the code that you will need to change according to your circumstance. I have left comments in the code that tell you what you need to change, and each section should also tell you which variables may need to be adjust to get the desired outcome.

**NOTE:** I cannot guarantee that this will work in Jupyter Notebook, as I have only ran this code using Google Colab. However, the general process should be the same.

### **Section 1: Mounting the Drive + Installs and Imports**

This deals with the next two code boxes. If you are running this code in Google Colab, be sure to mount to the drive (first code box), then run the installs and imports (second code box). If you are running this in Jupyter Notebook, feel free to ignore the first code box.

The main variable you will most likely have to adjust is the ***datadir*** variable.

The ***datadir*** variable: 
> This is the path to the folder that contains all of the resources that you are working with (i.e. PDFs, images, .txt files, etc.).

On Google Colab, the second code box may tell you to restart the runtime; just click the button that says "RESTART RUNTIME", then run the code box again.

**NOTE:** This is a required section. DO NOT SKIP THIS!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Installs
!sudo apt install tesseract-ocr
!pip install pytesseract
!pip install PyMuPDF Pillow
!pip install python-pptx
!pip install python-docx

# Imports
import cv2
import pytesseract
import fitz
import io
import matplotlib.pyplot as plt
import numpy as np

from PIL import Image
from pptx import Presentation
from pptx.dml.color import RGBColor
from pptx.enum.text import PP_ALIGN
from pptx.util import Pt
from docx.shared import Inches

# Data Directory (NOTE: Change this if needed)
datadir = "/content/drive/MyDrive/Grease/resources/"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2build2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### **Section 2: Converting PDF to Images**

This next code box takes a PDF of the script and converts all of the slides to images, then stores it into a folder. There are two key variables that you may have to adjust: ***pdf_name*** and ***im_folder***. 

The ***pdf_name*** variable:

> This is name of the PDF. All you would need to do is change "original_script.pdf" to the name of your script's PDF.

The ***im_folder*** variable:

> This is the name of the folder that you want to store all the images into. To make things easier, you should create this folder inside of the "./resources" folder. Again, change "pages" according to the name of your folder.

**NOTE:** This section should only be ran if the given script is a PDF of images of the script. Theoretically, you could also run this if you are given a PDF of the script with highlightable text, but at that point, it would be easier to just copy and paste the text into a .txt file manually. However, if you would like to create multiple .txt files for each page in case of future script edits, you can run this section as well as the next section. Remember that you only need to run this part once!

In [None]:
# (NOTE: Adjust these two variables accordingly)
pdf_name = "original_script.pdf"
im_folder = "pages"

# This code converts a .pdf of the script into images of every page; all images are saved into a folder.
# SOURCE: https://pymupdf.readthedocs.io/en/latest/recipes-images.html
pdf_file = fitz.open(datadir + pdf_name)
for page in pdf_file:
  zoom_x = 2.0
  zoom_y = 2.0 
  mat = fitz.Matrix(zoom_x, zoom_y)
  pix = page.get_pixmap(matrix=mat)
  pix.save(datadir + im_folder + "/page-%i.png" % page.number)

### **Section 3: Converting Images into .txt Files**

This next code box takes all of the images that you created in Section 2 and converts them into .txt files. Well, not *all* of the images. 

There are three key variables that you will have to change here: ***txt_folder***, ***first_page***, and ***last_page***.

The ***txt_folder*** variable:

> This is name of the folder where you want to store the .txt files in. You should make this folder inside of the "./resources" folder as well.

The ***first_page*** variable:

> This indicaties the first page *of the PDF* that you want to start converting to a .txt file. The reason this exists is because the actual lines that actors will say does not necessarily start on the first page of the PDF.

The ***last_page*** variable:

> This indicaties the last page *of the PDF* that you want to end at for converting to a .txt file. The reason this exists is because the actual lines that actors will say does not necessarily on end the last page of the PDF.


**NOTE:** Make sure that the page numbers that you use for ***first_page*** and ***last_page*** correspond to the page number given by your *images*, NOT the *PDF*! Also, this bit may take a while: Be patient, young one! Remember that you do not need to run this code again after you've run it at least once.

In [None]:
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

# (NOTE: Adjust these three variables accordingly)
txt_folder = "text-files"
first_page = 4
last_page = 44

# This code converts all of the images to .txt files
# SOURCE: https://www.geeksforgeeks.org/how-to-extract-text-from-images-with-python/
for i in range(first_page, last_page + 1):
  img = cv2.imread(datadir + im_folder + "/page-" + str(i) + ".png")
  gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  txt_file = open(datadir + txt_folder + "/page-" + str(i) + ".txt", "w")
  txt_file.write(str(pytesseract.image_to_string(gray)))
  txt_file.close()

### **Section 4: STOP! Hammertime.**

Now that you've ran Sections 2 and 3, you should have a folder containing .txt files of each page of the script. Congratulations! Now it's time for manual work!

...what? You didn't think you wouldn't have to do any manual labor now, did you?

The issue is that although Step 3 did a pretty good job at reading the text from images of the script, it is not entirely perfect. This is where you come in! We now have a bunch of individual .txt files for each page, but we just want one final big .txt file to work with. Thus, you should run the first code box below to combine all of the .txt files into one big .txt file. There are four variables that you will have to change:

The ***revised_txt_folder*** variable:

> This is the name of the folder that you've put all your revised .txt files into. Again, to make things easier, you should create this folder inside of the "./resources" folder.

The ***script_name*** variable:

> This is the name of the overarching .txt file that will contain the entire script of the musical. Name it whatever you'd like; go crazy!

The ***first_page*** variable:

> You might notice that the actual script may start on a different page after revising the .txt files (i.e., some pages are being left out). This variable is here for you to adjust accordingly.

The ***last_page*** variable:

> Again, you might notice that the actual script may end on a different page after revising the .txt files (i.e., some pages are being left out). This variable is here for you to adjust accordingly.

**NOTE:** You do NOT have to convert all periods after a character's name into colons; ideally, this Notebook will do that for you.

In [None]:
# (NOTE: Adjust these four variables accordingly)
revised_txt_folder = "cleaned-text"
script_name = "combined-script.txt" # TODO: Change the name of this
first_page = 4
last_page = 44

# This code combines all of the .txt files of each page into one giant .txt file
with open(datadir + script_name, "w") as output:
  for i in range(first_page, last_page + 1):
    curr_page = datadir + revised_txt_folder + "/page-" + str(i) + ".txt"
    for line in open(curr_page, "r"):
      output.write(line)

Congratulations! We now have a giant .txt file containing a decently accurate version of the original script! Now we can start manually fixing all the small errors that the program made while trying to read the text from the PDF of images, as well as any make any changes to the script that the higher-ups at ISM want.

Here are some things you should do before moving on to Section 5:

> 1.) Make sure that all speakers are in uppercase and that their name is followed by a period (Ex: "JAN.").

> 2.) Try to condense any lines that seem oddly broken up. Each slide of the PowerPoint will correspond to one line on the .txt file, so you ideally want at least one whole sentence per slide rather than just half of a sentence.

> 3.) Put ACTS and SCENES inside of parentheses. This will ensure that these won't show up on the PowerPoint, but keeping them in the .txt file can help you find where you need to make additions or changes to the script.

> 4.) Sometimes, there will be moments of silence during the play, such as before a scene or a dramatic pause (those pesky theatre kids!). To account for this, you can add a new line with **{BLANK}** in the giant .txt file.

> 5.) If there is singing (denoted by people speaking in all-caps), add **{MUSIC}** to the beginning AND end of the song.

There is no need to worry about empty lines; in fact, keep them in if it helps you visually when you make edits to the script.

**NOTE:** You may want to store multiple copies of this giant .txt file, as you will be working on it quite a lot and it would be a shame if you accidentally deleted all your progress...

### **Section 5: Converting Script to PowerPoint**

This will be the final code that you run in order to generate the PowerPoint. You will be running this code quite a lot, especially as you make changes to the .txt of the script. Be sure to check out the last part of Section 4 for advice on how to make these changes!

Now it's time for the good stuff. The first code block does some wonky code stuff that puts every line in the big .txt file into giant list. While it does this, it also makes sure that the quotation marks are all the same format and replaces the period at the end of each speaker's name with a  "|". There is only one variable that you need to adjust:

The ***script_name*** variable:

> This is the name of the giant .txt file that you are working with.

In [None]:
### Part 1: Getting the script into a list with, standardizing quotations and adding colons after speakers
# (NOTE: Adjust this variable accordingly)
script_name = "script.txt"

# Stuff to store the script inside lists
speaking = True
cleaned_script = []
final_script = []
script = open(datadir + script_name, "r")

# 1.) Get the speaking parts from the scripts (without cues); also standardize quotation marks
for line in script:
  curr_line = line[:-1]
  if curr_line != "":
    final_line = ""
    for char in curr_line:
      # Handles speaking vs. cues
      if char == '(' or char == '[':
        speaking = False
      elif char == ')' or char == ']':
        speaking = True
        continue

      if speaking == True:
        # Fix quotation marks
        if char == '‘' or char == '’':
          final_line += "\'"
        elif char == '“' or char == '”':
          final_line += "\""
        else:
          final_line += char
    final_line = final_line.strip()

    if len(final_line) > 0:
      cleaned_script.append(final_line)

# 2.) Replace period after actor name with colon
music = False
for line in cleaned_script:
  # print(line)
  final_line = ""
  if line == "{MUSIC}" and music == False:
    music = True
    continue
  elif line == "{MUSIC}" and music == True:
    music = False
    continue
  
  if music == False:
    split_line = line.split(".")
    if split_line[0].isupper() == True and len(split_line) >= 2: # TODO: Test "STUFF. Why do I have to do this?" and "HI..." # Works with "THIS IS MY LIFE"
      final_line = split_line[0] + "|" + line[line.index(".") + 1:]
    else:
      final_line = line
  else:
    split_line = line.split(".")
    # TODO: Test "STUFF. Why do I have to do this?" and "HI..."
    if split_line[0].isupper() == True and len(split_line) >= 2 and len(split_line[0]) < 20: # NOTE: 20 is an assumption that no one's names will be longer than 20 characters long
      final_line = split_line[0] + "|" + line[line.index(".") + 1:]
    else:
      final_line = line
  final_script.append(final_line)

The second code block takes the giant list and perfoms magic to convert the list into a PowerPoint Presentation. Okay, maybe it's not magic. But since I made the code, I think it's kind of cool.

Here are a list of variables that you may need to adjust:

The ***pptx_name*** variable:

> This is the name of the PowerPoint that you will get after running this code. DO NOT change the .pptx extention!

The ***slide_color*** variable:

> This is the background color of each slide on the PowerPoint in RGB terms. It is currently black, like my soul.

The ***font_name*** variable:

> This is the font of the text that will show up on each slide.

The ***font_size*** variable:

> This is the font size of the text that will show up on each slide. Incredible.

The ***name_color*** variable:

> This is the color of the speaker's name, if there is a speaker on the slide.

The ***text_color*** variable:

> This is the color of speech text. In other words, when people are talking, this is the color of the stuff they say. (I hope that makes sense.)

**NOTE:** In case anything is confusing, the code block also has some comments in it that may help you in case you would like to make other adjustments.

In [None]:
### Part 2: Convert the script to PowerPoint
# (NOTE: Adjust these variables accordingly)
pptx_name = "grease.pptx"
slide_color = RGBColor(0, 0, 0)
font_name = "Montserrat"
font_size = 34
name_color = RGBColor(255, 0, 0)
text_color = RGBColor(255, 225, 225)

# Constants for where the text is located
left = Inches(0.5)
top = Inches(4.5)
width = Inches(9.0)
height = Inches(2.5)

# SOURCE: https://python-pptx.readthedocs.io/en/latest/index.html
script_pptx = Presentation()

for i in range(len(final_script)):
  split_line = final_script[i].split("|") # 2 things if there is a character speaking, 1 thing otherwise
  layout = script_pptx.slide_layouts[6]
  slide = script_pptx.slides.add_slide(layout)

  # Background settings
  background = slide.background
  fill = background.fill
  fill.solid()
  fill.fore_color.rgb = slide_color

  # print(len(slide.placeholders))

  if len(split_line) == 1 and split_line[0] == "{BLANK}":
    continue
  elif len(split_line) == 1 and split_line[0] != "{BLANK}":
    textbox = slide.shapes.add_textbox(left, top, width, height)
    text_frame = textbox.text_frame
    text_frame.word_wrap = True
    
    p = text_frame.add_paragraph()
    p.alignment = PP_ALIGN.CENTER

    run = p.add_run()
    run.text = split_line[0]
    
    font = run.font
    font.name = font_name
    font.color.rgb = text_color
    font.size = Pt(font_size)
  elif len(split_line) == 2:
    textbox = slide.shapes.add_textbox(left, top, width, height)
    text_frame = textbox.text_frame
    text_frame.word_wrap = True
    
    p = text_frame.add_paragraph()
    p.alignment = PP_ALIGN.CENTER

    # Name
    run = p.add_run()
    run.text = split_line[0] + ":"
    
    font = run.font
    font.name = font_name
    font.color.rgb = name_color
    font.size = Pt(font_size)

    # Line
    run2 = p.add_run()
    run2.text = split_line[1]
    
    font2 = run2.font
    font2.name = font_name
    font2.color.rgb = text_color
    font2.size = Pt(font_size)

script_pptx.save(datadir + pptx_name)

Congratulations; you've reached the end! You should now have a beautiful PowerPoint presentation that you can use live at the musical. 

Just one last thing: During tech week, you will actually be able to experience the show for yourself, which means you will be the one who actually understands how the pacing of the show goes, as well as where it would be good to add blank slides or separate sentences for dramatic effect. Here are some tips to help you during tech week:

> 1.) Spend the first two days just getting used to the feel of the show. Feel free to record the audio of the show while they perform so that you can go back home and relisten to the pacing.

> 2.) Spend the next couple of days just checking for any errors, and fixing anything that is missing from above.

> 3.) Spend the last couple of days taking notes on what lines lead to blank slides, whether or not some lines come in super fast, etc. A recommendation would be to take notes on a phone, and separate notes into each act and scene for clarity sake.

If you need any additional help with understanding something or just want to compliment me for my hard work, you can PM me using one of the following platforms below.


**Email:** In-progress (currently getting a new one)


**Instagram:** @lin.whatever

---