## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Abbeer Wani
    - Email: aw3527@drexel.edu
- Group member 2
    - Name: Andy Cherney
    - Email: alc466@drexel.edu
- Group member 3
    - Name: Kexin Shang
    - Email: ks4254@drexel.edu
- Group member 4
    - Name: Robert Logovinsky
    - Email: rsl32@drexel.edu

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment Group 2

## Module C _(23 points)_

__C1.__ _(3 points)_ First, complete the function below, which is intended to read the text from `data/tempest.txt` and split its content by a delimiter into scenes, using a pattern that matches the scene headers for a delimiter via the `re.split(delim, text)` function (__Section 4.4.1.2__).

In [1]:
# C1:Function(2/3)

import re

def load_and_split(delim, file_path):
    
    #---your code starts here---
    
    with open(file_path, "r") as f:
        text = f.read()
    
    scene_texts = re.split(delim, text)
    
    return scene_texts

For reference, your output should be:
```
(19,
 ['SCENE 1',
  'SCENE 2',
  'SCENE 1',
  'SCENE 2',
  'SCENE 1',
  'SCENE 2',
  'SCENE 3',
  'SCENE 1',
  'SCENE 1'])
```

In [2]:
# C1:SanityCheck

scene_texts = load_and_split("(SCENE\s+\d+)", "data/tempest.txt")
len(scene_texts), scene_texts[1::2]

(19,
 ['SCENE 1',
  'SCENE 2',
  'SCENE 1',
  'SCENE 2',
  'SCENE 1',
  'SCENE 2',
  'SCENE 3',
  'SCENE 1',
  'SCENE 1'])

Is the first element returned by the `re.split()` function desirable as literary content? 

In [3]:
# C1:Inline(1/3)

# Is the first element returned by the `re.split()` 
# function desirable as literary content? 
# Print one of "Yes" or "No"
print("No")

No


__C2.__ _(7 points)_ Now let's make sure we've really managed to get all of the scenes and their structure! Notice that the scene number repeat&mdash;this is because they restart with each `ACT`, and there are multiple `SCENE 1`s, etc., per act which we'd like to separate and structure. In particular, update the delimiter from __C1__ to _flexibly_ capture the `ACT` information, too. 

For reference, your output should be:
```
(19,
 ['ACT I. SCENE 1',
  'SCENE 2',
  'ACT II. SCENE 1',
  'SCENE 2',
  'ACT III. SCENE 1',
  'SCENE 2',
  'SCENE 3',
  'ACT IV. SCENE 1',
  'ACT V. SCENE 1'])
```

In [4]:
# C2:Inline(5/7)

#---your code starts here---

delim = r"((?:ACT\s\w+\.\s)*(?:SCENE\s\d))"

#---your code stops here---
act_scene_texts = load_and_split(delim, "data/tempest.txt")
len(act_scene_texts), act_scene_texts[1::2]

(19,
 ['ACT I. SCENE 1',
  'SCENE 2',
  'ACT II. SCENE 1',
  'SCENE 2',
  'ACT III. SCENE 1',
  'SCENE 2',
  'SCENE 3',
  'ACT IV. SCENE 1',
  'ACT V. SCENE 1'])

Now review the delimited text from `ACT. I. SCENE 1` and answer the following `Inline()` question, which addresses whether each speaker-speech pair can be separated simply by a newline-delimiting split.

For reference, the first scene's text should begin with:
```
On a ship at sea; a tempestuous noise of thunder and lightning
heard

Enter a SHIPMASTER and a BOATSWAIN

  MASTER. Boatswain!
  BOATSWAIN. Here, master; what cheer?
  MASTER. Good! Speak to th' mariners; fall to't yarely, or
    we run ourselves a
```

In [5]:
# C2:Inline(2/7)

# Will a simple split by newline, i.e., by '\n' separate the
# different statements, i.e., is each line a different speaker-speech pair?
# Print one either "Yes" or "No"
print("No")

No


__C3.__ _(8 points)_ Now your job is to complete the `process_scenes` function. In particular, for each scene in each act use regular expressions to separate the (newline delimited) lines, separate the speakers from the beginnings of their speeches, and collect the multiline speeches that follow. Each speaker-speech pair should then be stored in the `defaultdict` structure being utilized in the `SanityCheck` below. In particular storing each scene's speakers/speechs as a list of list-pairs. For example, from the `SanityCheck` below:
```
data['ACT I']['SCENE 1'][i][0]
```
keys for the `i`th speaker in `'SCENE 1'` of `'ACT I'`.

In [6]:
# C3:Function(6/8)

from collections import defaultdict

def process_scenes(act_scene_texts):
    
    data = defaultdict(lambda : defaultdict(list))
    current_act, current_scene = "", ""
    for act_scene_heading, scene_text in zip(act_scene_texts[1::2], 
                                         act_scene_texts[2::2]):
        
        
        ## if it's a new act, update the current_act value
        if act_scene_heading[:3] == "ACT":
            
            #---your code starts here--- (separate and collect the act/scene information)
           
            current_act, current_scene = act_scene_heading.split(". ")
            #---your code stops here---
            
        else: ## otherwise assign the act_scene_heading
            
            #---your code starts here--- (just update the current_scene)
            current_scene = act_scene_heading
            
            #---your code stops here---

        ## now separate the speaker names and collect each speech (including
        ## multi-line concatenations) to pair the speeches with the speakers
        speaker, speech = "", ""

        for line in scene_text.split("\n"):
            #---your code starts here
            
            if not line:
                continue

            if re.match("[A-Z]+(?=\.)", line.strip()):
                if speaker and speech:
                    data[current_act][current_scene].append([speaker, speech])
                    speech = ""
                matches = re.match("([A-Z]+)(\.\s+)(.*)",line.strip())
                speaker = matches.group(1)
                speech = matches.group(3)
                continue
          
            speech += line         
            #---your code stops here---
    return data


For reference, your output should be:
```
---

Speaker: MASTER
Speech: Boatswain!

---

Speaker: BOATSWAIN
Speech: Here, master; what cheer?

---

Speaker: MASTER
Speech: Good! Speak to th' mariners; fall to't yarely, or    we run ourselves aground; bestir, bestir.               Exit                       Enter MARINERS

---

```

In [7]:
# C3:SanityCheck

data = process_scenes(act_scene_texts)
print('---\n')
for i in range(3):
    print('Speaker: ' + data['ACT I']['SCENE 1'][i][0])
    print('Speech: '+ data['ACT I']['SCENE 1'][i][1])
    print('\n---\n')

---

Speaker: MASTER
Speech: Boatswain!

---

Speaker: BOATSWAIN
Speech: Here, master; what cheer?

---

Speaker: MASTER
Speech: Good! Speak to th' mariners; fall to't yarely, or    we run ourselves aground; bestir, bestir.               Exit                       Enter MARINERS

---



Reviewing the output, have we correctly handled structure stage direction?

In [8]:
# C3:Inline(2/8)

# Reviewing the output, is structure for stage direction handled correctly?
# Print either "Yes" or "No"
print("No")

No


__C4.__ _(5 points)_ Finally, complete the function below by collecting all of the speakers in the data (across all `'ACT'`s and `'SCENE'`s) to see which speakers had the most lines.

Now count up the number of times each character spoke in the entire book. Print out the speakers and their speech counts from most to least. Remark on any limitations your work in this exercise's preprocessing in the response box below. Do you see any artifacts of imprecision in your regex?

In [9]:
# C4:Function(3/5)

from collections import Counter

def count_speaker_lines(data):
    speech_counts = Counter()
    
    #---your code starts here---
    for scene_info in data.values():
        for speaker_info in scene_info.values():
                speech_counts.update(speaker for speaker, speech in speaker_info)

    #---your code stops here---
    
    return speech_counts


For reference, your output should be:
```
[('PROSPERO', 113),
 ('SEBASTIAN', 67),
 ('STEPHANO', 60),
 ('ANTONIO', 59),
 ('GONZALO', 52)]
```

In [14]:
# C4:SanityCheck

speech_counts = count_speaker_lines(data)
speech_counts.most_common(5)

[('PROSPERO', 112),
 ('SEBASTIAN', 67),
 ('STEPHANO', 60),
 ('ANTONIO', 59),
 ('GONZALO', 51)]

Which speaker spoke more lines, `'PERMISSION'` or `'FRANCISCO'`?

In [1]:
# C4:Inline(1/5)

# Which speaker spoke more lines, Prospero or FRANCISCO?
# Print one of "PERMISSION" or "FRANCISCO"
print("Prospero") # Could not see PERMISSION in counts

Prospero


Is it a problem that `'PERMISSION'` spoke this many lines? In other words, if we want to analyze the data we've structured in this exercise, are we going to have to do more work with how we determine and/or filter speaker-speech pairs?

In [13]:
# C4:Inline(1/5)

# Is it a problem that PERMISSION spoke this many lines?
# Print one of "Probably?" or "Probably not."
print("Probably?")

Probably?
