# L&S 88 - Lab 1 - SOLUTIONS
_Lab created by Chris Pyles, data set from [Kaggle](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)_

In this L&S 88 lab, we will explore the Jupyter environment and think about questions of reproducibility. As we discussed this week and last week, reproducibility is one of the most important aspects of data science; further, we talked about how annotating code and displaying results in a format that makes it accessible and which allows others to use different technologies to recreate your process is of paramount importance.

For this lab, we will be taking these ideas and applying them to a dataset that has had some work done on it already. In this notebook, we have loaded a data set that contains information about movies. This notebook contains code that _cleans_ the data; that is, it puts it into a format that can be used to answer a data-driven question. **Your assignment will be to fill in the Markdown cells in this notebook so that the code in each cell has an explanation for its methodology.** In order to help you with this, we will provide you with the question that we are using the data set to answer:

---

### The Question
The first part of developing a data-driven project is to decide what question you want to answer. The question needs to be specific, and it needs to be something you can develop a step-by-step approach for. With this notebook, I am going to use the `movies` Table to answer the following question:
> Can we predict the genre of a movie based on its synopsis?

It will take a few steps to answer this question. The main methodology will be to create a test set and determine the frequency of different words in synopses within different genres, and then develop a $k$-nearest neighbors classifier based on this information. The over-arching workflow will look something like this:
1. Data preprocessing
2. Group movies by genre and look for recurring words in plots
3. Write a $k$-nearest neighbor classifier
4. Test the classifier and determine its accuracy

---

Again, this notebook only focuses on Step 1 above, the data preprocessing (cleaning). Whenever you see a cell that looks like this:

double click it to edit and fill in the cell with Markdown to describe what is happening in the code cell below it. For the last Markdown cell, write a little conclusion about each step we went through and the overarching process of data cleaning.

In this notebook, there is some Python syntax that some of you may not be familiar with. A few times in the cells below, you will see something that looks like this:

```python
try:
    <try_expression>
except <error>:
    <error_expression>
```

These blocks tell Python to attempt the `<try_expression>` and if the error `<error>` is thrown, then to execute the `<error_expression>`. Below is a quick example of how these are used. In the first block, `x` is not defined, so a `NameError` is thrown, resulting in the `except` statement being run. In between the blocks, `x` is defined, so when the second block runs, the value assigned to `x` will be printed and, since no error is thrown, the `except` statement is not run.

In [1]:
try:
    print(x)
except NameError:
    print("x is not defined, so this is printed")
    
x = 2

try:
    print(x)
except NameError:
    print("this won't be printed, because x is defined now")

x is not defined, so this is printed
2


If you run into any more Python expressions you're unfamiliar with, ask one of us or check out the [Python docs](https://docs.python.org/3/) (although it may be more helpful to Google the method, instead of sifting through the documentation).

With regards to Markdown, we'll cover a more in-depth introduction to it later, but here are the basics: use underscores to _italicize_ (`_italicize_`) and double asterisks to **bolden** (`**bolden**`). Create ordered list by beginning each line with a number and a period:
1. `1.`
2. `2.` etc.

and unordered lists with asterisks:
* `*`
* `*` etc.

If you have any other Markdown questions, let us know and we can help with the syntax. Get into groups, get working, and good luck!

---

**EXAMPLE**

* import `datascience`, `numpy`, and `string` libraries
* load the `movies` Table from `movie_plots.csv`

In [2]:
from datascience import *
import numpy as np
import string

movies = Table.read_table('movie_plots.csv')
movies.show(5)

Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderland_(1903_ ...,"Alice follows a large white rabbit down a ""Rabbit-hole"". ..."
1907,Daniel Boone,American,Wallace McCutcheon and Ediwin S. Porter,"William Craven, Florence Lawrence",biographical,https://en.wikipedia.org/wiki/Daniel_Boone_(1907_film),Boone's daughter befriends an Indian maiden as Boone and ...
1907,How Brown Saw the Baseball Game,American,Unknown,Unknown,comedy,https://en.wikipedia.org/wiki/How_Brown_Saw_the_Baseball ...,Before heading out to a baseball game at a nearby ballpa ...
1907,Laughing Gas,American,Edwin Stanton Porter,"Bertha Regustus, Edward Boulden",comedy,https://en.wikipedia.org/wiki/Laughing_Gas_(film)#1907_Film,The plot is that of a black woman going to the dentist f ...
1908,The Adventures of Dollie,American,D. W. Griffith,"Arthur V. Johnson, Linda Arvidson",drama,https://en.wikipedia.org/wiki/The_Adventures_of_Dollie,On a beautiful summer day a father and mother take their ...


We have 19,241 rows, so the next thing to look at is what kinds of values we have in the DataFrame. Since we're focusing on the genre and plot, we can remove columns with irrelevant information:

In [3]:
movies = movies.select('Genre', 'Plot')
movies.show(5)

Genre,Plot
unknown,"Alice follows a large white rabbit down a ""Rabbit-hole"". ..."
biographical,Boone's daughter befriends an Indian maiden as Boone and ...
comedy,Before heading out to a baseball game at a nearby ballpa ...
comedy,The plot is that of a black woman going to the dentist f ...
drama,On a beautiful summer day a father and mother take their ...


#### Data Exploration and Accptable Genres
Now we need to see what kinds of data we have in the set. We begin by showing what the values in the `Genre` column of the DataFrame are:

In [4]:
movies.group('Genre')

Genre,count
usa,1
"usa, can",1
16 mm film,1
action,520
action / adventure,1
action / adventure / comedy,1
action / comedy,1
action / crime / drama,1
action / drama,2
action / drama / war,1


It looks like there's lots of different values in the column (2,210 to be exact). But for this question, we're really only interested in the basic genres of movies: action, adventure, comedy, drama, fantasy, historical, horror, romance, science fiction, and thriller.

To this end, we will define a function that will read through the `Genre` entry in each row and categorize them as one of the above. This process has a few steps:
1. Define a function that will determine if any of the above words are present in the entry
2. Filter `movies` for such rows
3. Change the entry for `Genre` to one of the above for each row left if entry is not one of the above

In [5]:
# Step 1:
def acceptable_genre(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')
    for genre in acceptable_genres:
        if genre in entry:
            return True
    return False

# Step 2:
filtered_movies = movies.where('Genre', acceptable_genre)
filtered_movies.show(5)

Genre,Plot
comedy,Before heading out to a baseball game at a nearby ballpa ...
comedy,The plot is that of a black woman going to the dentist f ...
drama,On a beautiful summer day a father and mother take their ...
drama,A thug accosts a girl as she leaves her workplace but a ...
comedy,A young couple decides to elope after being caught in th ...


In order to accomplish Step 3, we need to develop a few helper functions. Mainly, we need to comb each `Genre` entry in `filtered_movies` do the following:
1. Check if it is an exact match for something in `acceptable_genres`
2. If it is not, then use a heuristic to determine which genre it is a part of
3. Apply this function to each row of the DataFrame

For Step 3.1, we define the function `superfluous_text` which returns a Boolean value corresponding to whether or not there are extra characters beyond an entry in `accaptable_genres`.

In [6]:
# Step 3.1:
def superfluous_text(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')
    for genre in acceptable_genres:
        if genre == entry:
            return False
    return True

For Step 3.2, we take entries for which `superfluous_text` returns `True` and use a heuristic to determine which is the most accurate genre they fall under. The heuristic that we use is that the last word is likely the most general genre (i.e. the noun without any modifiers). In this way, `'historical drama'` would become `'drama'` or `'action comedy'` would become `'comedy'`.

The function `determine_genre` looks for words starting from the last word and going to the first until it finds an entry in `acceptable_genres`. If no such word exists, then an empty string, `''`, is inserted instead.

In [7]:
# Step 3.2:
def determine_genre(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')

    if not superfluous_text(entry):
        return entry
    for c in entry:
        if c not in string.ascii_lowercase:
            genre = entry.replace(c, '')
    try:
        genre = genre.split(' ')
    except UnboundLocalError:
        genre = entry.split(' ')

    i = -1
    new_genre = ''
    while new_genre not in acceptable_genres:
        try:
            genre = genre[i]
        except IndexError:
            new_genre = ''
            return new_genre
        if genre not in acceptable_genres:
            i -= 1

    return new_genre

Finally, for Step 3.3, the function `change_genres` takes a DataFrame as its argument and goes through the `Genre` column to replace the original `Genre` with the result of `determine_genre`. Then a new DataFrame is generated by copying the one passed as the argument and dropping the original `Genre` column and replacing it with a new `Genre` column. Then it filters out rows where the empty string `''` was inserted instead of a genre.

In [8]:
# Step 3.3:
def change_genres(tbl):
    new_genres = np.array([])
    for entry in tbl.column('Genre'):
        new_genre = determine_genre(entry)
        new_genres = np.append(new_genres, new_genre)
    
    new_tbl = tbl.drop('Genre').with_column('Genre', new_genres)
    
    return new_tbl

filtered_genres = change_genres(filtered_movies)
filtered_genres.show(5)

Plot,Genre
Before heading out to a baseball game at a nearby ballpa ...,comedy
The plot is that of a black woman going to the dentist f ...,comedy
On a beautiful summer day a father and mother take their ...,drama
A thug accosts a girl as she leaves her workplace but a ...,drama
A young couple decides to elope after being caught in th ...,comedy


#### Cleaning Plot Strings
Now that we have sorted the genres of each movie in the DataFrame, we look now to format the `Plot` column so that it can be more easily analyzed. In the cell below, the function `clean_string` is defined, which removes all characters that are not letters from a string and makes all letters lowercase. Then the function `clean_plots` is defined, which takes a DataFrame as its parameter and goes through each `Plot` entry and cleans the string there, returning a new DataFrame with cleaned plot strings.

In [9]:
def clean_string(s):
    for c in s:
        if c not in string.ascii_letters + ' ':
            s = s.replace(c, '')
        elif c in string.ascii_uppercase:
            i = string.ascii_uppercase.index(c)
            s = s.replace(c, string.ascii_lowercase[i])
    return s

def clean_plots(tbl):
    cleaned_strings = []
    for row in tbl.rows:
        cleaned_string = clean_string(row.item('Plot'))
        cleaned_strings += [cleaned_string]
    
    return tbl.drop('Plot').with_column('Plot', cleaned_strings)

cleaned_plots = clean_plots(filtered_genres)
cleaned_plots.show(5)

Genre,Plot
comedy,before heading out to a baseball game at a nearby ballpa ...
comedy,the plot is that of a black woman going to the dentist f ...
drama,on a beautiful summer day a father and mother take their ...
drama,a thug accosts a girl as she leaves her workplace but a ...
comedy,a young couple decides to elope after being caught in th ...


## Conclusions
...