# L&S 88 - Lab 1
_Lab created by Chris Pyles, data set from [Kaggle](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)_

In this L&S 88 lab, we will explore the Jupyter environment and think about questions of reproducibility. As we discussed this week and last week, reproducibility is one of the most important aspects of data science; further, we talked about how annotating code and displaying results in a format that makes it accessible and which allows others to use different technologies to recreate your process is of paramount importance.

For this lab, we will be taking these ideas and applying them to a dataset that has had some work done on it already. In this notebook, we have loaded a data set that contains information about movies. This notebook contains code that _cleans_ the data; that is, it puts it into a format that can be used to answer a data-driven question. **Your assignment will be to fill in the Markdown cells in this notebook so that the code in each cell has an explanation for its methodology.** In order to help you with this, we will provide you with the question that we are using the data set to answer:


---

### The Question
The first part of developing a data-driven project is to decide what question you want to answer. The question needs to be specific, and it needs to be something you can develop a step-by-step approach for. With this notebook, I am going to use the `movies` Table to answer the following question:
> Can we predict the genre of a movie based on its synopsis?

It will take a few steps to answer this question. The main methodology will be to create a test set and determine the frequency of different words in synopses within different genres, and then develop a $k$-nearest neighbors classifier based on this information. The over-arching workflow will look something like this:
1. Data preprocessing
2. Group movies by genre and look for recurring words in plots
3. Write a $k$-nearest neighbor classifier
4. Test the classifier and determine its accuracy

---



Again, this notebook only focuses on Step 1, the data preprocessing (cleaning). Whenever you see a cell that looks like this:

Type _Markdown_ and LaTeX: $\alpha ^2$

double click it to edit and fill in the cell with Markdown to describe what is happening in the code cell below it. For the last Markdown cell, write a little conclusion about each step we went through and the overarching process of data cleaning.

In this notebook, there is some Python syntax that some of you may not be familiar with. A few times in the cells below, you will see something that looks like this:

```python
try:
    <try_expression>
except <error>:
    <error_expression>
```

These blocks tell Python to attempt the `<try_expression>` and if the error `<error>` is thrown, then to execute the `<error_expression>`. Below is a quick example of how these are used. In the first block, `x` is not defined, so a `NameError` is thrown, resulting in the `except` statement being run. In between the blocks, `x` is defined, so when the second block runs, the value assigned to `x` will be printed and, since no error is thrown, the `except` statement is not run.

In [None]:
try:
    print(x)
except NameError:
    print("x is not defined, so this is printed")
    
x = 2

try:
    print(x)
except NameError:
    print("this won't be printed, because x is defined now")

If you run into any more Python expressions you're unfamiliar with, ask one of us or check out the [Python docs](https://docs.python.org/3/) (although it may be more helpful to Google the method, instead of sifting through the documentation).

With regards to Markdown, we'll cover a more in-depth introduction to it later, but here are the basics: use underscores to _italicize_ (`_italicize_`) and double asterisks to **bolden** (`**bolden**`). Create ordered list by beginning each line with a number and a period:
1. `1.`
2. `2.` etc.

and unordered lists with asterisks:
* `*`
* `*` etc.

If you have any other Markdown questions, let us know and we can help with the syntax. Get into groups, get working, and good luck!

---

In [None]:
from datascience import *
import numpy as np
import string

movies = Table.read_table('movie_plots.csv')
movies.show(5)

In [None]:
movies = movies.select('Genre', 'Plot')
movies.show(5)

In [None]:
movies.group('Genre')

In [None]:
def acceptable_genre(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')
    for genre in acceptable_genres:
        if genre in entry:
            return True
    return False

filtered_movies = movies.where('Genre', acceptable_genre)
filtered_movies.show(5)

In [None]:
def superfluous_text(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')
    for genre in acceptable_genres:
        if genre == entry:
            return False
    return True

In [None]:
def determine_genre(entry):
    acceptable_genres = make_array('action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller')

    if not superfluous_text(entry):
        return entry
    for c in entry:
        if c not in string.ascii_lowercase:
            genre = entry.replace(c, '')
    try:
        genre = genre.split(' ')
    except UnboundLocalError:
        genre = entry.split(' ')

    i = -1
    new_genre = ''
    while new_genre not in acceptable_genres:
        try:
            genre = genre[i]
        except IndexError:
            new_genre = ''
            return new_genre
        if genre not in acceptable_genres:
            i -= 1

    return new_genre

In [None]:
def change_genres(tbl):
    new_genres = np.array([])
    for entry in tbl.column('Genre'):
        new_genre = determine_genre(entry)
        new_genres = np.append(new_genres, new_genre)
    
    new_tbl = tbl.drop('Genre').with_column('Genre', new_genres)
    
    return new_tbl

filtered_genres = change_genres(filtered_movies)
filtered_genres.show(5)

In [None]:
def clean_string(s):
    for c in s:
        if c not in string.ascii_letters + ' ':
            s = s.replace(c, '')
        elif c in string.ascii_uppercase:
            i = string.ascii_uppercase.index(c)
            s = s.replace(c, string.ascii_lowercase[i])
    return s

def clean_plots(tbl):
    cleaned_strings = []
    for row in tbl.rows:
        cleaned_string = clean_string(row.item('Plot'))
        cleaned_strings += [cleaned_string]
    
    return tbl.drop('Plot').with_column('Plot', cleaned_strings)

cleaned_plots = clean_plots(filtered_genres)
cleaned_plots.show(5)