<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Python Review with Movie Data

_Author: Kiefer Katovich and Dave Yerrington | DSI-SF_

---

In this lab you'll be using the provided [IMDB](http://www.imdb.com/) `movies` list below as your dataset. 

This lab is designed to practice iteration and funcitons in particular. The normal questions are more gentle, and the challenge questions are suitable for advanced/expert python or programming-experienced students. 

All the questions require writing functions and also use iteration to solve. You should print out a test of each function you write.


### 1. Load the provided list of movies dictionaries.

In [1]:
# List of movies dictionaries:

movies = [
{
"name": "Usual Suspects", 
"imdb": 7.0,
"category": "Thriller"
},
{
"name": "Hitman",
"imdb": 6.3,
"category": "Action"
},
{
"name": "Dark Knight",
"imdb": 9.0,
"category": "Adventure"
},
{
"name": "The Help",
"imdb": 8.0,
"category": "Drama"
},
{
"name": "The Choice",
"imdb": 6.2,
"category": "Romance"
},
{
"name": "Colonia",
"imdb": 7.4,
"category": "Romance"
},
{
"name": "Love",
"imdb": 6.0,
"category": "Romance"
},
{
"name": "Bride Wars",
"imdb": 5.4,
"category": "Romance"
},
{
"name": "AlphaJet",
"imdb": 3.2,
"category": "War"
},
{
"name": "Ringing Crime",
"imdb": 4.0,
"category": "Crime"
},
{
"name": "Joking muck",
"imdb": 7.2,
"category": "Comedy"
},
{
"name": "What is the name",
"imdb": 9.2,
"category": "Suspense"
},
{
"name": "Detective",
"imdb": 7.0,
"category": "Suspense"
},
{
"name": "Exam",
"imdb": 4.2,
"category": "Thriller"
},
{
"name": "We Two",
"imdb": 7.2,
"category": "Romance"
}
]

---

### 2. Filtering data by IMDB score

#### 2.1 

Write a function that:

1. Accepts a single movie dictionary from the `movies` list as an argument.
2. Returns True if the IMDB score is above 5.5.

#### 2.2 [Challenge] 

Write a function that:

1. Accepts the movies list and a specified category.
2. Returns True if the average score of the category is higher than the average score of all movies.

In [2]:
# 2.1:

def imdb_score_over_bad(movie):
    if movie['imdb'] > 5.5:
        return True
    else:
        return False

print(movies[0])
print(imdb_score_over_bad(movies[0]))

{'name': 'Usual Suspects', 'imdb': 7.0, 'category': 'Thriller'}
True


In [3]:
# 2.2

def movies_category_over_avg(movies, category):
    overall_average = []
    category_average = []
    
    
    for movie in movies:
        # creates a list of all the imdb scores
        overall_average.append(movie['imdb'])
        # creates a list of all imdb scores that match the category argument
        if movie['category'] == category:
            category_average.append(movie['imdb'])
            
    # uses imbd scores list to manually calculate datasets mean    
    overall_average = sum(overall_average)/len(overall_average)
    # catch to identify and respond to invalid categories.  
    if len(category_average) == 0:
        print('no movies in specified category:', category)
        return False
    # else valid category, calculate mean
    else:
        category_average = sum(category_average)/len(category_average)
        # compare category and overall means
        if category_average > overall_average:
            return True
        else:
            return False

print(movies_category_over_avg(movies, 'Thriller'))
print(movies_category_over_avg(movies, 'Suspense'))

False
True


---

### 3. Creating subsets by numeric condition

#### 3.1

Write a function that:

1. Accepts the list of movies and a specified imdb score.
2. Returns the sublist of movies that have a score greater than the specified score.

#### 3.2 [Expert] 

Write a function that:

1. Accepts the movies list as an argument.
2. Returns the movies list sorted first by category and then by movie according to category average score and individual IMDB score, respectively.

In [4]:
# 3.1

def score_greater_subset(movies, score):
    subset = []
    for movie in movies:
        if movie['imdb'] > score:
            subset.append(movie)
    return subset

print(score_greater_subset(movies, 8.5))

[{'name': 'Dark Knight', 'imdb': 9.0, 'category': 'Adventure'}, {'name': 'What is the name', 'imdb': 9.2, 'category': 'Suspense'}]


In [5]:
# 3.2
# See here for another example and explanation of the lambda search:
# http://stackoverflow.com/questions/3766633/how-to-sort-with-lambda-in-python
# http://stackoverflow.com/questions/14299448/sorting-by-multiple-conditions-in-python

def category_score_sorted(movies):
    category_scores = {}
    for movie in movies:
        # if the category key does not exist in the category_scores dic
        if not movie['category'] in category_scores:
            # add the category key with its first value being the IMDB score
            category_scores[movie['category']] = [movie['imdb']]
        else:
            # otherwise append the score to the existing categories values list
            category_scores[movie['category']].append(movie['imdb'])
    
    # uses category key and values list to create a new dic where values are the means.
    category_averages = {}
    for cat, vals in list(category_scores.items()):
        category_averages[cat] = sum(vals)/len(vals)
    
    
    movies_sorted = sorted(movies, key=lambda x: (category_averages[x['category']],
                                                  x['imdb']), reverse=True)
        # 'key' argument in the sorted function refers the desired means of sorting.
        # 'x' is refering to each individual entry in the movies list
        # Lamba functions are like single use, one-line functions.
        # in this case we are sorting by category_avg and then IMDB scores
        # reverse because we want high to low instead of low to high
    
    return movies_sorted

category_score_sorted(movies)

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'},
 {'category': 'Suspense', 'imdb': 7.0, 'name': 'Detective'},
 {'category': 'Drama', 'imdb': 8.0, 'name': 'The Help'},
 {'category': 'Comedy', 'imdb': 7.2, 'name': 'Joking muck'},
 {'category': 'Romance', 'imdb': 7.4, 'name': 'Colonia'},
 {'category': 'Romance', 'imdb': 7.2, 'name': 'We Two'},
 {'category': 'Romance', 'imdb': 6.2, 'name': 'The Choice'},
 {'category': 'Romance', 'imdb': 6.0, 'name': 'Love'},
 {'category': 'Romance', 'imdb': 5.4, 'name': 'Bride Wars'},
 {'category': 'Action', 'imdb': 6.3, 'name': 'Hitman'},
 {'category': 'Thriller', 'imdb': 7.0, 'name': 'Usual Suspects'},
 {'category': 'Thriller', 'imdb': 4.2, 'name': 'Exam'},
 {'category': 'Crime', 'imdb': 4.0, 'name': 'Ringing Crime'},
 {'category': 'War', 'imdb': 3.2, 'name': 'AlphaJet'}]

---

### 4. Creating subsets by string condition

#### 4.1

Write a function that:

1. Accepts the movies list and a category name.
2. Returns the movie names within that category (case-insensitive!)
3. If the category is not in the data, print a message that it does not exist and return None.

Recall that to convert a string to lowercase, you can use:

```python
mystring = 'Dumb and Dumber'
lowercase_mystring = mystring.lower()
print lowercase_mystring
'dumb and dumber'
```

#### 4.2 [Challenge]

Write a function that:

1. Accepts the movies list and a "search string".
2. Returns a dictionary with keys `'category'` and `'title'` whose values are lists of categories that contain the search string and titles that contain the search string, respectively (case-insensitive!)

In [6]:
# 4.1

def category_subset(movies, category):
    category = category.lower()
    movies_subset = []
    
    for movie in movies:
        movie_category = movie['category'].lower()
        if movie_category == category:
            movies_subset.append(movie)
            
    if len(movies_subset) == 0:
        print('No movies in category:', category)
        return None
    else:
        return movies_subset
    
print(category_subset(movies, 'suspense'))
print(category_subset(movies, 'sci-fi'))

[{'name': 'What is the name', 'imdb': 9.2, 'category': 'Suspense'}, {'name': 'Detective', 'imdb': 7.0, 'category': 'Suspense'}]
No movies in category: sci-fi
None


In [7]:
# 4.2

def category_title_search(movies, search_string):
    search_string = search_string.lower()
    
    results = {'category':[], 'title':[]}
    for movie in movies:
        movie_category = movie['category'].lower()
        movie_title = movie['name'].lower()
        
        if search_string in movie_category:
            if not movie_category in results['category']:
                results['category'].append(movie_category)
            
        if search_string in movie_title:
            results['title'].append(movie_title)
            
    return results

print(category_title_search(movies, 'SUS'))

{'category': ['suspense'], 'title': ['usual suspects']}


---

### 5. Multiple conditions

#### 5.1

Write a function that:

1. Accepts the movies list and a "search criteria" variable.
2. If the criteria variable is numeric, return a list of movie titles with a score greater than or equal to the criteria.
3. If the criteria variable is a string, return a list of movie titles that match that category (case-insensitive!). If there is no match, return an empty list and print an informative message.

#### 5.2 [Expert]

Write a function that:

1. Accepts the movies list and a string search criteria variable.
2. The search criteria variable can contain within it:
  - Boolean operations: `'AND'`, `'OR'`, and `'NOT'` (can have/be lowercase as well, I just capitalized for clarity).
  - Search criteria specified with syntax `score=...`, `category=...`, and/or `title=...`, where the `...` indicates what to look for.
    - If `score` is present, it means scores greater than or equal to the value.
    - For `category` and `title`, the string indicates that the category or title must _contain_ the search string (case-insensitive).
3. Return the matches for the search criteria specified.

In [8]:
# 5.1

def general_search(movies, criterion):
    titles_matches = []
    
    # first check the criterian type
    if type(criterion) in [int, float]:
        search_for = 'score'
    elif type(criterion) == str:
        search_for = 'titles'
        criterion = criterion.lower()
    else:
        print('criterion neither string nor numeric')
        return titles_matches
    
    for movie in movies:
        if search_for == 'score':
            if movie['imdb'] > criterion:
                titles_matches.append(movie['name'])
                
        else:
            if movie['category'].lower() == criterion:
                titles_matches.append(movie['name'])
                
    if len(titles_matches) == 0:
        print('no matches found')
    
    return titles_matches

print(general_search(movies, 6.9))
print(general_search(movies, 'suspense'))
print(general_search(movies, 'horror'))
print(general_search(movies, {'name':'the godfather'}))
                
        

['Usual Suspects', 'Dark Knight', 'The Help', 'Colonia', 'Joking muck', 'What is the name', 'Detective', 'We Two']
['What is the name', 'Detective']
no matches found
[]
criterion neither string nor numeric
[]


In [9]:
# 5.2

# this function is used later in the function boolean_search and may not make sense initially.
def movie_matches_subparser(movies, movie_key, value):
    # if we are assessing a title criteria
    if movie_key == 'title':
        movie_key = 'name'
    # if not a title, category or imdb, throw an error message
    elif movie_key not in ['category','imdb']:
        print('movie lookup key', movie_key, 'incorrect')
        return []
    # we are assessing a score criteria  
    if movie_key == 'imdb':
        try:
            value = float(value)
        # if score is invalid, throw an error message
        except:
            print('imdb', value, 'cannot become float')
            return []
        
    subset = []
    # assigns index values to movies and appends indexes of movies in the specified criteria
    for movie_ind, movie in enumerate(movies):
        # looks at scores
        if type(value) == float:
            if movie[movie_key] >= value:
                subset.append(movie_ind)
        # looks for strings
        else:
            if value in movie[movie_key].lower():
                subset.append(movie_ind)
    
    return subset


# this function is used later in the function boolean_search and may not make sense initially.
def meets_boolean_criteria(movies, criteria_info):
    # movie indexes = the length of movies to compare to criteria_info
    movie_inds = list(range(len(movies)))
    
    full_set = set(movie_inds)
    return_set = set(movie_inds)
    
    # take a look at our movie's indexes and their booleans.
    for boolean, movie_subset in criteria_info:
        
        #removes duplicate movies as the for loop iterates through
        movie_subset = set(movie_subset)
        
        # uses bools to add or drop movie index lists from the return set
        if boolean == 'and':
            return_set = return_set & movie_subset
        elif boolean == 'or':
            return_set = return_set | movie_subset
        elif boolean == 'not':
            return_set = return_set - movie_subset
        elif boolean == 'ornot':
            return_set = return_set | (full_set - movie_subset)
            
    return_list = []
    # uses those index values to extract the rest of the movie information
    for ind in list(return_set):
        return_list.append(movies[ind])
        
    return return_list  
            
                

def boolean_search(movies, search):
    # convert string to lower
    search = search.lower()
    # split criteria into various parts using whitespace.  
    search = search.split(' ')
    # if extra or no whitespace is used in the search criteria issues will arise
    criteria_info = []
    current_boolean = 'and'
    
    # utilize a while statement to individual assess and extract separate criteria
    while len(search) > 0:
        # pop off that first criteria
        item = search.pop(0)
        '''This if statement may seem tricky, but its trying to figure out of the 
        current criteria is a relational operator or a specified criteria'''
        
        if item in ['and','or','not']:
            if (current_boolean == 'or') and (item == 'not'):
                current_boolean = 'ornot'
            else:
                current_boolean = item
            continue
        else:
            if '=' in item:
                item = item.split('=')
            else:
                print(item, 'syntax incorrect')
                return []
            # pass the specified criteria through the movie_matches_subparser             
            movie_match_inds = movie_matches_subparser(movies, item[0], item[1])
            # now we will append the index results from the movie_match_inds with their desired bool  
            criteria_info.append([current_boolean, movie_match_inds])

    # finally compare the list of movies to the identified index values and bools        
    matches = meets_boolean_criteria(movies, criteria_info)
    return matches
        

In [10]:
boolean_search(movies, 'imdb=7.0 NOT category=suspense OR NOT title=love')

[{'category': 'Thriller', 'imdb': 7.0, 'name': 'Usual Suspects'},
 {'category': 'Action', 'imdb': 6.3, 'name': 'Hitman'},
 {'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Drama', 'imdb': 8.0, 'name': 'The Help'},
 {'category': 'Romance', 'imdb': 6.2, 'name': 'The Choice'},
 {'category': 'Romance', 'imdb': 7.4, 'name': 'Colonia'},
 {'category': 'Romance', 'imdb': 5.4, 'name': 'Bride Wars'},
 {'category': 'War', 'imdb': 3.2, 'name': 'AlphaJet'},
 {'category': 'Crime', 'imdb': 4.0, 'name': 'Ringing Crime'},
 {'category': 'Comedy', 'imdb': 7.2, 'name': 'Joking muck'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'},
 {'category': 'Suspense', 'imdb': 7.0, 'name': 'Detective'},
 {'category': 'Thriller', 'imdb': 4.2, 'name': 'Exam'},
 {'category': 'Romance', 'imdb': 7.2, 'name': 'We Two'}]

In [11]:
boolean_search(movies, 'imdb=8.9')

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'},
 {'category': 'Suspense', 'imdb': 9.2, 'name': 'What is the name'}]

In [12]:
boolean_search(movies, 'imdb=8.9 AND NOT category=suspense')

[{'category': 'Adventure', 'imdb': 9.0, 'name': 'Dark Knight'}]

In [13]:
boolean_search(movies, 'imdb=notafloat')

imdb notafloat cannot become float


[]

In [14]:
boolean_search(movies, 'category=1')

[]

In [15]:
boolean_search(movies, 'category=1')

[]

In [16]:
boolean_search(movies, 'category=suspense WHEN imdb=5.5')

when syntax incorrect


[]

In [17]:
boolean_search(movies, 'review_count=100')

movie lookup key review_count incorrect


[]