<h1>Exploring Academy Award nomination data</h1>

The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The awards categories have changed over the years, and you can learn more about when categories were added on Wikipedia.

Here are the columns in the dataset, academy_awards.csv:

    Year - the year of the awards ceremony.
    
    Category - the category of award the nominee was nominated for.
    
    Nominee - the person nominated for the award.
    
    Additional Info - this column contains additional info like:
        the movie the nominee participated in.
        the character the nominee played (for acting awards).
        
    Won? - this column contains either YES or NO depending on if the nominee won the award.


In [23]:
#Import dependencies
import pandas as pd
import sqlite3 as sql

In [2]:
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')

In [4]:
df.head(4)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,


As demonstated above, there are 6 unnamed collumns at the end of the table, let's see if they contain any useful information

In [5]:
df['Unnamed: 5'].value_counts()

*                                                                                                               7
 resilience                                                                                                     1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 discoverer of stars                                                                                            1
Name: Unnamed: 5, dtype: int64

In [6]:
df['Unnamed: 6'].value_counts()

*                                                                   9
 sympathetic                                                        1
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
 flexibility and water resistance                                   1
Name: Unnamed: 6, dtype: int64

In [7]:
df['Unnamed: 7'].value_counts()

 while requiring no dangerous solvents. [Systems]"    1
*                                                     1
 kindly                                               1
Name: Unnamed: 7, dtype: int64

In [8]:
df['Unnamed: 8'].value_counts()

 understanding comedy genius - Mack Sennett.""    1
*                                                 1
Name: Unnamed: 8, dtype: int64

In [9]:
df['Unnamed: 9'].value_counts()

*    1
Name: Unnamed: 9, dtype: int64

In [10]:
df['Unnamed: 10'].value_counts()

*    1
Name: Unnamed: 10, dtype: int64

The dataset is incredibly messy and you may have noticed many inconsistencies that make it hard to work with. Most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to. 

Let's filter our Dataframe so it's more manageable. We'll create a dataframe containing the following categories:

    Actor -- Leading Role
    Actor -- Supporting Role
    Actress -- Leading Role
    Actress -- Supporting Role
    
Before we filter the data, let's clean up the Year column by selecting just the first 4 digits in each value in the column, therefore excluding the value in parentheses

In [14]:
df.Year = df.Year.str[0:4].astype('int64')
df.Year.head(2)

0    2010
1    2010
Name: Year, dtype: int64

To simplify our analysis, we shall just look at nominations after the year 2000

In [16]:
later_than_2000 = df[df.Year > 2000]

Now we can select just the categories we are interested in

In [17]:
award_categories = ['Actor -- Leading Role', 'Actor -- Supporting Role',
                   'Actress -- Leading Role','Actress -- Supporting Role']
nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]

Later on we shall be using SQLite to query this data. Since SQLite uses the integers 0 and 1 to represent Boolean values, we shall convert the Won? column to reflect this. We can also rename the Won? column to Won so that it's consistent with the other column names. Finally, lets get rid of the 6 extra, unnamed columns, since they contain only null values in our filtered Dataframe nominations.

In [18]:
replace_dict = {'NO':0, 'YES':1}
nominations['Won?'] = nominations['Won?'].map(replace_dict)
nominations['Won'] = nominations['Won?']
final_nominations = nominations.drop(['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9','Unnamed: 10'], axis=1)
final_nominations.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Year,Category,Nominee,Additional Info,Won
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0


The additional info collumn contains the movie that the actor stared in as well as the character they portrayed, like so: Movie_Title{Character}.
Instead of keeping these values in 1 column, we can split them up into 2 different columns for easier querying.

In [22]:
additional_info = final_nominations['Additional Info']
additional_info_one = additional_info.str.rstrip("'}")
additional_info_two = additional_info_one.str.split("{'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations['Movie'] = movie_names
final_nominations['Character'] = characters
final_nominations = final_nominations.drop('Additional Info', axis=1)
final_nominations.head(5)

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston


Now that our Dataframe is cleaned up, let's write these records to a SQL database.

In [None]:
conn = sql.connect('nominations.db')
#Since this file doesn't exist in our current directory, it will be automatically created
final_nominations.to_sql("nominations", conn, index=False)

In [26]:
conn.close()

In [28]:
conn = sql.connect('nominations.db')
q = "pragma table_info(nominations)"
print(conn.execute(q).fetchall())

[(0, 'Year', 'INTEGER', 0, None, 0), (1, 'Category', 'TEXT', 0, None, 0), (2, 'Nominee', 'TEXT', 0, None, 0), (3, 'Won', 'INTEGER', 0, None, 0), (4, 'Movie', 'TEXT', 0, None, 0), (5, 'Character', 'TEXT', 0, None, 0)]


In [30]:
q = "SELECT * FROM nominations LIMIT 10"
first_10_rows = conn.execute(q).fetchall()
first_10_rows

[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful ', 'Uxbal'),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0,
  'True Grit ',
  'Rooster Cogburn'),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0,
  'The Social Network ',
  'Mark Zuckerberg'),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1,
  "The King's Speech ",
  'King George VI'),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0,
  '127 Hours ',
  'Aron Ralston'),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1,
  'The Fighter ',
  'Dicky Eklund'),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0,
  "Winter's Bone ",
  'Teardrop'),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0,
  'The Town ',
  'James Coughlin'),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0,
  'The Kids Are All Right ',
  'Paul'),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0,
  "The King's Speech ",
  'Lionel Logue')]