# Guided Project: Preparing data for SQLite

So far, we've learned how to write SQL queries to interact with existing databases. In this guided project, you'll learn how to clean a CSV dataset and add it to a SQLite database. If you're new to either our guided projects or Jupyter notebook in general, you can learn more [here](https://www.dataquest.io/mission/162/guided-project-using-jupyter-notebook). You can find the solutions to this guided project [here](https://github.com/dataquestio/solutions/blob/master/Mission215Solutions.ipynb).
<br>

We'll work with data on Academy Award nominations, which can be downloaded [here](https://www.aggdata.com/awards/oscar). The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The awards categories have changed over the years, and you can learn more about when categories were added on [Wikipedia](http://bit.ly/1Rig7Gs).

Here are the columns in the dataset, `academy_awards.csv`:

* `Year` - the year of the awards ceremony.
* `Category` - the category of award the nominee was nominated for.
* `Nominee` - the person nominated for the award.
* `Additional Info` - this column contains additional info like:
  * the movie the nominee participated in.
  * the character the nominee played (for acting awards).
* `Won?` - this column contains either `YES` or `NO` depending on if the nominee won the award.

In [1]:
import pandas as pd

academy = pd.read_csv('academy_awards.csv', 
                      encoding='ISO-8859-1')

academy.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


In [2]:
# There are 6 unnamed columns at the end. 
# Use the value_counts method to explore if any of them have valid values that we need.

for i in range(5, 11):
    colname = 'Unnamed: '+str(i)
    
    print(colname)
    print(academy[colname].value_counts())
    print('#'*30, '\n')

Unnamed: 5
*                                                                                                               7
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 resilience                                                                                                     1
 discoverer of stars                                                                                            1
Name: Unnamed: 5, dtype: int64
############################## 

Unnamed: 6
*                                                                   9
 sympathetic                                                        1
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
 flexibility and water resistance                                   1
Name: Unnamed: 6, dtype: int64
############################## 



In [3]:
# You'll notice that the Additional Info column contains a few different formatting styles.
# Start brainstorming ways to clean this column up.

academy['Additional Info'].value_counts().iloc[:10]

Metro-Goldwyn-Mayer              60
Walt Disney, Producer            57
Warner Bros.                     42
John Williams                    37
France                           35
Alfred Newman                    34
Italy                            26
Paramount                        24
Gordon Hollingshead, Producer    22
Edith Head                       22
Name: Additional Info, dtype: int64

The dataset is incredibly messy and you may have noticed many inconsistencies that make it hard to work with. Most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to. <br>

In the SQL and Databases: Intermediate course, we worked with a subset of the same dataset. This subset contained only the nominations from years 2001 to 2010 and only the following awards categories:

* `Actor -- Leading Role`
* `Actor -- Supporting Role`
* `Actress -- Leading Role`
* `Actress -- Supporting Role`

Let's filter our Dataframe to the same subset so it's more manageable.

#### instructions

Before we filter the data, let's clean up the `Year` column by selecting just the first 4 digits in each value in the column, therefore excluding the value in parentheses:

* Use Pandas vectorized string methods to select just the first 4 elements in each string.
  * E.g. df["Year"].str[0:2] returns a Series containing just the first 2 characters for each element in the Year column.

* Assign this new Series to the Year column to overwrite the original column.
* Convert the Year column to the int64 data type using astype. Make sure to reassign the integer Series object back to the Year column in the Dataframe or the changes won't be reflected.

Use conditional filtering to select only the rows from the Dataframe where the Year column is larger than 2000. Assign the new filtered Dataframe to later_than_2000.<br>

Use conditional filtering to select only the rows from later_than_2000 where the Category matches one of the 4 awards we're interested in.

* Create a list of strings named award_categories with the following strings:
  * Actor -- Leading Role
  * Actor -- Supporting Role
  * Actress -- Leading Role
  * Actress -- Supporting Role

Use the isin method in the conditional filter to return all rows in a column that match any of the values in a list of strings.

* Pass in award_categories to the isin method to return all rows : later_than_2000[later_than_2000["Category"].isin(award_categories)]
* Assign the resulting Dataframe to nominations.

In [4]:
# inst. 1
academy['Year'] = academy['Year'].str[0:2].astype(int)

In [5]:
academy.Year.dtype

dtype('int64')

In [6]:
academy.Year.unique()

array([20, 19])

In [7]:
# inst. 2, 3

later_than_2000 = academy[academy['Year'] > 19]

award_categories = ['Actor -- Leading Role',
                   'Actor -- Supporting Role',
                   'Actress -- Leading Role',
                   'Actress -- Supporting Role']

nominations = later_than_2000[later_than_2000["Category"]\
                              .isin(award_categories)]

In [8]:
nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,20,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,20,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,20,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,20,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,20,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


#### instructions

Use the [Series method map](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) to replace all NO values with 0 and all YES values with 1.

* Select the Won? column from nominations.
* Then create a dictionary where each key is a value we want to replace and each value is the corresponding replacement value.
  * The following dictionary replace_dict = { True: 1, False: 0 } would replace all True values with 1 and all False values with 0.
* Call the map function on the Series object and pass in the dictionary you created.
* Finally, reassign the new Series object to the Won? column in nominations.

Create a new column Won that contains the values from the Won? column.

* Select the Won? column and assign it to the Won column. Both columns should be in the Dataframe still.

Use the drop method to remove the extraneous columns.
As the required parameter, pass in a list of strings containing the following values:

* Won?
* Unnamed: 5
* Unnamed: 6
* Unnamed: 7
* Unnamed: 8
* Unnamed: 9
* Unnamed: 10

Set the axis parameter to 1 when calling the drop method.
<br>
Assign the resulting Dataframe to final_nominations.

In [9]:
nominations['Won?'] = nominations['Won?'].map({'NO':0, 'YES':1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [10]:
nominations['Won'] = nominations['Won?']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
nominations.Won.head()

0    0
1    0
2    0
3    1
4    0
Name: Won, dtype: int64

In [12]:
final_nominations = nominations.drop(['Won?',
                                     'Unnamed: 5',
                                     'Unnamed: 6',
                                     'Unnamed: 7',
                                     'Unnamed: 8',
                                     'Unnamed: 9',
                                     'Unnamed: 10'],
                                    axis = 1)

In [13]:
final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won
0,20,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0
1,20,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0
2,20,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0
3,20,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1
4,20,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0


Use [vectorized string methods](http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods) to clean up the `Additional Info` column:

* Select the Additional Info column and strip the single quote and closing brace ("'}") using the rstrip method. Assign the resulting Series object to additional_info_one.
* Split additional_info_one on the string, " {', using the split method and assign to additional_info_two. Each value in this Series object should be a list containing the movie name first then the character name.
* Access the first element from each list in additional_info_two using vectorized string methods and assign to movie_names. Here's what the code looks like: additional_info_two.str[0]
* Access the second element from each list in additional_info_two using vectorized string methods and assign to characters.

Assign the Series movie_names to the Movie column in the final_nominations Dataframe.<br>

Assign the Series characters to the Character column in the final_nominations Dataframe.<br>

Use the head method to preview the first few rows to make sure the values in the Character and Movie columns resemble the Additional Info column.<br>

Drop the Additional Info column using the drop method.



In [14]:
"Biutiful {'Uxbal'}".rstrip("'}")

"Biutiful {'Uxbal"

In [15]:
final_nominations['Additional Info'].head()

0                        Biutiful {'Uxbal'}
1             True Grit {'Rooster Cogburn'}
2    The Social Network {'Mark Zuckerberg'}
3      The King's Speech {'King George VI'}
4                127 Hours {'Aron Ralston'}
Name: Additional Info, dtype: object

In [16]:
additional_info_one =\
    final_nominations['Additional Info'].\
                    apply(lambda x : x.rstrip("'}"))

In [44]:
additional_info_one

0                                      Biutiful {'Uxbal
1                           True Grit {'Rooster Cogburn
2                  The Social Network {'Mark Zuckerberg
3                    The King's Speech {'King George VI
4                              127 Hours {'Aron Ralston
5                            The Fighter {'Dicky Eklund
6                              Winter's Bone {'Teardrop
7                             The Town {'James Coughlin
8                         The Kids Are All Right {'Paul
9                      The King's Speech {'Lionel Logue
10                         The Kids Are All Right {'Nic
11                                  Rabbit Hole {'Becca
12                                  Winter's Bone {'Ree
13              Black Swan {'Nina Sayers/The Swan Queen
14                               Blue Valentine {'Cindy
15                       The Fighter {'Charlene Fleming
16                  The King's Speech {'Queen Elizabeth
17                             The Fighter {'Ali

In [17]:
movie_names = additional_info_one.apply(lambda x: x.split(" {'")[0])
characters = additional_info_one.apply(lambda x: x.split(" {'")[1])

In [18]:
final_nominations['Movie'] = movie_names
final_nominations['Character'] = characters

final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won,Movie,Character
0,20,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0,Biutiful,Uxbal
1,20,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0,True Grit,Rooster Cogburn
2,20,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0,The Social Network,Mark Zuckerberg
3,20,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1,The King's Speech,King George VI
4,20,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0,127 Hours,Aron Ralston


In [19]:
final_nominations.drop(['Additional Info'], axis=1, inplace=True)

### Cleaning completed

Now that our Dataframe is cleaned up, let's write these records to a SQL database. We can use the Pandas Dataframe method `to_sql` to create a new table in a database we specify. This method has 2 required parameters:

* `name` - string corresponding to the name of the table we want created. The rows from our Dataframe will be added to this table after it's created.
* `conn`: the Connection instance representing the database we want to add to.

Behind the scenes, Pandas creates a table and uses the first parameter to name it. Pandas uses the data types of each of the columns in the Dataframe to create a SQLite schema for this table. Since SQLite uses integer values to represent Booleans, it was important to convert the Won column to the integer values 0 and 1. We also converted the Year column to the integer data type, so that this column will have the appropriate type in our table. Here's the mapping for our columns from the Pandas data type to the SQLite data type:

column|Pandas data type|SQLite data type
:---:|:---:|:---:
Year|int64|integer
Won|int64|integer
Category|object|text
Nominee|object|text
Movie|object|text
Character|object|text


After creating the table, Pandas creates a large `INSERT` query and runs it to insert the values into the table. We can customize the behavior of the to_sql method using its parameters. For example, if we wanted to append rows to an existing SQLite table instead of creating a new one, we can set the `if_exists` parameter to `"append"`. By default, if_exists is set to "fail" and no rows will be inserted if we specify a table name that already exists. 
### If we're inserting a large number of records into SQLite and we want to break up the inserting of records into chunks, we can use the chunksize parameter to set the number of rows we want inserted each time.

Since we're creating a database from scratch, **we need to create a database file first so we can connect to it and export our data**. To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. **If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file**. <br>

SQLite doesn't have a [special file format](http://stackoverflow.com/questions/808499/what-is-the-best-extension-name-sqlite-database-files) and you can use any file extension you'd like when creating a SQLite database. We generally use the .db extension, which isn't a file extension that's generally used for other applications.

#### instructions

Create the SQLite database nominations.db and connect to it.

* Import sqlite3 into the environment.
* Use the sqlite3 method connect to connect to the database file nominations.db.
  * Since it doesn't exist in our current directory, it will be automatically created.
  * Assign the returned Connection instance to conn.

Use the Dataframe method to_sql to export final_nominations to nominations.db.

* For the first parameter, set the table name to "nominations".
* For the second parameter, pass in the Connection instance.
* Set the index parameter to False.

In [20]:
import sqlite3

conn = sqlite3.connect('nominations_to_prepare.db')
final_nominations.to_sql('nominations_to_prepare', conn, index=False)

* Import sqlite3 into the environment.
* Create a Connection instance using the sqlite3 method connect to connect to your database file.
* Explore the database to make sure the nominations table matches our Dataframe.
  * Return and print the schema using pragma table_info(). The following schema should be returned:
    * Year: Integer.
    * Category: Text.
    * Nominee: Text.
    * Won: Text.
    * Movie: Text.
    * Character: Text.

  * Return and print the first 10 rows using the SELECT and LIMIT statements.

* Once you're done, use the Connection method close to close the connection to the database.

In [21]:
pd.read_sql('select * from nominations_to_prepare limit 5', conn)

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,20,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,20,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,20,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,20,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,20,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston


In [22]:
pd.read_sql('pragma table_info(nominations_to_prepare)', conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Year,INTEGER,0,,0
1,1,Category,TEXT,0,,0
2,2,Nominee,TEXT,0,,0
3,3,Won,INTEGER,0,,0
4,4,Movie,TEXT,0,,0
5,5,Character,TEXT,0,,0


In [23]:
conn.close()

For next steps, explore the rest of our original dataset `academy_awards.csv` and brainstorm how to fix the rest of the dataset:

* The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
* What are all the different formatting styles that the `Additional Info` column contains. Can we use tools like regular expressions to capture these patterns and clean them up?
  * The nominations for the `Art Direction` category have lengthy values for Additional Info. What information is useful and how do we extract it?
  * Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?
* What's the best way to handle awards ceremonies that included movies from 2 years?
  * E.g. see 1927/28 (1st) in the Year column.

### The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?

In [19]:
academy = pd.read_csv('academy_awards.csv', 
                      encoding='ISO-8859-1')

#academy.head()

* We can keep 'Acting (other)' label for actors categorization
  * Even though we can not use additional info (sex / role weight), it seems worth information keeping.
  
* Likewise 'Acting (other)' - For non-actor categorization in Category column, there seems the following pattern:
  * {main part} ({additional info})
  * We can keep the first 'main part' text except additional info in parenthesis.

In [9]:
academy.Category.unique()

array(['Actor -- Leading Role', 'Actor -- Supporting Role',
       'Actress -- Leading Role', 'Actress -- Supporting Role',
       'Animated Feature Film', 'Art Direction', 'Cinematography',
       'Costume Design', 'Directing', 'Documentary (Feature)',
       'Documentary (Short Subject)', 'Film Editing',
       'Foreign Language Film', 'Makeup', 'Music (Scoring)',
       'Music (Song)', 'Best Picture', 'Short Film (Animated)',
       'Short Film (Live Action)', 'Sound', 'Sound Editing',
       'Visual Effects', 'Writing', 'Honorary Award',
       'Irving G. Thalberg Memorial Award',
       'Scientific and Technical (Scientific and Engineering Award)',
       'Scientific and Technical (Technical Achievement Award)',
       'Scientific and Technical (Bonner Medal)',
       'Jean Hersholt Humanitarian Award',
       'Scientific and Technical (Gordon E. Sawyer Award)',
       'Scientific and Technical (Academy Award of Merit)',
       'Scientific and Technical (Special Awards)',
       '

### What are all the different formatting styles that the Additional Info column contains. Can we use tools like regular expressions to capture these patterns and clean them up?
* The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
* Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?

### Formats in 'Additional Info' column
[] represents kinds for sole strings, brackets not included.

* ACTOR category
  * [moviename] {[charactername]}
  * [moviename]

* NON-ACTOR category
  * one person
    * [realname]
      * real-name
      * real, name
    * [realname], [role]
    * [realname[, [corporation]
    * [role] by [realname]
  
  * people
    * [role1] by [realname1]; [role2] by [realname2]
    * [realname1] and [realname2], [role]
    * [role1]: [realname1], [realname2]; [role2]: [realname3]

* NON-PERSON category
  * Country
  * Corporation




#### 1. What information is useful and how do we extract it?
* We can keep 'NON-ACTOR' category patterns above, using regular expressions to filter out.

#### 2. Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?
* Considering it seems better to get limited information than get nothing, we can take the tradeoffs to take Additional Info without character name.

### What's the best way to handle awards ceremonies that included movies from 2 years?

* For further use of 'Year' column as INTEGER type, specify the prior year (e.g. 1927/28 --> 1927)
  * Add 'Order' which shows the num order of ceremonies to supplement the 'Year' column.
  * e.g. Year = 1927, Order = 1
  * e.g. Year = 1929, ORder = 2
  * etc ...

In [22]:
academy.tail(10)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
10127,1927/28 (1st),Writing,"George Marion, Jr.[NOTE: This nomination was n...",,NO,,,,,,
10128,1927/28 (1st),Writing,"To Charles Chaplin, for acting, writing, direc...",,YES,,,,,,
10129,1927/28 (1st),Honorary Award,"To Warner Bros., for producing The Jazz Singer...",,YES,,,,,,
10130,1927/28 (1st),Honorary Award,"To Charles Chaplin, for acting, writing, direc...",,YES,,,,,,
10131,1927/28 (1st),Engineering Effects (archaic category),Ralph Hammeras [NOTE: This nomination was not ...,,NO,,,,,,
10132,1927/28 (1st),Engineering Effects (archaic category),Roy Pomeroy,Wings,YES,,,,,,
10133,1927/28 (1st),Engineering Effects (archaic category),Nugent Slaughter [NOTE: Though no specific tit...,,NO,,,,,,
10134,1927/28 (1st),Unique and Artistic Picture (archaic category),Fox,Sunrise,YES,,,,,,
10135,1927/28 (1st),Unique and Artistic Picture (archaic category),Metro-Goldwyn-Mayer,The Crowd,NO,,,,,,
10136,1927/28 (1st),Unique and Artistic Picture (archaic category),Paramount Famous Lasky,Chang,NO,,,,,,
