# 1.1 Movies Database

Take the movies dataset and turn it into a single `sqlite` database. It should have one table for each csv file in the movies dataset

In [61]:
import sqlite3
import pandas as pd

files = ['names', 'movies', 'ratings', 'title_principals']

movies_set = sqlite3.connect('data/movies_set.sqlite')

def sqlize_csv(file):
    df = pd.read_csv('data/'+file+'.csv')
    df.to_sql(
        name=file,
        con=movies_set, 
        schema=None, 
        if_exists='replace', 
        index=True, 
    )
    
for file in files:
    print('Saving', file)
    sqlize_csv(file)

Saving names
Saving movies


  if self.run_code(code, result):


Saving ratings
Saving title_principals


In [62]:
# c = movies_set.cursor()
# c.execute("SELECT name FROM sqlite_master WHERE type='table';")
# c.fetchall()

# 1.2 Queries

**1.2.1** Use a single query to pull the original title of movies with a budget above $5m

**1.2.2** Use a query to pull the english-language films with the word `war` in their title

**1.2.3** Left join the average ratings from the `ratings` table onto the `movies_metadata` table, so you can have a relation between budget and rating. Hint: use a subquery.

In [63]:
movies = pd.read_csv('data/movies.csv')

def get_currency(x):
    if isinstance(x, str):
        return(x.split(' ')[0])
    return('')

movies['Currency'] = movies['budget'].apply(get_currency)

currencies = movies['Currency'].unique().tolist()

def clean_currency(x):
    if isinstance(x, str):
        xt = x
        for currency in currencies:
            xt = xt.replace(currency, '')
        return(xt.replace(',', ''))
    return(x)

  interactivity=interactivity, compiler=compiler, result=result)


In [64]:
movies['budget'] = movies['budget'].apply(clean_currency).astype(float)
movies['budget'] = movies['budget'].fillna(0.)
movies.to_sql(
    name='movies',
    con=movies_set, 
    schema=None, 
    if_exists='replace', 
    index=True, 
)

**1.2.1**

In [65]:
def query_movies(q):
    return pd.read_sql_query(q, con=movies_set)

In [66]:
five_mil = query_movies("SELECT title FROM movies WHERE budget > 5000000")
five_mil

Unnamed: 0,title
0,Metropolis
1,Napoleone
2,La regola del gioco
3,Kate & Leopold
4,La cittadella degli eroi
...,...
9641,Aakashaganga II
9642,Munthiri Monchan
9643,Upin & Ipin: Keris Siamang Tunggal
9644,Kaithi


**1.2.2**

In [67]:
q = """
SELECT title 
FROM movies 
    WHERE language = 'English' 
    AND title LIKE '% war %'
"""

eng_war = query_movies(q)
eng_war

Unnamed: 0,title
0,The War Against Mrs. Hadley
1,Linea di fuoco - War zone
2,Afganistan - The last war bus (L'ultimo bus di...
3,The War Bride
4,The War Within
5,Der Fluss war einst ein Mensch
6,The War of 1812
7,Cinematic Titanic: War of the Insects
8,The Civil War on Drugs
9,The War I Knew


**1.2.3** Left join the average ratings from the ratings table onto the movies_metadata table, so you can have a relation between budget and rating. Hint: use a subquery.

In [73]:
ratings = pd.read_csv('data/ratings.csv')
ratings
# movies

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,...,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0
1,tt0000574,6.1,589,6.3,6.0,57,18,58,137,139,...,6.2,23.0,6.6,14.0,6.4,66.0,6.0,96.0,6.2,331.0
2,tt0001892,5.8,188,6.0,6.0,6,6,17,44,52,...,5.8,4.0,6.8,7.0,5.4,32.0,6.2,31.0,5.9,123.0
3,tt0002101,5.2,446,5.3,5.0,15,8,16,62,98,...,5.5,14.0,6.1,21.0,4.9,57.0,5.5,207.0,4.7,105.0
4,tt0002130,7.0,2237,6.9,7.0,210,225,436,641,344,...,7.3,82.0,7.4,77.0,6.9,139.0,7.0,488.0,7.0,1166.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85850,tt9908390,5.3,398,5.5,6.0,13,9,26,65,104,...,5.7,11.0,5.0,2.0,5.5,12.0,6.3,22.0,5.3,214.0
85851,tt9911196,7.7,724,7.9,8.0,65,139,288,170,42,...,8.0,47.0,7.3,30.0,7.0,6.0,6.8,13.0,7.7,388.0
85852,tt9911774,7.9,265,7.8,8.0,63,29,61,61,31,...,,,,,1.0,1.0,,,2.0,2.0
85853,tt9914286,6.4,194,9.4,10.0,176,0,2,2,1,...,,,7.0,1.0,4.0,3.0,1.7,5.0,5.8,5.0


In [77]:
q = """
SELECT movies.title, ratings.weighted_average_vote, movies.budget
FROM ratings
LEFT JOIN movies ON ratings.imdb_title_id = movies.imdb_title_id
"""

joined = query_movies(q)
joined

Unnamed: 0,title,weighted_average_vote,budget
0,Miss Jerry,5.9,0.0
1,The Story of the Kelly Gang,6.1,2250.0
2,Den sorte drøm,5.8,0.0
3,Cleopatra,5.2,45000.0
4,L'Inferno,7.0,0.0
...,...,...,...
85850,Le lion,5.3,0.0
85851,De Beentjes van Sint-Hildegard,7.7,0.0
85852,Padmavyuhathile Abhimanyu,7.9,0.0
85853,Sokagin Çocuklari,6.4,0.0




# 2. Baseball Database

The [Baseball Database](http://www.seanlahman.com/baseball-archive/statistics/) has an sqlite version. Download it for these exercises.

**2.1** Which player has had the most homeruns?

**2.2** Is there a relation between how many homeruns a player has made in a year and his salary that year? Pull both colums together in a single query

