<span style="font-size: 36px;">W4111_Spring_2025_002 - Introduction to Databases:<br>All Tracks Data Engineering Phase 1<br>Professor Ferguson's Example</br></span>

# Overview

## Application Scenario

The following diagram depicts some major elements of the applications. 
- Both tracks implement a simple [data engineering](https://en.wikipedia.org/wiki/Data_engineering) project, specifically an [extract-transform-load](https://en.wikipedia.org/wiki/Extract,_transform,_load) application in a Jupyter notebook.
  - The input datatsets are information from [IMDB](https://developer.imdb.com/non-commercial-datasets/) and information about [Game of Thrones](https://github.com/jeffreylancaster/game-of-thrones).
  - The data engineering tasks process and load information into three databases:
      - A local installation of MySQL
      - A cloud document database on [MongoDB Atlas](https://www.mongodb.com/atlas)
      - A graph database on [Neo4j Aura](https://neo4j.com/product/auradb/)
  - The programming track implements a simple full-stack web application that supports searching and displaying information, and also adds creating and update additional data.
- The non-programming track implements:
  - Additional data engineering tasks to build a [data warehouse](https://en.wikipedia.org/wiki/Data_warehouse) that provides the data in a format suitable for decision support/data science.
  - A very simply decision support/data insight application in a Jupyter notebook. The application queries the various databases to produce "views" that can be used for visualization.

| <img src="overall-system.jpg" width="1000px;"> |
| :---: |
| __Overall Application Concept__ |

## Data Engineering

The following diagram is an overview of data engineering concepts, and entity-relationship modeling in general.

| <img src="top_down_bottom_up.jpg" width="900px;"> |
| :---: |
| __Data Modeling__ |

The data engineering tasks for the project are primarily _bottom-up data analysis and engineering._ There are two datasets that are the input to the data engineering:
1. IMDB data in comma separated value file.
2. Games-of-Thrones data in [JSON](https://en.wikipedia.org/wiki/JSON) files.

This Jupyter notebook provides some examples for the first phase of data engineering:
1. The initial data loading.
1. Define the "to be" data model.
2. How to map from the "as is" data to the "to be" data.

The providing the information in the project example helps students some understand the project tasks.

| <img src="data-janitor.jpg" width="900px;"> |
| :---: |
| __Data Engineering__ |

# Initialization

## General Python Packages

In [2]:
import copy

In [3]:
import json

In [4]:
import pandas

In [5]:
import numpy

## MySQL

### Import Packages

In [1]:
# You should have installed the packages for previous homework assignments
# If not, you can %pip install the packages.
#
import pymysql
import sqlalchemy

### ipython-sql

In [6]:
# This fixes a version incpatibility problem between ipython-sql and other packages.
# You may not need to do this. If it causes problems, you can restart the kernel, reimport the packages above
# and try skipping this cell.
#
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

In [7]:
# You have installed and configured ipython-sql for previous assignments.
# https://pypi.org/project/ipython-sql/
#
%load_ext sql

In [8]:
# Make sure that you set these values to the correct values for your installation and 
# configuration of MySQL
#
db_user = "root"
db_password = "dbuserdbuser"

In [9]:
# Create the URL for connecting to the database.
# Do not worry about the local_infile=1, I did that for wizard reasons that you should not have to use.
#
db_url = f"mysql+pymysql://{db_user}:{db_password}@localhost?local_infile=1"

In [10]:
# Initialize ipython-sql
#
%sql $db_url

In [12]:
# Your answer MAY be different based on the databases and tables that you have created on your local MySQL instance.
#
%sql use db_book;
%sql show tables;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
15 rows affected.


Tables_in_db_book
advisor
classroom
course
department
hw3_student
hw3_student_simple_1
hw3_student_simple_2
instructor
instructor_public
prereq


### PyMySQL

In [13]:
# We talked about the concept of a connection to databases in general, and also about connection libraries.
# pymysql is a common python connection library for MySQL.
#
default_mysql_conn = pymysql.connect(
    user=db_user,
    password=db_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True
)

In [14]:
# This cell shows how to programatically query an SQL database from python. The programming track will have to use
# code like this in their project. The non-programming track may have to use code like this in some complex scenarios.
#
cur = default_mysql_conn.cursor()

result = cur.execute("select * from db_book.student where dept_name='Comp. Sci.';");
result = cur.fetchall()
result_df = pandas.DataFrame(result)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102
1,12345,Shankar,Comp. Sci.,32
2,54321,Williams,Comp. Sci.,54
3,76543,Brown,Comp. Sci.,58


### SQLAlchemy

In [15]:
# SQLAlchemy is a common foundational library for connecting to SQL databases. Pandas integrates with SQLAlchemy.
# SQLAlchemy also support object-relational-mapping, but we do not use those features.
#
from sqlalchemy import create_engine
default_engine = create_engine(db_url)

In [16]:
result_df = pandas.read_sql(
    "select * from db_book.student where dept_name='Comp. Sci.'", con=default_engine
)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102.0
1,12345,Shankar,Comp. Sci.,32.0
2,54321,Williams,Comp. Sci.,54.0
3,76543,Brown,Comp. Sci.,58.0


## MongoDB

### Installation

Students have two choices for MongoDB:
1. Install a local instance of [MongoDB](https://www.mongodb.com/docs/manual/installation/) and [Compass.](https://www.mongodb.com/docs/compass/current/install/)
2. Use a [SaaS/cloud](https://en.wikipedia.org/wiki/Software_as_a_service) version of MongoDB and Compass. This gives students some experience with [Database-as-a-Service](https://en.wikipedia.org/wiki/Data_as_a_service), which may sound cool on job interviews.


Setup and configuration may be a little tricky in both cases. This is a 4xxx course at an elite university. Regardless of your major, any job with data is going to expect you to be able to set up, configure and connect to a database. There is a ton of online instructions, tutorials, videos, etc. on how to accomplish these tasks.


If you are using MongoDB Atlas you will need to create a cluster and get the connection URL. You will also need to have a user ID and password. Again, you should be able to figure this out. READ THE INSTRUCTIONS AND TUTORIALS.

| <img src="manual.jpg"> |
| :---: |
| __Read the Manual and Instructions__|

### Connect and Test

I use MongoDB Atlas. I Created the account using my Columbia login UNI. I also wrote down the password.

In [21]:
# You may have to do a pip install.
#
# %pip install pymongo

In [22]:
import pymongo

In [17]:
# Put your user ID and password here.
#
mongodb_user_id = "dff9"
mongodb_password = "fquYGkveLj3XXZCt"

In [18]:
# Follow the instruction for getting a python connection URL and code.
# Replace <db_password> with your MongoDB password
#
mongodb_url = "mongodb+srv://dff9:<db_password>@cluster0.t8qdk.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

In [23]:
# This is my connection URL.
#
mongodb_url = "mongodb+srv://dff9:fquYGkveLj3XXZCt@cluster0.t8qdk.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

In [32]:
# You can follow the tutorial.
#
from pymongo import MongoClient
import certifi

client = pymongo.MongoClient(mongodb_url, tlsCAFile=certifi.where())

# client = MongoClient(mongodb_url)

try:
    # start example code here

    # end example code here

    client.admin.command("ping")
    print("Connected successfully")

    # other application code

    # client.close()

except Exception as e:
    raise Exception(
        "The following error occurred: ", e)


Connected successfully


In [33]:
# Create a test database and collection.
#
database = client["test_database"]
collection = database["test_collection"]

In [34]:
# Add some data.
#
document_list = [
   { "name" : "Mongo's Burgers" },
   { "name" : "Mongo's Pizza" }
]

insert_count = collection.insert_many(document_list)

In [38]:
# Get the IDs of the inserted objects.
insert_count.inserted_ids

[ObjectId('67f0ebdbeb5a9851a3008dcd'), ObjectId('67f0ebdbeb5a9851a3008dce')]

In [43]:
# Find matching objects.
# 
result = collection.find(
    filter={"name": "Mongo's Pizza"},
    projection={"_id": 0}
)

In [44]:
list(result)

[{'name': "Mongo's Pizza"}]

You are fine for now.

## Neo4j

In [48]:
# You may have to do a pip install
#
# %pip install neo4j

You have two choices for getting access to Neo4j.
- [Local installation](https://neo4j.com/download/) of Neo4j Desktop.
- The SaaS version [Neo4j Aura](https://neo4j.com/download)

My examples use Neo4j Aura. Once again, READ THE INSTRUCTIONS to install, configure, create a collection and run the Movies Graph examples.


| <img src="instructions.jpg" width="500px"> |
| :---: |
| __Read the Manual and Instructions__|

In [57]:
# Wait 60 seconds before connecting using these details, or login to https://console.neo4j.io to validate the Aura Instance is available
# 
# I used download to get my connection information. I had to add the " for the strings
#
NEO4J_URI="neo4j+s://da38d60b.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="ISBZxGQpxtJJqlSjr6GNKeFv0UYZgdKB1k9dl3aRKVw"
AURA_INSTANCEID="da38d60b"
AURA_INSTANCENAME="Instance01"


In [59]:
from neo4j import GraphDatabase

# URI examples: "neo4j://localhost", "neo4j+s://xxx.databases.neo4j.io"
#
# I changed +s to +ssc to make it work. ssc means self signed certificate. You would NEVER do this in practice.
# This is Wizard Sh*t and just for this class.
#
NEO4J_URI = "neo4j+ssc://da38d60b.databases.neo4j.io"
AUTH = (NEO4J_USERNAME, NEO4J_PASSWORD)

with GraphDatabase.driver(NEO4J_URI, auth=AUTH) as driver:
    result = driver.verify_connectivity()
    print("Since this did not explode, you are cool.")

Since this did not explode, you are cool.


In [63]:
# These cells assume that you followed the tutorial for the Movie Database.
# I showed how to do this in lecture 10, which very few of you attended or watched.
#
from neo4j import GraphDatabase


class Neo4jAuraDB:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
        print("Created driver.")

    def close(self):
        self.driver.close()

    def find_person_by_name(self, name):
        query = """
        MATCH (p:Person {name: $name})
        RETURN p
        """
        with self.driver.session() as session:
            result = session.run(query, name=name)
            return [record["p"] for record in result]

# Example usage
if __name__ == "__main__":
    db = Neo4jAuraDB(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)
    try:
        people = db.find_person_by_name("Tom Hanks")
        if people:
            print("Found:")
            for person in people:
                print(dict(person))
        else:
            print("Tom Hanks not found.")
    finally:
        db.close()


Created driver.
Found:
{'born': 1956, 'name': 'Tom Hanks'}


You are golden if you got to here.

# Loading the Data

## Load the GoT Title Basics and name_basics into MySQL

I am going to load basic ```title_basics``` information into a new database.

In [64]:
%sql drop schema if exists s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
7 rows affected.


[]

In [65]:
%sql create schema s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

I am going to save the GoT ```title_basics.```

In [68]:
%pwd

'/Users/donald.ferguson/Dropbox/000/00-Current-Repos/W4111_Project_Template_V3/common_data_engineering'

In [69]:
# This code assume you are running the notebook in the project cloned from GitHub and are in the
# correct directory.
#
default_engine = create_engine(db_url)
df = pandas.read_csv("../data/IMDB/got_title_basics.csv")
df.to_sql("got_title_basics", schema="s25_project", con=default_engine, index=False, if_exists="replace")

73

In [70]:
%sql use s25_project
%sql select * from got_title_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
73 rows affected.


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt1480055,tvEpisode,Winter Is Coming,Winter Is Coming,0,2011,\N,62,"Action,Adventure,Drama"
1,tt1668746,tvEpisode,The Kingsroad,The Kingsroad,0,2011,\N,56,"Action,Adventure,Drama"
2,tt1829962,tvEpisode,Lord Snow,Lord Snow,0,2011,\N,58,"Action,Adventure,Drama"
3,tt1829963,tvEpisode,"Cripples, Bastards, and Broken Things","Cripples, Bastards, and Broken Things",0,2011,\N,56,"Action,Adventure,Drama"
4,tt1829964,tvEpisode,The Wolf and the Lion,The Wolf and the Lion,0,2011,\N,55,"Action,Adventure,Drama"
5,tt1837862,tvEpisode,A Golden Crown,A Golden Crown,0,2011,\N,53,"Action,Adventure,Drama"
6,tt1837863,tvEpisode,You Win or You Die,You Win or You Die,0,2011,\N,58,"Action,Adventure,Drama"
7,tt1837864,tvEpisode,The Pointy End,The Pointy End,0,2011,\N,59,"Action,Adventure,Drama"
8,tt1851397,tvEpisode,Fire and Blood,Fire and Blood,0,2011,\N,53,"Action,Adventure,Drama"
9,tt1851398,tvEpisode,Baelor,Baelor,0,2011,\N,57,"Action,Adventure,Drama"


In [71]:
df = pandas.read_csv("../data/IMDB/name_basics.csv")
df.to_sql("got_title_basics", schema="s25_project", con=default_engine, index=False, if_exists="replace")

351

In [72]:
%sql use s25_project
%sql select * from got_title_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
351 rows affected.


nconst,primaryName,primaryProfession,knownForTitles,first_name,last_name,middle_name,nickname,title,suffix,birth_year,death_year
nm0389698,B.J. Hogg,"actor,music_department","tt0986233,tt1240982,tt0970411,tt0944947",B.J.,Hogg,,,,,1955.0,2020.0
nm0269923,Michael Feast,"actor,composer,soundtrack","tt0120879,tt0472160,tt0362192,tt0810823",Michael,Feast,,,,,1946.0,
nm0727778,David Rintoul,"actor,archive_footage","tt1139328,tt4786824,tt6079772,tt1007029",David,Rintoul,,,,,1948.0,
nm6729880,Chuku Modu,"actor,writer,producer","tt4154664,tt2674426,tt0944947,tt6470478",Chuku,Modu,,,,,1990.0,
nm0853583,Owen Teale,"actor,writer,archive_footage","tt0102797,tt0944947,tt0485301,tt0462396",Owen,Teale,,,,,1961.0,
nm0203801,Karl Davies,"actor,producer","tt3428912,tt7366338,tt0944947,tt12879632",Karl,Davies,,,,,1982.0,
nm8257864,Megan Parkinson,"actress,director,writer","tt0944947,tt26934073,tt4276618,tt6636246",Megan,Parkinson,,,,,,
nm0571654,Fintan McKeown,actor,"tt0112178,tt0110116,tt0166396,tt0944947",Fintan,McKeown,,,,,,
nm1528121,Philip McGinley,"actor,archive_footage","tt0944947,tt1446714,tt0053494,tt4015216",Philip,McGinley,,,,,1981.0,
nm0000980,Jim Broadbent,"actor,writer,producer","tt0203009,tt1431181,tt1007029,tt0425112",Jim,Broadbent,,,,,1949.0,


## Load the Character Information into MongoDB

In [73]:
character_info_file = "../data/GoT/character_relationship_scenes.json"
with open(character_info_file) as in_file:
    character_info = json.load(in_file)

In [75]:
character_info[0:2]

[{'_id': {'$oid': '65aa4b003c830ea6acd774d4'},
  'seasonNum': 1,
  'episodeNum': 1,
  'sceneNum': 9,
  'characterName': 'Waymar Royce',
  'killedBy': 'White Walker'},
 {'_id': {'$oid': '65aa4b003c830ea6acd774d4'},
  'seasonNum': 1,
  'episodeNum': 1,
  'sceneNum': 12,
  'characterName': 'Gared',
  'killedBy': 'White Walker'}]

In [76]:
# Get ride of the "_id" because that is old information.
#
for c in character_info:
    del c["_id"]

In [77]:
character_info[0:2]

[{'seasonNum': 1,
  'episodeNum': 1,
  'sceneNum': 9,
  'characterName': 'Waymar Royce',
  'killedBy': 'White Walker'},
 {'seasonNum': 1,
  'episodeNum': 1,
  'sceneNum': 12,
  'characterName': 'Gared',
  'killedBy': 'White Walker'}]

In [79]:
# Insert the documents into mongodb.
database = client["S25_GoT"]
collection = database["characters_scenes_relationships"]

In [80]:
result = collection.insert_many(character_info)

In [81]:
result.inserted_ids

[ObjectId('67f0f764eb5a9851a3008dcf'),
 ObjectId('67f0f764eb5a9851a3008dd0'),
 ObjectId('67f0f764eb5a9851a3008dd1'),
 ObjectId('67f0f764eb5a9851a3008dd2'),
 ObjectId('67f0f764eb5a9851a3008dd3'),
 ObjectId('67f0f764eb5a9851a3008dd4'),
 ObjectId('67f0f764eb5a9851a3008dd5'),
 ObjectId('67f0f764eb5a9851a3008dd6'),
 ObjectId('67f0f764eb5a9851a3008dd7'),
 ObjectId('67f0f764eb5a9851a3008dd8'),
 ObjectId('67f0f764eb5a9851a3008dd9'),
 ObjectId('67f0f764eb5a9851a3008dda'),
 ObjectId('67f0f764eb5a9851a3008ddb'),
 ObjectId('67f0f764eb5a9851a3008ddc'),
 ObjectId('67f0f764eb5a9851a3008ddd'),
 ObjectId('67f0f764eb5a9851a3008dde'),
 ObjectId('67f0f764eb5a9851a3008ddf'),
 ObjectId('67f0f764eb5a9851a3008de0'),
 ObjectId('67f0f764eb5a9851a3008de1'),
 ObjectId('67f0f764eb5a9851a3008de2'),
 ObjectId('67f0f764eb5a9851a3008de3'),
 ObjectId('67f0f764eb5a9851a3008de4'),
 ObjectId('67f0f764eb5a9851a3008de5'),
 ObjectId('67f0f764eb5a9851a3008de6'),
 ObjectId('67f0f764eb5a9851a3008de7'),
 ObjectId('67f0f764eb5a98

In [82]:
# Do the same for characters in general.
#
character_info_file = "../data/GoT/characters.json"
with open(character_info_file) as in_file:
    character_info = json.load(in_file)
    character_info = character_info["characters"]

In [83]:
character_info[0:2]

[{'characterName': 'Addam Marbrand',
  'characterLink': '/character/ch0305333/',
  'actorName': 'B.J. Hogg',
  'actorLink': '/name/nm0389698/'},
 {'characterName': 'Aegon Targaryen',
  'houseName': 'Targaryen',
  'royal': True,
  'parents': ['Elia Martell', 'Rhaegar Targaryen'],
  'siblings': ['Rhaenys Targaryen', 'Jon Snow'],
  'killedBy': ['Gregor Clegane']}]

In [84]:
collection = database["characters"]
result = collection.insert_many(character_info)
result.inserted_ids

[ObjectId('67f0f82eeb5a9851a3008ec3'),
 ObjectId('67f0f82eeb5a9851a3008ec4'),
 ObjectId('67f0f82eeb5a9851a3008ec5'),
 ObjectId('67f0f82eeb5a9851a3008ec6'),
 ObjectId('67f0f82eeb5a9851a3008ec7'),
 ObjectId('67f0f82eeb5a9851a3008ec8'),
 ObjectId('67f0f82eeb5a9851a3008ec9'),
 ObjectId('67f0f82eeb5a9851a3008eca'),
 ObjectId('67f0f82eeb5a9851a3008ecb'),
 ObjectId('67f0f82eeb5a9851a3008ecc'),
 ObjectId('67f0f82eeb5a9851a3008ecd'),
 ObjectId('67f0f82eeb5a9851a3008ece'),
 ObjectId('67f0f82eeb5a9851a3008ecf'),
 ObjectId('67f0f82eeb5a9851a3008ed0'),
 ObjectId('67f0f82eeb5a9851a3008ed1'),
 ObjectId('67f0f82eeb5a9851a3008ed2'),
 ObjectId('67f0f82eeb5a9851a3008ed3'),
 ObjectId('67f0f82eeb5a9851a3008ed4'),
 ObjectId('67f0f82eeb5a9851a3008ed5'),
 ObjectId('67f0f82eeb5a9851a3008ed6'),
 ObjectId('67f0f82eeb5a9851a3008ed7'),
 ObjectId('67f0f82eeb5a9851a3008ed8'),
 ObjectId('67f0f82eeb5a9851a3008ed9'),
 ObjectId('67f0f82eeb5a9851a3008eda'),
 ObjectId('67f0f82eeb5a9851a3008edb'),
 ObjectId('67f0f82eeb5a98

You are golden for now.

## Load the Character Information into Aura

In [125]:
# Rerunning the slightly modified code from above.
#
class Neo4jAuraDB:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
        # print("Created driver.")

    def close(self):
        self.driver.close()

    def create_character_node(self, label, properties):
        with self.driver.session() as session:
            session.execute_write(self._create_node, label, properties)

    def count_characters(self):
        with self.driver.session() as session:
            query = "match (n:GoT:Character) return count(n)"
            records, summary, keys = driver.execute_query(
                query
            )
            # Loop through results and do something with them
            for record in records:
                print(record.data())
        

    def _create_node(self, tx, label, properties):
        
        # Build Cypher property string: key1: $key1, key2: $key2, ...
        prop_string = ", ".join([f"{k}: ${k}" for k in properties.keys()])
        
        # Final Cypher query
        query = f"CREATE (c:{label} {{ {prop_string} }})"

        # print(query)
        
        tx.run(query, **properties)
    
    def insert_character(self, c):
        # Some of the fields that interest use might be None.
        # So, we just get the fields we want.
        #
        fields = ["characterName", "characterLink", "actorName", "actorLink", "houseName",
                "royal", "kingsguard"
                ]
        new_c = dict()
        for f in fields:
           v = c.get(f, None)
           if v:
               new_c[f] = v
        self.create_character_node("GoT:Character", new_c)
        # print("This seems to have worked.")

     

In [126]:
# Now let's insert the characters.
#
db = Neo4jAuraDB(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)
try:
    for c in character_info:
        people = db.insert_character(character_info[0])
finally:
    # db.close()
    pass

In [None]:
%ls $imdb_data_dir

In [127]:
# How many were created?
db = Neo4jAuraDB(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)
db.count_characters()

  records, summary, keys = driver.execute_query(


{'count(n)': 389}


In [None]:
%sql drop database if exists s25_project

In [None]:
%sql create database s25_project

We are simply going to load the data for now. The data files do not have a header row identifying the
columns. So, we have to define the columns and not use the header inferring for the file read.

In [None]:
name_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_name_basics.csv', 
                                  header=None)

In [None]:
name_basics_df.columns = ["nconst", "primaryName", "birthYear", "deathYear", 
                          "primaryProfession", "knownForTitles"]

In [None]:
name_basics_df.to_sql('name_basics', schema='s25_project',
                      con=default_engine, index=False, if_exists='append')

In [None]:
%sql use s25_project

In [None]:
%sql select * from name_basics limit 5;

In [None]:
title_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_basics.csv', header=None)

In [None]:
title_basics_df.columns = ["tconst", "titleType", "primaryTitle",
                          "originalTitle", "isAdult", "startYear", "endYear",
                          "runtimeMinutes", "genres"]

In [None]:
title_basics_df.to_sql('title_basics', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

In [None]:
title_principals_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_principals.csv', header=None)

In [None]:
title_principals_df.head(5)

In [None]:
title_principals_df.columns = ["nconst", "tconst", "ordering", "category", "job", "characters"]

In [None]:
title_principals_df.to_sql('title_principals', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

In [None]:
%sql select * from title_principals limit 5;

In [None]:
title_ratings_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_ratings.csv')

In [None]:
title_ratings_df.columns= ['tconst', "averageRating", "noOfVotes"]

In [None]:
title_ratings_df.to_sql('title_ratings', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

In [None]:
%sql select * from title_ratings limit 5;