<span style="font-size: 36px;">W4111_Spring_2025_002 - Introduction to Databases:<br>Non-Programming Track Project Phase 1<br>Professor Ferguson's Example</br></span>

__Notes to DFF:__
- Remove MongoDB and Neo4j for Phase 1.
- Use prebuild CSV file for GoT data.

# Overview

## Application Scenario

There are many types of applications that use databases. Two of the most common types are:
1. __Interactive/Operational Applications__ allow end users to create, retrieve, update and delete information. Online banking, e-commerce, course registration, etc. are examples. The programming track students implement a simple, [full-stack](https://en.wikipedia.org/wiki/Solution_stack), interaction applicative.<br><br>
2. __Data Insight/Decision Support Applications__ are primarily for analysts and business professionals. The applications enable complex query and data navigation to provide information useful for managing and improving business processes, products, etc. Non-programming track students implement complex queries that provide datasets useful for visualization. Visualization is often the first step in a data insight project.

Both interactive/operational and data insight applications requires database design and query skills. If there is preexisting data in files or external systems, both application scenarios requires [data engineering.](https://en.wikipedia.org/wiki/Data_engineering)

The following diagram depicts some major elements of the applications. 
- Both tracks implement a simple [data engineering](https://en.wikipedia.org/wiki/Data_engineering) project, specifically an [extract-transform-load](https://en.wikipedia.org/wiki/Extract,_transform,_load) application in a Jupyter notebook.
  - The input datatsets are information from [IMDB](https://developer.imdb.com/non-commercial-datasets/) and information about [Game of Thrones](https://github.com/jeffreylancaster/game-of-thrones).
  - The data engineering tasks process and load information into three databases:
      - A local installation of MySQL
      - A cloud document database on [MongoDB Atlas](https://www.mongodb.com/atlas)
      - A graph database on [Neo4j Aura](https://neo4j.com/product/auradb/)
- The non-programming track implements:
  - Additional data engineering tasks to build a [data warehouse](https://en.wikipedia.org/wiki/Data_warehouse) the provides the data in a format suitable for decision support/data science.
  - A very simply decision support/data insight application in a Jupyter notebook. The application queries the various databases to produce "views" that can be used for visualization.
- The programming track implements a web application that implements simple retrieve and data navigation. 

| <img src="overall-system.jpg" width="1000px;"> |
| :---: |
| __Overall Application Concept__ |

## Data Engineering

The following diagram is an overview of data engineering concepts, and entity-relationship modeling in general.

| <img src="top_down_bottom_up.jpg" width="900px;"> |
| :---: |
| __Data Modeling__ |

The data engineering tasks for the project are primarily _bottom-up data analysis and engineering._ There are two datasets that are the input to the data engineering:
1. IMDB data in comma separated value file.
2. Games-of-Thrones data in [JSON](https://en.wikipedia.org/wiki/JSON) files.

This Jupyter notebook provides some examples for the first phase of data engineering:
1. The initial data loading.
1. Define the "to be" data model.
2. Hhow to map from the "as is" data to the "to be" data.

The providing the information in the project example helps students some understand the project tasks.

| <img src="data-janitor.jpg" width="900px;"> |
| :---: |
| __Data Engineering__ |

## Data Insight Application

A separate document will explain the data analysis and insight tasks and provide examples. The current project and purpose of this notebook is to ensure that students have a working environment.

## Interactive Web Application

A separate document will explain the web application tasks. The current project and purpose of this notebook is to ensure that students have a working environment.

## This Notebook

This notebook simply enables students to get started with the environment, test their setup, etc.

## Prerequisites

To complete this notebook, the students use the same tools used in homework 0, 1 and 2.

# Initialization

## General Python Packages

In [1]:
import copy

In [2]:
import json

In [3]:
import pandas

In [4]:
import numpy

In [5]:
# You should have installed the packages for previous homework assignments
#
import pymysql
import sqlalchemy

## MySQL

### ipython-sql

In [6]:
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

In [7]:
# You have installed and configured ipython-sql for previous assignments.
# https://pypi.org/project/ipython-sql/
#
%load_ext sql

In [8]:
# Make sure that you set these values to the correct values for your installation and 
# configuration of MySQL
#
db_user = "root"
db_password = "dbuserdbuser"

In [9]:
# Create the URL for connecting to the database.
# Do not worry about the local_infile=1, I did that for wizard reasons that you should not have to use.
#
db_url = f"mysql+pymysql://{db_user}:{db_password}@localhost?local_infile=1"

In [10]:
# Initialize ipython-sql
#
%sql $db_url

In [12]:
# Your answer will be different based on the databases that you have created on your local MySQL instance.
#
%sql use db_book;
%sql show tables;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
12 rows affected.


Tables_in_db_book
advisor
classroom
course
department
instructor
instructor_public
prereq
section
student
takes


### PyMySQL

In [13]:
default_mysql_conn = pymysql.connect(
    user=db_user,
    password=db_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True
)

In [15]:
cur = default_mysql_conn.cursor()

result = cur.execute("select * from db_book.student where dept_name='Comp. Sci.';");
result = cur.fetchall()
result_df = pandas.DataFrame(result)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102
1,12345,Shankar,Comp. Sci.,32
2,54321,Williams,Comp. Sci.,54
3,76543,Brown,Comp. Sci.,58


### SQLAlchemy

In [161]:
from sqlalchemy import create_engine
default_engine = create_engine(db_url)

In [165]:
result_df = pandas.read_sql(
    "select * from db_book.student where dept_name='Comp. Sci.'", con=default_engine
)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102.0
1,12345,Shankar,Comp. Sci.,32.0
2,54321,Williams,Comp. Sci.,54.0
3,76543,Brown,Comp. Sci.,58.0


# OPTIONAL -- Download and Preprocess IMDB

## Overview

Students do not need to implement the download tasks. The notebook provides the examples just for educational purposes. You can consider these tasks as an example of _extract._

The project directory ```/data``` contains subdirectoories for IMDB data and the _Game of Thrones datasets._

The original IMDB datasets are __HUGE__ and probably too large for most student laptops.

To make the work easier for students, the professor preprocessed the datasets to make them smaller. Basically, the IMDB datasets only contain information from IMDB that is "related" to _Game of Thrones_ and the actors in _Game of Thrones._

This optional section demonstrates how to download and preprocess the data.

## Download the Data

The following code will download the datasets from the [IMDB public dataset](https://developer.imdb.com/non-commercial-datasets/) site.

The code also shows how to "do things in parallel." The downloads are network bandwidth limited, not CPU limited. So, we can download more quickly by using [Python packages and support for asynchronous](https://docs.python.org/3/library/asyncio-task.html#coroutines), 
[coroutine](https://en.wikipedia.org/wiki/Coroutine) based parallel execution. Other languages support [true threads](https://en.wikipedia.org/wiki/Thread_(computing) but Python is single-threaded.  The notebook could have used multi-programming and forked processes, but the implementation chose coroutines.

__Note:__ You may have to install the packages.

In [None]:
# %pip install aiohttp
# %pip install asyncio
# %pip install aiofiles

In [23]:
import aiohttp
import asyncio
import os
import time

# The URLs to the IMDB data files.
# I manually entered the URLs from reading the IMDB dataset page. I could have used HTML scrapping to get the information.
#
imdb_data_files = [
    {
        "file_name": "name_basics.tsv",
        "url": "https://datasets.imdbws.com/name.basics.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_basics.tsv",
        "url": "https://datasets.imdbws.com/title.basics.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_principals.tsv",
        "url": "https://datasets.imdbws.com/title.principals.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_ratings.tsv",
        "url": "https://datasets.imdbws.com/title.ratings.tsv.gz",
        "data_directory": "../data/IMDB/"
    }
]

async def download_csv(url: str, save_path: str):
    """
    Asynchronously download a CSV file from a URL and save it to a specified path.

    :param url: The URL of the CSV file.
    :param save_path: The file path where the CSV will be saved.
    """
    try:
        start_time = time.time()
        
        print(f"Starting download of {save_path} at {start_time}")

        # This call creates an HTTP connection/session. Operations on the session can go asynchronous. So, we use the 
        # asynchronous IO package and mark the call asynch.
        #
        async with aiohttp.ClientSession() as session:

            # The GET operation will download the file. This uses the network and also goes asynchronous.
            #
            async with session.get(url) as response:

                # 200 means success and we have started to stream the download.
                #
                if response.status == 200:

                    # Asynchronously read the data. Note, this consumes a lot of memory and may overwhelm some personal computers.
                    # A real extract with read in chunks and save in chunks to avoid memory overload.
                    # 
                    content = await response.read()

                    # We save the file synchronous. Since these writes are local and not using the network,
                    # we do not bother with the asynchronous write. 
                    #
                    with open(save_path, 'wb') as file:
                        file.write(content)
                        
                    end_time = time.time()
                    elapsed_time = end_time - start_time
                    print(f"Downloaded and saved {save_path} at time {end_time}")
                    print(f"Download {save_path} elapsed time {elapsed_time}")
                else:
                    print(f"Failed to download {url}, HTTP status: {response.status}")
    except Exception as e:
        print(f"Error downloading {url}: {e}")

async def main():
    """Main function to download multiple CSV files."""
    # Define URLs and corresponding save paths

    tasks = []

    for f in imdb_data_files:
        # DFF TODO -- I should really use os.path.join()
        save_path = f["data_directory"] + f["file_name"] + ".gz"
        save_url = f["url"]

        tasks.append(download_csv(save_url, save_path))

    # Run tasks concurrently
    await asyncio.gather(*tasks)

Now we are going to execute the downloads. The order of the print statements hint at the parallel execution. But,
print statements also are "asynchronous" and cause coroutines to yield.

So, the interleaving of the prints can be a little odd.

In [24]:
await main()

Starting download of ../data/IMDB/name_basics.tsv.gz at 1739699100.9520352
Starting download of ../data/IMDB/title_basics.tsv.gz at 1739699100.9521558
Starting download of ../data/IMDB/title_principals.tsv.gz at 1739699100.952199
Starting download of ../data/IMDB/title_ratings.tsv.gz at 1739699100.952256
Downloaded and saved ../data/IMDB/title_ratings.tsv.gz at time 1739699102.4722161
Download ../data/IMDB/title_ratings.tsv.gz elapsed time 1.5199601650238037
Downloaded and saved ../data/IMDB/title_basics.tsv.gz at time 1739699128.790411
Download ../data/IMDB/title_basics.tsv.gz elapsed time 27.838255167007446
Downloaded and saved ../data/IMDB/name_basics.tsv.gz at time 1739699134.306066
Download ../data/IMDB/name_basics.tsv.gz elapsed time 33.35403084754944
Downloaded and saved ../data/IMDB/title_principals.tsv.gz at time 1739699147.1196318
Download ../data/IMDB/title_principals.tsv.gz elapsed time 46.16743278503418


## Uncompress

We now need to uncompress the files. The download format was gzip and I want to uncompress to the TSV.

In [25]:
import aiofiles
import gzip

# The URLs to the IMDB data files.
# I manually entered the URLs.
# DFF TODO -- Having the list manually repeated in each program is a bad idea.
# DFF TODO -- Show how to use HTML scraping to get the files?
#
imdb_data_files = [
    {
        "file_name": "name_basics.tsv",
        "url": "https://datasets.imdbws.com/name.basics.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_basics.tsv",
        "url": "https://datasets.imdbws.com/title.basics.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_principals.tsv",
        "url": "https://datasets.imdbws.com/title.principals.tsv.gz",
        "data_directory": "../data/IMDB/"
    },
    {
        "file_name": "title_ratings.tsv",
        "url": "https://datasets.imdbws.com/title.ratings.tsv.gz",
        "data_directory": "../data/IMDB/"
    }
]


async def decompress_gzip_async(file_path: str, output_path: str):
    """
    Asynchronously decompress a gzipped file and save the decompressed content to a specified path.

    :param file_path: Path to the gzipped file.
    :param output_path: Path to save the decompressed file.
    """
    try:
        start_time = time.time()
        print(f"Starting decompress of {file_path} at {start_time}")
        async with aiofiles.open(file_path, 'rb') as gzipped_file:
            compressed_content = await gzipped_file.read()

        uncompressed_content = gzip.decompress(compressed_content)

        async with aiofiles.open(output_path, 'wb') as output_file:
            await output_file.write(uncompressed_content)

        end_time= time.time()
        elapsed_time = end_time - start_time
        print(f"Decompressed and saved {output_file} at time {end_time}")
        print(f"Deompress {file_path} elapsed time {elapsed_time}")
    except Exception as e:
        print(f"Error decompressing {file_path}: {e}")


async def uncompress_main():
    """Main function to asynchronously decompress multiple gzipped files."""
    files_to_decompress = []

    tasks = []

    for f in imdb_data_files:
        read_path = f["data_directory"] + f["file_name"] + ".gz"
        save_path = f["data_directory"] + f["file_name"]
        tasks.append(decompress_gzip_async(read_path, save_path))

    await asyncio.gather(*tasks)


In [26]:
await uncompress_main()

Starting decompress of ../data/IMDB/name_basics.tsv.gz at 1739699215.3770502
Starting decompress of ../data/IMDB/title_basics.tsv.gz at 1739699215.377103
Starting decompress of ../data/IMDB/title_principals.tsv.gz at 1739699215.37712
Starting decompress of ../data/IMDB/title_ratings.tsv.gz at 1739699215.37713
Decompressed and saved <aiofiles.threadpool.binary.AsyncBufferedIOBase object at 0x116670290> wrapping <_io.BufferedWriter name='../data/IMDB/title_ratings.tsv'> at time 1739699225.4190578
Deompress ../data/IMDB/title_ratings.tsv.gz elapsed time 10.041927814483643
Decompressed and saved <aiofiles.threadpool.binary.AsyncBufferedIOBase object at 0x114f59070> wrapping <_io.BufferedWriter name='../data/IMDB/title_basics.tsv'> at time 1739699225.419895
Deompress ../data/IMDB/title_basics.tsv.gz elapsed time 10.042791843414307
Decompressed and saved <aiofiles.threadpool.binary.AsyncBufferedIOBase object at 0x114f5cc20> wrapping <_io.BufferedWriter name='../data/IMDB/name_basics.tsv'> at

Let's look at what we have.

In [27]:
%ls -l ../data/IMDB/*.tsv

-rw-r--r--@ 1 donald.ferguson  staff   872079477 Feb 16 04:47 ../data/IMDB/name_basics.tsv
-rw-r--r--@ 1 donald.ferguson  staff   988515488 Feb 16 04:46 ../data/IMDB/title_basics.tsv
-rw-r--r--@ 1 donald.ferguson  staff  4048428739 Feb 16 04:47 ../data/IMDB/title_principals.tsv
-rw-r--r--@ 1 donald.ferguson  staff    26676816 Feb 16 04:46 ../data/IMDB/title_ratings.tsv


They are kind of big and are about 1GB each. The TSV files have millions of rows.

I separately preloaded loaded the data and we can see the size of the tables in rows.

In [28]:
%sql select count(*) as count from f24_imdb_raw.title_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


count
11201577


In [29]:
%sql select count(*) as count from f24_imdb_raw.name_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


count
13911891


In [30]:
%sql select count(*) as count from f24_imdb_raw.title_principals

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


count
88812727


In [31]:
%sql select count(*) as count from f24_imdb_raw.title_ratings

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


count
1493420


They are both over 10 million rows. Loading that into a spreadsheet would be super fun.

Past semesters have shown that a lot of students' personal computers cannot handle tables this large.


In the next section, I am going to preprocess the data.

## Preprocess (Filter) the Data

I previously produced a file that contains only the ```tconst``` values for _Game of Throne_ episodes.

Students will perform this step in their project by processing the _Game of Thrones_ data files.

I use this file to extract ONLY _Game of Thrones_ episodes from the IMDB dataset.

The python code below performs the filtering. Logically, this is a simple [directed-acyclic-graph (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph). In other parts of the project, which are optional, I show how to use [Apache Spark](https://en.wikipedia.org/wiki/Apache_Spark) for DAG based data processing.

The following is a simple, logical representation of the DAG in the code.

| <img src="simple_dag.jpg"> |
| :---: |
| __Simple title_basics Processing DAG__

Spark can run nodes in the DAG in parallel, and can also parallelize individual nodes in the DAG. This simple example in a notebook executes serially.

In [36]:
import csv
import pandas

# This is a CSV file that has information about the episodes in GoT.
# I built this file by processing the GoT dataset, which students
# will do in later phases of the project.
#
got_title_info_file = "s25_project_solution_episode_title_basics.csv"


def get_filter_column_values(file_name: str, column_name: str, delimiter:str = ",") -> list:
    """
    Reads a CSV/TSV file and for each row gets the value of the column_name
    :param file_name: File to read.
    :param column_name: Column name.
    :param delimiter: Delimiter in the file.
    :return: A list of the values for the column in the rows in the file.
    """
    result = []

    with open(file_name, "r") as in_file:
        in_csv_file = csv.DictReader(in_file, delimiter=delimiter)
        for r in in_csv_file:
            new_v = r.get(column_name, None)
            if new_v:
                result.append(new_v)

    return result

def get_row_matching_column(file_name: str,
                            column_name: str,
                            column_filter_values: list,
                            delimiter:str = ",",
                            trace_read_count: int=0) -> list:
    """
    Reads a CSV/TSV file and for each row gets the value of the column_name in a row. If the value is in
    the set of values for column_filter_values, the row is added to the result.
    :param file_name: File to read.
    :param column_name: Column name.
    :param column_filter_values: List of strings with values to match.
    :param delimiter: Delimiter in the file.
    :param trace_read_count: How many rows to read in between trace printouts.
    :return: A list of the rows that match the filter condition for the column in the rows in the file.
    """
    result = []
    rows_read = 0

    with open(file_name, "r") as in_file:
        in_csv_file = csv.DictReader(in_file, delimiter=delimiter)
        for r in in_csv_file:
            rows_read += 1
            row_v = r.get(column_name, None)
            if row_v in column_filter_values:
                result.append(r)

            if trace_read_count > 0 and rows_read % trace_read_count == 0:
                found = len(result)
                print(f"Read {rows_read} and found {found} matching rows.")

    return result


def process_files():
    tconsts = get_filter_column_values(got_title_info_file, 'tconst')
    print("tconsts = ", tconsts)
    trows = get_row_matching_column(
        "../data/IMDB/title_basics.tsv", "tconst", tconsts, delimiter="\t", trace_read_count=100000
    )
    df = pandas.DataFrame(trows)
    df.to_csv("../data/IMDB/got_title_basics.csv")

In [37]:
process_files()

tconsts =  ['tt1480055', 'tt1668746', 'tt1829962', 'tt1829963', 'tt1829964', 'tt1837862', 'tt1837863', 'tt1837864', 'tt1851398', 'tt1851397', 'tt1971833', 'tt2069318', 'tt2070135', 'tt2069319', 'tt2074658', 'tt2085238', 'tt2085239', 'tt2085240', 'tt2084342', 'tt2112510', 'tt2178782', 'tt2178772', 'tt2178802', 'tt2178798', 'tt2178788', 'tt2178812', 'tt2178814', 'tt2178806', 'tt2178784', 'tt2178796', 'tt2816136', 'tt2832378', 'tt2972426', 'tt2972428', 'tt3060856', 'tt3060910', 'tt3060876', 'tt3060782', 'tt3060858', 'tt3060860', 'tt3658012', 'tt3846626', 'tt3866836', 'tt3866838', 'tt3866840', 'tt3866842', 'tt3866846', 'tt3866850', 'tt3866826', 'tt3866862', 'tt3658014', 'tt4077554', 'tt4131606', 'tt4283016', 'tt4283028', 'tt4283054', 'tt4283060', 'tt4283074', 'tt4283088', 'tt4283094', 'tt5654088', 'tt5655178', 'tt5775840', 'tt5775846', 'tt5775854', 'tt5775864', 'tt5775874', 'tt5924366', 'tt6027908', 'tt6027912', 'tt6027914', 'tt6027916', 'tt6027920']
Read 100000 and found 0 matching rows.


A little "code" shows that I just created the file.

In [39]:
!date
%ls -l ../data/IMDB/got_title_basics.csv

Sun Feb 16 04:55:22 EST 2025
-rw-r--r--@ 1 donald.ferguson  staff  7072 Feb 16 04:54 ../data/IMDB/got_title_basics.csv


## Some Additional Processing

### Load the GoT Title Basics into MySQL

I am going to load basic ```title_basics``` information into a new database.

In [90]:
%sql drop schema if exists s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
4 rows affected.


[]

In [91]:
%sql create schema s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

I am going to save the GoT ```title_basics.```

In [92]:
default_engine = create_engine(db_url)
df = pandas.read_csv("../data/IMDB/got_title_basics.csv")
df.to_sql("got_title_basics", schema="s25_project", con=default_engine, index=False, if_exists="replace")

73

In [93]:
%sql use s25_project
%sql select * from got_title_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
73 rows affected.


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt1480055,tvEpisode,Winter Is Coming,Winter Is Coming,0,2011,\N,62,"Action,Adventure,Drama"
1,tt1668746,tvEpisode,The Kingsroad,The Kingsroad,0,2011,\N,56,"Action,Adventure,Drama"
2,tt1829962,tvEpisode,Lord Snow,Lord Snow,0,2011,\N,58,"Action,Adventure,Drama"
3,tt1829963,tvEpisode,"Cripples, Bastards, and Broken Things","Cripples, Bastards, and Broken Things",0,2011,\N,56,"Action,Adventure,Drama"
4,tt1829964,tvEpisode,The Wolf and the Lion,The Wolf and the Lion,0,2011,\N,55,"Action,Adventure,Drama"
5,tt1837862,tvEpisode,A Golden Crown,A Golden Crown,0,2011,\N,53,"Action,Adventure,Drama"
6,tt1837863,tvEpisode,You Win or You Die,You Win or You Die,0,2011,\N,58,"Action,Adventure,Drama"
7,tt1837864,tvEpisode,The Pointy End,The Pointy End,0,2011,\N,59,"Action,Adventure,Drama"
8,tt1851397,tvEpisode,Fire and Blood,Fire and Blood,0,2011,\N,53,"Action,Adventure,Drama"
9,tt1851398,tvEpisode,Baelor,Baelor,0,2011,\N,57,"Action,Adventure,Drama"


### Get the IMDB Name Basics

This data processing is easier in SQL. In a "real" data engineering project, I would use [Spark SQL](https://spark.apache.org/sql/) to perform these operations.

I previously loaded the raw IMDB and put an index on ```name_basics.nconst```, ```title_principals.nconst``` and ```title_principals.tconst.```

I can use an indexed query to find all of the "principals" that were in GoT episodes.

In [94]:
%%sql

use s25_project;

with one as (select distinct nconst
             from got_title_basics
                      join f24_imdb_clean.title_principals using (tconst)),
two as (select * from f24_imdb_clean.name_basics where nconst in (select nconst from one))
select * from two limit 10;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
10 rows affected.


nconst,primaryName,title,first_name,middle_name,last_name,suffix,nickname,death_year,birth_year,knownForTitles
nm0000293,Sean Bean,,Sean,,Bean,,,1959,,"tt0120737,tt0944947,tt1181791,tt0167261"
nm0004692,Mark Addy,,Mark,,Addy,,,1964,,"tt0119164,tt0183790,tt0944947,tt0955308"
nm0182666,Nikolaj Coster-Waldau,,Nikolaj,,Coster-Waldau,,,1970,,"tt0944947,tt1483013,tt2404233,tt0110631"
nm0265610,Michelle Fairley,,Michelle,,Fairley,,,1963,,"tt0944947,tt2431286,tt1390411,tt0926084"
nm0372176,Lena Headey,,Lena,,Headey,,,1973,,"tt0416449,tt0944947,tt1374989,tt1343727"
nm3592338,Emilia Clarke,,Emilia,,Clarke,,,1986,,"tt0944947,tt1340138,tt2674426,tt3778644"
nm0322513,Iain Glen,,Iain,,Glen,,,1961,,"tt0944947,tt6954652,tt10370380,tt6859806"
nm0516003,Harry Lloyd,,Harry,,Lloyd,,,1983,,"tt2980516,tt1007029,tt4190530,tt1229822"
nm3229685,Kit Harington,,Kit,,Harington,,,1986,,"tt0944947,tt1921064,tt0938330,tt9032400"
nm3849842,Sophie Turner,,Sophie,,Turner,,,1996,,"tt0944947,tt6565702,tt3385516,tt1731701"


In [99]:
%%sql
drop table if exists imdb_got_name_basics;

create table imdb_got_name_basics as
with one as (select distinct nconst
             from got_title_basics
                      join f24_imdb_clean.title_principals using (tconst)),
two as (select * from f24_imdb_clean.name_basics where nconst in (select nconst from one))
select * from two;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
108 rows affected.


[]

We will see later that there are "actors" in the GoT data that are not "principals" as far as IMDB is concerned.


We will deal with this later.

### Expand title_basics and title_principals

I would like to know _all_ of the titles that the princpals acted in, not just GoT.

I can use my previously loaded data and some SQL.

Let's get the ```title_principals``` mapping.

In [100]:
%%sql

use s25_project;

select
    count(distinct nconst, tconst)
from
    f24_imdb_clean.title_principals join imdb_got_name_basics
using(nconst);

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
1 rows affected.


"count(distinct nconst, tconst)"
15757


In [101]:
%%sql

drop table if exists imdb_got_title_principals;

create table imdb_got_title_principals as
select
    distinct nconst, tconst
from
    f24_imdb_clean.title_principals join imdb_got_name_basics
using(nconst);

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
15757 rows affected.


[]

In [102]:
%sql select * from imdb_got_title_principals limit 10;

 * mysql+pymysql://root:***@localhost?local_infile=1
10 rows affected.


nconst,tconst
nm0000293,tt0088407
nm0000293,tt0090798
nm0000293,tt0096180
nm0000293,tt0096416
nm0000293,tt0097352
nm0000293,tt0099566
nm0000293,tt0100056
nm0000293,tt0100642
nm0000293,tt0100940
nm0000293,tt0101066


The preceeding gives a feel for some of the data engineering tasks.

For the rest of the notebook, we will use the precreated CSV files.

# Load the IMDB Data

The IMDB data is in a file in the project directory.

We are going to restart the process and just load the preprocessed information.

In [106]:
imdb_data_dir = "../data/IMDB"

In [107]:
%ls $imdb_data_dir

got_imdb_name_basics.csv       title_basics.tsv
got_imdb_title_basics.csv      title_basics.tsv.gz
got_imdb_title_principals.csv  title_principals.tsv
got_imdb_title_ratings.csv     title_principals.tsv.gz
got_title_basics.csv           title_ratings.tsv
name_basics.tsv                title_ratings.tsv.gz
name_basics.tsv.gz


In [108]:
%sql drop database if exists s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [109]:
%sql create database s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

We are simply going to load the data for now. The data files do not have a header row identifying the
columns. So, we have to define the columns and not use the header inferring for the file read.

In [110]:
name_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_name_basics.csv', 
                                  header=None)

In [111]:
name_basics_df.columns = ["nconst", "primaryName", "birthYear", "deathYear", 
                          "primaryProfession", "knownForTitles"]

In [112]:
name_basics_df.to_sql('name_basics', schema='s25_project',
                      con=default_engine, index=False, if_exists='append')

351

In [113]:
%sql use s25_project

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [114]:
%sql select * from name_basics limit 5;

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0389698,B.J. Hogg,1955.0,2020.0,"actor,music_department","tt0986233,tt1240982,tt0970411,tt0944947"
nm0269923,Michael Feast,1946.0,,"actor,composer,soundtrack","tt0120879,tt0472160,tt0362192,tt0810823"
nm0727778,David Rintoul,1948.0,,"actor,archive_footage","tt1139328,tt4786824,tt6079772,tt1007029"
nm6729880,Chuku Modu,1990.0,,"actor,writer,producer","tt4154664,tt2674426,tt0944947,tt6470478"
nm0853583,Owen Teale,1961.0,,"actor,writer,archive_footage","tt0102797,tt0944947,tt0485301,tt0462396"


In [115]:
title_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_basics.csv', header=None)

In [116]:
title_basics_df.columns = ["tconst", "titleType", "primaryTitle",
                          "originalTitle", "isAdult", "startYear", "endYear",
                          "runtimeMinutes", "genres"]

In [117]:
title_basics_df.to_sql('title_basics', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

29058

In [118]:
title_principals_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_principals.csv', header=None)

In [119]:
title_principals_df.head(5)

Unnamed: 0,0,1,2,3,4,5
0,nm0389698,tt0087286,5,actor,,"[""Big Billy""]"
1,nm0389698,tt0101201,9,actor,,"[""Billy Murray""]"
2,nm0389698,tt0120087,8,actor,,"[""Col. Reece""]"
3,nm0389698,tt0122738,9,actor,,"[""Minister""]"
4,nm0389698,tt0124972,7,actor,,"[""Mr.Ken Campbell""]"


In [120]:
title_principals_df.columns = ["nconst", "tconst", "ordering", "category", "job", "characters"]

In [121]:
title_principals_df.to_sql('title_principals', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

34193

In [122]:
%sql select * from title_principals limit 5;

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


nconst,tconst,ordering,category,job,characters
nm0389698,tt0087286,5,actor,,"[""Big Billy""]"
nm0389698,tt0101201,9,actor,,"[""Billy Murray""]"
nm0389698,tt0120087,8,actor,,"[""Col. Reece""]"
nm0389698,tt0122738,9,actor,,"[""Minister""]"
nm0389698,tt0124972,7,actor,,"[""Mr.Ken Campbell""]"


In [123]:
title_ratings_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_ratings.csv')

In [124]:
title_ratings_df.columns= ['tconst', "averageRating", "noOfVotes"]

In [125]:
title_ratings_df.to_sql('title_ratings', schema='s25_project',
                      con=default_engine, index=False, if_exists='replace')

17798

In [126]:
%sql select * from title_ratings limit 5;

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


tconst,averageRating,noOfVotes
tt0055556,7.3,81
tt0056105,6.4,587
tt0056696,5.9,185
tt0057435,6.2,486
tt0058142,6.9,1535


# Some Data Cleanup Examples

## To Be Datamodel

This is the "to be" data model that I want.

In [127]:
%sql drop schema if exists s25_project_fixed;
%sql create schema s25_project_fixed;

 * mysql+pymysql://root:***@localhost?local_infile=1
6 rows affected.
 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

In [130]:
%%sql

create table if not exists s25_project_fixed.name_basics
(
    nconst             varchar(16)  not null
        primary key,
    primaryName        varchar(256) null,
    first_name         varchar(256) null,
    last_name          varchar(256) null,
    middle_name        varchar(256) null,
    title              varchar(256) null,
    nick_name          varchar(256) null,
    birth_year         int          null,
    death_year          int          null,
    primaryProfessions varchar(256) null,
    knownForTitles     varchar(256) null
);

create table if not exists s25_project_fixed.title_basics
(
    tconst         varchar(16)  not null
        primary key,
    titleType      varchar(256) null,
    primaryTitle   varchar(256) null,
    originalTitle  varchar(256) not null,
    isAdult        int          null,
    startYear      int          null,
    endYear        int          null,
    runtimeMinutes int          null,
    genres         varchar(256) null
);

create table if not exists s25_project_fixed.got_episodes
(
    seasonNum          int          not null,
    episodeNum         int          not null,
    episodeTitle       varchar(256) null,
    episodeAirDate     varchar(16)  null,
    episodeDescription varchar(512) null,
    imdb_tconst        varchar(16)  null,
    primary key (seasonNum, episodeNum),
    constraint got_episodes_uq_tconst
        unique (imdb_tconst),
    constraint got_episodes_title_basics_tconst_fk
        foreign key (imdb_tconst) references f24_project_clean.title_basics (tconst)
);

create table if not exists s25_project_fixed.title_principals
(
    nconst     varchar(16)  not null,
    tconst     varchar(16)  not null,
    ordering   int          not null,
    category   varchar(256) null,
    job        varchar(256) null,
    characters varchar(256) null,
    primary key (nconst, tconst, ordering),
    constraint title_principals_name___fk
        foreign key (nconst) references f24_project_clean.name_basics (nconst),
    constraint title_principals_titles__fk
        foreign key (tconst) references f24_project_clean.title_basics (tconst)
);

create table if not exists s25_project_fixed.title_ratings
(
    tconst        varchar(16) not null
        primary key,
    noOfVotes     int         null,
    averageRating double      null,
    constraint title_ratings_basics_fk
        foreign key (tconst) references f24_project_clean.title_basics (tconst)
);

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.


[]

## Cleanup name_basic

In [131]:
name_basics = %sql select * from s25_project.name_basics

 * mysql+pymysql://root:***@localhost?local_infile=1
351 rows affected.


In [132]:
name_basics_list = list(name_basics.dicts())

In [133]:
name_basics_list[0:5]

[{'nconst': 'nm0389698',
  'primaryName': 'B.J. Hogg',
  'birthYear': 1955.0,
  'deathYear': 2020.0,
  'primaryProfession': 'actor,music_department',
  'knownForTitles': 'tt0986233,tt1240982,tt0970411,tt0944947'},
 {'nconst': 'nm0269923',
  'primaryName': 'Michael Feast',
  'birthYear': 1946.0,
  'deathYear': None,
  'primaryProfession': 'actor,composer,soundtrack',
  'knownForTitles': 'tt0120879,tt0472160,tt0362192,tt0810823'},
 {'nconst': 'nm0727778',
  'primaryName': 'David Rintoul',
  'birthYear': 1948.0,
  'deathYear': None,
  'primaryProfession': 'actor,archive_footage',
  'knownForTitles': 'tt1139328,tt4786824,tt6079772,tt1007029'},
 {'nconst': 'nm6729880',
  'primaryName': 'Chuku Modu',
  'birthYear': 1990.0,
  'deathYear': None,
  'primaryProfession': 'actor,writer,producer',
  'knownForTitles': 'tt4154664,tt2674426,tt0944947,tt6470478'},
 {'nconst': 'nm0853583',
  'primaryName': 'Owen Teale',
  'birthYear': 1961.0,
  'deathYear': None,
  'primaryProfession': 'actor,writer,arc

In [135]:
# You have have to pip install this package.
#
# Some things are just really hard in SQL. So, for the name, we are going to use a python package
# that heuristically parses name strings and guesses the canonical format.
#
from nameparser import HumanName

In [136]:
for a in name_basics_list:
    human_name = HumanName(a["primaryName"])
    a["first_name"] = human_name["first"]
    a["last_name"] = human_name["last"]
    a["middle_name"] = human_name["middle"]
    a["nickname"] = human_name["nickname"]
    a["title"] = human_name["title"]
    a["suffix"] = human_name["suffix"]
    a["birth_year"] = a["birthYear"]
    del a["birthYear"]
    a["death_year"] = a["deathYear"]
    del a["deathYear"]

In [137]:
name_basics_list[0:3]

[{'nconst': 'nm0389698',
  'primaryName': 'B.J. Hogg',
  'primaryProfession': 'actor,music_department',
  'knownForTitles': 'tt0986233,tt1240982,tt0970411,tt0944947',
  'first_name': 'B.J.',
  'last_name': 'Hogg',
  'middle_name': '',
  'nickname': '',
  'title': '',
  'suffix': '',
  'birth_year': 1955.0,
  'death_year': 2020.0},
 {'nconst': 'nm0269923',
  'primaryName': 'Michael Feast',
  'primaryProfession': 'actor,composer,soundtrack',
  'knownForTitles': 'tt0120879,tt0472160,tt0362192,tt0810823',
  'first_name': 'Michael',
  'last_name': 'Feast',
  'middle_name': '',
  'nickname': '',
  'title': '',
  'suffix': '',
  'birth_year': 1946.0,
  'death_year': None},
 {'nconst': 'nm0727778',
  'primaryName': 'David Rintoul',
  'primaryProfession': 'actor,archive_footage',
  'knownForTitles': 'tt1139328,tt4786824,tt6079772,tt1007029',
  'first_name': 'David',
  'last_name': 'Rintoul',
  'middle_name': '',
  'nickname': '',
  'title': '',
  'suffix': '',
  'birth_year': 1948.0,
  'death_y

In [138]:
name_basics_df = pandas.DataFrame(name_basics_list)

In [139]:
name_basics_df.head(4)

Unnamed: 0,nconst,primaryName,primaryProfession,knownForTitles,first_name,last_name,middle_name,nickname,title,suffix,birth_year,death_year
0,nm0389698,B.J. Hogg,"actor,music_department","tt0986233,tt1240982,tt0970411,tt0944947",B.J.,Hogg,,,,,1955.0,2020.0
1,nm0269923,Michael Feast,"actor,composer,soundtrack","tt0120879,tt0472160,tt0362192,tt0810823",Michael,Feast,,,,,1946.0,
2,nm0727778,David Rintoul,"actor,archive_footage","tt1139328,tt4786824,tt6079772,tt1007029",David,Rintoul,,,,,1948.0,
3,nm6729880,Chuku Modu,"actor,writer,producer","tt4154664,tt2674426,tt0944947,tt6470478",Chuku,Modu,,,,,1990.0,


In [141]:
name_basics_df.to_sql('name_basics', schema="s25_project_fixed",
                      con=default_engine, index=False, if_exists='replace')

351

In [142]:
%sql select * from s25_project_fixed.name_basics where last_name='Bean'

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


nconst,primaryName,primaryProfession,knownForTitles,first_name,last_name,middle_name,nickname,title,suffix,birth_year,death_year
nm0000293,Sean Bean,"actor,producer,animation_department","tt0120737,tt0944947,tt1181791,tt0167261",Sean,Bean,,,,,1959.0,


## Primary Profession

| <img src="primary_profession.jpg" width="700px;"> |
| :---: |
| __Multi-Valued Primary Profession__ |

How do we do this? The tricky bit is unwinding strings like "actor,producer,sound_track".

In [145]:
%%sql

use s25_project;

with one as (
    select
        nconst,
        name_basics.primaryProfession,
        substr(name_basics.primaryProfession, 1, locate(',', name_basics.primaryProfession)-1) as p1,
        substr(name_basics.primaryProfession, locate(',', name_basics.primaryProfession)+1)  as remainder
    from
        name_basics
),
    two as (
        select
            nconst,
            one.primaryProfession,
            p1,
            substr(remainder, 1, locate(',', remainder) -1) as p2,
            substr(remainder, locate(',', remainder)+1)  as p3
        from
            one
    ),
    three as (
        select
            nconst,
            primaryProfession,
            if(p1='', NULL, p1) as p1,
            if(p2='', NULL, p2) as p2,
            if(p3='', NULL, p3) as p3
        from
            two
    ),
    four as (select nconst, p1 as profession
             from three
             union
             select nconst, p2 as profession
             from three
             union
             select nconst, p3 as profession
             from three)
select
    *
from
    four
order by nconst limit 10;


 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
10 rows affected.


nconst,profession
nm0000293,animation_department
nm0000293,actor
nm0000293,producer
nm0000596,soundtrack
nm0000596,producer
nm0000596,actor
nm0000980,producer
nm0000980,writer
nm0000980,actor
nm0001097,actor


Some comments:
- __Slow is smoth. Smooth is fast.__ $\Rightarrow$ Writing this one common table expression at a time and testing each step makes it much, much easier.
- You can validate the result with a query like ... ...

In [146]:
%%sql

use s25_project_raw;

with one as (
    select
        nconst,
        name_basics.primaryProfession,
        substr(name_basics.primaryProfession, 1, locate(',', name_basics.primaryProfession)-1) as p1,
        substr(name_basics.primaryProfession, locate(',', name_basics.primaryProfession)+1)  as remainder
    from
        name_basics
),
    two as (
        select
            nconst,
            one.primaryProfession,
            p1,
            substr(remainder, 1, locate(',', remainder) -1) as p2,
            substr(remainder, locate(',', remainder)+1)  as p3
        from
            one
    ),
    three as (
        select
            nconst,
            primaryProfession,
            if(p1='', NULL, p1) as p1,
            if(p2='', NULL, p2) as p2,
            if(p3='', NULL, p3) as p3
        from
            two
    ),
    four as (select nconst, p1 as profession
             from three
             union
             select nconst, p2 as profession
             from three
             union
             select nconst, p3 as profession
             from three)
select
    nconst, primaryProfession, group_concat(four.profession) as all_professions
from
    name_basics join four using(nconst)
group by nconst, primaryProfession
order by nconst
limit 10;
    

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
10 rows affected.


nconst,primaryProfession,all_professions
nm0000293,"actor,producer,animation_department","animation_department,actor,producer"
nm0000596,"actor,producer,soundtrack","actor,producer,soundtrack"
nm0000980,"actor,writer,producer","producer,writer,actor"
nm0001097,"actor,producer,director","producer,director,actor"
nm0001290,"actor,director,writer","writer,actor,director"
nm0001354,"actor,soundtrack,archive_footage","soundtrack,actor,archive_footage"
nm0001671,"actress,costume_department,costume_designer","actress,costume_department,costume_designer"
nm0002103,"actor,soundtrack,archive_footage","soundtrack,actor,archive_footage"
nm0004355,"actor,writer,director","actor,writer,director"
nm0004692,"actor,music_department,soundtrack","soundtrack,music_department,actor"


Let's create the ```professions.``` 

In [147]:
%%sql

use s25_project_fixed;

create table professions
(
    profession_id int auto_increment,
    profession    varchar(64) not null,
    constraint professions_pk
        primary key (profession_id)
);

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.
0 rows affected.


[]

Let's get a list of distinct values for ```primaryProfession``` and load into the ```professions``` table.

In [150]:
%%sql

insert into professions(profession)
with one as (
    select
        nconst,
        name_basics.primaryProfession,
        substr(name_basics.primaryProfession, 1, locate(',', name_basics.primaryProfession)-1) as p1,
        substr(name_basics.primaryProfession, locate(',', name_basics.primaryProfession)+1)  as remainder
    from
        name_basics
),
    two as (
        select
            nconst,
            one.primaryProfession,
            p1,
            substr(remainder, 1, locate(',', remainder) -1) as p2,
            substr(remainder, locate(',', remainder)+1)  as p3
        from
            one
    ),
    three as (
        select
            nconst,
            primaryProfession,
            if(p1='', NULL, p1) as p1,
            if(p2='', NULL, p2) as p2,
            if(p3='', NULL, p3) as p3
        from
            two
    ),
    four as (select nconst, p1 as profession
             from three
             union
             select nconst, p2 as profession
             from three
             union
             select nconst, p3 as profession
             from three)
select
    distinct profession
from
    four
where
    profession is not NULL;

 * mysql+pymysql://root:***@localhost?local_infile=1
32 rows affected.


[]

Now let's build the associative entity.

In [155]:
%%sql

create table name_basics_professions as 
with one as (
    select
        nconst,
        name_basics.primaryProfession,
        substr(name_basics.primaryProfession, 1, locate(',', name_basics.primaryProfession)-1) as p1,
        substr(name_basics.primaryProfession, locate(',', name_basics.primaryProfession)+1)  as remainder
    from
        name_basics
),
    two as (
        select
            nconst,
            one.primaryProfession,
            p1,
            substr(remainder, 1, locate(',', remainder) -1) as p2,
            substr(remainder, locate(',', remainder)+1)  as p3
        from
            one
    ),
    three as (
        select
            nconst,
            primaryProfession,
            if(p1='', NULL, p1) as p1,
            if(p2='', NULL, p2) as p2,
            if(p3='', NULL, p3) as p3
        from
            two
    ),
    four as (select nconst, p1 as profession
             from three
             union
             select nconst, p2 as profession
             from three
             union
             select nconst, p3 as profession
             from three)
select
    nconst,
    (select profession_id from professions where professions.profession=four.profession) as profession_id
from
    four
order by nconst;

 * mysql+pymysql://root:***@localhost?local_infile=1
927 rows affected.


[]

Let's do some double checking ... ...

In [159]:
%%sql

with one as (
    select * from name_basics join name_basics_professions using(nconst)
),
    two as (
        select * from one join professions using(profession_id)
)
select nconst, primaryName, primaryProfession,  group_concat(profession) as the_professions
from two
group by nconst, primaryName, primaryProfession
order by nconst limit 10;

 * mysql+pymysql://root:***@localhost?local_infile=1
10 rows affected.


nconst,primaryName,primaryProfession,the_professions
nm0000293,Sean Bean,"actor,producer,animation_department","animation_department,producer,actor"
nm0000596,Jonathan Pryce,"actor,producer,soundtrack","soundtrack,actor,producer"
nm0000980,Jim Broadbent,"actor,writer,producer","producer,actor,writer"
nm0001097,Charles Dance,"actor,producer,director","actor,director,producer"
nm0001290,Richard E. Grant,"actor,director,writer","writer,actor,director"
nm0001354,Ciarán Hinds,"actor,soundtrack,archive_footage","archive_footage,soundtrack,actor"
nm0001671,Diana Rigg,"actress,costume_department,costume_designer","costume_designer,costume_department,actress"
nm0002103,Julian Glover,"actor,soundtrack,archive_footage","archive_footage,actor,soundtrack"
nm0004355,Roger Ashton-Griffiths,"actor,writer,director","actor,writer,director"
nm0004692,Mark Addy,"actor,music_department,soundtrack","soundtrack,actor,music_department"


Looks good to me. We could not create some foreign keys, indexes, etc. We should also start dropping columns from ```name_basics```.

In [160]:
%sql describe name_basics;

 * mysql+pymysql://root:***@localhost?local_infile=1
12 rows affected.


Field,Type,Null,Key,Default,Extra
nconst,text,YES,,,
primaryName,text,YES,,,
primaryProfession,text,YES,,,
knownForTitles,text,YES,,,
first_name,text,YES,,,
last_name,text,YES,,,
middle_name,text,YES,,,
nickname,text,YES,,,
title,text,YES,,,
suffix,text,YES,,,


We can get rid of ```primaryName``` and ```primaryProfession.``` We will also handle ```knownForTitles``` differently. I am temporarily going to create a table ```name_basics_clean```.