# Create SQLite Database

In this Jupyter Notebook we will create a database that will be used to analyse IMDb data. We will use a database system called SQLite that enables us to query SQL syntax with python. In order to do so, we use commands from ipython-sql - <i>%sql</i> and <i>%%sql</i>. These commands allow us to connect to a database and then write SQL commands in Jupyter Notebook.

In [1]:
import pandas as pd
import sqlite3

The IMDb database contains 7 separate data files (see [here](https://datasets.imdbws.com/)) with information about TV movies and series, respective ratings, crew, actors, among others. In this analysis, we will use 4 data files with the following information:
- <b>title.basics</b>: title of the movie/show, type (movie, short, TV episode, ...), year, runtime, and genre;
- <b>title.crew</b>: average rating, and number of votes;
- <b>title.ratings</b>: unique identifier of directors and writers;
- <b>name.basics</b>: name of the person, year of birth, year of death (if applicable), profession, and the titles the person is known for.

Additionally, each file contains an alphanumeric unique identifier of the movie/show (<i>tconst</i> variable) with the exception of the name.basics file, which contains the unique identifier of the person (directors, writers, actors).

The Pandas library is used to read the files, which come in the .tsv format (tab-separated values).

In [2]:
basics_df = pd.read_csv('data/basics.tsv', sep='\t')
ratings_df = pd.read_csv('data/ratings.tsv', sep='\t')
crew_df = pd.read_csv('data/crew.tsv', sep='\t')
names_df = pd.read_csv('data/names.tsv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


To get more familiarized with the data we can visualize the head of each dataframe.

### Basics Dataframe

In [3]:
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


### Ratings Dataframe

In [4]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1910
1,tt0000002,5.8,256
2,tt0000003,6.5,1712
3,tt0000004,5.6,169
4,tt0000005,6.2,2527


### Crew Dataframe

In [5]:
crew_df.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


### Names Dataframe

In [6]:
names_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0031983,tt0050419,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0071877,tt0038355,tt0037382"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0054452,tt0049189,tt0056404,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0077975,tt0080455,tt0072562,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0083922,tt0050986,tt0050976"


Now that we have the data we need saved as pandas dataframes, we will create our SQLite database, called <i>IMDb</i>.

In [7]:
cnn = sqlite3.connect('IMDb.db')

The last step is adding our tables to the database, which can be performed with the Pandas function <i>.to_sql</i>.

In [8]:
basics_df.to_sql('basics', cnn)
ratings_df.to_sql('ratings', cnn)
crew_df.to_sql('crew', cnn)
names_df.to_sql('names', cnn)

The SQLite system does not have a server. Instead, it writes the data directly in our disk, which we can access later. 