# Cartesian Product


## Cartesian Product

So far we have restricted ourselves to operators that operate on one table at a time.  This is logical in the sense that our operators create relations!  However, we know that a typical database contains many tables, which in fact may be related.  So, how do we do queries using mulitple tables?  

The first step toward applying the operators we have learned about so far to multiple tables is to merge the tables together   We do this using the cartesian product.   A cartesian product creates one table out of two tables by creating every possible combination of each row in table A with each row in table B, forming a new relation with A+B columns, and A*B rows!

![](cartprod1.png)


Of course this can create an **enormous** table, so the cartesian product is always followed by a query where we limit the number of rows by comparing a column in relation A against a column in relation B.

![](cartprod2.png)


# Natural Join

The natural join or ``njoin`` operator takes the pattern of cartesian product followed by query, and wraps it all into one operation subject to the following:

* The query condition tests for equality
* The query condition of equality applies to all columns with the same name in both relations.

You can see this in the following diagram, where we have two relations.  Both have a column named C1.  
The resulting relation has a single C1 column where only the rows where C1 holds the same value in both relations.  The other values from the row are filled in with the values from the matching rows.

![](njoin.png)



In [8]:
import warnings
warnings.filterwarnings('ignore')

from reframe import Relation
r1 = Relation('/home/faculty/millbr02/pub/R1.csv',sep=',')
r2 = Relation('/home/faculty/millbr02/pub/R2.csv',sep=',')
r1

Unnamed: 0,C1,C2
0,1,A
1,2,B
2,3,C
3,5,D


In [9]:
r2

Unnamed: 0,C1,C3
0,7,E
1,3,F
2,1,J
3,2,L


In [10]:
r1.njoin(r2)

Unnamed: 0,C1,C2,C3
0,1,A,J
1,2,B,L
2,3,C,F


In [11]:
%load_ext sql

In [16]:
%sql postgresql://millbr02:@localhost/jtest

'Connected: millbr02@jtest'

In [18]:
%%sql

select * 
from r1 natural join r2


3 rows affected.


C1,C2,C3
1,A,J
2,B,L
3,C,F


To return to the cartesian product example, note that it is not 100% equivalent as the cartesian product retains and renames the second copy of column1.

In [27]:
%%sql

select * 
from r1, r2
where r1."C1" = r2."C1"

3 rows affected.


C1,C2,C1_1,C3
1,A,1,J
2,B,2,L
3,C,3,F


Now lets look at a more real example.  From our city and country tables we have a problem:

* the column name is in both relations, but means different things
* the column population is in both relations but means different things
* the column we would like to join on is the countrycode column, but it is called code in the country relation and countrycode in the city relation.

We can remedy this in relational algebra by using the rename operator.

In [28]:
city = Relation('/home/faculty/millbr02/pub/city.csv')
country = Relation('/home/faculty/millbr02/pub/country.csv')

In [29]:
city.head()

Unnamed: 0,id,name,countrycode,district,population
0,1,Kabul,AFG,Kabol,1780000
1,2,Qandahar,AFG,Qandahar,237500
2,3,Herat,AFG,Herat,186800
3,4,Mazar-e-Sharif,AFG,Balkh,127800
4,5,Amsterdam,NLD,Noord-Holland,731200


In [30]:
country.head()

Unnamed: 0,code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090,1919.0,22720000,45.9,5976,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
1,NLD,Netherlands,Europe,Western Europe,41526,1581.0,15864000,78.3,371362,360478.0,Nederland,Constitutional Monarchy,Beatrix,5,NL
2,ANT,Netherlands Antilles,North America,Caribbean,800,,217000,74.7,1941,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33,AN
3,ALB,Albania,Europe,Southern Europe,28748,1912.0,3401200,71.6,3205,2500.0,Shqipëria,Republic,Rexhep Mejdani,34,AL
4,DZA,Algeria,Africa,Northern Africa,2381740,1962.0,31471000,69.7,49982,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35,DZ


In [31]:
city.rename('name','cname').rename('population','pop').head()

Unnamed: 0,id,cname,countrycode,district,pop
0,1,Kabul,AFG,Kabol,1780000
1,2,Qandahar,AFG,Qandahar,237500
2,3,Herat,AFG,Herat,186800
3,4,Mazar-e-Sharif,AFG,Balkh,127800
4,5,Amsterdam,NLD,Noord-Holland,731200


Now lets select the cities in Norway

In [38]:
city.rename('name','cname').rename('population','pop').njoin(country.rename('code','countrycode').query("name == 'Norway'")).\
    project(['cname','name','countrycode'])


Unnamed: 0,cname,name,countrycode
0,Oslo,Norway,NOR
1,Bergen,Norway,NOR
2,Trondheim,Norway,NOR
3,Stavanger,Norway,NOR
4,Bærum,Norway,NOR


In SQL there is no way to rename the columns on the fly, but we do have a more general join operator we can use to do the above query as follows:



In [39]:
%sql postgresql://millbr02:@localhost/world

'Connected: millbr02@world'

In [42]:
%%sql

select city.name, country.name, countrycode
from city join country on code = countrycode
where country.name = 'Norway'

5 rows affected.


name,name_1,countrycode
Oslo,Norway,NOR
Bergen,Norway,NOR
Trondheim,Norway,NOR
Stavanger,Norway,NOR
Bærum,Norway,NOR


The natural join operator works very well on the movie database as it has two columns with the same name in both the moviecast table and the release_date table.

Lets look at using natural join to find the names of all of the lead actors in the  movies released in october of 2015 in Norway.

In [43]:
%sql postgresql://millbr02:@localhost/movies

'Connected: millbr02@movies'

In [46]:
%%sql

select title, name, date
from moviecast natural join release_date
where month = 10 and year = 2015 and country = 'Norway' and n = 1
order by date

15 rows affected.


title,name,date
The Martian,Matt Damon,2015-10-02 00:00:00
The Intern,Robert De Niro,2015-10-02 00:00:00
Pan,Hugh Jackman,2015-10-07 00:00:00
The Walk (II),Joseph Gordon-Levitt,2015-10-09 00:00:00
Klovn Forever,Casper Christensen,2015-10-09 00:00:00
Rudhramadevi,Anushka Shetty,2015-10-09 00:00:00
The Transporter Refueled,Ed Skrein,2015-10-09 00:00:00
Crimson Peak,Mia Wasikowska,2015-10-16 00:00:00
Legend,Paul (XVIII) Anderson,2015-10-16 00:00:00
Black Mass,Johnny Depp,2015-10-23 00:00:00


In [47]:
moviecast = Relation('/home/faculty/millbr02/pub/cast.csv',sep=',')
release_date = Relation('/home/faculty/millbr02/pub/release_dates.csv',sep=',')

There are a couple of things to notice and think about.

In [51]:
moviecast.query("n == 1").njoin(release_date.query("month == 10 & year == 2015 & country == 'Norway'")).project(['title','name','date']).sort(['date'])

Unnamed: 0,title,name,date
2,The Martian,Matt Damon,2015-10-02
3,The Intern,Robert De Niro,2015-10-02
6,Pan,Hugh Jackman,2015-10-07
1,Klovn Forever,Casper Christensen,2015-10-09
5,The Walk (II),Joseph Gordon-Levitt,2015-10-09
11,The Transporter Refueled,Ed Skrein,2015-10-09
13,Rudhramadevi,Anushka Shetty,2015-10-09
0,Legend,Paul (XVIII) Anderson,2015-10-16
14,Crimson Peak,Mia Wasikowska,2015-10-16
4,Black Mass,Johnny Depp,2015-10-23


In [53]:
release_date.njoin(moviecast).query("month == 10 & year == 2015 & country == 'Norway' & n == 1").project(['title','name','date']).sort(['date'])

Unnamed: 0,title,name,date
13501247,The Intern,Robert De Niro,2015-10-02
13924371,The Martian,Matt Damon,2015-10-02
9242941,Pan,Hugh Jackman,2015-10-07
6707245,Klovn Forever,Casper Christensen,2015-10-09
10336246,Rudhramadevi,Anushka Shetty,2015-10-09
14850153,The Transporter Refueled,Ed Skrein,2015-10-09
14941867,The Walk (II),Joseph Gordon-Levitt,2015-10-09
2850250,Crimson Peak,Mia Wasikowska,2015-10-16
7187906,Legend,Paul (XVIII) Anderson,2015-10-16
1699579,Black Mass,Johnny Depp,2015-10-23
