<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Connect-the-sql" data-toc-modified-id="Connect-the-sql-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Connect the sql</a></span></li><li><span><a href="#Create-pandas-dataframes-from-database-dat-files" data-toc-modified-id="Create-pandas-dataframes-from-database-dat-files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create pandas dataframes from database dat files</a></span></li><li><span><a href="#Make-dataframes-columns-dtype-good" data-toc-modified-id="Make-dataframes-columns-dtype-good-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Make dataframes columns dtype good</a></span></li><li><span><a href="#Create-pandas-df-of-all-tables-and-columns-names" data-toc-modified-id="Create-pandas-df-of-all-tables-and-columns-names-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create pandas df of all tables and columns names</a></span></li><li><span><a href="#Section6:-Advanced-SQL-Commands" data-toc-modified-id="Section6:-Advanced-SQL-Commands-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Section6: Advanced SQL Commands</a></span><ul class="toc-item"><li><span><a href="#Timestamp" data-toc-modified-id="Timestamp-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Timestamp</a></span></li><li><span><a href="#Mathematical-functions" data-toc-modified-id="Mathematical-functions-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Mathematical functions</a></span></li><li><span><a href="#String-Functions-and-Operators" data-toc-modified-id="String-Functions-and-Operators-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>String Functions and Operators</a></span></li><li><span><a href="#Subquery" data-toc-modified-id="Subquery-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Subquery</a></span><ul class="toc-item"><li><span><a href="#example-1" data-toc-modified-id="example-1-5.4.1"><span class="toc-item-num">5.4.1&nbsp;&nbsp;</span>example 1</a></span></li></ul></li><li><span><a href="#SELF-JOIN" data-toc-modified-id="SELF-JOIN-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>SELF JOIN</a></span></li></ul></li></ul></div>

# Connect the sql

In [1]:
import numpy as np
import pandas as pd
import os
import yaml

with open( os.path.expanduser('~') + "/.postgres_conf.yml", 'r') as stream:
    try:
        yaml_dict = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pw = yaml_dict['password']
port = yaml_dict['port']

In [2]:
%load_ext sql

In [3]:
%sql postgres://postgres:$pw@localhost:$port/dvdrental

'Connected: postgres@dvdrental'

![](../images/dvdrental_schema.png)

# Create pandas dataframes from database dat files

In [4]:
#  %sql select * from staff;

In [5]:
# I can drop the table, but I want to keep that table for learning purpose.
# Instead I will create new table called staffs.

In [6]:
staffs = pd.read_csv('../data/dvdrental/2187.dat', sep=r'\t',
                     header=None, engine='python')

cols = ['staff_id', 'first_name', 'last_name', 'address_id', 'email',
        'store_id', 'active', 'username', 'password', 'last_update',
        'picture']

staffs.columns = cols
staffs = staffs.head(2)
staffs['active'] = True
staffs.drop('picture',axis=1,inplace=True)
print(staffs.shape)
staffs.head()

(2, 10)


Unnamed: 0,staff_id,first_name,last_name,address_id,email,store_id,active,username,password,last_update
0,1,Mike,Hillyer,3,Mike.Hillyer@sakilastaff.com,1,True,Mike,8cb2237d0679ca88db6464eac60da96345513964,2006-05-16 16:13:11.79328
1,2,Jon,Stephens,4,Jon.Stephens@sakilastaff.com,2,True,Jon,8cb2237d0679ca88db6464eac60da96345513964,2006-05-16 16:13:11.79328


In [7]:
tables = ['staffs', 'category', 'film_category', 'country', 'actor',
          'language', 'inventory', 'payment', 'rental', 'city',
          'store', 'film', 'address', 'film_actor', 'customer']

staff = tables[0] # we do not use it, we use staffs dataframe. but keep name staffs.
category = tables[1]
film_category = tables[2]
country = tables[3]
actor = tables[4]
language = tables[5]
inventory = tables[6]
payment = tables[7]
rental = tables[8]
city = tables[9]
store = tables[10]
film = tables[11]
address = tables[12]
film_actor = tables[13]
customer = tables[14]

In [8]:
# first create separate sql tables so that we can convert them to pandas dataframes.
# staff = %sql select * from $staff;  # this fails
category = %sql select * from $category;
film_category = %sql select * from $film_category;
country = %sql select * from $country;
actor = %sql select * from $actor;
language = %sql select * from $language;
inventory = %sql select * from $inventory;
payment = %sql select * from $payment;
rental = %sql select * from $rental;
city = %sql select * from $city;
store = %sql select * from $store;
film = %sql select * from $film;
address = %sql select * from $address;
film_actor = %sql select * from $film_actor;
customer = %sql select * from $customer;

 * postgres://postgres:***@localhost:5432/dvdrental
16 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
1000 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
109 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
200 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
6 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
4581 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
14596 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
16044 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
600 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
1000 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
603 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
5462 rows affected.
 * postgres://postgres:***@localhost:5432/dvdrental
599 rows affected.


In [9]:
category = category.DataFrame()
film_category = film_category.DataFrame()
country = country.DataFrame()
actor = actor.DataFrame()
language = language.DataFrame()
inventory = inventory.DataFrame()
payment = payment.DataFrame()
rental = rental.DataFrame()
city = city.DataFrame()
store = store.DataFrame()
film = film.DataFrame()
address = address.DataFrame()
film_actor = film_actor.DataFrame()
customer = customer.DataFrame()

# Make dataframes columns dtype good

In [10]:
df_tables = [staffs, category, film_category, country, actor,
            language, inventory, payment, rental, city,
            store, film, address, film_actor, customer]

In [11]:
def show_first_value_and_dtype(num):
    df_tables_dtypes = [df_tables[i].dtypes.to_frame()
                        for i in range(len(df_tables)) ]
    df_tables_first_value = [df_tables[i].head(1).T
                             for i in range(len(df_tables)) ]

    display(pd.concat([df_tables_first_value[num], df_tables_dtypes[num]],
                      axis=1, sort=True,ignore_index=True)
     .rename(columns={0: 'value', 1: 'dtype'})
     .style.apply(lambda x: ['background: lightblue' 
                             if x['dtype'] == 'object'
                             else ''
                             for _ in x],axis=1)
            .set_caption('Dataframe name: ' + tables[num])
    )

In [12]:
show_first_value_and_dtype(0)

Unnamed: 0,value,dtype
active,True,bool
address_id,3,int64
email,Mike.Hillyer@sakilastaff.com,object
first_name,Mike,object
last_name,Hillyer,object
last_update,2006-05-16 16:13:11.79328,object
password,8cb2237d0679ca88db6464eac60da96345513964,object
staff_id,1,int64
store_id,1,int64
username,Mike,object


In [13]:
staffs['last_update'] = pd.to_datetime(staffs['last_update'])
staffs.dtypes

staff_id                int64
first_name             object
last_name              object
address_id              int64
email                  object
store_id                int64
active                   bool
username               object
password               object
last_update    datetime64[ns]
dtype: object

In [14]:
len(tables)

15

In [15]:
# keep looking all tables and change data dtype if necessary
show_first_value_and_dtype(14)

Unnamed: 0,value,dtype
active,1,int64
activebool,True,bool
address_id,530,int64
create_date,2006-02-14,object
customer_id,524,int64
email,jared.ely@sakilacustomer.org,object
first_name,Jared,object
last_name,Ely,object
last_update,2013-05-26 14:49:45.738000,datetime64[ns]
store_id,1,int64


In [16]:
payment['amount'] = pd.to_numeric(payment['amount'], errors='coerce')

In [17]:
film['rental_rate'] = pd.to_numeric(film['rental_rate'], errors='coerce')
film['replacement_cost'] = pd.to_numeric(film['replacement_cost'], errors='coerce')

In [18]:
customer['create_date'] = pd.to_datetime(customer['create_date'])

# Create pandas df of all tables and columns names

In [19]:
df_tables = [staffs, category, film_category, country, actor,
            language, inventory, payment, rental, city,
            store, film, address, film_actor, customer]

In [20]:
all_columns = [df.columns.tolist() for df in df_tables]
all_columns[0]

['staff_id',
 'first_name',
 'last_name',
 'address_id',
 'email',
 'store_id',
 'active',
 'username',
 'password',
 'last_update']

In [21]:
df_tables_cols = pd.DataFrame(all_columns).T.fillna('')

df_tables_cols.columns = tables
df_tables_cols

Unnamed: 0,staffs,category,film_category,country,actor,language,inventory,payment,rental,city,store,film,address,film_actor,customer
0,staff_id,category_id,film_id,country_id,actor_id,language_id,inventory_id,payment_id,rental_id,city_id,store_id,film_id,address_id,actor_id,customer_id
1,first_name,name,category_id,country,first_name,name,film_id,customer_id,rental_date,city,manager_staff_id,title,address,film_id,store_id
2,last_name,last_update,last_update,last_update,last_name,last_update,store_id,staff_id,inventory_id,country_id,address_id,description,address2,last_update,first_name
3,address_id,,,,last_update,,last_update,rental_id,customer_id,last_update,last_update,release_year,district,,last_name
4,email,,,,,,,amount,return_date,,,language_id,city_id,,email
5,store_id,,,,,,,payment_date,staff_id,,,rental_duration,postal_code,,address_id
6,active,,,,,,,,last_update,,,rental_rate,phone,,activebool
7,username,,,,,,,,,,,length,last_update,,create_date
8,password,,,,,,,,,,,replacement_cost,,,last_update
9,last_update,,,,,,,,,,,rating,,,active


# Section6: Advanced SQL Commands

## Timestamp
![](../images/sql_datetime_functions.png)
![](../images/sql_datetime_operators.png)

In [22]:
%%sql
select * from payment limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date
17503,341,2,1520,7.99,2007-02-15 22:25:46.996577
17504,341,1,1778,1.99,2007-02-16 17:23:14.996577
17505,341,1,1849,7.99,2007-02-16 22:41:45.996577
17506,341,2,2829,2.99,2007-02-19 19:39:56.996577
17507,341,2,3130,7.99,2007-02-20 17:31:48.996577


In [23]:
%%sql
select extract(day from payment_date) from payment limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


date_part
15.0
16.0
16.0
19.0
20.0


In [24]:
payment.payment_date.dt.day.head()

0    15
1    16
2    16
3    19
4    20
Name: payment_date, dtype: int64

In [25]:
%%sql
select sum(amount) as total_amount, extract(month from payment_date) as month
from payment
group by month
order by sum(amount) desc
limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


total_amount,month
28559.46,4.0
23886.56,3.0


In [26]:
payment.groupby(payment.payment_date.dt.month)['amount'].sum().reset_index()\
.rename(columns={'payment_date': 'month', 'amount': 'total_amount'})\
.sort_values('total_amount', ascending=False)\
.head(2)

Unnamed: 0,month,total_amount
2,4,28559.46
1,3,23886.56


## Mathematical functions
https://www.postgresql.org/docs/9.1/functions-math.html

![](../images/sql_math_operators.png)
![](../images/sql_math_functions.png)
![](../images/sql_math_random_functions.png)
![](../images/sql_math_trig_functions.png)

In [27]:
%%sql
select * from payment limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date
17503,341,2,1520,7.99,2007-02-15 22:25:46.996577
17504,341,1,1778,1.99,2007-02-16 17:23:14.996577
17505,341,1,1849,7.99,2007-02-16 22:41:45.996577
17506,341,2,2829,2.99,2007-02-19 19:39:56.996577
17507,341,2,3130,7.99,2007-02-20 17:31:48.996577


In [28]:
%%sql
select customer_id / rental_id as new_id -- / is integer division
from payment limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


new_id
0
0


In [29]:
payment.customer_id.div(payment.rental_id).head(2)

0    0.224342
1    0.191789
dtype: float64

In [30]:
payment.customer_id.div(payment.rental_id).rename('new_id').head(2)

0    0.224342
1    0.191789
Name: new_id, dtype: float64

In [31]:
payment.customer_id.add(payment.rental_id).rename('new_id').head(2).to_frame()

Unnamed: 0,new_id
0,1861
1,2119


In [32]:
(payment.customer_id / payment.rental_id).rename('new_id').to_frame().head(2)

Unnamed: 0,new_id
0,0.224342
1,0.191789


In [33]:
%%sql
select cast(customer_id as float) / rental_id as new_id
from payment limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


new_id
0.224342105263158
0.191788526434196


## String Functions and Operators
https://www.postgresql.org/docs/9.1/functions-string.html

![](../images/sql_string_functions_and_operators.png)
![](../images/sql_other_string_functions1.png)
![](../images/sql_other_string_functions2.png)
![](../images/sql_other_string_functions3.png)
![](../images/sql_builtin_conversions.png)

In [34]:
%%sql
select * from customer limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


customer_id,store_id,first_name,last_name,email,address_id,activebool,create_date,last_update,active
524,1,Jared,Ely,jared.ely@sakilacustomer.org,530,True,2006-02-14,2013-05-26 14:49:45.738000,1
1,1,Mary,Smith,mary.smith@sakilacustomer.org,5,True,2006-02-14,2013-05-26 14:49:45.738000,1


In [35]:
%%sql
select first_name || ' ' || last_name as full_name from customer limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


full_name
Jared Ely
Mary Smith


In [36]:
(customer.first_name + ' ' + customer.last_name).head(2)

0     Jared Ely
1    Mary Smith
dtype: object

In [37]:
%%sql
select first_name, char_length(first_name)
from customer limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


first_name,char_length
Jared,5
Mary,4


In [38]:
pd.concat([customer.first_name, customer.first_name.str.len()],
          ignore_index=True, sort=False,
          axis=1).rename(columns={0: 'first_name', 1: 'first_name_length'}
                        ).head(2)

Unnamed: 0,first_name,first_name_length
0,Jared,5
1,Mary,4


## Subquery

https://www.dofactory.com/sql/subquery
```sql
SELECT column-names
  FROM table-name1
 WHERE value IN (SELECT column-name
                   FROM table-name2 
                  WHERE condition)
                  
-- example 2
SELECT column1 = (SELECT column-name FROM table-name WHERE condition),
       column-names
  FROM table-name
 WEHRE condition

-- example 3
SELECT FirstName, LastName, 
       OrderCount = (SELECT COUNT(O.Id) FROM [Order] O WHERE O.CustomerId = C.Id)
  FROM Customer C 
```

![](../images/sql_subquery.png)
![](../images/sql_subquery2.png)
![](../images/sql_subquery3.png)

### example 1
![](../images/subquery_video_qn1.png)

In [39]:
%%sql
select title, rental_rate from film limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


title,rental_rate
Chamber Italian,4.99
Grosse Wonderful,4.99


In [40]:
%%sql
select avg(rental_rate) from film;

 * postgres://postgres:***@localhost:5432/dvdrental
1 rows affected.


avg
2.98


In [41]:
%%sql
select title, rental_rate
from film
where rental_rate > 2.98
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rental_rate
Chamber Italian,4.99
Grosse Wonderful,4.99
Airport Pollock,4.99
Bright Encounters,4.99
Ace Goldfinger,4.99


In [42]:
%%sql
select title, rental_rate
from film
where rental_rate > (select avg(rental_rate) from film)
order by rental_rate, title
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rental_rate
Adaptation Holes,2.99
Affair Prejudice,2.99
African Egg,2.99
Agent Truman,2.99
Alabama Devil,2.99


In [43]:
(film[['title','rental_rate']]
 .assign(avg_rental_rate = lambda x: x.rental_rate.mean())
 .loc[lambda x: x.rental_rate > x.avg_rental_rate]
 .sort_values(['rental_rate','title'])
 .drop('avg_rental_rate', axis=1)
 .head()
)

# NOTE: it is NOT good practice to chain for everything just because we can
# it is better to do in two steps.

Unnamed: 0,title,rental_rate
6,Adaptation Holes,2.99
7,Affair Prejudice,2.99
8,African Egg,2.99
9,Agent Truman,2.99
11,Alabama Devil,2.99


In [44]:
## another example

In [45]:
%%sql
select * from rental limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
2,2005-05-24 22:54:33,1525,459,2005-05-28 19:40:33,1,2006-02-16 02:30:53
3,2005-05-24 23:03:39,1711,408,2005-06-01 22:12:39,1,2006-02-16 02:30:53


In [46]:
%%sql
select * from inventory limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


inventory_id,film_id,store_id,last_update
1,1,1,2006-02-15 10:09:17
2,1,1,2006-02-15 10:09:17


In [47]:
%%sql
select i.film_id, r.return_date from  rental r
inner join inventory i
on i.inventory_id = r.inventory_id
where r.return_date between '2005-05-29' and '2005-05-30'
order by film_id
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


film_id,return_date
15,2005-05-29 06:57:37
19,2005-05-29 16:25:37
45,2005-05-29 01:07:54
50,2005-05-29 07:22:10
52,2005-05-29 06:33:57


In [48]:
# sql join ==> pandas merge

(pd.merge(inventory,
          rental.loc[lambda x: x.return_date.between('2005-05-29','2005-05-30')],
        on='inventory_id')
 [['film_id','return_date']]
 .head()
)

Unnamed: 0,film_id,return_date
0,15,2005-05-29 06:57:37
1,19,2005-05-29 16:25:37
2,45,2005-05-29 01:07:54
3,50,2005-05-29 07:22:10
4,52,2005-05-29 06:33:57


In [49]:
%%sql

select film_id, title
from film
where film_id in
(select i.film_id -- can not used more columns , r.return_date
 from  rental r
inner join inventory i
on i.inventory_id = r.inventory_id
where r.return_date between '2005-05-29' and '2005-05-30'
)
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


film_id,title
15,Alien Center
19,Amadeus Holy
45,Attraction Newton
50,Baked Cleopatra
52,Ballroom Mockingbird


In [50]:
# in pandas I can do this in multiple steps

# first merge two dataframes and get film ids
my_film_ids = (pd.merge(inventory,
        rental.loc[lambda x: x.return_date.between('2005-05-29','2005-05-30')],
        on='inventory_id')
  .film_id
)

In [51]:
# aliter use query if dataframes has >15k rows.
my_film_ids = (pd.merge(inventory,
        rental.query(""" '2005-05-29' < return_date <  '2005-05-30' """),
        on='inventory_id')
  .film_id
)

In [52]:
my_film_ids[:5]

0    15
1    19
2    45
3    50
4    52
Name: film_id, dtype: int64

In [53]:
film[['film_id','title']].loc[lambda x: x.film_id.isin(my_film_ids)].head()

Unnamed: 0,film_id,title
18,15,Alien Center
22,19,Amadeus Holy
49,45,Attraction Newton
54,50,Baked Cleopatra
57,52,Ballroom Mockingbird


In [54]:
# I can do this using single chain but it creates TOO BIG dataframe in the
# memory and this is waste of time.
# Always opt for simplicity.
# Just because I know how to write complex queries does not mean I have to
# write them.
(pd.merge(inventory,
          rental.loc[lambda x: x.return_date.between('2005-05-29','2005-05-30')],
        on='inventory_id')
 .merge(film, on='film_id')
 [['film_id','title']]
 .head()
)

Unnamed: 0,film_id,title
0,15,Alien Center
1,19,Amadeus Holy
2,45,Attraction Newton
3,50,Baked Cleopatra
4,52,Ballroom Mockingbird


In [55]:
# again , slightly better version of it
# just merge dataframes with required columns

(pd.merge(inventory[['inventory_id','film_id']],
             rental[['inventory_id','return_date']]
              .loc[lambda x: x.return_date.between('2005-05-29','2005-05-30')],
          on='inventory_id')
 .merge(film[['film_id','title']], on='film_id')
 [['film_id','title']]
 .head()
)

Unnamed: 0,film_id,title
0,15,Alien Center
1,19,Amadeus Holy
2,45,Attraction Newton
3,50,Baked Cleopatra
4,52,Ballroom Mockingbird


In [56]:
%%sql

select film_id, title
from film
where film_id in

(select i.film_id
 from  rental r
inner join inventory i
on i.inventory_id = r.inventory_id
where r.return_date between '2005-05-29' and '2005-05-30'
)

limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


film_id,title
15,Alien Center
19,Amadeus Holy
45,Attraction Newton
50,Baked Cleopatra
52,Ballroom Mockingbird


## SELF JOIN
- http://www.sqlservertutorial.net/sql-server-basics/sql-server-self-join/
```sql
SELECT column_name(s)
FROM table1 T1, table1 T2
WHERE condition;
```

**Example**
```sql
SELECT A.CustomerName AS CustomerName1, B.CustomerName AS CustomerName2, A.City
FROM Customers A, Customers B
WHERE A.CustomerID <> B.CustomerID
AND A.City = B.City
ORDER BY A.City;
```

**Another Example**
```sql
SELECT
    c1.city,
    c1.first_name + ' ' + c1.last_name customer_1,
    c2.first_name + ' ' + c2.last_name customer_2
FROM
    sales.customers c1
INNER JOIN sales.customers c2 ON c1.customer_id > c2.customer_id
AND c1.city = c2.city
ORDER BY
    city,
    customer_1,
    customer_2;
```
![](../images/sql_self_join.png)

![](../images/sql_self_join_caveat.png)
![](../images/sql_self_join_subquery.png)
![](../images/sql_self_join_example.png)
![](../images/sql_self_join_example1a.png)
![](../images/sql_self_join_example1b.png)

In [57]:
%%sql
select * from customer limit 2;

 * postgres://postgres:***@localhost:5432/dvdrental
2 rows affected.


customer_id,store_id,first_name,last_name,email,address_id,activebool,create_date,last_update,active
524,1,Jared,Ely,jared.ely@sakilacustomer.org,530,True,2006-02-14,2013-05-26 14:49:45.738000,1
1,1,Mary,Smith,mary.smith@sakilacustomer.org,5,True,2006-02-14,2013-05-26 14:49:45.738000,1


In [58]:
## find all the customer where customer last name matches first name

In [60]:
%%sql
select a.first_name, a.last_name, b.first_name, b.last_name
from customer as a, customer as b
where a.first_name = b.last_name
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


first_name,last_name,first_name_1,last_name_1
Rose,Howard,Darlene,Rose
Kelly,Torres,Denise,Kelly
Kim,Cruz,Lillie,Kim
Joy,George,Joseph,Joy
Terry,Carlson,Jennie,Terry


In [108]:
pd.merge(customer[['first_name','last_name']],
         customer[['first_name','last_name']],
         left_on='first_name',
         right_on='last_name').head()

Unnamed: 0,first_name_x,last_name_x,first_name_y,last_name_y
0,Rose,Howard,Darlene,Rose
1,Kelly,Torres,Denise,Kelly
2,Kelly,Knott,Denise,Kelly
3,Kim,Cruz,Lillie,Kim
4,Joy,George,Joseph,Joy


In [105]:
#sql aliter

In [106]:
%%sql
select a.customer_id, a.first_name, a.last_name, b.customer_id, b.first_name, b.last_name
from customer as a
join customer as b
on a.first_name = b.last_name
limit 5;

 * postgres://postgres:***@localhost:5432/dvdrental
5 rows affected.


customer_id,first_name,last_name,customer_id_1,first_name_1,last_name_1
65,Rose,Howard,157,Darlene,Rose
67,Kelly,Torres,74,Denise,Kelly
118,Kim,Cruz,233,Lillie,Kim
230,Joy,George,307,Joseph,Joy
253,Terry,Carlson,265,Jennie,Terry


In [109]:
pd.merge(customer[['customer_id', 'first_name','last_name']],
         customer[['customer_id', 'first_name','last_name']],
         left_on='first_name',
         right_on='last_name').head()

Unnamed: 0,customer_id_x,first_name_x,last_name_x,customer_id_y,first_name_y,last_name_y
0,65,Rose,Howard,157,Darlene,Rose
1,67,Kelly,Torres,74,Denise,Kelly
2,546,Kelly,Knott,74,Denise,Kelly
3,118,Kim,Cruz,233,Lillie,Kim
4,230,Joy,George,307,Joseph,Joy
