<a href="https://colab.research.google.com/github/aakhterov/Python_practice/blob/master/SQL_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# enable using commands %sql and %%sql
%load_ext sql

In [4]:
def prepare_data_from_csv(path_to_csv: str) -> str:
  '''
  format content from csv file for using in 'insert' SQL queries
  '''
  df = pd.read_csv(path_to_csv)
  return ','.join([str(tuple(row)) for row in df.values.tolist()])

In [5]:
# connect to a database sqltraining.sqlite
%sql sqlite:////content/drive/MyDrive/sqltraining.sqlite

# Task #1 Duplicate Transactions

**Problem:**

The duplicate_transactions table contains transaction_id, timestamp, price and department.

Address these four questions:

1. How many duplicate records are there? For instance, if Row 1, and Row 2 and Row 3 contain the same values, then there are two duplicate records.

2. How many unique records have duplications?

3. Remove duplicate records, only preserving the unique records.

4. Which department has the highest duplicate records? Return the department name and count of duplicate records. Assume the possibility that multiple departments
could have the same highest count.

## 1.1. Creating tables and inserting values

In [6]:
%sql drop table duplicate_transactions

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


[]

In [7]:
%%sql
CREATE TABLE duplicate_transactions (
transaction_id VARCHAR,
timestamp INTEGER,
price INTEGER,
department VARCHAR);

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


[]

In [41]:
data = prepare_data_from_csv('/content/drive/MyDrive/Colab Notebooks/Data/SQL_training/duplicate_transactions/duplicate_transactions.csv')
data[:100]

"('qsllwzfgsu', 815725, 67, 'Movies'),('ldfdvwgsxa', 863206, 18, 'Groceries'),('glvkzgxobb', 846203, "

In [10]:
%%sql insert into duplicate_transactions values {data}

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
580 rows affected.


[]

## 1.2. Look at the tables

In [15]:
%sql select * from duplicate_transactions limit 1, 5

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


transaction_id,timestamp,price,department
ldfdvwgsxa,863206,18,Groceries
glvkzgxobb,846203,58,Computer
gnrowryred,865632,48,Computer
wulekumebr,871200,47,Music
mudonayuog,865647,26,Groceries


## 1.3. Solution

1. How many duplicate records are there? For instance, if Row 1, and Row 2 and Row 3 contain the same values, then there are two duplicate records.

In [24]:
%%sql
select sum(c) as duplicates_count from
(
  select count(*)-1 as c
  from duplicate_transactions
  group by transaction_id, timestamp, price, department
  having count(*) > 1
)

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


duplicates_count
80


2. How many unique records have duplications?

In [25]:
%%sql
select count(transaction_id) as records_have_duplications from
(
  select transaction_id
  from duplicate_transactions
  group by transaction_id, timestamp, price, department
  having count(*) > 1
)

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


records_have_duplications
47


3. Remove duplicate records, only preserving the unique records.

In [31]:
%%sql
select distinct *
from duplicate_transactions

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


transaction_id,timestamp,price,department
qsllwzfgsu,815725,67,Movies
ldfdvwgsxa,863206,18,Groceries
glvkzgxobb,846203,58,Computer
gnrowryred,865632,48,Computer
wulekumebr,871200,47,Music
mudonayuog,865647,26,Groceries
xycfertiwm,835662,37,Music
jezmgvkcfh,883012,50,Groceries
wldyhjgwbm,826626,85,Books
jwqmpbxuhg,830255,62,Music


4. Which department has the highest duplicate records? Return the department name and count of duplicate records. Assume the possibility that multiple departments could have the same highest count.

In [38]:
%%sql
select department, s from
(
  select department, sum(c) as s from
    (
      select department, count(*) as c
      from duplicate_transactions
      group by transaction_id, timestamp, price, department
      having count(*) > 1
    )
  group by department
)
where s =
(
  select max(s) from
    (
      select department, sum(c) as s from
        (
          select department, count(*) as c
          from duplicate_transactions
          group by transaction_id, timestamp, price, department
          having count(*) > 1
        )
      group by department
    )
)

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


department,s
Music,33


# Task #2 Connections

**Problem:**

Facebook’s analytics team wants to understand how users stay connected among friends on their platform.
The team believes that understanding patterns could help improve an algorithm that matches potential friends.
Use the friends table to address the questions below. A user can perform the following sequence of actions:
(1) request or receive, (2) connect, and (3) block.

1. Return a list of users who blocked another user after connecting for at least 90 days.
Show the user_id and receiver_id.

2. For each user, what is the proportion of each action? Note that the receiver_id can
appear in multiple actions per user, only regard the latest status when calculating
the distribution.

## 1.1. Creating tables and inserting values

In [6]:
%sql drop table friends_connections

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


[]

In [7]:
%%sql
CREATE TABLE friends_connections (
	date VARCHAR,
	user_id FLOAT,
	receiver_id INTEGER,
	action VARCHAR
);

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


[]

In [8]:
data = prepare_data_from_csv('/content/drive/MyDrive/Colab Notebooks/Data/SQL_training/facebook_connections/friends_connections.csv')
data[:100]

"('2020-01-30', 100, 246, 'Sent'),('2020-01-01', 100, 895, 'Received'),('2020-05-03', 100, 895, 'Conn"

In [9]:
%%sql insert into friends_connections values {data}

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
4841 rows affected.


[]

## 1.2. Look at the tables

In [6]:
%sql select * from friends_connections limit 1, 5

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


date,user_id,receiver_id,action
2020-01-01,100.0,895,Received
2020-05-03,100.0,895,Connected
2020-02-06,101.0,678,Sent
2020-04-14,101.0,678,Connected
2020-01-03,101.0,790,Sent


## 1.3. Solution

1. Return a list of users who blocked another user after connecting for at least 90 days. Show the user_id and receiver_id.

In [11]:
%%sql
select * from friends_connections
where user_id=895 and receiver_id=100

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


date,user_id,receiver_id,action


In [33]:
%%sql
select fc1.user_id, fc1.receiver_id from friends_connections fc1
join friends_connections fc2 on fc1.user_id=fc2.user_id and fc1.receiver_id=fc2.receiver_id
where
fc1.action='Connected' and fc2.action='Blocked' and JulianDay(date(fc2.date))-JulianDay(date(fc1.date)) >= 90

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


user_id,receiver_id
107.0,415
121.0,263
147.0,607
149.0,486
178.0,697
202.0,630
217.0,248
272.0,801
273.0,288
299.0,609


2. For each user, what is the proportion of each action? Note that the receiver_id can
appear in multiple actions per user, only regard the latest status when calculating
the distribution.

In [25]:
%%sql
select
t.user_id,
round(t.sent_count/t.sum, 2) prop_sent,
round(t.received_count/t.sum, 2) prop_received,
round(t.connected_count/t.sum, 2) prop_connected,
round(t.blocked_count/t.sum, 2) prop_blocked
from
(
  select
  user_tbl.user_id,
  cast(ifnull(sent_tbl.c, 0) as float) sent_count,
  cast(ifnull(received_tbl.c, 0) as float) received_count,
  cast(ifnull(connected_tbl.c, 0) as float) connected_count,
  cast(ifnull(blocked_tbl.c, 0) as float) blocked_count,
  ifnull(sent_tbl.c, 0) + ifnull(received_tbl.c, 0) + ifnull(connected_tbl.c, 0) + ifnull(blocked_tbl.c, 0) sum
  from
  (
    select distinct user_id from friends_connections group by user_id, receiver_id
  ) as user_tbl

  left join
  (
    select user_id, count(*) as c from
    (
      select temp.* from
      (
        select max(date), user_id, receiver_id, action
        from friends_connections
        group by user_id, receiver_id
      ) temp
      where action='Sent'
    )
    group by user_id
  ) sent_tbl on user_tbl.user_id = sent_tbl.user_id

  left join
  (
    select user_id, count(*) as c from
    (
      select temp.* from
      (
        select max(date), user_id, receiver_id, action
        from friends_connections
        group by user_id, receiver_id
      ) temp
      where action='Received'
    )
    group by user_id
  ) received_tbl on user_tbl.user_id = received_tbl.user_id

  left join
  (
    select user_id, count(*) as c from
    (
      select temp.* from
      (
        select max(date), user_id, receiver_id, action
        from friends_connections
        group by user_id, receiver_id
      ) temp
      where action='Connected'
    )
    group by user_id
  ) connected_tbl on user_tbl.user_id = connected_tbl.user_id

  left join
  (
    select user_id, count(*) as c from
    (
      select temp.* from
      (
        select max(date), user_id, receiver_id, action
        from friends_connections
        group by user_id, receiver_id
      ) temp
      where action='Blocked'
    )
    group by user_id
  ) blocked_tbl on user_tbl.user_id = blocked_tbl.user_id
) as t

 * sqlite:////content/drive/MyDrive/sqltraining.sqlite
Done.


user_id,prop_sent,prop_received,prop_connected,prop_blocked
100.0,0.5,0.0,0.5,0.0
101.0,0.5,0.0,0.5,0.0
102.0,0.0,0.0,1.0,0.0
103.0,0.0,0.0,0.0,1.0
104.0,0.0,0.0,0.5,0.5
105.0,0.0,0.0,0.0,1.0
106.0,0.0,0.0,1.0,0.0
107.0,0.5,0.0,0.25,0.25
108.0,0.67,0.0,0.33,0.0
109.0,0.25,0.5,0.25,0.0
