In [1]:
%run helper/setup_notebook.ipynb import display_table

Successfully connected to leetcode50 database.


In [2]:
table_name = 'queries'
display_table(table_name)

+------------+------------------+----------+--------+
| query_name |      result      | position | rating |
+------------+------------------+----------+--------+
|    Dog     | Golden Retriever |    1     |   5    |
|    Dog     | German Shepherd  |    2     |   5    |
|    Dog     |       Mule       |   200    |   1    |
|    Cat     |     Shirazi      |    5     |   2    |
|    Cat     |     Siamese      |    3     |   3    |
|    Cat     |      Sphynx      |    7     |   4    |
+------------+------------------+----------+--------+


- ### We define query quality as: *The average of the ratio between query rating and its position.* 
- ### We also define poor query percentage as: *The percentage of all queries with rating less than 3.*

### Write an SQL query to find each query_name, the quality and poor_query_percentage. Both quality and poor_query_percentage should be rounded to 2 decimal places.

```
+------------+---------+-----------------------+
| query_name | quality | poor_query_percentage |
+------------+---------+-----------------------+
| Dog        | 2.50    | 33.33                 |
| Cat        | 0.66    | 33.33                 |
+------------+---------+-----------------------+
Explanation: 
Dog queries quality is ((5 / 1) + (5 / 2) + (1 / 200)) / 3 = 2.50
Dog queries poor_ query_percentage is (1 / 3) * 100 = 33.33

Cat queries quality equals ((2 / 5) + (3 / 3) + (4 / 7)) / 3 = 0.66
Cat queries poor_ query_percentage is (1 / 3) * 100 = 33.33
```

In [3]:
%%sql 

SELECT query_name, AVG(rating / position) as quality
FROM Queries 
GROUP BY query_name;


query_name,quality
Dog,2.50166667
Cat,0.65713333


In [4]:
%%sql
SELECT 
    query_name, 
    ROUND(AVG(rating / position), 2) AS quality,
    ROUND((SUM(CASE WHEN rating < 3 THEN 1 ELSE 0 END) / COUNT(*)) * 100, 2) AS poor_query_percentage
FROM Queries
GROUP BY query_name;


query_name,quality,poor_query_percentage
Dog,2.5,33.33
Cat,0.66,33.33


# Using Pandas

In [5]:
import pandas as pd 

In [6]:
queries_query = %sql SELECT * FROM queries # type: ignore
queries_df = queries_query.DataFrame()
display(queries_df)

Unnamed: 0,query_name,result,position,rating
0,Dog,Golden Retriever,1,5
1,Dog,German Shepherd,2,5
2,Dog,Mule,200,1
3,Cat,Shirazi,5,2
4,Cat,Siamese,3,3
5,Cat,Sphynx,7,4


In [7]:
queries_df.groupby(by='query_name').agg({'rating':'mean'})

Unnamed: 0_level_0,rating
query_name,Unnamed: 1_level_1
Cat,3.0
Dog,3.666667


In [8]:
queries_df.groupby(by='query_name')[['rating', 'position']].mean()

Unnamed: 0_level_0,rating,position
query_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cat,3.0,5.0
Dog,3.666667,67.666667


In [9]:
# Define a function to calculate the rounded average
def calculate_average(grouped_df):
    return round((grouped_df['rating'] / grouped_df['position']).mean(), 2)

In [10]:
queries_df.groupby(by='query_name').apply(calculate_average)

query_name
Cat    0.66
Dog    2.50
dtype: float64

In [11]:
queries_df.groupby(by='query_name').apply(calculate_average)

query_name
Cat    0.66
Dog    2.50
dtype: float64

In [12]:
# Define a function to calculate the rounded average
def calculate_percentage(grouped_df):
    return round((grouped_df['rating'] < 3).sum() / len(grouped_df) * 100, 2)

In [13]:
queries_df.groupby('query_name').apply(calculate_percentage)

query_name
Cat    33.33
Dog    33.33
dtype: float64

In [14]:
output_df = queries_df.groupby('query_name').apply(calculate_average).reset_index()
output_df

Unnamed: 0,query_name,0
0,Cat,0.66
1,Dog,2.5


In [15]:
output_df = output_df.rename(columns={0: 'quality'})
output_df

Unnamed: 0,query_name,quality
0,Cat,0.66
1,Dog,2.5


In [16]:
# Must drop the old index, otherwise the indexes will not align when the new column is created
output_df['temp_column'] = queries_df \
                            .groupby('query_name') \
                            .apply(calculate_percentage)
output_df

Unnamed: 0,query_name,quality,temp_column
0,Cat,0.66,
1,Dog,2.5,


In [17]:
output_df['poor_quality_percentage'] = queries_df \
                                        .groupby('query_name') \
                                        .apply(calculate_percentage) \
                                        .reset_index(drop=True)
output_df

Unnamed: 0,query_name,quality,temp_column,poor_quality_percentage
0,Cat,0.66,,33.33
1,Dog,2.5,,33.33


In [18]:
# Drop the temporary column
output_df = output_df.drop('temp_column', axis=1)
output_df

Unnamed: 0,query_name,quality,poor_quality_percentage
0,Cat,0.66,33.33
1,Dog,2.5,33.33


## Using Lambda

1. `queries_df.groupby('query_name')`: This groups the DataFrame by the 'query_name' column. It creates separate groups for each unique value - Cat, Dog- in the 'query_name' column.

2. `agg(...)`: The `agg` function is used to perform aggregations on each group. It takes in a dictionary where the keys represent the column names of the resulting DataFrame, and the values specify the aggregation operations to be applied.

3. `quality=('rating', lambda x: round((x / queries_df['position']).mean(), 2))`: Here, we define an aggregation operation for the 'quality' column. The key 'quality' specifies the resulting column name. The value is a tuple where the first element ('rating') represents the column to be aggregated, and the second element is a lambda function that calculates the quality value.

   - `x` represents the grouped 'rating' column for each group.
   - `x / queries['position']` calculates the division of each element in 'rating' by the corresponding value in the 'position' column.
   - `.mean()` calculates the mean of the resulting division values within each group.
   - `round(..., 2)` rounds the mean value to two decimal places.

4. `poor_query_percentage=('rating', lambda x: round((x < 3).sum() / x.count() * 100, 2))`: Similarly, we define an aggregation operation for the 'poor_query_percentage' column using a lambda function.

   - `(x < 3).sum()` counts the number of elements in the 'rating' column that are less than 3 within each group.
   - `x.count()` counts the total number of elements in the 'rating' column within each group.
   - `(... / x.count()) * 100` calculates the ratio of the count of poor queries to the total count and multiplies it by 100 to get the percentage.
   - `round(..., 2)` rounds the percentage value to two decimal places.

In [19]:
# Calculate the quality and poor_query_percentage
result = queries_df.groupby('query_name').agg(
    quality=('rating', lambda x: round((x / queries_df['position']).mean(), 2)),
    poor_query_percentage=('rating', lambda x: round((x < 3).sum() / x.count() * 100, 2))
)

result

Unnamed: 0_level_0,quality,poor_query_percentage
query_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cat,0.66,33.33
Dog,2.5,33.33


In [20]:
result = queries_df.groupby('query_name').apply(calculate_average).reset_index().rename(columns={0: 'quality'})
result['poor_query_percentage'] = queries_df.groupby('query_name').apply(calculate_percentage).reset_index(drop=True)
result

Unnamed: 0,query_name,quality,poor_query_percentage
0,Cat,0.66,33.33
1,Dog,2.5,33.33
