**[SQL Course Home Page](https://www.kaggle.com/learn/SQL)**

---


# Intro

Stack Overflow (stackoverflow.com) is a widely beloved Question and Answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language). 

Their data is publicly available. What cool things do you think it would be useful for?

Here's one idea:
You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.

As usual, run the following cell to set up our feedback system before moving on.

In [1]:
# Set up feedack system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *

# import package with helper functions 
import bq_helper

# create a helper object for this dataset
stack_overflow = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                              dataset_name="stackoverflow")

Using Kaggle's public dataset BigQuery integration.
Using Kaggle's public dataset BigQuery integration.


# Questions

# 1) Explore the Data

Before writing queries or **JOIN** clauses, you'll want to see what tables are available. 

This may be a good time to practice **tab completion** for when you don't remember command names. If you type `stack_overflow.` and then hit tab, you will see a list of methods for the `stack_overflow` object (don't forget the dot before hitting tab.)

In [2]:
# Your code here
list_of_tables = stack_overflow.list_tables()    # get a list of available tables

print(list_of_tables)
q_1.check()

['badges', 'comments', 'post_history', 'post_links', 'posts_answers', 'posts_moderator_nomination', 'posts_orphaned_tag_wiki', 'posts_privilege_wiki', 'posts_questions', 'posts_tag_wiki', 'posts_tag_wiki_excerpt', 'posts_wiki_placeholder', 'stackoverflow_posts', 'tags', 'users', 'votes']


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [3]:
# q_1.solution()

# 2) Review Relevant Tables

If you are interested in people who answer questions on a given topic, the `posts_answers` table is a natural place to look. Run the following cell and look at the output

In [4]:
stack_overflow.head('posts_answers')

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,58545647,,"<p>You can implement the <a href=""https://docs...",,,0,,2019-10-24 16:35:51.947000+00:00,,2019-10-24 16:35:51.947000+00:00,,,,,2541560,58545487,2,0,,
1,58545649,,"<p>You may be having an issue with the ""stage""...",,,0,,2019-10-24 16:35:59.377000+00:00,,2019-10-24 16:35:59.377000+00:00,,,,,4434749,56565949,2,0,,
2,58545664,,<p>I am not sure why you need that exactly but...,,,0,,2019-10-24 16:36:39.870000+00:00,,2019-10-24 16:36:39.870000+00:00,,,,,8343843,58545068,2,0,,
3,58545675,,<pre><code>Object delegateObj = readField(valu...,,,1,,2019-10-24 16:37:20.207000+00:00,,2019-10-24 16:37:20.207000+00:00,,,,,12269981,57195785,2,0,,
4,58545677,,<p>I had to remove the line</p>\n\n<pre><code>...,,,0,,2019-10-24 16:37:51.253000+00:00,,2019-10-24 16:37:51.253000+00:00,,,,,1775258,58428566,2,0,,


It isn't clear yet how to the find users who answered questions on any given topic. But `posts_answers` has a `parent_id` column. If you are familiar with the Stack Overflow site, you might figure out that the `parent_id` is the question each post is answering.

Look at `posts_questions` using the line below.

In [5]:
stack_overflow.head('posts_questions')

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,59030907,NLP Transformers: Best way to get a fixed sent...,<p>I'm loading a language model from torch hub...,,2,0,,2019-11-25 11:36:12.360000+00:00,2.0,2019-11-26 16:49:27.200000+00:00,2019-11-26 16:49:27.200000+00:00,,3192021.0,,3192021,,1,4,machine-learning|deep-learning|nlp|pytorch|wor...,256
1,59123008,How to detect keypress in C++ on MacOS?,<p>I'm writing an assignment that needs to sho...,,0,0,,2019-12-01 06:13:36.580000+00:00,,2019-12-01 06:13:36.580000+00:00,NaT,,,,10166333,,1,0,c++|events|sdl|keypress|keyboard-events,1
2,58858824,stabilizing Android command line app benchmark...,<p>I'm writing C++ module which later will be ...,,0,0,,2019-11-14 14:14:51.043000+00:00,,2019-11-14 14:14:51.043000+00:00,NaT,,,,5827390,,1,0,android|performance-testing|benchmarking,2
3,58947708,How to find pixel/dp distance between 2 points...,<p>I need to draw limit lines on both x and y-...,,0,0,,2019-11-20 06:01:55.060000+00:00,,2019-11-20 11:39:16.020000+00:00,2019-11-20 11:39:16.020000+00:00,,1631967.0,,7637587,,1,-1,mpandroidchart,2
4,58968440,will mobile browser download sourcemap file?,"<p>I've read some article or question said ""so...",,0,0,,2019-11-21 06:19:37.400000+00:00,,2019-11-21 06:19:37.400000+00:00,NaT,,,,3983650,,1,0,source-maps,2


Are there any fields that identify what topic or technology each question is about?

If so, how could you find the user ID\'s of users who answered questions about a specific topic?

Think about it, then check the solution by running the code in the next cell.

In [6]:
q_2.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
`posts_questions` has a column called `tags` which lists the topics/technologies each question is about.

`posts_answers` has a column called `parent_id` which identifies the ID of the question each answer is responding to.
`posts_answers` also has an `owner_user_id` column which specifies the ID of the user who answered the question.

You can join these two tables to:
- determine the `tags` for each answer, and then
- select the `owner_user_id` of the answers on the desired tag.

This is exactly what you will do over the next few questions.


# 3 Selecting The Right Questions

A lot of this data is text. 

Here is one last technique you'll learn in this course which you can apply to this text:

A **WHERE** clause can limit your results to rows with certain text using the **LIKE** feature. For example, to select just the third row of the `pets` table, we would write

`SELECT * FROM PETS WHERE NAME LIKE 'Ripley'`

![](https://i.imgur.com/Ef4Puo3.png)

You can also use `%` as a "wildcard" for any number of characters. So you can get the third row with 

`SELECT * FROM PETS WHERE NAME LIKE '%ipl%'`

Try this yourself.
Before finding users who have answered questions, write a query that selects the `id`, `title` and `owner_user_id` from the `posts_questions` table. Restrict the results to rows that contain the word **bigquery** in the `tags` column. Include rows where there is other text in addition to the word `bigquery` (e.g. if a row has a tag `bigquery-sql`, your results should include that too).

In [7]:
# Your code here
questions_query = \
"""
SELECT id, title, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags LIKE '%bigquery%'
"""

questions_results = stack_overflow.query_to_pandas_safe(questions_query, max_gb_scanned=25) # this query reads a lot of data
print(questions_results.head())
q_3.check()


         id                                              title  owner_user_id
0  59104395  How to store datetimeoffset(4) datatype from s...     12436928.0
1  58861311  Count and records from yesterday and add datec...      6347701.0
2  59075362  Google Bigquery, getting beginning and end of ...     12447336.0
3  59072924  Calculate avg daily and hourly phone call roll...     12379424.0
4  58986030  No matching signature for function TIMESTAMP_D...     12081490.0


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [8]:
# q_3.hint()
# q_3.solution()

# 4 Your First Join
Now that you have a query to select questions on any given topic (in this case, you chose `bigquery`), you can find the answers to those questions with a **JOIN**.  

Write a SQL query that returns the `id`, `body` and `owner_user_id` from the `posts_answers` table for answers to `bigquery` related questions. That is, you should have one row in your results for each answer to a question that has a `bigquery` in the tag.

Here's a reminder of what a **JOIN** looked like in the tutorial
```
SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` as p
INNER JOIN `bigquery-public-data.pet_records.owners` as o ON p.ID = o.Pet_ID
```

It may be useful to scroll up and review the results from when you called **head** on `posts_answers` and `posts_questions`.  

In [9]:
from time import time

answers_query = \
"""
SELECT pa.id, pa.body, pa.owner_user_id
    FROM `bigquery-public-data.stackoverflow.posts_questions` AS pq
        INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS pa
            ON pq.id = pa.parent_id
    WHERE pq.tags LIKE '%bigquery%'
"""

answers_results = stack_overflow.query_to_pandas_safe(answers_query, max_gb_scanned=50) # query scans more than 1GB of data, but less than 2.
print(answers_results.head())
q_4.check()

         id                                               body  owner_user_id
0  33483810  <hr>\n\n<p>The above example doesn't show how ...      2379486.0
1  33504705  <p>I'm not sure if this question is still open...      5520858.0
2  33520266  <p>Yes, the link you have is the way to use se...      1066506.0
3  33540643  <p>For copying large files from some HTTP loca...        40999.0
4  33590170  <p>you've already solved it by partitioning. i...      2213940.0


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [10]:
# q_4.hint()
# q_4.solution()

# 5 Answer The Question
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.

Write a new query that selects data from the `posts_questions` and `posts_answers` tables. The results should have a single row for each user who answered at least one questions with a tag that includes the string `bigquery`. Each row in your results should have two columns:
- a column called `user_id` that contains the `owner_user_id` from the `posts_answers` table
- a column called `number_of_answers` that contains the number of answers the user has written to `bigquery` questions

In [11]:
# your code here
bigquery_experts_query =  \
"""
SELECT pa.owner_user_id AS user_id, COUNT(1) AS number_of_answers
    FROM `bigquery-public-data.stackoverflow.posts_questions` AS pq
        INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS pa
            ON pq.id = pa.parent_id
    WHERE pq.tags LIKE '%bigquery%'
    GROUP BY pa.owner_user_id
    HAVING number_of_answers > 0
"""
bigquery_experts_results = stack_overflow.query_to_pandas_safe(bigquery_experts_query, max_gb_scanned=50)

print(bigquery_experts_results.head())
q_5.check()

     user_id  number_of_answers
0  5239549.0                 39
1  8212141.0                 18
2  1560780.0                 16
3    19880.0                 35
4  4001094.0                 93


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [12]:
# q_5.hint()
# q_5.solution()

# Building A More Generally Useful Service

How could you convert what you've done so it's a general function a website could call on the backend to get experts on any topic?  

Think about it and then check the solution below.

In [13]:
def expert_finder(topic, bigQueryHelper):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.
    
    Inputs:
        topic: A string with the topic of interest
        bigQueryHelper: A BigQueryHelper object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers sorted by number_of_answers.
    '''
    query = f"""
            SELECT pa.owner_user_id AS user_id, COUNT(1) AS number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS pq
                    INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS pa
                        ON pq.id = pa.parent_id
                WHERE pq.tags LIKE '%{topic}%'
                GROUP BY pa.owner_user_id
                ORDER BY number_of_answers DESC
            """
    return bigQueryHelper.query_to_pandas_safe(query, max_gb_scanned=50)
    
print(expert_finder('tensorflow', stack_overflow).head())

     user_id  number_of_answers
0  3574081.0               1065
1  1782792.0                565
2   712995.0                478
3  2097240.0                445
4   992489.0                404


In [14]:
# q_6.solution()

# Congratulations
You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's large datasets.

Want to go play with your new powers?  Kaggle has BigQuery datasets available [here](https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=sizeAll&filetype=fileTypeBigQuery).

# Feedback
Bring any questions or feedback to the [Learn Discussion Forum](https://www.kaggle.com/learn-forum).


---
**[SQL Course Home Page](https://www.kaggle.com/learn/SQL)**

