# Using Project Debater services for analyzing and finding insights in the survey data 
When you have a large collection of texts representing people’s opinions (such as product reviews, survey answers or  social media), it is difficult to understand the key issues that come up in the data. Going over thousands of comments is prohibitively expensive.  Existing automated approaches are often limited to identifying recurring phrases or concepts and the overall sentiment toward them, but do not provide detailed or actionable insights.

In this tutorial you will gain hands-on experience in using Project Debater services for analyzing and deriving insights from open-ended answers.  

The data we will use is a community survey conducted in the city of Austin in the years 2016 and 2017 (https://data.world/cityofaustin/mf9f-kvkk). In this survey, the citizens of Austin where asked "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?". 

We will analyze their open-ended answers in different ways by using four Debater services, the *Argument Quality* service, the *Key Point Analysis (KPA)* service, the *Term Wikifier* service and the *Term Relater* service, and we will see how they can be combined into a powerful text analysis tool.

## 1. Run *Key Point Analysis* on 1000 randomly selected sentences from 2016 survey

### 1.1 Read random sample of 1000 sentences from 2016 comments
Let's take a look at the first 5 lines in the *dataset_austin_sentences.csv* file, which holds the Austin survey dataset.

In [1]:
file = open('./scraped_tweets.csv', 'r')
lines = file.readlines()
print('\n'.join(lines[:5]))

tweet,id,topic,lang

People eat all this healthy food but drink like a fish,1498718487094743041,healthy food,en

"[First thing to do in the morning]



✓ Pray



The file has all the survey answers after they were split into sentences. Each row in the file corresponds to a single sentence. Each row has the following attributes: \['id', 'text', 'district','year'\]. We will first read the attached csv file into the 'sentences' variable. 

In [2]:
import csv
import random


with open('./scraped_tweets.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    sentences = list(reader)

Let's have a look at the content *sentences* variable. 

In [3]:
print('There are %d sentences in the dataset' % len(sentences))
print('Each sentence is a dictionary with the following keys: %s' % str(sentences[0].keys()))

There are 2220 sentences in the dataset
Each sentence is a dictionary with the following keys: dict_keys(['tweet', 'id', 'topic', 'lang'])


Let's select only the sentences from the 2016 survey and randomly sample 1000 out of them. The *Key Point Analysis* service is able to run over hundreds of thousands of sentences, however since the computation is heavy in resources (particularly GPUs) the trial version is limited to 1000 sentences. Using a random.seed(0) is important since we already prepared a hot-cache over these sentences for a quicker *Key Point Analysis* run.

In [4]:
random.seed(0)
unique_sentences = []
ids = set()

for sent in sentences:
    if sent['id'] not in ids:
        ids.add(sent['id'])
        unique_sentences.append(sent)


random_sample_sentences = random.sample(unique_sentences, 1000)





### 1.2 Run *Key Point Analysis* on the random sample

Key point analysis is a novel and promising approach for summarization, with an important quantitative angle. This service summarizes a collection of comments on a given topic as a small set of key points. The salience of each key point is given by the number of its matching sentences in the given comments.

Before running the *Key Point Analysis* service we first need to initialize our client.  The DebaterApi object supplies the clients for the various Debater services.   The clients print information using the logger and a suitable verbosity level is should be set. The DebaterApi object is configured with an API key. It should be  retrieved from the Project Debater Early Access Program site.  In this case it is passed by the enviroment variable *DEBATER_API_KEY*.  We then obtain the keypoint client from the DebaterAPI object.

The *Key Point Analysis* service stores the data (and results cache) in a domain. A user can create several domains, one for each dataset. Domains are only accessible to the user who created them.  In this tutorial, we will run all *Key Point Analysis* jobs in the same domain named 'austin_demo'.

Full documentation of the *Key Point Analysis* service can be found [here](https://early-access-program.debater.res.ibm.com/docs/services/keypoints/keypoints_pydoc.html).


In [5]:
from debater_python_api.api.debater_api import DebaterApi
from austin_utils import init_logger
import os
import json
import io

credentials_path = './credentials.json'

with io.open(credentials_path) as f_in:
    credentials = json.load(f_in)


init_logger()
api_key = credentials['debater_api_key']
debater_api = DebaterApi(apikey=api_key)
keypoints_client = debater_api.get_keypoints_client()
domain = 'sustainable diet'

Exercise 1:

Let's define a method named *run_kpa*. The method receives a list of sentences (each sentence is a dictionary with the following keys: 'id','text') and runs *Key Point Analysis* on these sentences. The method also receives the *run_params* parameter, which enable us to customize and affect the *Key Point Analysis* operation.

In order to run *Key Point Analysis*, we need to:

1. Upload the comments into a domain using the **keypoints_client.upload_comments(domain=domain, comments_ids=sentences_ids, comments_texts=sentences_texts, dont_split=True)** method. This method receives the domain, a list of comment_ids and a list of comment_texts. By default, when uploading comments into a domain, the *Key Point Analysis* service splits the comments into sentences by default and runs a minor cleansing on the sentences. Since we already splitted the comments into sentences ourselves and we want to *Key Point Analysis* service to use them as is, we will set the *dont_split* parameter to True.

2. Wait till all comments in the domain are processed using the **keypoints_client.wait_till_all_comments_are_processed(domain=domain)** method.

3. Start a *Key Point Analysis* job using the **future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params)** method. This method receives the domain, a list of comment_ids and a *run_params*. The run_params is a dictionary with various parameters for customizing the job. The job runs in an async manner therefore the method returns a future object.

4. Use the returned future and wait till results are available using the **kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)** method. The method waits for the job to finish and eventually returns the result. The result is a dictionary containing the key points (sorted descendingly according to number of matched sentences) and for each key point has a list of matched sentences (sorted descendingly according to their match score). An additional 'none' key point is added which holds all the sentences that don't match any key point.

Our run_kpa method will return this result dictionary. It will also return the unique identifirt for this analysis called *job_id* retreived from the future. We will need this job_id in a following exercise.

In [6]:
def run_kpa(sentences, run_params):
    sentences_texts = [sentence['tweet'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.upload_comments(domain=domain, 
                                     comments_ids=sentences_ids, 
                                     comments_texts=sentences_texts, 
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain=domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, 
                                                    comments_ids=sentences_ids, 
                                                    run_params=run_params)

    kpa_result = future.get_result(high_verbosity=True, 
                                   polling_timout_secs=5)
    
    return kpa_result, future.get_job_id()

We will now use the method you implemented and run over the random sample and print the result. In order to limit the number of key points in the result to 20, we will use *run_params={'n_top_kps': 20}*.

In [7]:
from austin_utils import print_results

kpa_result_random_1000, _ = run_kpa(random_sample_sentences, {'n_top_kps': 20})
print_results(kpa_result_random_1000, n_sentences_per_kp=2, title='Random sample')

2022-03-01 23:52:18,312 [INFO] keypoints_client.py 316: uploading 1000 comments in batches
2022-03-01 23:52:18,313 [INFO] keypoints_client.py 245: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-03-01 23:52:19,138 [INFO] keypoints_client.py 333: uploaded 1000 comments, out of 1000
2022-03-01 23:52:19,139 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-03-01 23:52:25,213 [INFO] keypoints_client.py 345: domain: sustainable diet, comments status: {'processed_comments': 2808, 'pending_comments': 0, 'processed_sentences': 2808}
2022-03-01 23:52:25,214 [INFO] keypoints_client.py 245: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:52:31,664 [INFO] keypoints_client.py 407: started a kp analysis job - domain: sustainable diet, job_id: 621ea3af72766c72841eeb50
2022-03-01 23:52:31,665 [INFO] keypoints_clien

Stage 1/1: |--------------------------------------------------| 0.0% Complete



2022-03-01 23:52:54,408 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:52:55,524 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 0, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |--------------------------------------------------| 0.0% Complete



2022-03-01 23:53:00,529 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:01,565 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 0, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |--------------------------------------------------| 0.0% Complete



2022-03-01 23:53:06,567 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:07,607 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |█████---------------------------------------------| 10.0% Complete



2022-03-01 23:53:12,613 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:13,535 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |█████---------------------------------------------| 10.0% Complete



2022-03-01 23:53:18,541 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:19,690 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |█████---------------------------------------------| 10.0% Complete



2022-03-01 23:53:24,696 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:30,033 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |█████---------------------------------------------| 10.0% Complete



2022-03-01 23:53:35,039 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:35,620 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |█████---------------------------------------------| 10.0% Complete



2022-03-01 23:53:40,625 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:41,604 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 8, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |████████████████████████████████████████----------| 80.0% Complete



2022-03-01 23:53:46,610 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:52,766 [INFO] keypoints_client.py 584: job_id 621ea3af72766c72841eeb50 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 10, 'total_batches': 10, 'batch_size': 2000}}


Stage 1/1: |██████████████████████████████████████████████████| 100.0% Complete




2022-03-01 23:53:57,772 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:53:59,636 [INFO] keypoints_client.py 587: job_id 621ea3af72766c72841eeb50 is done, returning result


Random sample coverage: 23.32
Random sample key points:
34 - Learning healthy plant-based diet... 👍
	- I’m on a all green diet ion talking food
	- Fruit and vegetable accessibility is so important for development. The way I snacked
	  from the backyard garden as a kid...such a healthy habit.
20 - Providing a healthier, more sustainable, and classic meat snack
	- If you want live in Aus and want to eat clean, ethical, sustainable meat - switch to
	  kangaroo meat
	- #meatfreeTuesday nut bars, vegetable hokkien noodles, vegetable spring rolls, fresh
	  fruit salad... easy as and for a great cause saving the world a couple of meat free days
	  a week !
18 - my diet consists exclusively of legumes
	- #Paleo diet prohibits: legumes, for example beans and peanuts.
	- Bad Diet: only baked goods five days a week. Otherwise, only legumes
15 - Everything is organic over here
	- All organic, no ass shots
	- I crop dust in whole foods BAYBEH
15 - Healthy food is disgusting
	- Jeeeesus the raw dog 

## 2. Run *Key Point Analysis* on 1000 top quality sentences from 2016 survey
### 2.1 Select top 1000 sentences from 2016 data using the *Argument Quality* service
The answers in the Austin Survey dataset vary in length, style and quality. Selecting the sentences randomly may lead to running over many sentences that are not very informative. Running over the randomly selected sentences reached a 28.36% coverage. This means that only 28.36% of the sentences matched a key point. In order to improve the coverage and the quality of our results, we will now run over higher quality sentences and select the 1000 sentences with the highest *Argument Quality* score. The *Argument Quality* service receives pairs of \[sentence, topic\] and returns a score indicating whether the sentence is phrased in grammatically correct, clear and concise language.   The ranking of the quality is based on the machine learning model, which was trained on human assesments of over 30,000 arguments. 

In [8]:
from austin_utils import print_top_and_bottom_k_sentences

def get_top_quality_sentences(sentences, top_k):    
    arg_quality_client = debater_api.get_argument_quality_client()
    sentences_topic = [{'sentence': sentence['tweet'], 'topic': sentence['topic']} for sentence in sentences]
    arg_quality_scores = arg_quality_client.run(sentences_topic)
    sentences_and_scores = zip(sentences, arg_quality_scores)
    sentences_and_scores_sorted = sorted(sentences_and_scores, key=lambda x: x[1], reverse=True)
    sentences_sorted = [sentence for sentence, _ in sentences_and_scores_sorted]
    print_top_and_bottom_k_sentences(sentences_sorted, 10)
    return sentences_sorted[:top_k]

sentences_top_1000 = get_top_quality_sentences(unique_sentences, 1000)

ArgumentQualityClient: 100%|██████████| 2125/2125 [00:25<00:00, 83.18it/s] 
2022-03-01 23:54:25,228 [INFO] argument_quality_client.py 21: argument_quality_client.run = 25576.614141464233ms.


Top 10 quality sentences: 
	- Studies have also shown that green tea has antimicrobial properties that inhibit the
	  growth of bacteria and viruses. Adding green tea to a healthy regimen consisting of a
	  nutritious diet and sufficient sleep is effectual in boosting immunity and keeping the
	  body healthy.
	- Sufficient sleep, 
exercise, 
healthy food, 
friendship, 
and peace of mind,
are
	  necessities, not luxuries.
	- Whole grains are natural high in fiber and can reduce the risk of heart disease,
	  diabetes, certain cancers and other health problems.
	- A high intake of foods that increase the body's alkalinity, such as vegetables, fruits,
	  and legumes, has been suggested to increase healthy life expectancy.
	- veganism is just a glorified eating disorder
	- What's the most effective meal frequency for weight loss? 

You can have 'n' number of
	  meals/day but make sure you stay in a Calorie Deficit,you're having adequate Protein and
	  Fats to support your body and the  diet

### 2.2 Run *Key Point Analysis* over the selected sentences
We will now run the *run_kpa* method over the top 1000 quality sentences

In [9]:
kpa_result_top_aq_1000, _ = run_kpa(sentences_top_1000, {'n_top_kps': 20})
print_results(kpa_result_top_aq_1000, n_sentences_per_kp=2, title='Top aq 2016')

2022-03-01 23:54:25,235 [INFO] keypoints_client.py 316: uploading 1000 comments in batches
2022-03-01 23:54:25,236 [INFO] keypoints_client.py 245: client calls service (post): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-03-01 23:54:26,763 [INFO] keypoints_client.py 333: uploaded 1000 comments, out of 1000
2022-03-01 23:54:26,764 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-03-01 23:54:40,075 [INFO] keypoints_client.py 345: domain: sustainable diet, comments status: {'processed_comments': 2808, 'pending_comments': 1000, 'processed_sentences': 2808}
2022-03-01 23:54:50,086 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/comments
2022-03-01 23:54:54,719 [INFO] keypoints_client.py 345: domain: sustainable diet, comments status: {'processed_comments': 2937, 'pending_comments': 0, 'processed_sentences': 2937}
2022-03-01 2

Stage 1/1: |--------------------------------------------------| 0.0% Complete



2022-03-01 23:55:19,898 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:20,933 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 0, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |--------------------------------------------------| 0.0% Complete



2022-03-01 23:55:25,939 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:26,975 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:55:31,981 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:33,529 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:55:38,534 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:39,058 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:55:44,061 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:45,100 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:55:50,106 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:54,009 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:55:59,015 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:55:59,538 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:56:04,544 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:56:05,580 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 1, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████----------------------------------------| 20.0% Complete



2022-03-01 23:56:10,586 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:56:16,025 [INFO] keypoints_client.py 584: job_id 621ea44372766c72841eeb61 is running, progress: {'total_stages': 1, 'stage_1': {'inferred_batches': 5, 'total_batches': 5, 'batch_size': 2000}}


Stage 1/1: |██████████████████████████████████████████████████| 100.0% Complete




2022-03-01 23:56:21,031 [INFO] keypoints_client.py 245: client calls service (get): https://keypoint-matching-backend.debater.res.ibm.com/kp_extraction
2022-03-01 23:56:22,681 [INFO] keypoints_client.py 587: job_id 621ea44372766c72841eeb61 is done, returning result


Top aq 2016 coverage: 13.85
Top aq 2016 key points:
32 - my diet consists exclusively of legumes
	- #Paleo diet prohibits: legumes, for example beans and peanuts.
	- Bad Diet: all legumes, no cruciferous vegetables
14 - my food intake lately's been too healthy
	- I haven’t been having enough green in my diet the last two weeks so I’m bout to fix that
	  effective now .
	- never craved healthy food in my life but after the past few days isolating and eating
	  nothing but shite all I want is loads of fruit and veggies
13 - Providing a healthier, more sustainable, and classic meat snack
	- If you want live in Aus and want to eat clean, ethical, sustainable meat - switch to
	  kangaroo meat
	- Just a friendly reminder to stop killing your beloved dogs with heavily processed dog
	  food and instead give them a healthy diet of raw meat and whatever appropriate whole
	  foods :-)
6 - My former editor is running low on food and fuel.
	- We’ve hit icy waters, no land to be seen 
The food’s get

### 2.3 Customize key point analysis
It is possible to costumize and affect the analysis by passing different parameters in the *run_params* dictionary. In this subsection we will see few examples.

#### 2.3.1 Hierarchical key points
Often, few key points address a similar topic. In order to get an even clearer sumamry of the data, we can group similar key points together using the *Hierarchical Key Points* feature. To acrivate it we add two additional parameters to run_params: {'perform_kp_hierarchy': True, 'kp_hierarchy_threshold': 0.3}. *perform_kp_hierarchy* activates the feature and *kp_hierarchy_threshold* sets a threshold for grouping similar key points. The lower the threshold, more key points are grouped with lower similarity.

In [None]:
kpa_result_top_aq_1000_2016, _ = run_kpa(sentences_2016_top_1000_aq, 
                                    {'n_top_kps': 20, 'perform_kp_hierarchy': True, 'kp_hierarchy_threshold': 0.3})
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016, hierarchical')

#### 2.3.2 Increase coverage by decreasing the matching threshold
Running over higher quality sentences we managed to increase our coverage to 41.05%. In order to increase the coverage more, we will add another parameter to the run_params called *mapping_threshold*. 

The mapping_threshold is responsible of deciding whether a sentences matches (supports) a key point. Therefore reducing the threshold from the 0.99 default value makes more sentences match key points and increases the coverage, at the risk of reducing the precision.

In [None]:
kpa_result_top_aq_1000_2016, kpa_top_aq_1000_2016_job_id = run_kpa(sentences_2016_top_1000_aq, 
                                                                {'n_top_kps': 20, 'mapping_threshold': 0.95})
print_results(kpa_result_top_aq_1000_2016, n_sentences_per_kp=2, title='Top aq 2016')

The coverage was indeed increased to about 50%. Let's examine the bottom 5 sentences that were matched to the first key point and make sure that the precision is still high.

In [None]:
from austin_utils import print_bottom_matches_for_kp
print_bottom_matches_for_kp(kpa_result_top_aq_1000_2016, 'Traffic congestion needs major improvement', 5)

## 3. Run *Key Point Analysis* over 2017 survey using the key points from 2016 survey
### 3.1 Select top 1000 sentences from 2017 data using the *Argument Quality* service
It is very useful to be able to compare between different subsets of the data (compare between different years, different districts, etc'). We will demonstrate how easy it is to compare the 2017 data to the 2016 data. A similar comparisson can be done between districts or other subsets. 

Let's first filter the 2017 sentences and take the top 1000 quality sentences, as done for the 2016 sentences.

In [None]:
sentences_2017 = [sentence for sentence in sentences if sentence['year'] == '2017']
sentences_2017_top_1000_aq = get_top_quality_sentences(sentences_2017, 1000, 'Austin is a great place to live')

### 3.2 Run *Key Point Analysis* over top 1000 quality 2017 sentences using the key points from 2016
Exercise 2:<br/>
In order to compare the 2017 sentences to 2016 sentences we will want to map the 2017 sentences to the same key points extracted on the 2016 sentences (otherwise different key points could be automattically extracted on the 2017 sentences and it would be hard to compare between them).

For this end we will reimplement the *run_kpa* method (please copy paste the previous one and modify it). This time the method will receive a new *key_points_by_job_id* parameter. This parameter is passed to the *key_points_by_job_id* parameter in the **future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids, run_params=run_params, key_points_by_job_id=key_points_by_job_id)** method. When *None* is passed to *key_points_by_job_id*, key points are automatically extracted, however when it is set with a *job_id* of a previous job it uses the key points from that job and matches all sentences to them.

In [None]:
def run_kpa(sentences, run_params, key_points_by_job_id=None):
    sentences_texts = [sentence['text'] for sentence in sentences]
    sentences_ids = [sentence['id'] for sentence in sentences]

    keypoints_client.upload_comments(domain=domain,
                                     comments_ids=sentences_ids,
                                     comments_texts=sentences_texts,
                                     dont_split=True)

    keypoints_client.wait_till_all_comments_are_processed(domain=domain)

    future = keypoints_client.start_kp_analysis_job(domain=domain, comments_ids=sentences_ids,
                                                    run_params=run_params,
                                                    key_points_by_job_id=key_points_by_job_id)

    kpa_result = future.get_result(high_verbosity=True, polling_timout_secs=5)
    
    return kpa_result, future.get_job_id()

Let's use the new *run_kpa* and provide it with the *top 1000 quality sentences from 2017* and the job_id of *top 1000 quality sentences from 2016*.

In [None]:
kpa_result_top_aq_1000_2017, _ = run_kpa(sentences_2017_top_1000_aq, 
                                    {'n_top_kps': 20, 'mapping_threshold': 0.95}, kpa_top_aq_1000_2016_job_id)
print_results(kpa_result_top_aq_1000_2017, n_sentences_per_kp=2, title='Top aq 2017, using 2016 key points')

Since both jobs have the same key points, we can now easily compare the two results.

In [None]:
from austin_utils import compare_results

compare_results(kpa_result_top_aq_1000_2016, '2016', kpa_result_top_aq_1000_2017, '2017')

Note: This comparision is for illustration only. Given that we ran on a subset of comments, the statistical significant of difference between the years is limited, except for the most recurring keypoints.

## 4. Deep dive into the *traffic problem* in Austin using the *Term Wikifier* and *Term Relater* services
As we've seen in the 2016 results, that the traffic problem in Austin is significant. In this section we will use the *Term Wikifier* and *Term Relater* services to select a subset of the sentences related to the *Traffic* topic and run *Key Point Analysis* over them. 

The *Term Wikifier* service runs over sentences and identifies the Wikipedia concepts that are referenced by phrases in the sentence text.  Concepts correspond to Wikipedia articles.  Each occurance of a concept in the sentence is called a *mention*.  For example, the sentence "My car insurance went up 20% due to vehicle thefts and burglary" mentions three Wikipedia concepts: The phrase "car insurance" is mapped to the concept *Vehicle insurance*; the phrase "vehicle thefts" is mapped to the concept *Motor vehicle theft* and the phrase "burglary" is mapped to the concept *Burglary*.

The *Term Relater* service runs over pairs of Wikipedia concepts and scores how closely these concepts are related.  For example, the *Car* concept is very related to the *Traffic* concept but the *Cat* concept is not very related to the *Traffic* concept.

We will use the *Term Wikifier* to extract all mentions in all sentences; then use the *Term Relater* to select a subset of these mentions which are related to the 'Traffic' concept; then select all sentences that have mentions related to the 'Traffic' concept; and finally run *Key Point Analysis* over them. Running over these sentences will create key points specifically to the traffic problem in Austin and expose insights and suggestions related to it.

### 4.1 Calculate the mentions in the sentences using the *Term Wikifier*
Exercise 3:

Please complete the missing parts in the *get_sentence_to_mentions(sentences_texts)* method. The method uses the *Term Wikifier* service, calculates the mentions for each sentence and stores it in a dictionary named *sentence_to_mentions*. 

The *Term Wikifier* client runs over the sentences_texts using the **mentions_list = term_wikifier_client.run(sentences_texts)** method and returns a list of mentions_lists.

In [None]:
def get_sentence_to_mentions(sentences_texts):
    term_wikifier_client = debater_api.get_term_wikifier_client()

    mentions_list = term_wikifier_client.run(sentences_texts)
    
    sentence_to_mentions = {}
    for sentence_text, mentions in zip(sentences_texts, mentions_list):
        sentence_to_mentions[sentence_text] = set([mention['concept']['title'] for mention in mentions])
    return sentence_to_mentions

Let's calculate the mentions on all 2016 sentences"

In [None]:
sentences_2016_texts = [sentence['text'] for sentence in sentences_2016]
sentence_to_mentions = get_sentence_to_mentions(sentences_2016_texts)

### 4.2 Find the mentions that relate to the *traffic* concept using the *Term Relater* service
Since we're interested in the *Traffic* concept, we will now take all mentions and find the ones that are related to that concept. Then we will select all sentences that have at least one mention that is related to the *Traffic* concept.

In [None]:
all_mentions = set([mention for sentence in sentence_to_mentions 
                   for mention in sentence_to_mentions[sentence]])

Exercise 4:<br/>
Please complete the missing parts in the *get_related_mentions(concept, threshold, all_mentions)* method. It receives a given concept, a threshold and all_mentions. It then uses the *Term Relater* service to calculate the relatedness between the mentions and the concept and returns all mentions that have relatedness score above the given threhold. The *term_relater_client* runs over the pairs using the **scores = term_relater_client.run(concept_mention_pairs)** method and returns a list of scores.

In [None]:
def get_related_mentions(concept, threshold, all_mentions):
    term_relater_client = debater_api.get_term_relater_client()
    concept_mention_pairs = [[concept, mention] for mention in all_mentions]

    scores = term_relater_client.run(concept_mention_pairs)
    
    return [mention for mention, score in zip(all_mentions, scores) if score > threshold]

We will now use the method you've implemented and find the mentions that match the *traffic* concept.

In [None]:
matched_mentions = get_related_mentions('Traffic', 0.5, all_mentions)
print(matched_mentions)

### 4.3 Run *Key Point Analysis* over the sentences that relate to the *Traffic* concept
Let's select the sentences that have mentions that are related to the *Traffic* concept and run over them. We will need to switch back from sentences_texts to sentences dictionaries since our *run_kpa* method needs the sentences dictionaries.

In [None]:
matched_sentences_texts = [sentence for sentence in sentences_2016_texts 
                     if len(sentence_to_mentions[sentence].intersection(matched_mentions)) > 0]
matched_sentences = [sentence for sentence in sentences_2016 if sentence['text'] in matched_sentences_texts]
matched_sentences = matched_sentences if len(matched_sentences) <= 1000 else random.sample(matched_sentences, 1000)
print('Running over %d sentences' % len(matched_sentences))

Finally, let's run over these sentences and examine the *Traffic* related key points

In [None]:
kpa_result_traffic_2016, _ = run_kpa(matched_sentences, {'n_top_kps': 20, 'mapping_threshold': 0.99}, None)
print_results(kpa_result_traffic_2016, n_sentences_per_kp=2, title='Traffic KPA 2016')

### 4.4 Conclusion

In this tutorial, we showed how *Key Point Analysis* can provide you with detailed insights over survey data right out of the box - significantly reducing the effort required by a data scientist to analyze the data.  We also demonstrated how key point analysis over unstructured text can be combined with available structured information, to provide new views over the data.   Finally, we showed how utilizing of additional Project Debater text analysis services such as *Argument Quality*, *Term Wikifier* , and *Term Relater* can further improve the quality of the results.