# 1. 分析Github爬下来的issue和pull request （各种开源app的仓库）

## 1.1 获得app的列表

### 1.1.1 获得Google Play的列表,并写入到文件中

In [1]:
import analyze_review
analyze_review.write_app_list()

输出文件路径
/mnt/d/Onedrive/Code/UIReviewAnalysis/src/analyze/GooglePlay_app_name.txt


### 1.1.2 获得Fdroid的app列表，并写入到文件中

In [2]:
!python /mnt/d/Onedrive/Code/UIReviewAnalysis/script/get_fdroid_app_list.py

Fdroid app的总数(不管是否存在)：
6298
输出文件路径
/mnt/d/Onedrive/Code/UIReviewAnalysis/src/analyze/Fdroid_app_name.txt


#### 对Fdroid的app的筛选与统计

In [3]:
!python /mnt/d/Onedrive/Code/UIReviewAnalysis/script/get_fdroid_repo_list.py

Fdroid app总数：
5192
筛选后的总数(即源码地址为GitHub的)：
2496


## 1.2 分析爬下来Fdroid的数据

In [4]:
%load_ext autoreload
%autoreload 2
import analyze_pull_request_and_issue as pi
import keywords_search
import analyze_review_reply
from pprint import pprint
import imp
import pandas as pd
imp.reload(pi)

<module 'analyze_pull_request_and_issue' from '/mnt/d/Onedrive/Code/UIReviewAnalysis/src/analyze/analyze_pull_request_and_issue.py'>

### 1.2.1 计算两个数据之间的overlap

In [5]:
data, app_list = pi.get_google_play_in_fdroid_data()

[31m同时存在于Google Play和Fdroid中的app的数量[0m
1046


### 1.2.2 基本数据的统计

In [6]:
pi.count()

[31m同时存在于Google Play和Fdroid中的app的数量[0m
1046
[31m数据的字段[0m
['issue_url',
 'user_name',
 'owner_repo',
 'types',
 'create_time',
 'update_time',
 'state',
 'title',
 'text']
[31mIssue + Pull Request总量[0m
127049
[31mPull request的数量[0m
48484
[31mIssue的数量[0m
78565
[31mApp的数量[0m
434


In [7]:
# TODO,重新处理一下log的编码
pi.process_log()

[31mFdroid包含的所有的（有GitHub地址的）app数量[0m
2496
[31m爬取到的项目[0m
2493
[31m不存在的项目[0m
3


In [8]:
pi.process_list()

[31m官方提供的（有GitHub地址的）app数量[0m
2496


## 1.3 UI相关问题的分析

### 1.3.1 首先是要利用UI关键词搜索相关的issue和pull request。

In [9]:
data, app_list = pi.get_google_play_in_fdroid_data()

[31m同时存在于Google Play和Fdroid中的app的数量[0m
1046


In [10]:
#click
#legend
keywords = keywords_search.get_keywords()

In [11]:
data_ui = pi.search_ui_issue_and_pr()

[31m搜索并分析UI相关的pull request和issue[0m
[31m同时存在于Google Play和Fdroid中的app的数量[0m
1046
[31m小测试：pandas中两个column的字符可以直接完成拼接[0m
[31m字符A[0m
Add Slovak dictionary
[31m字符B[0m
Made it with aosp-dictionary-tools from https://github.com/hermitdave/FrequencyWords/tree/master/content/2018/sk.
Filtering it from all the junk was quite a pain but in the end, I think what I made is quite a decent dictionary.

Should I also submit this to Lineage? 
[31m字符A+B data['title'] + '\n' + data['text'][0m
Add Slovak dictionary
Made it with aosp-dictionary-tools from https://github.com/hermitdave/FrequencyWords/tree/master/content/2018/sk.
Filtering it from all the junk was quite a pain but in the end, I think what I made is quite a decent dictionary.

Should I also submit this to Lineage? 
[31mUI关键词数量[0m
55
[31m搜索出来的UI相关的数量[0m
31947
[31mUI相关的占比[0m
25.145%


### 1.3.2 分别统计找到的issue和pull request

In [12]:
ui_pull_request, ui_issue = pi.count_ui_issue_and_pr(data, data_ui)

[31m搜索出来的UI相关的pull request数量以及占比[0m
9567
19.732%
[31mOpen的pull request及占比[0m
271
2.833%
[31mClosed的pull request及占比[0m
9296
97.167%
[31m搜索出来的UI相关的issue数量以及占比[0m
22380
0.285%
[31mOpen的issue及占比[0m
4833
21.595%
[31mClosed的issue及占比[0m
17547
78.405%


### 1.3.3 进行sample，99%的置信度，5%的置信区间

In [13]:
import sample_size
pi.red("采样的pull request数量")
pull_request_number = sample_size.calculate_size(2.58, 0.05, 9567)
print(pull_request_number)
pi.red("采样的issue的数量")
issue_number = sample_size.calculate_size(2.58, 0.05, 22380)
print(issue_number)

# 先只看issue
# 确定随机种子并采样issue
SEED = 666
sample_pull_request = ui_pull_request.sample(n=622, random_state=SEED)
pi.red("Sample中open的pull request")
open_sample_pull_request = sample_pull_request[sample_pull_request["state"] == "open"]
print(len(open_sample_pull_request))
pi.red("Sample中closed的pull request")
closed_sample_pull_request = sample_pull_request[sample_pull_request["state"] == "closed"]
print(len(closed_sample_pull_request))
sample_issue = ui_issue.sample(n=646, random_state=SEED)
pi.red("Sample中open的issue")
open_sample_issue = sample_issue[sample_issue["state"] == "open"]
print(len(open_sample_issue))
pi.red("Sample中closed的issue")
closed_sample_issue = sample_issue[sample_issue["state"] == "closed"]
print(len(closed_sample_issue))

[31m采样的pull request数量[0m
622
[31m采样的issue的数量[0m
646
[31mSample中open的pull request[0m
16
[31mSample中closed的pull request[0m
606
[31mSample中open的issue[0m
155
[31mSample中closed的issue[0m
491


### 1.3.4 将采样的输出为html格式，并且高亮关键词

In [14]:
%%time
html_pull_request = keywords_search.color_html(sample_pull_request['sentences'], keywords=keywords)
html_issue = keywords_search.color_html(sample_issue['sentences'], keywords=keywords)
pi.red("Pull request的一个例子")
print(html_pull_request[:1])
pi.red("Issue的一个例子")
print(html_issue[:1])

TypeError: color_html() missing 1 required positional argument: 'file'

In [15]:
%%time
html_pull_request = keywords_search.color_html(sample_pull_request['sentences'], 
                                                keywords, 
                                                "sample_ui_pull_request_sample_check.html")
html_issue = keywords_search.color_html(sample_issue['sentences'], 
                                        keywords,
                                        "sample_ui_issue_check.html")

CPU times: user 219 ms, sys: 46.9 ms, total: 266 ms
Wall time: 224 ms


### 1.3.5 统计标记的数据

In [16]:
labels = pd.read_csv("label_ui_issue_category.txt", header=None, sep=" ", names=['category', 'subcategory'])
pi.red("统计UI issue的数量")
print(labels['category'].value_counts())
pi.red("统计UI issue的比例")
print(labels['category'].value_counts(normalize=True))
pi.red("统计Issue type的数量")
print(labels['subcategory'].value_counts())
pi.red("统计Issue type的比例")
print(labels['subcategory'].value_counts(normalize=True))

[31m统计UI issue的数量[0m
appearance     398
interaction    137
experience      73
0               23
others          15
Name: category, dtype: int64
[31m统计UI issue的比例[0m
appearance     0.616099
interaction    0.212074
experience     0.113003
0              0.035604
others         0.023220
Name: category, dtype: float64
[31m统计Issue type的数量[0m
layout           139
color            134
gesture           90
image             86
navigation        49
icon              33
motion            29
customization     25
0                 20
material          15
feedback          10
text               6
notification       4
(interface)        3
(ugly)             1
accessibility      1
screen             1
Name: subcategory, dtype: int64
[31m统计Issue type的比例[0m
layout           0.215170
color            0.207430
gesture          0.139319
image            0.133127
navigation       0.075851
icon             0.051084
motion           0.044892
customization    0.038700
0                0.030960
materi

# 2. 分析review和reply

In [17]:
pi.red("RQ2中的各种数据的统计。")
analyze_review_reply.analyze_advice_interaction()

analyze_review_reply.analyze_advice_experience()
analyze_review_reply.analyze_count()

[31mRQ2中的各种数据的统计。[0m
[31m统计Advice中interaction的数量[0m
[31mAdvice的数量[0m
107
[31mAdvice中interaction的数量[0m
43
[31mAdvice中interaction中navigation的数量及占比[0m
20
46.512%
[31mAdvice中interaction中gesture的数量及占比[0m
16
37.209%
[31m统计Advice中experience的数量[0m
31
[31mAdvice中experience中customization的数量及占比[0m
15
48.387%
[31m以advice为例，统计不同的dialogue中，四个category和17个subcategory的各种分布[0m
interaction               39
appearance                32
experience                28
appearance/interaction     2
appearance/experience      2
interaction/appearance     1
others                     1
experience/interaction     1
Name: Category, dtype: int64
navigation                              17
customization limitation                14
gesture                                 13
image                                   10
feedback                                10
layout                                   9
notification                             7
iconography                              6
color          