# BERT on Repo Description

1. Construct a sentence corpus for each software type using labeled/manually validated repo descriptions
2. Calculate embeddingfor each corpus
3. Compare all repo description with each software type corpus using cosine-similarity score
    - It took about 1.5 hrs to run the embedding on repo data

Author: Cierra and Crystal

In [2]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])?  y


In [1]:
#pgadmin
import os
import psycopg2 as pg


#bert
from sentence_transformers import SentenceTransformer, util
import torch

import pandas as pd

import re

import nltk
nltk.download("punkt")

from nltk import tokenize

import scipy

import datetime

[nltk_data] Downloading package punkt to /home/dab3dj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Embedding Model

In [4]:
#embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2') #quicker model
embedder = SentenceTransformer('paraphrase-mpnet-base-v2') #most accurate, long run time

# Data

## I. Unlabelled Repo Data

In [5]:
repo_data = pd.read_csv("/home/zz3hs/git/dspg21oss/data/dspg21oss/clean_github_repos_157k.csv")

In [6]:
repo_data

Unnamed: 0,slug,description
0,vuejs/vue,Vue js is a progressive incrementally adopt...
1,facebook/react,A declarative efficient and flexible JavaScr...
2,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...
3,twbs/bootstrap,The most popular HTML CSS and JavaScript fra...
4,ohmyzsh/ohmyzsh,A delightful community driven with 1700 c...
...,...,...
157533,VeryLittleGravitas/CDTADPQ,Very Little Gravitas implementation of Prototy...
157534,dajinchu/kde-connect-android,For Google Code In
157535,LibrinnoTeam/LibraryHelpBot,Library Management System ITP2 project
157536,Twissi/Animator,Animator for hacklace See hacklace org for fu...


In [7]:
# get a list of repo descriptions
repo_description = repo_data["description"].tolist()

print(repo_description[0:10])
len(repo_description)

['  Vue js is a progressive  incrementally adoptable JavaScript framework for building UI on the web ', 'A declarative  efficient  and flexible JavaScript library for building user interfaces ', 'An Open Source Machine Learning Framework for Everyone', 'The most popular HTML  CSS  and JavaScript framework for developing responsive  mobile first projects on the web ', '   A delightful community driven  with 1700  contributors  framework for managing your zsh configuration  Includes nearly 300 optional plugins  rails  git  OSX  hub  capistrano  brew  ant  php  python  etc   over 140 themes to spice up your morning  and an auto update tool so that makes it easy to keep up with the latest updates from the community ', 'Flutter makes it easy and fast to build beautiful apps for mobile and beyond ', '  Algorithms and data structures implemented in JavaScript with explanations and links to further readings', 'JavaScript Style Guide', 'All Algorithms implemented in Python', ' Java   Java   Jav

157538

## II. Labelled Repo Data -- Software Type Corpus

In [8]:
data = pd.read_excel('/home/zz3hs/git/dspg21oss/data/dspg21oss/labelled_repo/oss_software_labelled_python_sz.xlsx') #import csv
data

  data = pd.read_excel('/home/zz3hs/git/dspg21oss/data/dspg21oss/labelled_repo/oss_software_labelled_python_sz.xlsx') #import csv


Unnamed: 0,slug,description,language,topics,commits,forks,stars,watchers,python_label,prog_python
0,openwrt/openwrt,this repository is a mirror of https://git.ope...,C,,46330,5847,8944,547,,1
1,grpc/grpc,"the c based grpc (c++, python, ruby, objective...",C++,,42676,8069,30940,1359,1.0,1
2,libvirt/libvirt,read-only mirror. please submit merge requests...,C,,35951,496,818,84,,1
3,coreboot/coreboot,mirror of https://review.coreboot.org/coreboot...,C,,32000,367,1309,116,,1
4,ccxt/ccxt,a javascript / python / php cryptocurrency tra...,JavaScript,altcoin api arbitrage bitcoin bot cryptocurren...,27607,4971,20017,951,1.0,1
...,...,...,...,...,...,...,...,...,...,...
594,alfredfrancis/ai-chatbot-framework,a python chatbot framework with natural langua...,Python,chatbot python nltk ai sklearn natural-languag...,355,631,1491,130,1.0,1
595,jhao104/proxy_pool,python√Å√†¬®√ã√¥¬¥‚Ä∞¬™¬£√Å√™√úip√ä¬±‚Ä†(proxy pool),Python,crawler proxy proxypool spider ssdb flask craw...,354,3567,12785,438,,1
596,getsentry/responses,a utility for mocking out the python requests ...,Python,,354,269,3259,73,1.0,2
597,spotify/dh-virtualenv,python virtualenvs in debian packages,Python,dh-virtualenv python debian debian-packages om...,353,168,1471,41,,1


In [14]:
# software type
type_name =  "python_label"
#filter 500 validated repos that are labelled 1 (numeric)
corpus_type_i = data[data[type_name] ==1][["slug",type_name]]


#perform a left merge to get cleaned repo description
corpus_type_i = corpus_type_i.merge(repo_data, on='slug', how='left')
corpus_type_i = corpus_type_i["description"].tolist()
corpus_type_i[0:10]

['The C based gRPC  C    Python  Ruby  Objective C  PHP  C  ',
 'A JavaScript   Python   PHP cryptocurrency trading API with support for more than 120 bitcoin altcoin exchanges',
 'Low Code Open Source Framework in Python and JS',
 'Flexible and powerful data analysis   manipulation library for Python  providing labeled data structures similar to R data frame objects  statistical functions  and much more',
 'Python based continuous integration testing framework  your pull requests are more than welcome ',
 'Interactive Data Visualization in the browser  from  Python',
 'Official repository for Spyder   The Scientific Python Development Environment',
 'NumPy aware dynamic Python compiler using LLVM',
 'A NumPy compatible array library accelerated by CUDA',
 'The Database Toolkit for Python']

# Embedding 

In [15]:
# embedding for the corpus
corpus_type_i_embeddings = embedder.encode(corpus_type_i, show_progress_bar=True) # embeddings


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [96]:

queries = repo_description

# pre-specified number of sentences
num_sentences = 10 #find 10 most similar sentences from the corpus

# init a result list for scores
result = []


t1 = datetime.datetime.now()
print("Start:", t1)

for query in queries: #compare each sentence in the abstract to the software type corpus
    #Compute embeddings
    query_embedding = embedder.encode(query, show_progress_bar=False, convert_to_tensor=True) 

    # We use cosine-similarity and torch.topk to find the highest k scores
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_type_i_embeddings)[0]

    top_results = torch.topk(cos_scores, k=num_sentences)   #get the top k scores
    result.append(top_results.values.tolist()) #unlist the top result list
   
    #print 10 most similar entences from the corpus and their corresponding scores
    #print("\n\n======================\n\n")
    #print("Query:", query)
    #print("Results:", top_results)
    #print("\nTop k=10 most similar sentences in corpus:")
    #for score, idx in zip(top_results[0], top_results[1]):
    #    print(corpus_type_i_clean[idx], "(Score: {:.4f})".format(score))

t2 =  datetime.datetime.now()
print("Finished", len(result), "descriptions at", t2)
print("It took", t2-t1, "to run.")

Start: 2021-07-17 15:15:06.160619
Finished 157538 descriptions at 2021-07-17 16:43:47.442722
It took 1:28:41.282103 to run.


In [97]:
#TODO: THIS WAY, similarity_score IS SAVED AS A STRING, NEED TO FIGURE OUT HOW TO SAVE AS A LIST
#save the similarity score as a variable of the original repo data
repo_data["similarity_score"] = result

In [98]:
#save csv
#repo_data.to_csv(r'/home/zz3hs/git/dspg21oss/data/dspg21oss/repo_data_python_score.csv', index = False)   


# Similarity Score Analysis

In [18]:
from scipy import stats
from scipy.stats import skew
import statistics #calculate mean and others

In [16]:
#read in data
repo_data = pd.read_csv(r'/home/zz3hs/git/dspg21oss/data/dspg21oss/repo_data_python_score.csv')   


In [17]:
repo_data

Unnamed: 0,slug,description,readme,language,topics,commits,forks,stars,watchers,similarity_score
0,vuejs/vue,"üññ Vue.js is a progressive, incrementally-adopt...","<p align=""center""><a href=""https://vuejs.org"" ...",JavaScript,"['vue', 'javascript', 'frontend', 'framework']",3070.0,29611.0,185611.0,6250.0,"[0.42707720398902893, 0.41736093163490295, 0.3..."
1,facebook/react,"A declarative, efficient, and flexible JavaScr...",# [React](https://reactjs.org/) ¬∑ [![GitHub li...,JavaScript,"['javascript', 'react', 'frontend', 'declarati...",12695.0,34352.0,171327.0,6718.0,"[0.5678911805152893, 0.5504274368286133, 0.549..."
2,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...,"<div align=""center"">\n<img src=""https://www.te...",C++,"['tensorflow', 'machine-learning', 'python', '...",75671.0,84937.0,156754.0,8092.0,"[0.6914295554161072, 0.6374238133430481, 0.602..."
3,twbs/bootstrap,"The most popular HTML, CSS, and JavaScript fra...","<p align=""center"">\n<a href=""https://getbootst...",JavaScript,"['css', 'bootstrap', 'javascript', 'html', 'sc...",19228.0,73981.0,151778.0,7079.0,"[0.4793936014175415, 0.46719563007354736, 0.46..."
4,ohmyzsh/ohmyzsh,üôÉ A delightful community-driven (with 1700+ c...,"<p align=""center""><img alt=""Oh My Zsh"" src=""ht...",Shell,"['shell', 'zsh-configuration', 'theme', 'termi...",5447.0,22232.0,129314.0,2678.0,"[0.4871608316898346, 0.44735753536224365, 0.44..."
...,...,...,...,...,...,...,...,...,...,...
157533,VeryLittleGravitas/CDTADPQ,Very Little Gravitas implementation of Prototy...,"# CA Alerts, made with Very Little Gravitas fo...",CSS,"['prototype', 'messaging', 'emergency', 'govte...",414.0,1.0,0.0,2.0,"[0.34419283270835876, 0.3314574360847473, 0.32..."
157534,dajinchu/kde-connect-android,For Google Code-In,,Java,[],414.0,0.0,0.0,1.0,"[0.525587260723114, 0.47061866521835327, 0.459..."
157535,LibrinnoTeam/LibraryHelpBot,Library Management System. ITP2 project,# Library Help Bot\r\n\r\n## Purpose of the ap...,Python,"['telegram-bot', 'mariadb', 'python', 'flask']",415.0,2.0,0.0,2.0,"[0.5185543894767761, 0.517078697681427, 0.4827..."
157536,Twissi/Animator,Animator for hacklace. See hacklace.org for fu...,Animator\n========\n\nAnimator for hacklace. S...,Java,[],415.0,0.0,0.0,0.0,"[0.41843315958976746, 0.4162086844444275, 0.40..."


In [29]:
#score is in a string, convert to a list, also make sure the numbers are float
score_ls = repo_data["similarity_score"]

score_ls_float = []
for sentence_score in score_ls:
    sentence_score = str(sentence_score)[1:-1]
    sentence_score = sentence_score.split(",")
    item_float= []
    for item in sentence_score:
        item_float.append(float(item))
    score_ls_float.append(item_float)

    
repo_data["similarity_score_float"] = score_ls_float

In [30]:
#check scores are in a list
repo_data["similarity_score_float"][0][0]

0.42707720398902893

In [31]:
# get score statistics
score_ls = repo_data["similarity_score_float"]

mean_score= []
range_score = []
max_score = []
median_score = []
skewness_score = []
for sentence_score in score_ls:
    mean_score.append(statistics.mean(sentence_score))
    range_score.append(max(sentence_score)- min(sentence_score))
    max_score.append(max(sentence_score))
    median_score.append(statistics.median(sentence_score))
    skewness_score.append(stats.skew(sentence_score))
repo_data["mean_score"]=mean_score
repo_data["range_score"]=range_score
repo_data["max_score"]=max_score
repo_data["median_score"]=median_score
repo_data["skewness_score"]=skewness_score

In [32]:
repo_data

Unnamed: 0,slug,description,readme,language,topics,commits,forks,stars,watchers,similarity_score,similarity_score_float,mean_score,range_score,max_score,median_score,skewness_score
0,vuejs/vue,"üññ Vue.js is a progressive, incrementally-adopt...","<p align=""center""><a href=""https://vuejs.org"" ...",JavaScript,"['vue', 'javascript', 'frontend', 'framework']",3070.0,29611.0,185611.0,6250.0,"[0.42707720398902893, 0.41736093163490295, 0.3...","[0.42707720398902893, 0.41736093163490295, 0.3...",0.386763,0.065932,0.427077,0.381985,0.976499
1,facebook/react,"A declarative, efficient, and flexible JavaScr...",# [React](https://reactjs.org/) ¬∑ [![GitHub li...,JavaScript,"['javascript', 'react', 'frontend', 'declarati...",12695.0,34352.0,171327.0,6718.0,"[0.5678911805152893, 0.5504274368286133, 0.549...","[0.5678911805152893, 0.5504274368286133, 0.549...",0.516635,0.074020,0.567891,0.503811,0.849749
2,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...,"<div align=""center"">\n<img src=""https://www.te...",C++,"['tensorflow', 'machine-learning', 'python', '...",75671.0,84937.0,156754.0,8092.0,"[0.6914295554161072, 0.6374238133430481, 0.602...","[0.6914295554161072, 0.6374238133430481, 0.602...",0.603414,0.110041,0.691430,0.587026,1.810697
3,twbs/bootstrap,"The most popular HTML, CSS, and JavaScript fra...","<p align=""center"">\n<a href=""https://getbootst...",JavaScript,"['css', 'bootstrap', 'javascript', 'html', 'sc...",19228.0,73981.0,151778.0,7079.0,"[0.4793936014175415, 0.46719563007354736, 0.46...","[0.4793936014175415, 0.46719563007354736, 0.46...",0.424735,0.096729,0.479394,0.413202,0.336363
4,ohmyzsh/ohmyzsh,üôÉ A delightful community-driven (with 1700+ c...,"<p align=""center""><img alt=""Oh My Zsh"" src=""ht...",Shell,"['shell', 'zsh-configuration', 'theme', 'termi...",5447.0,22232.0,129314.0,2678.0,"[0.4871608316898346, 0.44735753536224365, 0.44...","[0.4871608316898346, 0.44735753536224365, 0.44...",0.432683,0.075190,0.487161,0.425684,1.349496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157533,VeryLittleGravitas/CDTADPQ,Very Little Gravitas implementation of Prototy...,"# CA Alerts, made with Very Little Gravitas fo...",CSS,"['prototype', 'messaging', 'emergency', 'govte...",414.0,1.0,0.0,2.0,"[0.34419283270835876, 0.3314574360847473, 0.32...","[0.34419283270835876, 0.3314574360847473, 0.32...",0.317462,0.045936,0.344193,0.318741,0.451481
157534,dajinchu/kde-connect-android,For Google Code-In,,Java,[],414.0,0.0,0.0,1.0,"[0.525587260723114, 0.47061866521835327, 0.459...","[0.525587260723114, 0.47061866521835327, 0.459...",0.366771,0.236830,0.525587,0.312547,0.700486
157535,LibrinnoTeam/LibraryHelpBot,Library Management System. ITP2 project,# Library Help Bot\r\n\r\n## Purpose of the ap...,Python,"['telegram-bot', 'mariadb', 'python', 'flask']",415.0,2.0,0.0,2.0,"[0.5185543894767761, 0.517078697681427, 0.4827...","[0.5185543894767761, 0.517078697681427, 0.4827...",0.471379,0.077404,0.518554,0.467471,0.701968
157536,Twissi/Animator,Animator for hacklace. See hacklace.org for fu...,Animator\n========\n\nAnimator for hacklace. S...,Java,[],415.0,0.0,0.0,0.0,"[0.41843315958976746, 0.4162086844444275, 0.40...","[0.41843315958976746, 0.4162086844444275, 0.40...",0.400396,0.032988,0.418433,0.397456,0.503637


In [2]:
df = pd.read_csv('~/git/dspg21oss/data/dspg21oss/full_repo_sim_scores_co.csv')

In [3]:
df.head()

Unnamed: 0,slug,description,ai_sim_score,blockchain_sim_score,clang_sim_score,database_sim_score,dataviz_sim_score,java_sim_score,javascript_sim_score,php_sim_score,python_sim_score
0,vuejs/vue,Vue js is a progressive incrementally adopt...,"[0.38483116030693054, 0.34321022033691406, 0.3...","[0.3770439028739929, 0.37109243869781494, 0.36...","[0.4809216260910034, 0.42691099643707275, 0.42...","[0.3832893371582031, 0.38218772411346436, 0.38...","[0.6460678577423096, 0.47009673714637756, 0.46...","[0.45489370822906494, 0.3935528099536896, 0.37...","[0.6812090873718262, 0.5953835844993591, 0.560...","[0.43424174189567566, 0.43411487340927124, 0.4...","[0.4530397057533264, 0.451049268245697, 0.4241..."
1,facebook/react,A declarative efficient and flexible JavaScr...,"[0.40412580966949463, 0.40174245834350586, 0.3...","[0.4677104353904724, 0.4392547011375427, 0.383...","[0.5760725140571594, 0.5097417235374451, 0.488...","[0.49702656269073486, 0.45085686445236206, 0.4...","[0.5633409023284912, 0.5507672429084778, 0.506...","[0.5418217182159424, 0.5393773317337036, 0.452...","[1.0000003576278687, 0.763877272605896, 0.6752...","[0.5178160071372986, 0.48840513825416565, 0.47...","[0.5891509652137756, 0.5473170280456543, 0.513..."
2,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...,"[0.6374237537384033, 0.5639429092407227, 0.554...","[0.4424801468849182, 0.4372060000896454, 0.425...","[0.47011685371398926, 0.42693886160850525, 0.4...","[0.49735206365585327, 0.48971667885780334, 0.4...","[0.616600513458252, 0.5324415564537048, 0.4950...","[0.5114375352859497, 0.40726539492607117, 0.40...","[0.5229647755622864, 0.4638928174972534, 0.430...","[0.4243718385696411, 0.36536622047424316, 0.36...","[0.6434419751167297, 0.6402792930603027, 0.637..."
3,twbs/bootstrap,The most popular HTML CSS and JavaScript fra...,"[0.34504151344299316, 0.2958582937717438, 0.27...","[0.3497481048107147, 0.34957581758499146, 0.33...","[0.49070748686790466, 0.47657108306884766, 0.4...","[0.372490257024765, 0.3629458248615265, 0.3476...","[0.40439078211784363, 0.3792816698551178, 0.35...","[0.38311436772346497, 0.3724273443222046, 0.36...","[1.0000003576278687, 0.6907554864883423, 0.591...","[0.42142561078071594, 0.3685976564884186, 0.36...","[0.47627848386764526, 0.45655155181884766, 0.4..."
4,ohmyzsh/ohmyzsh,A delightful community driven with 1700 c...,"[0.35356998443603516, 0.3418255150318146, 0.32...","[0.43588021397590637, 0.3517952263355255, 0.34...","[0.38053128123283386, 0.340486079454422, 0.331...","[0.4280346632003784, 0.4265906810760498, 0.423...","[0.47275614738464355, 0.42588570713996887, 0.4...","[0.3862834870815277, 0.3327464759349823, 0.326...","[0.3588447570800781, 0.332401305437088, 0.3097...","[0.42803463339805603, 0.399021178483963, 0.363...","[0.4832826554775238, 0.45471256971359253, 0.43..."
