# Abbreviations:
- `df` -> dataframe
- `Series` -> column / kolonne
- `embs` -> embeddings (numerical representation of words in a vector space, from our NLP AI model)
- `uembs` -> umap embeddings (dimensionality reduction of embs, to make it easier to visualize, 768 -> 2)

# Imports

In [1]:
from scripts import *


Local stopwords:        True
GPUs detected:          1
Using GPU:              True
Device:                 cuda
Got model from pickle:  True



# Data

## Getting Data

In [2]:
df = pd.read_csv("data/tensorflow.csv", on_bad_lines='skip')

df = df[["author_name", "time_sec", "subject"]]


df = df.rename(columns={
    "author_name": "user", 
    "time_sec": "time_sec", 
    "subject": "text"
})

# if your computer does not have GPU support, you can use a sample of the dataset instead to make it run in a reasonable time
if device == "cpu": df = df.sample(frac=0.05)

df

Unnamed: 0,user,time_sec,text
0,ekelsen,1524782988,Merge-pull-request-18846-from-yongtang-04252018-FloorDiv-int8
1,ekelsen,1524781964,Merge-pull-request-18881-from-ManHyuk-fix_typo
2,ekelsen,1524781835,Merge-pull-request-18907-from-yongtang-18363-mpi
3,ekelsen,1524781628,Merge-pull-request-18896-from-KikaTech-fix_lite_topk
4,Yong Tang,1524775671,Fix-cmake-build-issues-with-GPU-on-Linux-18775
...,...,...,...
32464,Vijay Vasudevan,1446933504,TensorFlow-Upstream-commits-to-git
32465,Vijay Vasudevan,1446922181,TensorFlow-Upstream-latest-commits-to-git
32466,Vijay Vasudevan,1446875858,TensorFlow-Upstream-changes-to-git
32467,Manjunath Kudlur,1446863831,TensorFlow-Upstream-latest-changes-to-Git


We see that the commits are in order, and ready to be sliced timewise

In [3]:
df.isna().sum()

user        0
time_sec    0
text        0
dtype: int64

We are intending to use the `text` field as a temporary substitue of `categoryRaw` which we wait to get from the schibsted data

## Cleaning Data

In [4]:
df["text_clean"] = df["text"].apply(string_cleaner)

df[["text", "text_clean"]].head(5)

Unnamed: 0,text,text_clean
0,Merge-pull-request-18846-from-yongtang-04252018-FloorDiv-int8,merge pull request 18846 from yongtang 04252018 floordiv int8
1,Merge-pull-request-18881-from-ManHyuk-fix_typo,merge pull request 18881 from manhyuk fix_typo
2,Merge-pull-request-18907-from-yongtang-18363-mpi,merge pull request 18907 from yongtang 18363 mpi
3,Merge-pull-request-18896-from-KikaTech-fix_lite_topk,merge pull request 18896 from kikatech fix_lite_topk
4,Fix-cmake-build-issues-with-GPU-on-Linux-18775,fix cmake build issues with gpu on linux 18775


## Engineering Data

### Making a new column for the week

In [5]:
df["time_week"] = df["time_sec"].apply(lambda x: x//604800)

df.to_pickle(names["df"])

### Grouping on users

lag dictionary av å groupe på alle commit messages de har 
set groups på userId senere, så kan vi lage animation frames av hvordan grupper beveger seg

In [6]:
# dfu - ABBR: Data Frame User grouped
dfu = df[["text_clean", "time_sec", "time_week", "user"]].copy()

dfu = dfu.groupby(["user", "time_week"]).agg(list).reset_index()

dfu["text_clean_join"] = dfu["text_clean"].apply(lambda x: " ".join(x))

dfu.head(3)

Unnamed: 0,user,time_week,text_clean,time_sec,text_clean_join
0,4F2E4A2E,2465,[update get_startedmd 8924 ],[1491251028],update get_startedmd 8924
1,4F2E4A2E,2469,[fixing anaconda install command 9277 ],[1493310176],fixing anaconda install command 9277
2,4F2E4A2E,2472,[fixing 404 url 10052 ],[1495235056],fixing 404 url 10052


In [7]:
weeks_per_user = dfu["user"].value_counts().reset_index()

print(len(weeks_per_user))

weeks_per_user[weeks_per_user["user"] > 10]

1644


Unnamed: 0,index,user
0,A. Unique TensorFlower,125
1,Benoit Steiner,118
2,Shanqing Cai,114
3,Eugene Brevdo,110
4,Derek Murray,106
...,...,...
138,guschmue,11
139,David Z. Chen,11
140,Shivani Agrawal,11
141,Siddharth Agrawal,11


# Machine Learning

## Unsupervised ML

### NLP Embeddings

Getting the 768 dimensional embeddings for each commit message

In [8]:
try:
    if conf["fresh_data"]: raise Exception
    embs = pickle.load(open(names[f"embs-{device}"], 'rb'))
    
except:
    embs = sbert_emb_getter(df["text_clean"].to_numpy(), filename=names[f"model-{device}"])
    pickle.dump(embs, open(names[f"embs-{device}"], 'wb'))

    conf["fresh_embs"] = True

print(f"fresh embs: {conf['fresh_embs']}")

fresh embs: False


### Dimensionality Reduction

We use UMAP to reduce the dimensionality of the embeddings from 768 to 2, so that we can visualize them

In [9]:
try:
    if conf["fresh_data"]: raise Exception
    uembs = pickle.load(open(names[f"uembs-{device}"], 'rb'))
    
except:
    uembs = UMAP(n_neighbors=20, min_dist=0.1).fit_transform(embs)
    pickle.dump(uembs, open(names[f"uembs-{device}"], 'wb'))

    conf["fresh_uembs"] = True

print(f"fresh uembs: {conf['fresh_uembs']}")

fresh uembs: False


In [10]:
# TODO make this plot just a trace, to fit in gridplot

fig = px.scatter(x=uembs[:,0], y=uembs[:,1])

fig.update_layout(width=800, height=800)
fig.update_traces(marker=dict(size=2))

# plotting to show how the embeddings are when just dimensionality reduction is used
fig_show_save(fig, "umap-scatter", show=conf["show_figs"])

In [11]:
clusters_2d = HDBSCAN(min_cluster_size=100, cluster_selection_method="leaf").fit(uembs)


print(f"""
    2D
    Number of clusters: {len(set(clusters_2d.labels_)) - 1}
    Number of rows as outliers: {clusters_2d.labels_.tolist().count(-1)}
""")


    2D
    Number of clusters: 53
    Number of rows as outliers: 15409



## Semi Supervised

### Exploring Stopwords

#### Checking most common words

First checking without filtering for stopwords, then checking with filtering for stopwords

Then checking with filtering for english stopwords

In [12]:
vc = (
    df["text_clean"].apply(lambda x: (x.split(" ")))
    .explode()
    .value_counts()
    .reset_index()
)

vc.head(10)

Unnamed: 0,index,text_clean
0,,941166
1,to,8273
2,for,6103
3,the,5658
4,in,5346
5,of,3995
6,fix,3943
7,add,3846
8,from,3711
9,merge,3530


In [13]:
# stopwords has been imported from nltk
s_words = stopwords.words('english')

print(f"""
    {type(s_words)}
    {len(s_words)}
    {s_words[0:10]}
""")


    <class 'list'>
    179
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]



In [14]:
# vc where the index is not in the stopwords list
vc = vc[~vc["index"].isin(s_words)]

# getting the top 20 words
interests = list(vc.head(conf["interest_words"] + 1)["index"])

# removing the space that becomes the first element in the list
interests = interests[1:]

print(interests)

['fix', 'add', 'merge', 'update', 'pull', 'request', 'op', 'python', 'docs', 'tensorflow', 'generated', 'xla', 'ops', 'change', 'support', 'use', 'remove', 'test', 'internal', 'make']


In [15]:
# threshold for seeing if a commit message belongs to a interest or if not
threshold = 0.2

if not conf["generate_interests"]:
    interests = interest_fixer("""
    fix add merge remove update pull request python docs tensorflow generated
    """)

print(f"using generated interests: {conf['generate_interests']}")

using generated interests: True


In [16]:
# custom stopwords as a union with the nltk stopwords and the interests found by value counting all the words in the commit messages
bonus_words = stopwords.words('english') + (interests)

# dumping for the plotting notebook
pickle.dump(bonus_words, open(names["bonus-words"], 'wb'))

In [17]:
## TODO make this use the get embeddings function
# getting results.
y, similarity = make_dataset(embs, targets=interests, model=model, target_threshold=threshold)

In [18]:
fetch_from_pickle = True

try:
  if conf["fresh_data"]: raise Exception

  uemb_semi_s = pickle.load( open( names[f"uembs-s-{device}"], "rb" ) )
  
except:
  # used to have just nn = 100 at 0.2 similiarity, and not the metric and target weight
  uemb_semi_s = UMAP(n_neighbors=100, metric="cosine", target_weight=1).fit_transform(embs, y-1)
  pickle.dump( uemb_semi_s, open( names[f"uembs-s-{device}"], "wb" ) )

  conf["fresh_s_uembs"] = True

print(f"fresh semi supervised uembs: {conf['fresh_s_uembs']}")

fresh semi supervised uembs: False


In [19]:
cluster_semi_s_hdb = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uemb_semi_s)

## Cluster & Topic Inspection

In [20]:
result_2d = result_df_maker(uembs, clusters_2d.labels_, df["text_clean"].to_numpy(), bonus_words=interests)

vcr = result_2d[["cluster_label", "topics"]].groupby(["cluster_label", "topics"])["topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,topics,commit_count
0,-1,to for the in from,15409
20,19,to the in for tf,1427
2,1,true db after to no,1253
42,41,to for in the tfdata,1157
30,29,to graph the in for,817
17,16,from patch gunan jhseu terrytangyuan,720
24,23,to tensor in tensors for,565
19,18,gpu for to cuda the,559
3,2,for changes commit from of,545
1,0,related files pbtxt opspbtxt to,495


In [21]:
result_2d_semi = result_df_maker(uemb_semi_s, cluster_semi_s_hdb.labels_, df["text_clean"].to_numpy(), bonus_words=interests)

vcr = result_2d_semi[["cluster_label", "topics"]].groupby(["cluster_label", "topics"])["topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,topics,commit_count
0,-1,to the for in of,7858
14,13,from master patch gunan vrv,1817
31,30,to in for the of,1793
32,31,tests for to in the,1578
16,15,to tf in for the,1380
46,45,to for in the of,1345
2,1,generation doc module to strings,1260
55,54,to the in for of,1230
29,28,typo in minor typos documentation,718
22,21,to readmemd release for version,693


In [22]:
# finding most common words in 20 most common topics to see if we need more stopwords
vcr["topics"].apply(lambda x: x.split(" ")).explode().value_counts().head(20)

to                       16
for                      15
in                       11
the                      11
of                        6
from                      2
merging                   1
unused                    1
code                      1
duplicate                 1
deleted                   1
graph                     1
commit                    1
changes                   1
pbtxt                     1
related                   1
files                     1
python3                   1
op_gen_overridespbtxt     1
build                     1
Name: topics, dtype: int64

# Making Result DF

Kan gjøre regersjons øking per interesse, og velge hvilken som er mest likely ved et tidspunkt

In [23]:
dfres = df[["text_clean", "time_sec", "time_week", "user"]].copy()

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: datetime.fromtimestamp(x).isocalendar()[1])

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: x//604800)

dfres["x"] = uemb_semi_s[:, 0]

dfres["y"] = uemb_semi_s[:, 1]

dfres["cluster"] = cluster_semi_s_hdb.labels_



dfres["interest"] = list(y)

# -1 to make up for adding 1 earlier
dfres["interest_word"] = dfres["interest"].apply(lambda x: interests[x-1])

# find topic by interest instead
topic_dict = topic_by_clusterId(dfres["text_clean"].to_numpy(), dfres["interest"].to_numpy(), bonus_words=bonus_words)

dfres["topic"] = dfres["interest"].apply(lambda x: " ".join(list(topic_dict[x])))

dfres = dfres[dfres["cluster"] != -1]

# Pickling the dfres for plotting in other notebook
dfres.to_pickle(names["dfres"])

dfres

Unnamed: 0,text_clean,time_sec,time_week,user,x,y,cluster,interest,interest_word,topic
0,merge pull request 18846 from yongtang 04252018 floordiv int8,1524782988,2521,ekelsen,13.684902,8.124016,13,3,merge,changes commit master github branch
1,merge pull request 18881 from manhyuk fix_typo,1524781964,2521,ekelsen,13.341024,7.941803,13,3,merge,changes commit master github branch
2,merge pull request 18907 from yongtang 18363 mpi,1524781835,2521,ekelsen,13.750832,8.121236,13,3,merge,changes commit master github branch
3,merge pull request 18896 from kikatech fix_lite_topk,1524781628,2521,ekelsen,13.789827,7.538333,13,3,merge,changes commit master github branch
4,fix cmake build issues with gpu on linux 18775,1524775671,2521,Yong Tang,-1.428590,11.992411,22,0,make,build added api function shape
...,...,...,...,...,...,...,...,...,...,...
32441,make license in setuppy match,1447301481,2393,Illia Polosukhin,-0.482800,9.924575,32,0,make,build added api function shape
32442,removed license headers,1447301260,2393,Illia Polosukhin,-0.684595,9.162777,39,17,remove,unused cleanup code build delete
32448,update description of various issue discussion forums by vanhoucke,1447206079,2392,Vijay Vasudevan,-1.855338,8.827711,41,15,support,added dependencies function new unary
32451,tensorflow initial steps towards python3 support some documentation,1447092667,2392,Vijay Vasudevan,-3.275972,13.399507,30,10,tensorflow,functions go wrapper tensorboard tensor


# Exploration

## Collaborative filtering

### Data Restructuring

In [35]:
userdf = pd.DataFrame({
    "user": list(df["user"]),
    "time_week" : list(df["time_week"]),
    "interests" : list(y),
    "cluster_id" : list(cluster_semi_s_hdb.labels_)
    })

userdf["interests_word"] = userdf["interests"].apply(lambda x: interests[x-1])

userdf = userdf.groupby(["user", "time_week"]).agg(list).reset_index()

len(userdf)

7588

In [40]:
userdf_dict = userdf[["user","time_week","interests_word"]].copy()

userdf_dict["interest_dict"] = userdf["interests_word"].apply(lambda x: dict_per_user(x, interests))

userdf_dict.drop(columns=["interests_word"], inplace=True)

print(len(userdf_dict))

userdf_dict.head(3)

7588


Unnamed: 0,user,time_week,interest_dict
0,4F2E4A2E,2465,"{'fix': 0, 'add': 0, 'merge': 0, 'update': 1, 'pull': 0, 'request': 0, 'op': 0, 'python': 0, 'docs': 0, 'tensorflow': 0, 'generated': 0, 'xla': 0, 'ops': 0, 'change': 0, 'support': 0, 'use': 0, 'r..."
1,4F2E4A2E,2469,"{'fix': 0, 'add': 0, 'merge': 0, 'update': 0, 'pull': 0, 'request': 0, 'op': 0, 'python': 0, 'docs': 0, 'tensorflow': 0, 'generated': 0, 'xla': 0, 'ops': 0, 'change': 0, 'support': 0, 'use': 0, 'r..."
2,4F2E4A2E,2472,"{'fix': 1, 'add': 0, 'merge': 0, 'update': 0, 'pull': 0, 'request': 0, 'op': 0, 'python': 0, 'docs': 0, 'tensorflow': 0, 'generated': 0, 'xla': 0, 'ops': 0, 'change': 0, 'support': 0, 'use': 0, 'r..."


In [44]:
userdf_ex = pd.DataFrame(list(userdf["interests_word"].apply(lambda x: dict_per_user(x, interests))))


## TODO
# - Yeet the bot
# - use mean normalization
userdf_ex_norm = (userdf_ex-userdf_ex.min())/(userdf_ex.max()-userdf_ex.min())

np_matrix = userdf_ex_norm.to_numpy()


userdf_ex_norm.insert(0, "user", list(userdf["user"]))
userdf_ex_norm.insert(1, "time_week", list(userdf["time_week"]))

userdf_ex_norm

## Rating is how close they are to the center of the interest

## TODO prøv å faktisk implementere colaborative filtering funksjonaliten

# - en linje per interesse per gruppe
# - regression per interesse for å predicte neste 20 punkt
# - 

Unnamed: 0,user,time_week,fix,add,merge,update,pull,request,op,python,...,generated,xla,ops,change,support,use,remove,test,internal,make
0,4F2E4A2E,2465,0.000000,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1,4F2E4A2E,2469,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129
2,4F2E4A2E,2472,0.090909,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,4d55397500,2481,0.000000,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,4d55397500,2507,0.000000,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7583,郭同jet · 耐心,2441,0.090909,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7584,郭同jet · 耐心,2442,0.090909,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7585,郭同jet · 耐心,2444,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7586,黄璞,2476,0.090909,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [68]:
colab_clusters = HDBSCAN(min_cluster_size=40, min_samples=10, metric='euclidean', cluster_selection_method='eom').fit(np_matrix)

print(f"""
    Full dimensionality clustering output:
    Len of colab clusters: {len(colab_clusters.labels_)}
    Number of clusters: {len(set(colab_clusters.labels_)) - 1}
    Number of rows as outliers: {colab_clusters.labels_.tolist().count(-1)}
""")



    Full dimensionality clustering output:
    Len of colab clusters: 7588
    Number of clusters: 30
    Number of rows as outliers: 1338



In [49]:
colab_umap = UMAP(n_neighbors=100, metric="cosine").fit_transform(np_matrix)

In [69]:
colab_resdf = pd.DataFrame({
    "x" : colab_umap[:, 0], 
    "y" : colab_umap[:, 1], 
    "cluster" : colab_clusters.labels_
})

colab_resdf = colab_resdf[colab_resdf["cluster"] != -1]

fig_colab = px.scatter(colab_resdf, x="x", y="y", color="cluster", title="Colab clustering")
c
fig_colab.show()

# TODO add one visualisation without time grouping
# this would give us the "true" user groups, and then we could see if they moved around without breaking up the group too much

In [26]:
## TODO
# - now we have to do this per user.
# - we need to look at what a given user is "comitting" about interest wise, and then see which cluster that user is in
# - then when we    title="Timeline of commits by interest",                                                         

## Exploring water simulation based prediction potential

In [27]:

print(f"""
    Bounds of the uembs
    
    x axis:
    {min(uembs[:,0])}
    {max(uembs[:,0])}
    
    "y axis"
    {min(uembs[:,1])}
    {max(uembs[:,1])}
""")


    Bounds of the uembs
    
    x axis:
    -15.239805221557617
    18.667102813720703
    
    "y axis"
    -16.358308792114258
    23.227251052856445



Kan sette ramme til vann prediction på +- 25 på begge akser

512 x 512*2 pixels i det spacet

lage neste frame i animasjonen

gi to frames av fortid
- kan gi en frame per uke per bruker
- kan ha en farge per bruker gruppe