# Abbreviations:
- `df` -> dataframe
- `Series` -> column / kolonne
- `embs` -> embeddings (numerical representation of words in a vector space, from our NLP AI model)
- `uembs` -> umap embeddings (dimensionality reduction of embs, to make it easier to visualize, 768 -> 2)

# Imports

In [1]:
from scripts import *


Local stopwords:        True
GPUs detected:          1
Using GPU:              True
Device:                 cuda
Got model from pickle:  True



# Data

## Getting Data

In [56]:
df = pd.read_csv("data/tensorflow.csv", on_bad_lines='skip')

df = df[["author_name", "time_sec", "subject"]]


# df = df.rename(columns={
#     "author_name": "user", 
#     "time_sec": "time_sec", 
#     "subject": "text"
# })

# if your computer does not have GPU support, you can use a sample of the dataset instead to make it run in a reasonable time
if device == "cpu": df = df.sample(frac=0.05)

df

Unnamed: 0,author_name,time_sec,subject
0,ekelsen,1524782988,Merge-pull-request-18846-from-yongtang-04252018-FloorDiv-int8
1,ekelsen,1524781964,Merge-pull-request-18881-from-ManHyuk-fix_typo
2,ekelsen,1524781835,Merge-pull-request-18907-from-yongtang-18363-mpi
3,ekelsen,1524781628,Merge-pull-request-18896-from-KikaTech-fix_lite_topk
4,Yong Tang,1524775671,Fix-cmake-build-issues-with-GPU-on-Linux-18775
...,...,...,...
32464,Vijay Vasudevan,1446933504,TensorFlow-Upstream-commits-to-git
32465,Vijay Vasudevan,1446922181,TensorFlow-Upstream-latest-commits-to-git
32466,Vijay Vasudevan,1446875858,TensorFlow-Upstream-changes-to-git
32467,Manjunath Kudlur,1446863831,TensorFlow-Upstream-latest-changes-to-Git


We see that the commits are in order, and ready to be sliced timewise

In [3]:
df.isna().sum()

commit               0
author_name          0
time_sec             0
subject              0
files_changed     3324
lines_inserted    4294
lines_deleted     7857
dtype: int64

We are intending to use the `subject` field as a temporary substitue of `categoryRaw` which we wait to get from the schibsted data

## Cleaning Data

In [4]:
df["subject_clean"] = df["subject"].apply(string_cleaner)

df[["subject", "subject_clean"]].head(5)

Unnamed: 0,subject,subject_clean
0,Merge-pull-request-18846-from-yongtang-04252018-FloorDiv-int8,merge pull request 18846 from yongtang 04252018 floordiv int8
1,Merge-pull-request-18881-from-ManHyuk-fix_typo,merge pull request 18881 from manhyuk fix_typo
2,Merge-pull-request-18907-from-yongtang-18363-mpi,merge pull request 18907 from yongtang 18363 mpi
3,Merge-pull-request-18896-from-KikaTech-fix_lite_topk,merge pull request 18896 from kikatech fix_lite_topk
4,Fix-cmake-build-issues-with-GPU-on-Linux-18775,fix cmake build issues with gpu on linux 18775


## Engineering Data

### Making a new column for the week

In [35]:
df["time_week"] = df["time_sec"].apply(lambda x: x//604800)

df.to_pickle("data/df.pkl")

### Grouping on users

lag dictionary av å groupe på alle commit messages de har 
set groups på userId senere, så kan vi lage animation frames av hvordan grupper beveger seg

In [49]:
# dfu - ABBR: Data Frame User grouped
dfu = df[["subject_clean", "time_sec", "time_week", "author_name"]].copy()

dfu = dfu.groupby(["author_name", "time_week"]).agg(list).reset_index()

dfu["subject_clean_join"] = dfu["subject_clean"].apply(lambda x: " ".join(x))

dfu.head(3)

Unnamed: 0,author_name,time_week,subject_clean,time_sec,subject_clean_join
0,4F2E4A2E,2465,[update get_startedmd 8924 ],[1491251028],update get_startedmd 8924
1,4F2E4A2E,2469,[fixing anaconda install command 9277 ],[1493310176],fixing anaconda install command 9277
2,4F2E4A2E,2472,[fixing 404 url 10052 ],[1495235056],fixing 404 url 10052


In [9]:
weeks_per_user = dfu["author_name"].value_counts().reset_index()

print(len(weeks_per_user))

weeks_per_user[weeks_per_user["author_name"] > 10]

1644


Unnamed: 0,index,author_name
0,A. Unique TensorFlower,125
1,Benoit Steiner,118
2,Shanqing Cai,114
3,Eugene Brevdo,110
4,Derek Murray,106
...,...,...
138,guschmue,11
139,David Z. Chen,11
140,Shivani Agrawal,11
141,Siddharth Agrawal,11


# Machine Learning

## Unsupervised ML

### NLP Embeddings

Getting the 768 dimensional embeddings for each commit message

In [10]:
try:
    if conf["fresh_data"]: raise Exception
    embs = pickle.load(open(f'data/embs-{device}.pkl', 'rb'))
    
except:
    embs = sbert_emb_getter(df["subject_clean"].to_numpy(), filename=f"model-{device}")
    pickle.dump(embs, open(f'data/embs-{device}.pkl', 'wb'))

    conf["fresh_embs"] = True

print(f"fresh embs: {conf['fresh_embs']}")

fresh embs: False


### Dimensionality Reduction

We use UMAP to reduce the dimensionality of the embeddings from 768 to 2, so that we can visualize them

In [11]:
try:
    if conf["fresh_data"]: raise Exception
    uembs = pickle.load(open(f'data/umap-{device}.pkl', 'rb'))
    
except:
    uembs = UMAP(n_neighbors=20, min_dist=0.1).fit_transform(embs)
    pickle.dump(uembs, open(f'data/umap-{device}.pkl', 'wb'))

    conf["fresh_uembs"] = True

print(f"fresh uembs: {conf['fresh_uembs']}")

fresh uembs: False


In [12]:
# TODO make this plot just a trace, to fit in gridplot

fig = px.scatter(x=uembs[:,0], y=uembs[:,1])

fig.update_layout(width=800, height=800)
fig.update_traces(marker=dict(size=2))

# plotting to show how the embeddings are when just dimensionality reduction is used
fig_show_save(fig, "umap-scatter", show=conf["show_figs"])

In [14]:
clusters_2d = HDBSCAN(min_cluster_size=100, cluster_selection_method="leaf").fit(uembs)


print(f"""
    2D
    Number of clusters: {len(set(clusters_2d.labels_)) - 1}
    Number of rows as outliers: {clusters_2d.labels_.tolist().count(-1)}
""")


    2D
    Number of clusters: 55
    Number of rows as outliers: 15049



## Semi Supervised

### Exploring Stopwords

#### Checking most common words

First checking without filtering for stopwords, then checking with filtering for stopwords

Then checking with filtering for english stopwords

In [15]:
vc = (
    df["subject_clean"].apply(lambda x: (x.split(" ")))
    .explode()
    .value_counts()
    .reset_index()
)

vc.head(10)

Unnamed: 0,index,subject_clean
0,,941166
1,to,8273
2,for,6103
3,the,5658
4,in,5346
5,of,3995
6,fix,3943
7,add,3846
8,from,3711
9,merge,3530


In [16]:
# stopwords has been imported from nltk
s_words = stopwords.words('english')

print(f"""
    {type(s_words)}
    {len(s_words)}
    {s_words[0:10]}
""")


    <class 'list'>
    179
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]



In [17]:
# vc where the index is not in the stopwords list
vc = vc[~vc["index"].isin(s_words)]

# getting the top 20 words
interests = list(vc.head(conf["interest_words"] + 1)["index"])

# removing the space that becomes the first element in the list
interests = interests[1:]

print(interests)

['fix', 'add', 'merge', 'update', 'pull', 'request', 'op', 'python', 'docs', 'tensorflow', 'generated', 'xla', 'ops', 'change', 'support', 'use', 'remove', 'test', 'internal', 'make']


In [19]:
# threshold for seeing if a commit message belongs to a interest or if not
threshold = 0.2

if not conf["generate_interests"]:
    interests = interest_fixer("""
    fix add merge remove update pull request python docs tensorflow generated
    """)

print(f"using generated interests: {conf['generate_interests']}")

using generated interests: True


In [20]:
# custom stopwords as a union with the nltk stopwords and the interests found by value counting all the words in the commit messages
bonus_words = stopwords.words('english') + (interests)

# dumping for the plotting notebook
pickle.dump(bonus_words, open(f'data/bonus-words.pkl', 'wb'))

In [21]:
## TODO make this use the get embeddings function
# getting results.
y, similarity = make_dataset(embs, targets=interests, model=model, target_threshold=threshold)

In [22]:
fetch_from_pickle = True

try:
  if conf["fresh_data"]: raise Exception

  uemb_semi_s = pickle.load( open( f"data/emb-semi-s-fast-{device}.pickle", "rb" ) )
  
except:
  # used to have just nn = 100 at 0.2 similiarity, and not the metric and target weight
  uemb_semi_s = UMAP(n_neighbors=100, metric="cosine", target_weight=1).fit_transform(embs, y-1)
  pickle.dump( uemb_semi_s, open( f"data/emb-semi-s-fast-{device}.pickle", "wb" ) )

  conf["fresh_s_uembs"] = True

print(f"fresh semi supervised uembs: {conf['fresh_s_uembs']}")

fresh semi supervised uembs: False


In [23]:
cluster_semi_s_hdb = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uemb_semi_s)

## Cluster & Topic Inspection

In [24]:
result_2d = result_df_maker(uembs, clusters_2d.labels_, df["subject_clean"].to_numpy(), bonus_words=interests)

vcr = result_2d[["cluster_label", "topics"]].groupby(["cluster_label", "topics"])["topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,topics,commit_count
0,-1,to for the in from,15049
16,15,to the in for tf,1544
39,38,to for in the and,1300
1,0,to true the after db,1256
48,47,to the in for variable,843
45,44,to the for tests in,818
27,26,to graph the in for,770
22,21,to tensor in for tensors,631
6,5,for changes commit from the,546
18,17,from patch gunan case540 androbin,544


In [25]:
result_2d_semi = result_df_maker(uemb_semi_s, cluster_semi_s_hdb.labels_, df["subject_clean"].to_numpy(), bonus_words=interests)

vcr = result_2d_semi[["cluster_label", "topics"]].groupby(["cluster_label", "topics"])["topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,topics,commit_count
0,-1,to for the in of,8512
28,27,to in for the of,1861
32,31,to for in the of,1672
29,28,tests for to in the,1584
14,13,to tf in the for,1394
2,1,generation doc module to strings,1263
20,19,from patch gunan benoitsteiner terrytangyuan,1210
42,41,the to in for of,1031
48,47,to in the for and,765
50,49,to for of the in,733


In [26]:
# finding most common words in 20 most common topics to see if we need more stopwords
vcr["topics"].apply(lambda x: x.split(" ")).explode().value_counts().head(20)

to                       16
for                      15
the                      13
in                       13
of                        7
from                      2
tensorboard               1
nodes                     1
op_gen_overridespbtxt     1
pbtxt                     1
files                     1
typo                      1
related                   1
merging                   1
changes                   1
commit                    1
pylint                    1
python3                   1
minor                     1
typos                     1
Name: topics, dtype: int64

# Making Result DF

Kan gjøre regersjons øking per interesse, og velge hvilken som er mest likely ved et tidspunkt

In [29]:
dfres = df[["subject_clean", "time_sec", "time_week", "author_name"]].copy()

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: datetime.fromtimestamp(x).isocalendar()[1])

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: x//604800)

dfres["x"] = uemb_semi_s[:, 0]

dfres["y"] = uemb_semi_s[:, 1]

dfres["cluster"] = cluster_semi_s_hdb.labels_



dfres["interest"] = list(y)

# -1 to make up for adding 1 earlier
dfres["interest_word"] = dfres["interest"].apply(lambda x: interests[x-1])

# find topic by interest instead
topic_dict = topic_by_clusterId(dfres["subject_clean"].to_numpy(), dfres["interest"].to_numpy(), bonus_words=bonus_words)

dfres["topic"] = dfres["interest"].apply(lambda x: " ".join(list(topic_dict[x])))

dfres = dfres[dfres["cluster"] != -1]

# Pickling the dfres for plotting in other notebook
dfres.to_pickle("data/dfres.pkl")

dfres

Unnamed: 0,subject_clean,time_sec,time_week,author_name,x,y,cluster,interest,interest_word,topic
0,merge pull request 18846 from yongtang 04252018 floordiv int8,1524782988,2521,ekelsen,-3.377634,14.547507,18,3,merge,changes commit master github branch
1,merge pull request 18881 from manhyuk fix_typo,1524781964,2521,ekelsen,-3.186249,14.280680,18,3,merge,changes commit master github branch
2,merge pull request 18907 from yongtang 18363 mpi,1524781835,2521,ekelsen,-3.405393,14.624919,18,3,merge,changes commit master github branch
3,merge pull request 18896 from kikatech fix_lite_topk,1524781628,2521,ekelsen,-3.835373,14.089850,19,3,merge,changes commit master github branch
5,merge pull request 17602 from joeyearsley patch 1,1524775556,2521,Martin Wicke,-3.985922,14.056509,19,3,merge,changes commit master github branch
...,...,...,...,...,...,...,...,...,...,...
32440,tensorflow minor updates to docs build gpu config perf etc,1447356420,2393,Vijay Vasudevan,7.198559,-0.834650,27,10,tensorflow,functions go wrapper tensorboard tensor
32444,initial commit,1447300802,2393,Illia Polosukhin,7.018562,2.891060,42,2,add,added missing adding adds example
32447,add colah s fixed version of scalar equation to docs,1447209048,2392,Vijay Vasudevan,9.296671,1.881195,49,0,make,build added api function shape
32451,tensorflow initial steps towards python3 support some documentation,1447092667,2392,Vijay Vasudevan,7.306913,-0.681716,27,10,tensorflow,functions go wrapper tensorboard tensor


# Exploration

## Collaborative filtering

### Data Restructuring

In [None]:
# making dict based on interests and giving every key the value of zero
commit_dict = dict.fromkeys(interests, 0)

# function that takes a list of interests words, and returns a dict 
# with the interests as keys and the number of times the interest appears in the list as values
def dict_per_user(interset_word_list):
    tempdict = commit_dict.copy()

    for ele in interset_word_list:
        tempdict[ele] += 1

    return tempdict

In [33]:
# making new df grouped by user name
## TODO hent data fra faktisk df 
userdf = dfres[["author_name", "interest_word"]].groupby("author_name").agg(list).reset_index()

userdf_ex = pd.DataFrame(list(userdf["interest_word"].apply(dict_per_user)))

## TODO
# - Yeet the bot
# - use mean normalization
userdf_ex_norm =(userdf_ex-userdf_ex.min())/(userdf_ex.max()-userdf_ex.min())

userdf_ex_norm["author_name"] = userdf["author_name"].copy()

userdf_ex_norm

## Rating is how close they are to the center of the interest

## TODO prøv å faktisk implementere colaborative filtering funksjonalitetn

Unnamed: 0,fix,add,merge,update,pull,request,op,python,docs,tensorflow,...,xla,ops,change,support,use,remove,test,internal,make,author_name
0,0.003745,0.0,0.000000,0.001681,0.0,0.0,0.0,0.000000,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4F2E4A2E
1,0.003745,0.0,0.000000,0.003361,0.0,0.0,0.0,0.000000,0.000000,0.00073,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4d55397500
2,0.000000,0.0,0.002703,0.003361,0.0,0.0,0.0,0.000000,0.014815,0.00219,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,A. Rosenberg Johansen
3,1.000000,1.0,0.243243,1.000000,1.0,1.0,1.0,1.000000,1.000000,1.00000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,A. Unique TensorFlower
4,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000752,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADiegoCAlonso
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1281,0.003745,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Łukasz Bieniasz-Krzywiec
1282,0.018727,0.0,0.000000,0.006723,0.0,0.0,0.0,0.000000,0.000000,0.00073,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,田传武
1283,0.003745,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,郑泽宇
1284,0.007491,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.00073,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,郭同jet · 耐心


In [34]:
## TODO
# - now we have to do this per user.
# - we need to look at what a given user is "comitting" about interest wise, and then see which cluster that user is in
# - then when we    title="Timeline of commits by interest",                                                         

## Exploring water simulation based prediction potential

In [None]:

print(f"""
    Bounds of the uembs
    
    x axis:
    {min(uembs[:,0])}
    {max(uembs[:,0])}
    
    "y axis"
    {min(uembs[:,1])}
    {max(uembs[:,1])}
""")


    Bounds of the uembs
    
    x axis:
    -16.00608253479004
    18.984853744506836
    
    "y axis"
    -17.35825538635254
    22.694015502929688



Kan sette ramme til vann prediction på +- 25 på begge akser

512 x 512*2 pixels i det spacet

lage neste frame i animasjonen

gi to frames av fortid
- kan gi en frame per uke per bruker
- kan ha en farge per bruker gruppe