# Abbreviations:

- `df` -> dataframe

- `Series` -> column / kolonne

- `embs` -> embeddings (numerical representation of words in a vector space, from our NLP AI model)

- `uembs` -> umap embeddings (dimensionality reduction of embs, to make it easier to visualize, 768 -> 2)

- `target_intersts` -> the 20 words we use as targets for our grouping of commit messages. We can se these, and we use them as targets for semi-supervision

- `group_topics` -> the topics we generate based on message cluster, less important for the ML pipeline, more for the human reader to understand what the clusters are about

# Imports

In [1]:
from scripts import *


Local stopwords:        True
GPUs detected:          1
Using GPU:              True
Device:                 cuda
Using cuML:             True
Got model from pickle:  True



# Data

## Getting Data

In [2]:
df = pd.read_csv("data/tensorflow-big.csv", on_bad_lines='skip')

df = df[["author_name", "time_sec", "subject"]]


df = df.rename(columns={
    "author_name": "user", 
    "time_sec": "time_sec", 
    "subject": "text"
})

# if your computer does not have GPU support, you can use a sample of the dataset instead to make it run in a reasonable time
if device == "cpu": df = df.sample(frac=0.05)

df

Unnamed: 0,user,time_sec,text
0,Ian Hua,1664120759,Fix windows kokoro tests with rollback
1,A. Unique TensorFlower,1664096696,Update GraphDef version to 1265.
2,A. Unique TensorFlower,1664096664,compat: Update forward compatibility horizon to 2022-09-25
3,A. Unique TensorFlower,1664038724,Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/b28814ce0a18fea92883fbd8901397fe3a1ffbf6.
4,A. Unique TensorFlower,1664036871,Integrate LLVM at llvm/llvm-project@94896994386d
...,...,...,...
126545,Vijay Vasudevan,1446933504,TensorFlow: Upstream commits to git.
126546,Vijay Vasudevan,1446922181,TensorFlow: Upstream latest commits to git.
126547,Vijay Vasudevan,1446875858,TensorFlow: Upstream changes to git.
126548,Manjunath Kudlur,1446863831,TensorFlow: Upstream latest changes to Git.


We see that the commits are in order, and ready to be sliced timewise

In [3]:
df = df.astype({"text" : str})

In [4]:
# Consider Dropping A. Unique and Gardener

df["user"].value_counts().head(5)

A. Unique TensorFlower    30534
TensorFlower Gardener      6708
Mihai Maruseac             1365
Yong Tang                  1329
Shanqing Cai               1153
Name: user, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126550 entries, 0 to 126549
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user      126550 non-null  object
 1   time_sec  126550 non-null  int64 
 2   text      126550 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.9+ MB


In [6]:
df.isna().sum()

user        0
time_sec    0
text        0
dtype: int64

We are intending to use the `text` field as a temporary substitue of `categoryRaw` which we wait to get from the schibsted data

## Cleaning Data

In [7]:
df["text_clean"] = df["text"].apply(string_cleaner)

df[["text", "text_clean"]].head(5)

Unnamed: 0,text,text_clean
0,Fix windows kokoro tests with rollback,fix windows kokoro tests with rollback
1,Update GraphDef version to 1265.,update graphdef version to 1265
2,compat: Update forward compatibility horizon to 2022-09-25,compat update forward compatibility horizon to 2022 09 25
3,Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/b28814ce0a18fea92883fbd8901397fe3a1ffbf6.,update tfrt dependency to use revision httpgithubcomtensorflowruntimecommitb28814ce0a18fea92883fbd8901397fe3a1ffbf6
4,Integrate LLVM at llvm/llvm-project@94896994386d,integrate llvm at llvmllvm project94896994386d


## Engineering Data

### Making a new column for the week

In [8]:
df["time_week"] = df["time_sec"].apply(lambda x: x//604800)

df.to_pickle(names["df"])

### Grouping on users

lag dictionary av å groupe på alle commit messages de har 
set groups på userId senere, så kan vi lage animation frames av hvordan grupper beveger seg

In [9]:
# dfu - ABBR: Data Frame User grouped
dfu = df[["text_clean", "time_sec", "time_week", "user"]].copy()

dfu = dfu.groupby(["user", "time_week"]).agg(list).reset_index()

dfu["text_clean_join"] = dfu["text_clean"].apply(lambda x: " ".join(x))

dfu.head(3)

Unnamed: 0,user,time_week,text_clean,time_sec,text_clean_join
0,(David) Siu-Kei Muk,2503,[adding ps_strategy to run_config to enable different placement strategy in estimator],[1514284534],adding ps_strategy to run_config to enable different placement strategy in estimator
1,(David) Siu-Kei Muk,2511,[resolved merge conflict on import lines],[1518919967],resolved merge conflict on import lines
2,(David) Siu-Kei Muk,2512,[1 removing ps_strategy 2 modified estimator to take overriden device_fn from if set 3 removed ps_strategy related unit tests],[1519453956],1 removing ps_strategy 2 modified estimator to take overriden device_fn from if set 3 removed ps_strategy related unit tests


In [10]:
weeks_per_user = dfu["user"].value_counts().reset_index()

print(len(weeks_per_user))

weeks_per_user[weeks_per_user["user"] > 10]

3805


Unnamed: 0,index,user
0,A. Unique TensorFlower,354
1,Peter Hawkins,250
2,TensorFlower Gardener,223
3,Jacques Pienaar,221
4,Mark Daoust,207
...,...,...
511,Wenyi Zhao,11
512,Carl Thomé,11
513,Rui Zhao,11
514,Jayaram Bobba,11


# Machine Learning

## Unsupervised ML

### NLP Embeddings

Getting the 768 dimensional embeddings for each commit message

In [11]:
try:
    if conf["fresh_data"]: raise Exception
    embs = pickle.load(open(names[f"embs-{device}"], 'rb'))
    
except:
    embs = sbert_emb_getter(df["text_clean"].to_numpy(), filename=names[f"model-{device}"])
    pickle.dump(embs, open(names[f"embs-{device}"], 'wb'))

    conf["fresh_embs"] = True

print(f"fresh embs: {conf['fresh_embs']}")

fresh embs: True


### Dimensionality Reduction

We use UMAP to reduce the dimensionality of the embeddings from 768 to 2, so that we can visualize them

In [14]:
umap_metric = "euclidean"

try:
    if conf["fresh_data"]: raise Exception
    uembs = pickle.load(open(names[f"uembs-{device}"], 'rb'))
    
except:
    uembs = UMAP(n_neighbors=15, min_dist=0.0).fit_transform(embs)
    pickle.dump(uembs, open(names[f"uembs-{device}"], 'wb'))

    conf["fresh_uembs"] = True

print(f"fresh uembs: {conf['fresh_uembs']}")

fresh uembs: True


In [16]:
# TODO make this plot just a trace, to fit in gridplot

fig = px.scatter(x=uembs[:,0], y=uembs[:,1])

fig.update_layout(width=800, height=800)
fig.update_traces(marker=dict(size=2))

# plotting to show how the embeddings are when just dimensionality reduction is used
#fig_show_save(fig, "umap-scatter", show=conf["show_figs"])

fig.show()

In [17]:
clusters_2d = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uembs)


print(f"""
    2D
    Number of clusters: {len(set(clusters_2d.labels_)) - 1}
    Number of rows as outliers: {clusters_2d.labels_.tolist().count(-1)}
""")


    2D
    Number of clusters: 239
    Number of rows as outliers: 39492



## Semi Supervised

### Exploring Stopwords

#### Checking most common words

First checking without filtering for stopwords, then checking with filtering for stopwords

Then checking with filtering for english stopwords

In [18]:
vc = (
    df["text_clean"].apply(lambda x: (x.split(" ")))
    .explode()
    .value_counts()
    .reset_index()
)

vc.head(10)

Unnamed: 0,index,text_clean
0,to,36789
1,for,25907
2,the,21848
3,in,20546
4,add,16387
5,of,15700
6,update,15156
7,from,14898
8,fix,13657
9,change,12731


In [19]:
# stopwords has been imported from nltk
s_words = stopwords.words('english')

print(f"""
    {type(s_words)}
    {len(s_words)}
    {s_words[0:10]}
""")


    <class 'list'>
    179
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]



In [20]:
# vc where the index is not in the stopwords list
vc = vc[~vc["index"].isin(s_words)]

# getting the top 20 words
target_interest = list(vc.head(conf["interest_words"] + 1)["index"])

# removing the space that becomes the first element in the list
try:
  target_interest.remove("")
except:
  pass

print(target_interest)

['add', 'update', 'fix', 'change', 'merge', 'request', 'pull', 'use', 'ops', 'support', 'test', 'remove', 'op', 'generated', 'tensorflow', 'xla', 'functions', 'build', 'tests', 'internal']


In [21]:
# threshold for seeing if a commit message belongs to a interest or if not
threshold = 0.2

if not conf["generate_interests"]:
    target_interest = interest_fixer("""
    fix add merge remove update pull request python docs tensorflow generated
    """)

print(f"using generated target_interest: {conf['generate_interests']}")

using generated target_interest: True


In [22]:
# custom stopwords as a union with the nltk stopwords and the target_interest found by value counting all the words in the commit messages
bonus_words = stopwords.words('english') + (target_interest)

# dumping for the plotting notebook
pickle.dump(bonus_words, open(names["bonus-words"], 'wb'))

In [23]:
## TODO make this use the get embeddings function
# getting results.
y, similarity = make_dataset(embs, targets=target_interest, model=model, target_threshold=threshold)

In [1]:
print(y[5])

NameError: name 'y' is not defined

In [24]:
set(y)

list(y).index(0)

y[4]

0

In [25]:
set(target_interest)

{'add',
 'build',
 'change',
 'fix',
 'functions',
 'generated',
 'internal',
 'merge',
 'op',
 'ops',
 'pull',
 'remove',
 'request',
 'support',
 'tensorflow',
 'test',
 'tests',
 'update',
 'use',
 'xla'}

In [26]:

# liste av indexes
target_interest_list = [target_interest[i-1] for i in y]

print(len(target_interest_list) - len(df))

print(target_interest_list[0:10])

names["target-interest-list"] = "data/target-interest-list.pkl"

pickle.dump(target_interest_list, open(names["target-interest-list"], 'wb'))

0
['tests', 'update', 'update', 'update', 'internal', 'tensorflow', 'update', 'update', 'xla', 'tensorflow']


In [27]:
set(target_interest_list)
len(set(target_interest_list))

20

In [28]:
try:
  if conf["fresh_data"]: raise Exception

  uemb_semi_s = pickle.load( open( names[f"uembs-s-{device}"], "rb" ) )
  
except:
  # used to have just nn = 100 at 0.2 similiarity, and not the metric and target weight
  # target weight is between 0 - 1, 0.5 is default, we used 1 for a while
  uemb_semi_s = UMAP(n_neighbors=15, min_dist=0.0, target_weight=0.5).fit_transform(embs, y-1)
  pickle.dump( uemb_semi_s, open( names[f"uembs-s-{device}"], "wb" ) )

  conf["fresh_s_uembs"] = True

print(f"fresh semi supervised uembs: {conf['fresh_s_uembs']}")

fresh semi supervised uembs: True


In [29]:
cluster_semi_s_hdb = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uemb_semi_s)

## Cluster & Topic Inspection

In [30]:
result_2d = result_df_maker(uembs, clusters_2d.labels_, df["text_clean"].to_numpy(), bonus_words=target_interest)

vcr = result_2d[["cluster_label", "group_topics"]].groupby(["cluster_label", "group_topics"])["group_topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,group_topics,commit_count
0,-1,to for the in from,39492
124,123,to for in tensor the,5720
150,149,to for in the of,4801
235,234,from siju samuelpatch patch doc,2690
57,56,wrapper go for,1761
233,232,tfdata for to the service,1624
167,166,to disable the in for,1541
232,231,for to in the tf,1391
197,196,to graph in the for,1312
62,61,nfc to the of in,1276


In [31]:
result_2d_semi = result_df_maker(uemb_semi_s, y, df["text_clean"].to_numpy(), bonus_words=target_interest)

vcr = result_2d_semi[["cluster_label", "group_topics"]].groupby(["cluster_label", "group_topics"])["group_topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

# cluster labelis here interest label

Unnamed: 0,cluster_label,group_topics,commit_count
0,0,to for in the of,46894
15,15,go wrapper for to the,17297
5,5,from branch changes master into,12037
2,2,to files pbtxt related revision,10267
3,3,typo in of the for,9058
11,11,for to in disable the,6666
16,16,to for in the xlagpu,6264
12,12,unused cleanup from code removed,3565
18,18,to for llvm at integrate,3304
4,4,of rollback automated changelist g4,2343


In [32]:
# finding most common words in 20 most common group_topics to see if we need more stopwords
vcr["group_topics"].apply(lambda x: x.split(" ")).explode().value_counts().head(20)

for           14
to            13
the           11
in             7
of             7
from           3
disable        2
branch         2
changes        2
docs           2
cleanup        2
python         2
g4             1
only           1
visibility     1
function       1
comment        1
added          1
missing        1
adding         1
Name: group_topics, dtype: int64

# Making Result DF

Kan gjøre regersjons øking per interesse, og velge hvilken som er mest likely ved et tidspunkt

In [33]:
dfres = df[["text_clean", "time_sec", "time_week", "user"]].copy()

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: datetime.fromtimestamp(x).isocalendar()[1])

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: x//604800)

dfres["x"] = uemb_semi_s[:, 0]

dfres["y"] = uemb_semi_s[:, 1]

dfres["cluster"] = cluster_semi_s_hdb.labels_



dfres["interest_id"] = list(y)

# -1 to make up for adding 1 earlier
dfres["target_interest"] = dfres["interest_id"].apply(lambda x: target_interest[x-1])

# find topic by interest instead
topic_dict = topic_by_clusterId(dfres["text_clean"].to_numpy(), dfres["interest_id"].to_numpy(), bonus_words=bonus_words)

dfres["topic"] = dfres["interest_id"].apply(lambda x: " ".join(list(topic_dict[x])))

dfres = dfres[dfres["cluster"] != -1]

# Pickling the dfres for plotting in other notebook
dfres.to_pickle(names["dfres"])

dfres

Unnamed: 0,text_clean,time_sec,time_week,user,x,y,cluster,interest_id,target_interest,topic
1,update graphdef version to 1265,1664096696,2751,A. Unique TensorFlower,10.270692,-4.072044,41,2,update,files pbtxt related revision dependency
2,compat update forward compatibility horizon to 2022 09 25,1664096664,2751,A. Unique TensorFlower,-10.312591,-7.643162,18,2,update,files pbtxt related revision dependency
3,update tfrt dependency to use revision httpgithubcomtensorflowruntimecommitb28814ce0a18fea92883fbd8901397fe3a1ffbf6,1664038724,2751,A. Unique TensorFlower,10.326187,10.077053,186,2,update,files pbtxt related revision dependency
4,integrate llvm at llvmllvm project94896994386d,1664036871,2751,A. Unique TensorFlower,-9.121819,9.770739,98,0,internal,llvm make integrate api llvmllvm
5,updates the tensorflow tensorbundle read to handle single shard of large variable file more efficiently,1664016898,2751,A. Unique TensorFlower,-1.648512,4.487239,115,15,tensorflow,go wrapper tensor tensors revid
...,...,...,...,...,...,...,...,...,...,...
126524,make license in setuppy match,1447301481,2393,Illia Polosukhin,4.118978,1.115889,106,0,internal,llvm make integrate api llvmllvm
126526,added apache headers,1447301042,2393,Illia Polosukhin,3.528225,-0.484087,145,6,request,piperorigin revid comments review api
126527,initial commit,1447300802,2393,Illia Polosukhin,2.123744,0.938270,113,1,add,added missing adding comment comments
126531,update description of various issuediscussion forums by vanhoucke,1447206079,2392,Vijay Vasudevan,4.565873,1.916050,95,2,update,files pbtxt related revision dependency


# Exploration

## Collaborative filtering

### Data Restructuring

- Grouping by user to get info on their commits and which target_interest their commits belong to in a quantitative way
- Using the user groups, we can again group the df by user groups and time and now have very few groups, and we can do regression on their activity over time

In [34]:
userdf = pd.DataFrame({
    "user": df["user"],
#    "time_week" : list(df["time_week"]),
    "target_interest_id" : list(y),
    "cluster_id" : list(cluster_semi_s_hdb.labels_)
    })

# dropping the two bots
userdf = userdf[userdf["user"] != "A. Unique TensorFlower"]
userdf = userdf[userdf["user"] != "TensorFlower Gardener"]


userdf["target_interest"] = userdf["target_interest_id"].apply(lambda x: target_interest[x-1])

#userdf = userdf.groupby(["user", "time_week"]).agg(list).reset_index()

userdf = userdf.groupby("user").agg(list).reset_index()


len(userdf)

3803

In [35]:
#userdf_dict = userdf[["user","time_week","target_interest_word"]].copy()
userdf_dict = userdf[["user","target_interest"]].copy()


userdf_dict["target_interest_dict"] = userdf["target_interest"].apply(lambda x: dict_per_user(x, target_interest))

userdf_dict.drop(columns=["target_interest"], inplace=True)

print(len(userdf_dict))

userdf_dict.head(3)

3803


Unnamed: 0,user,target_interest_dict
0,(David) Siu-Kei Muk,"{'add': 0, 'update': 0, 'fix': 1, 'change': 0, 'merge': 8, 'request': 0, 'pull': 0, 'use': 0, 'ops': 0, 'support': 0, 'test': 1, 'remove': 0, 'op': 0, 'generated': 0, 'tensorflow': 2, 'xla': 0, 'f..."
1,103yiran,"{'add': 0, 'update': 0, 'fix': 0, 'change': 0, 'merge': 0, 'request': 0, 'pull': 0, 'use': 1, 'ops': 0, 'support': 0, 'test': 0, 'remove': 0, 'op': 0, 'generated': 0, 'tensorflow': 0, 'xla': 0, 'f..."
2,1e100,"{'add': 0, 'update': 1, 'fix': 0, 'change': 0, 'merge': 0, 'request': 0, 'pull': 0, 'use': 0, 'ops': 0, 'support': 0, 'test': 0, 'remove': 3, 'op': 0, 'generated': 0, 'tensorflow': 0, 'xla': 0, 'f..."


In [36]:
userdf_ex = pd.DataFrame(list(userdf["target_interest"].apply(lambda x: dict_per_user(x, target_interest))))


## TODO
# - Yeet the bot
# - use mean normalization
userdf_ex_norm = (userdf_ex-userdf_ex.min())/(userdf_ex.max()-userdf_ex.min())

target_interest_matrix = userdf_ex_norm.to_numpy()


userdf_ex_norm.insert(0, "user", list(userdf["user"]))
#userdf_ex_norm.insert(1, "time_week", list(userdf["time_week"]))

userdf_ex_norm

## Rating is how close they are to the center of the interest

## TODO prøv å faktisk implementere colaborative filtering funksjonaliten

# - en linje per interesse per gruppe
# - regression per interesse for å predicte neste 20 punkt
# - 

Unnamed: 0,user,add,update,fix,change,merge,request,pull,use,ops,...,test,remove,op,generated,tensorflow,xla,functions,build,tests,internal
0,(David) Siu-Kei Muk,0.0,0.000000,0.005780,0.0,0.016598,0.0,0.0,0.0,0.0,...,0.005747,0.000000,0.0,0.0,0.006135,0.0,0.0,0.000000,0.0,0.003030
1,103yiran,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.2,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000
2,1e100,0.0,0.004132,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.050847,0.0,0.0,0.000000,0.0,0.0,0.038462,0.0,0.004545
3,372046933,0.0,0.008264,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000
4,4F2E4A2E,0.0,0.004132,0.005780,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.001515
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3798,黄璞,0.0,0.000000,0.011561,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000
3799,黄鑫,0.0,0.000000,0.005780,0.0,0.002075,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000
3800,박상준,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.001515
3801,이장후,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.001515


In [37]:
colab_clusters = HDBSCAN(min_cluster_size=40, min_samples=10, metric='euclidean', cluster_selection_method='eom').fit(target_interest_matrix)

print(f"""
    Full dimensionality clustering output:
    Len of colab clusters: {len(colab_clusters.labels_)}
    Number of clusters: {len(set(colab_clusters.labels_)) - 1}
    Number of rows as outliers: {colab_clusters.labels_.tolist().count(-1)}
""")



    Full dimensionality clustering output:
    Len of colab clusters: 3803
    Number of clusters: 19
    Number of rows as outliers: 1180



In [38]:
colab_umap = UMAP(n_neighbors=15, min_dist=0.0).fit_transform(target_interest_matrix)

In [39]:
colab_resdf = pd.DataFrame({
    "x" : colab_umap[:, 0], 
    "y" : colab_umap[:, 1], 
    "cluster" : colab_clusters.labels_
})

# with few clusters you can turn on and off outliers with the -1 label
#colab_resdf = colab_resdf[colab_resdf["cluster"] != -1]

#turning cluster to str for discrete color
colab_resdf["cluster"] = colab_resdf["cluster"].astype(str)

fig_colab = px.scatter(colab_resdf, x="x", y="y", color="cluster", title="Colab clustering", width=800, height=800)

fig_colab.show()

# TODO add one visualisation without time grouping
# this would give us the "true" user groups, and then we could see if they moved around without breaking up the group too much
# also it is not a bug that there is overlap of clusters, as the clustering takes place before umap

In [40]:
## TODO
# - now we have to do this per user.
# - we need to look at what a given user is "comitting" about interest wise, and then see which cluster that user is in
# - then when we    title="Timeline of commits by interest",                                                         

## Exploring water simulation based prediction potential

In [41]:

print(f"""
    Bounds of the uembs
    
    x axis:
    {min(uembs[:,0])}
    {max(uembs[:,0])}
    
    "y axis"
    {min(uembs[:,1])}
    {max(uembs[:,1])}
""")


    Bounds of the uembs
    
    x axis:
    -2311.22412109375
    640.842529296875
    
    "y axis"
    -2745.157470703125
    775.3185424804688



Kan sette ramme til vann prediction på +- 25 på begge akser

512 x 512*2 pixels i det spacet

lage neste frame i animasjonen

gi to frames av fortid
- kan gi en frame per uke per bruker
- kan ha en farge per bruker gruppe