# Abbreviations:

- `df` -> dataframe

- `Series` -> column / kolonne

- `embs` -> embeddings (numerical representation of words in a vector space, from our NLP AI model)

- `uembs` -> umap embeddings (dimensionality reduction of embs, to make it easier to visualize, 768 -> 2)

- `target_intersts` -> the 20 words we use as targets for our grouping of commit messages. We can se these, and we use them as targets for semi-supervision

- `group_topics` -> the topics we generate based on message cluster, less important for the ML pipeline, more for the human reader to understand what the clusters are about

# Imports

In [1]:
from scripts import *


Local stopwords:        True
GPUs detected:          1
Using GPU:              True
Device:                 cuda
Using cuML:             True
Got model from pickle:  True



# Data

## Getting Data

In [2]:
df = pd.read_csv("data/tensorflow-big.csv", on_bad_lines='skip')

len_start = len(df)

df = df[["author_name", "time_sec", "subject"]]


df = df.rename(columns={
    "author_name": "user", 
    "time_sec": "time_sec", 
    "subject": "text"
})

# dropping the two bots
df = df[df["user"] != "A. Unique TensorFlower"]
df = df[df["user"] != "TensorFlower Gardener"]

len_drop_bots = len(df)


# if your computer does not have GPU support, you can use a sample of the dataset instead to make it run in a reasonable time
if device == "cpu": df = df.sample(frac=0.05)

print(f"""
Rows before droppings bots:   {len_start}
Rows after dropping bots:     {len_drop_bots}
Rows diff:                    {len_start - len_drop_bots}
""")

df


Rows before droppings bots:   126550
Rows after dropping bots:     89308
Rows diff:                    37242



Unnamed: 0,user,time_sec,text
0,Ian Hua,1664120759,Fix windows kokoro tests with rollback
8,Mehdi Amini,1664004666,Migrate XLA to use TSL status.h
9,Mehdi Amini,1664001037,Move tensorflow/tsl/*/build_config:stream_executor_{no_}_cuda to stream executor itself
12,Taehee Jeong,1663984135,[xla:gpu] NFC: Remove XLA_ENABLE_XLIR macros and always compile XLA runtime for Gpu
13,Chen Qian,1663982378,Some changes on the new optimizer: 1. Include `custom_objects` in `from_config` for deserializing custom learning rate. 2. Handle the error of seeing unrecognized variable with a better error mess...
...,...,...,...
126545,Vijay Vasudevan,1446933504,TensorFlow: Upstream commits to git.
126546,Vijay Vasudevan,1446922181,TensorFlow: Upstream latest commits to git.
126547,Vijay Vasudevan,1446875858,TensorFlow: Upstream changes to git.
126548,Manjunath Kudlur,1446863831,TensorFlow: Upstream latest changes to Git.


We see that the commits are in order, and ready to be sliced timewise

In [3]:
df = df.astype({"text" : str})

In [4]:
# Consider Dropping A. Unique and Gardener

df["user"].value_counts().head(5)

Mihai Maruseac    1365
Yong Tang         1329
Shanqing Cai      1153
Derek Murray      1088
Raman Sarokin     1067
Name: user, dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89308 entries, 0 to 126549
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user      89308 non-null  object
 1   time_sec  89308 non-null  int64 
 2   text      89308 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.7+ MB


In [6]:
df.isna().sum()

user        0
time_sec    0
text        0
dtype: int64

We are intending to use the `text` field as a temporary substitue of `categoryRaw` which we wait to get from the schibsted data

## Cleaning Data

In [7]:
df["text_clean"] = df["text"].apply(string_cleaner)

df[["text", "text_clean"]].head(5)

Unnamed: 0,text,text_clean
0,Fix windows kokoro tests with rollback,fix windows kokoro tests with rollback
8,Migrate XLA to use TSL status.h,migrate xla to use tsl statush
9,Move tensorflow/tsl/*/build_config:stream_executor_{no_}_cuda to stream executor itself,move tensorflowtslbuild_configstream_executor_no__cuda to stream executor itself
12,[xla:gpu] NFC: Remove XLA_ENABLE_XLIR macros and always compile XLA runtime for Gpu,xlagpu nfc remove xla_enable_xlir macros and always compile xla runtime for gpu
13,Some changes on the new optimizer: 1. Include `custom_objects` in `from_config` for deserializing custom learning rate. 2. Handle the error of seeing unrecognized variable with a better error mess...,some changes on the new optimizer 1 include custom_objects in from_config for deserializing custom learning rate 2 handle the error of seeing unrecognized variable with a better error message


## Engineering Data

### Making a new column for the week

In [8]:
df["time_week"] = df["time_sec"].apply(lambda x: x//604800)

df.to_pickle(names["df"])

### Grouping on users

lag dictionary av å groupe på alle commit messages de har 
set groups på userId senere, så kan vi lage animation frames av hvordan grupper beveger seg

In [9]:
# dfu - ABBR: Data Frame User grouped
dfu = df[["text_clean", "time_sec", "time_week", "user"]].copy()

dfu = dfu.groupby(["user", "time_week"]).agg(list).reset_index()

dfu["text_clean_join"] = dfu["text_clean"].apply(lambda x: " ".join(x))

dfu.head(3)

Unnamed: 0,user,time_week,text_clean,time_sec,text_clean_join
0,(David) Siu-Kei Muk,2503,[adding ps_strategy to run_config to enable different placement strategy in estimator],[1514284534],adding ps_strategy to run_config to enable different placement strategy in estimator
1,(David) Siu-Kei Muk,2511,[resolved merge conflict on import lines],[1518919967],resolved merge conflict on import lines
2,(David) Siu-Kei Muk,2512,[1 removing ps_strategy 2 modified estimator to take overriden device_fn from if set 3 removed ps_strategy related unit tests],[1519453956],1 removing ps_strategy 2 modified estimator to take overriden device_fn from if set 3 removed ps_strategy related unit tests


In [10]:
weeks_per_user = dfu["user"].value_counts().reset_index()

print(len(weeks_per_user))

weeks_per_user[weeks_per_user["user"] > 10]

3803


Unnamed: 0,index,user
0,Peter Hawkins,250
1,Jacques Pienaar,221
2,Mark Daoust,207
3,Benjamin Kramer,201
4,Yong Tang,197
...,...,...
509,Carl Thomé,11
510,Noah Eisen,11
511,Aditya Kane,11
512,Nishidha Panpaliya,11


# Machine Learning

## Unsupervised ML

### NLP Embeddings

Getting the 768 dimensional embeddings for each commit message

In [11]:
try:
    if conf["fresh_data"]: raise Exception
    embs = pickle.load(open(names[f"embs-{device}"], 'rb'))
    
except:
    embs = sbert_emb_getter(df["text_clean"].to_numpy(), filename=names[f"model-{device}"])
    pickle.dump(embs, open(names[f"embs-{device}"], 'wb'))

    conf["fresh_embs"] = True

print(f"fresh embs: {conf['fresh_embs']}")

fresh embs: False


### Dimensionality Reduction

We use UMAP to reduce the dimensionality of the embeddings from 768 to 2, so that we can visualize them

In [12]:
umap_metric = "euclidean"

try:
    if conf["fresh_data"]: raise Exception
    uembs = pickle.load(open(names[f"uembs-{device}"], 'rb'))
    
except:
    uembs = UMAP(n_neighbors=15, min_dist=0.0).fit_transform(embs)
    pickle.dump(uembs, open(names[f"uembs-{device}"], 'wb'))

    conf["fresh_uembs"] = True

print(f"fresh uembs: {conf['fresh_uembs']}")

fresh uembs: False


In [13]:
# TODO make this plot just a trace, to fit in gridplot

fig = px.scatter(
  x=uembs[:,0], 
  y=uembs[:,1],
  range_x = [-25, 25],
  range_y = [-25, 25]
)

fig.update_layout(width=800, height=800)
fig.update_traces(marker=dict(size=2))

# plotting to show how the embeddings are when just dimensionality reduction is used
#fig_show_save(fig, "umap-scatter", show=conf["show_figs"])

fig.show()

In [14]:
clusters_2d = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uembs)


print(f"""
    2D
    Number of clusters: {len(set(clusters_2d.labels_)) - 1}
    Number of rows as outliers: {clusters_2d.labels_.tolist().count(-1)}
""")


    2D
    Number of clusters: 164
    Number of rows as outliers: 28546



## Semi Supervised

### Exploring Stopwords

#### Checking most common words

First checking without filtering for stopwords, then checking with filtering for stopwords

Then checking with filtering for english stopwords

In [15]:
vc = (
    df["text_clean"].apply(lambda x: (x.split(" ")))
    .explode()
    .value_counts()
    .reset_index()
)

vc.head(10)

Unnamed: 0,index,text_clean
0,to,25015
1,for,18371
2,the,16677
3,in,16031
4,add,13026
5,of,11737
6,fix,11324
7,,9745
8,and,8876
9,a,8535


In [16]:
# stopwords has been imported from nltk
s_words = stopwords.words('english')

print(f"""
    {type(s_words)}
    {len(s_words)}
    {s_words[0:10]}
""")


    <class 'list'>
    179
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]



In [17]:
# vc where the index is not in the stopwords list
vc = vc[~vc["index"].isin(s_words)]

# getting the top 20 words
target_interest = list(vc.head(conf["interest_words"] + 1)["index"])

# removing the space that becomes the first element in the list
try:
  target_interest.remove("")
except:
  pass

print(target_interest)

['add', 'fix', 'change', 'merge', 'update', 'test', 'remove', 'support', 'use', 'xla', 'build', 'request', 'pull', 'op', 'tests', 'tf', 'make', 'ops', 'error', 'api']


In [18]:
# threshold for seeing if a commit message belongs to a interest or if not
threshold = 0.2

if not conf["generate_interests"]:
    target_interest = interest_fixer("""
    fix add merge remove update pull request python docs tensorflow generated
    """)

print(f"using generated target_interest: {conf['generate_interests']}")

using generated target_interest: True


In [19]:
# custom stopwords as a union with the nltk stopwords and the target_interest found by value counting all the words in the commit messages
bonus_words = stopwords.words('english') + (target_interest)

# dumping for the plotting notebook
pickle.dump(bonus_words, open(names["bonus-words"], 'wb'))

In [20]:
## TODO make this use the get embeddings function
# getting results.
y, similarity = make_dataset(embs, targets=target_interest, model=model, target_threshold=threshold)

In [21]:
print(y[5])

10


In [22]:
set(y)

list(y).index(0)

y[4]

0

In [23]:
set(target_interest)

{'add',
 'api',
 'build',
 'change',
 'error',
 'fix',
 'make',
 'merge',
 'op',
 'ops',
 'pull',
 'remove',
 'request',
 'support',
 'test',
 'tests',
 'tf',
 'update',
 'use',
 'xla'}

In [24]:

# liste av indexes
target_interest_list = [target_interest[i-1] for i in y]

print(len(target_interest_list) - len(df))

print(target_interest_list[0:10])

names["target-interest-list"] = "data/target-interest-list.pkl"

pickle.dump(target_interest_list, open(names["target-interest-list"], 'wb'))

0
['tests', 'xla', 'api', 'xla', 'api', 'xla', 'tf', 'api', 'api', 'xla']


In [25]:
set(target_interest_list)
len(set(target_interest_list))

20

In [26]:
try:
  if conf["fresh_data"]: raise Exception

  uemb_semi_s = pickle.load( open( names[f"uembs-s-{device}"], "rb" ) )
  
except:
  # used to have just nn = 100 at 0.2 similiarity, and not the metric and target weight
  # target weight is between 0 - 1, 0.5 is default, we used 1 for a while
  uemb_semi_s = UMAP(n_neighbors=15, min_dist=0.0, target_weight=0.5).fit_transform(embs, y-1)
  pickle.dump( uemb_semi_s, open( names[f"uembs-s-{device}"], "wb" ) )

  conf["fresh_s_uembs"] = True

print(f"fresh semi supervised uembs: {conf['fresh_s_uembs']}")

fresh semi supervised uembs: False


In [27]:
cluster_semi_s_hdb = HDBSCAN(min_cluster_size=100, min_samples=20, metric='euclidean', cluster_selection_method='eom').fit(uemb_semi_s)

## Cluster & Topic Inspection

In [28]:
result_2d = result_df_maker(uembs, clusters_2d.labels_, df["text_clean"].to_numpy(), bonus_words=target_interest)

vcr = result_2d[["cluster_label", "group_topics"]].groupby(["cluster_label", "group_topics"])["group_topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

Unnamed: 0,cluster_label,group_topics,commit_count
0,-1,to for in the of,28546
78,77,to for in the xlagpu,4351
149,148,to for tflite in the,3183
161,160,tensor to for tensors in,1911
59,58,gpu for to cuda the,1735
54,53,from benoitsteinermaster dongjoon taehoonleefix_typos terrytangyuanpatch,1650
147,146,to for the in of,1533
150,149,tfdata to service for the,1361
22,21,nfc to in for the,1357
70,69,to graph in the for,1305


In [29]:
result_2d_semi = result_df_maker(uemb_semi_s, y, df["text_clean"].to_numpy(), bonus_words=target_interest)

vcr = result_2d_semi[["cluster_label", "group_topics"]].groupby(["cluster_label", "group_topics"])["group_topics"].count().reset_index(name="commit_count").sort_values(by="commit_count", ascending=False).head(20)

vcr

# cluster labelis here interest label

Unnamed: 0,cluster_label,group_topics,commit_count
0,0,to for in the of,37775
16,16,to for in the tfdata,8823
2,2,typo in of the for,6357
4,4,from branch master into changes,6200
6,6,for to disable in the,5755
10,10,to for in the xlagpu,5257
5,5,to readmemd bump for revision,3890
7,7,unused cleanup code removed from,3014
11,11,for to the file windows,2555
3,3,internal of rollback changes automated,2537


In [30]:
# finding most common words in 20 most common group_topics to see if we need more stopwords
vcr["group_topics"].apply(lambda x: x.split(" ")).explode().value_counts().head(20)

to           14
the          14
for          13
in           10
of            5
changes       3
comments      2
automated     2
disable       2
branch        2
from          2
typo          2
example       1
usage         1
create        1
tflite        1
makefile      1
adding        1
missing       1
added         1
Name: group_topics, dtype: int64

# Making Result DF

Kan gjøre regersjons øking per interesse, og velge hvilken som er mest likely ved et tidspunkt

In [31]:
dfres = df[["text_clean", "time_sec", "time_week", "user"]].copy()

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: datetime.fromtimestamp(x).isocalendar()[1])

#dfres["time_week"] = dfres["time_sec"].apply(lambda x: x//604800)

dfres["x"] = uemb_semi_s[:, 0]

dfres["y"] = uemb_semi_s[:, 1]

dfres["cluster"] = cluster_semi_s_hdb.labels_



dfres["interest_id"] = list(y)

# -1 to make up for adding 1 earlier
dfres["target_interest"] = dfres["interest_id"].apply(lambda x: target_interest[x-1])

# find topic by interest instead
topic_dict = topic_by_clusterId(dfres["text_clean"].to_numpy(), dfres["interest_id"].to_numpy(), bonus_words=bonus_words)

dfres["topic"] = dfres["interest_id"].apply(lambda x: " ".join(list(topic_dict[x])))

dfres = dfres[dfres["cluster"] != -1]

# Pickling the dfres for plotting in other notebook
dfres.to_pickle(names["dfres"])

dfres

Unnamed: 0,text_clean,time_sec,time_week,user,x,y,cluster,interest_id,target_interest,topic
8,migrate xla to use tsl statush,1664004666,2751,Mehdi Amini,-2.440187,-6.688982,106,10,xla,xlagpu xlapython hlo xlacpu nfc
9,move tensorflowtslbuild_configstream_executor_no__cuda to stream executor itself,1664001037,2751,Mehdi Amini,3.505764,-2.417651,159,0,api,function added gpu mlir move
12,xlagpu nfc remove xla_enable_xlir macros and always compile xla runtime for gpu,1663984135,2751,Taehee Jeong,0.279381,-9.581120,23,10,xla,xlagpu xlapython hlo xlacpu nfc
15,xlagpu nfc remove xla_enable_xlir macros and always compile xla runtime for gpu,1663975730,2751,Eugene Zhulenev,0.272507,-9.585547,23,10,xla,xlagpu xlapython hlo xlacpu nfc
17,improve tf lite gpu docs organization and discoverability,1663974673,2751,Joe Fernandez,0.816003,-4.919706,100,16,tf,tfdata tflite service tensorflow documentation
...,...,...,...,...,...,...,...,...,...,...
126521,tensorflow a few small updates,1447382085,2393,Vijay Vasudevan,3.866246,-2.464835,81,5,update,readmemd bump revision source version
126525,removed license headers,1447301260,2393,Illia Polosukhin,2.840408,1.725416,68,7,remove,unused code cleanup clean removed
126526,added apache headers,1447301042,2393,Illia Polosukhin,1.119486,0.359533,113,12,request,changes branch requested comments address
126529,upstream a number of changes to git,1447271531,2392,Vijay Vasudevan,1.228989,2.250248,60,3,change,internal changes rollback automated changelist


# Exploration

## Collaborative filtering

### Data Restructuring

- Grouping by user to get info on their commits and which target_interest their commits belong to in a quantitative way
- Using the user groups, we can again group the df by user groups and time and now have very few groups, and we can do regression on their activity over time

In [160]:
userdf = pd.DataFrame({
    "user": df["user"],
#    "time_week" : list(df["time_week"]),
    "target_interest_id" : list(y),
    "cluster_id" : list(cluster_semi_s_hdb.labels_)
    })


userdf["target_interest"] = userdf["target_interest_id"].apply(lambda x: target_interest[x-1])


userdf = userdf.groupby("user").agg(list).reset_index()


print(len(userdf))

3803


### User Grouping

In [161]:
#userdf_dict = userdf[["user","time_week","target_interest_word"]].copy()
userdf_with_dict = userdf[["user","target_interest"]].copy()


userdf_with_dict["target_interest_dict"] = userdf["target_interest"].apply(lambda x: dict_per_user(x, target_interest))

userdf_with_dict.drop(columns=["target_interest"], inplace=True)

print(len(userdf_with_dict))

userdf_with_dict.head(3)

3803


Unnamed: 0,user,target_interest_dict
0,(David) Siu-Kei Muk,"{'add': 0, 'fix': 1, 'change': 0, 'merge': 8, 'update': 0, 'test': 1, 'remove': 0, 'support': 0, 'use': 0, 'xla': 0, 'build': 0, 'request': 0, 'pull': 0, 'op': 0, 'tests': 0, 'tf': 0, 'make': 0, '..."
1,103yiran,"{'add': 0, 'fix': 0, 'change': 0, 'merge': 0, 'update': 0, 'test': 0, 'remove': 0, 'support': 0, 'use': 0, 'xla': 0, 'build': 0, 'request': 0, 'pull': 0, 'op': 0, 'tests': 0, 'tf': 0, 'make': 0, '..."
2,1e100,"{'add': 0, 'fix': 0, 'change': 0, 'merge': 0, 'update': 1, 'test': 0, 'remove': 3, 'support': 0, 'use': 0, 'xla': 0, 'build': 3, 'request': 0, 'pull': 0, 'op': 0, 'tests': 0, 'tf': 0, 'make': 0, '..."


In [162]:
# normalising the userdf
#userdf = (userdf-userdf.min())/(userdf.max()-userdf.min())

target_interest_matrix = np.array(userdf["target_interest"].apply(lambda x: dict_per_user(x, target_interest)))

df_user_interest_matrix = pd.DataFrame(list(target_interest_matrix))

target_interest_matrix = df_user_interest_matrix.to_numpy()

df_user_interest_matrix.insert(0, "user", userdf["user"])

df_user_interest_matrix

Unnamed: 0,user,add,fix,change,merge,update,test,remove,support,use,...,build,request,pull,op,tests,tf,make,ops,error,api
0,(David) Siu-Kei Muk,0,1,0,8,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4
1,103yiran,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,1e100,0,0,0,0,1,0,3,0,0,...,3,0,0,0,0,0,0,0,0,3
3,372046933,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4F2E4A2E,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3798,黄璞,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3799,黄鑫,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3800,박상준,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3801,이장후,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [163]:
target_interest_matrix[0:5]

array([[0, 1, 0, 8, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

In [164]:
colab_clusters = HDBSCAN(min_cluster_size=90, min_samples=10, metric='euclidean', cluster_selection_method='eom').fit(target_interest_matrix)

print(f"""
    Full dimensionality clustering output:
    Len of colab clusters: {len(colab_clusters.labels_)}
    Number of clusters: {len(set(colab_clusters.labels_)) - 1}
    Number of rows as outliers: {colab_clusters.labels_.tolist().count(-1)}
""")



    Full dimensionality clustering output:
    Len of colab clusters: 3803
    Number of clusters: 8
    Number of rows as outliers: 1374



In [165]:
colab_umap = UMAP(n_neighbors=15, min_dist=0.0).fit_transform(target_interest_matrix)

In [166]:
colab_resdf = pd.DataFrame({
    "x" : colab_umap[:, 0], 
    "y" : colab_umap[:, 1], 
    "cluster" : colab_clusters.labels_
})

# with few clusters you can turn on and off outliers with the -1 label
#colab_resdf = colab_resdf[colab_resdf["cluster"] != -1]

#turning cluster to str for discrete color
colab_resdf["cluster"] = colab_resdf["cluster"].astype(str)

fig_colab = px.scatter(colab_resdf, x="x", y="y", color="cluster", title="Colab clustering", width=800, height=800, range_x=[-25, 25], range_y=[-25, 25])

fig_colab.show()

# TODO add one visualisation without time grouping
# this would give us the "true" user groups, and then we could see if they moved around without breaking up the group too much
# also it is not a bug that there is overlap of clusters, as the clustering takes place before umap

In [167]:
## TODO
# - now we have to do this per user.
# - we need to look at what a given user is "comitting" about interest wise, and then see which cluster that user is in
# - then when we    title="Timeline of commits by interest",                                                         

## Making df grouped on usergroup and time

In [168]:
# making dict to connect username and cluster id
userdId_groupID_dict = dict(zip(df_user_interest_matrix["user"], colab_clusters.labels_))

if len(userdId_groupID_dict) - len(df_user_interest_matrix) != 0:
    print("WARNING: dict and userdf_ex are not the same length")

In [169]:
df.columns

Index(['user', 'time_sec', 'text', 'text_clean', 'time_week'], dtype='object')

In [170]:
usergroupdf = pd.DataFrame({
    "user": df["user"],
    "time_sec" : list(df["time_sec"]),
    "target_interest_id" : list(y),
    })

# mapping in the target interest
usergroupdf["target_interest"] = usergroupdf["target_interest_id"].apply(lambda x: target_interest[x-1])

# Setting cluster id on the users to get the cluster id for each user
usergroupdf["user_group_id"] = usergroupdf["user"].apply(lambda x: userdId_groupID_dict[x])

# Making time sec into time day
usergroupdf["time_day"] = usergroupdf["time_sec"].apply(lambda x: x//(60*60*24))

usergroupdf.sample(5)

Unnamed: 0,user,time_sec,target_interest_id,target_interest,user_group_id,time_day
65312,River Riddle,1567477487,0,api,-1,18142
30742,Jacques Pienaar,1611193667,10,xla,-1,18648
99243,Michael Case,1520635266,3,change,-1,17599
74580,Vishnuvardhan Janapati,1556923367,5,update,-1,18019
25508,Andrew Selle,1618341389,16,tf,-1,18730


In [171]:
usergroupdf = usergroupdf.groupby(["user_group_id", "time_day"]).agg(list).reset_index()


print(f'User group equal to cluster groups: {len(usergroupdf["user_group_id"].unique()) == len(set(colab_clusters.labels_))}')
print(len(usergroupdf))


usergroupdf.head(3)

User group equal to cluster groups: True
5400


Unnamed: 0,user_group_id,time_day,user,time_sec,target_interest_id,target_interest
0,-1,16746,"[Vijay Vasudevan, Vijay Vasudevan, Vijay Vasudevan, Vijay Vasudevan, Manjunath Kudlur, Manjunath Kudlur]","[1446935606, 1446933504, 1446922181, 1446875858, 1446863831, 1446856078]","[0, 16, 5, 0, 5, 0]","[api, tf, update, api, update, api]"
1,-1,16747,"[Manjunath Kudlur, Manjunath Kudlur, Vijay Vasudevan]","[1447024477, 1447019816, 1447011446]","[0, 5, 5]","[api, update, update]"
2,-1,16748,"[Manjunath Kudlur, Manjunath Kudlur, Manjunath Kudlur, Manjunath Kudlur, Manjunath Kudlur, Manjunath Kudlur, Manjunath Kudlur]","[1447049284, 1447047697, 1447046850, 1447045469, 1447039735, 1447039421, 1447033308]","[0, 0, 16, 0, 16, 0, 0]","[api, api, tf, api, tf, api, api]"


In [172]:
# making the interest matrix again for user groups
usergroup_target_interest_matrix = np.array(usergroupdf["target_interest"].apply(lambda x: dict_per_user(x, target_interest)))

df_group_interest_matrix = pd.DataFrame(list(usergroup_target_interest_matrix))

usergroup_target_interest_matrix = df_group_interest_matrix.to_numpy()

#df_usergroup_interest_matrix = usergroupdf[["user_group_id", "time_day", "target_interest"]].copy()

df_group_interest_matrix.insert(0, "user_group_id", usergroupdf["user_group_id"])
df_group_interest_matrix.insert(1, "time_day", usergroupdf["time_day"])


df_group_interest_matrix

Unnamed: 0,user_group_id,time_day,add,fix,change,merge,update,test,remove,support,...,build,request,pull,op,tests,tf,make,ops,error,api
0,-1,16746,0,0,0,0,2,0,0,0,...,0,0,0,0,0,1,0,0,0,3
1,-1,16747,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,-1,16748,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,5
3,-1,16749,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,-1,16750,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5395,7,19208,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5396,7,19221,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5397,7,19227,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,0,0,0,0
5398,7,19228,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [173]:
df_group_interest_matrix.describe()

Unnamed: 0,user_group_id,time_day,add,fix,change,merge,update,test,remove,support,...,build,request,pull,op,tests,tf,make,ops,error,api
count,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,...,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0,5400.0
mean,0.864259,17945.749074,0.123519,1.177222,0.469815,1.148148,0.72037,1.065741,0.558148,0.044074,...,0.473148,0.023148,0.02463,0.022222,0.260741,1.633889,0.060556,0.173148,0.38,7.182778
std,2.36295,696.042434,0.426614,1.986498,1.04511,2.41029,1.720504,2.144945,1.283181,0.233161,...,1.084681,0.171127,0.181433,0.153573,0.716653,2.914132,0.273277,0.561978,0.807476,11.033168
min,-1.0,16746.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-1.0,17347.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,17912.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2.0,18528.0,0.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,11.0
max,7.0,19260.0,8.0,27.0,10.0,29.0,50.0,23.0,28.0,4.0,...,14.0,3.0,3.0,2.0,10.0,32.0,3.0,17.0,10.0,124.0


In [177]:
# making list of dfs based on usergroup

usergroup_df_list = []

for usergroup in df_group_interest_matrix["user_group_id"].unique():
  usergroup_df_list.append(df_group_interest_matrix[df_group_interest_matrix["user_group_id"] == usergroup])

In [178]:
for x in usergroup_df_list:
  print(x.shape)

(2480, 22)
(742, 22)
(444, 22)
(631, 22)
(195, 22)
(262, 22)
(206, 22)
(302, 22)
(138, 22)


# Group Predictions

This will be scuffed

# Trying with 

## Exploring water simulation based prediction potential

In [175]:

print(f"""
    Bounds of the uembs
    
    x axis:
    {min(uembs[:,0])}
    {max(uembs[:,0])}
    
    "y axis"
    {min(uembs[:,1])}
    {max(uembs[:,1])}
""")


    Bounds of the uembs
    
    x axis:
    -417.2867736816406
    2705.583984375
    
    "y axis"
    -314.9382019042969
    3557.76806640625



Kan sette ramme til vann prediction på +- 25 på begge akser

512 x 512*2 pixels i det spacet

lage neste frame i animasjonen

gi to frames av fortid
- kan gi en frame per uke per bruker
- kan ha en farge per bruker gruppe