# Describing the functionality of pandas and the basic functions from the face2face library

**Author**: Andreas Kruff

**Version**: 12.05.2020

**Description**: This tutorial describes the underlying pandas methods that are used to build the face2face methods and functions of this library.

## Table of Contents
#### [Implement the contact duration function](#contact_duration)
#### [Implement the triangle duration function](#triangle_duration)
#### [Implement the inter-contact duration function](#inter-contact_duration)
#### [Implement the average_degree duration function](#average_degree)
#### [Implement the group_list_degree function](#group_list_degree)

# Explanation of the distribution methods

A very basic method to analyze tij-datasets is to measure the probability distribution for contact durations of the given dataset. To do so you need to know how to interpret the data. For the creating of the tij data sets that are within this library RFID Chips were used which exchanged packages among themselves, when two people with RFID Chips stood in front of each other in a close range. If a whole package got exchanged between two RFID Chips this occurence becomes a part of the dataset. This was used to filter real contacts from random encounters.

## How to implement the calculate_contact_duration function 
<a name="contact_duration"></a>

For this reason you have to prepare the datasets with the help of python and pandas methods to be able to analyze the dataset in terms of its distribution. The next steps will show how the function "calculate_contact_duration" prepares the dataset to measure the probability, but at first we have to import a dataset to work with.

The cell below can be ignored, after being executed once. The path has to be set to the directory above to get access to the data and the functions of this libary.

In [4]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [5]:
from face2face.imports.load_all_data import Data

df = Data("WS16")

In [6]:
print(df.interaction.head(25))

             Time    i    j
Index                      
0      1480486100  125  130
1      1480486100    7  130
2      1480486100    9  110
3      1480486120    9  130
4      1480486160  125  130
5      1480486180    9   21
6      1480486200    9   21
7      1480486200    7  130
8      1480486220    9   21
9      1480486240    9   21
10     1480486240    7  130
11     1480486240  125   21
12     1480486260  125   21
13     1480486260    7  130
14     1480486280  125   21
15     1480486280    7  130
16     1480486300  125   21
17     1480486300    7  130
18     1480486320  125   21
19     1480486320    7  130
20     1480486340  125   21
21     1480486340    7  110
22     1480486340    7  130
23     1480486360  125   21
24     1480486360    7  130


When you take a look at the first 25 entrys of this tij-dataset you can see that the two people with ID 125 and ID 130 talked two times for this extract. The dataset shows that they talked at timestamp 1480486100, which means they talked from 1480486080 to 1480486100 for at least 20 seconds and for a maximum of 39 seconds. At timestamp 1480486160 you can see another contact between this two people, but if the difference between two timestamps is bigger than 20 seconds it means that there was a break in there conversation. This means that if the difference is 20 seconds you have to accumulate the the contact duration until the difference is bigger than 20 seconds. This is the main part of this function.

In [7]:
tij_data_sorted = df.interaction.sort_values(by=["i", "j"])

print(tij_data_sorted.head(25))

              Time  i   j
Index                    
107076  1480583020  0  26
124647  1480589960  1   0
136534  1480592120  1   0
2519    1480489180  1   2
22905   1480501800  1   2
27137   1480502720  1   2
45978   1480509360  1   2
45991   1480509380  1   2
46026   1480509440  1   2
46043   1480509460  1   2
46068   1480509500  1   2
46078   1480509520  1   2
46091   1480509540  1   2
46130   1480509600  1   2
46142   1480509620  1   2
46187   1480509700  1   2
46202   1480509720  1   2
46215   1480509740  1   2
46226   1480509760  1   2
46240   1480509780  1   2
46254   1480509800  1   2
46268   1480509820  1   2
46279   1480509840  1   2
46295   1480509860  1   2
46306   1480509880  1   2


To get a first overview of all contact compositions you can use the sort_values function from pandas and sort the datasets by ID i and ID j.

In [8]:
tij_data_sorted["diff"] = tij_data_sorted.groupby(["i", "j"])["Time"].diff()

In [9]:
print(tij_data_sorted.head(25))

              Time  i   j        diff
Index                                
107076  1480583020  0  26         nan
124647  1480589960  1   0         nan
136534  1480592120  1   0  2160.00000
2519    1480489180  1   2         nan
22905   1480501800  1   2 12620.00000
27137   1480502720  1   2   920.00000
45978   1480509360  1   2  6640.00000
45991   1480509380  1   2    20.00000
46026   1480509440  1   2    60.00000
46043   1480509460  1   2    20.00000
46068   1480509500  1   2    40.00000
46078   1480509520  1   2    20.00000
46091   1480509540  1   2    20.00000
46130   1480509600  1   2    60.00000
46142   1480509620  1   2    20.00000
46187   1480509700  1   2    80.00000
46202   1480509720  1   2    20.00000
46215   1480509740  1   2    20.00000
46226   1480509760  1   2    20.00000
46240   1480509780  1   2    20.00000
46254   1480509800  1   2    20.00000
46268   1480509820  1   2    20.00000
46279   1480509840  1   2    20.00000
46295   1480509860  1   2    20.00000
46306   1480

As a next step we need to get the differences between every timestamp where the same two people talked to each other. To get that you can use the groupby functions from pandas to get all timestamps for every pair of contacts. With the .diff() function you can measure the difference between every timestamp and the previous timestamp. The output of this construct is a list that we attached on our dataframe with the column name "diff". As you can see there are a lot of 20 seconds differences but also some "nan" values. This "nan" values exist because there is always a first occurence of two people talking to each other and then there are no previous timestamps to measure the difference. This means that this two people have talked for 20 seconds so far.

In [10]:
marker_list = []
for key, value in tij_data_sorted.iterrows():
    if value["diff"] > 20:
        marker_list.append(1)
    else:
        marker_list.append(0)

Our goal is to get the contact durations, so we need to accumlate all rows for every contact pair that have a difference from 20 seconds to the previous timestamp to get the contact duration. For that i created a list with markers, zero for nan values or 20 seconds and 1 for values that are bigger than 20 seconds.

In [11]:
tij_data_sorted["Marker"] = marker_list
print(tij_data_sorted.head(25))

              Time  i   j        diff  Marker
Index                                        
107076  1480583020  0  26         nan       0
124647  1480589960  1   0         nan       0
136534  1480592120  1   0  2160.00000       1
2519    1480489180  1   2         nan       0
22905   1480501800  1   2 12620.00000       1
27137   1480502720  1   2   920.00000       1
45978   1480509360  1   2  6640.00000       1
45991   1480509380  1   2    20.00000       0
46026   1480509440  1   2    60.00000       1
46043   1480509460  1   2    20.00000       0
46068   1480509500  1   2    40.00000       1
46078   1480509520  1   2    20.00000       0
46091   1480509540  1   2    20.00000       0
46130   1480509600  1   2    60.00000       1
46142   1480509620  1   2    20.00000       0
46187   1480509700  1   2    80.00000       1
46202   1480509720  1   2    20.00000       0
46215   1480509740  1   2    20.00000       0
46226   1480509760  1   2    20.00000       0
46240   1480509780  1   2    20.00

Right after i attached the marker list to the dataframe i can create the cumulative sum from the marker column as a new column. With the help of the cumulative sum you can identify every ongoing conversation, because it has the same number in the new "Ind" Column. That the first two rows have the same "Ind" altough its not an ongoing is no problem because we can use the ID's as well to unambigiously describe a conversation. 

In [12]:
tij_data_sorted["Ind"] = tij_data_sorted["Marker"].cumsum()
print(tij_data_sorted.head(25))

              Time  i   j        diff  Marker  Ind
Index                                             
107076  1480583020  0  26         nan       0    0
124647  1480589960  1   0         nan       0    0
136534  1480592120  1   0  2160.00000       1    1
2519    1480489180  1   2         nan       0    1
22905   1480501800  1   2 12620.00000       1    2
27137   1480502720  1   2   920.00000       1    3
45978   1480509360  1   2  6640.00000       1    4
45991   1480509380  1   2    20.00000       0    4
46026   1480509440  1   2    60.00000       1    5
46043   1480509460  1   2    20.00000       0    5
46068   1480509500  1   2    40.00000       1    6
46078   1480509520  1   2    20.00000       0    6
46091   1480509540  1   2    20.00000       0    6
46130   1480509600  1   2    60.00000       1    7
46142   1480509620  1   2    20.00000       0    7
46187   1480509700  1   2    80.00000       1    8
46202   1480509720  1   2    20.00000       0    8
46215   1480509740  1   2    20

With that being said we can again use the groupby function and use .size() to count how often a combination of the same "Ind", "i" and "j" row occur and create a new column "Number" for this amount.

In [13]:
tij_data_sorted = tij_data_sorted.groupby(["Ind", "i", "j"]).size().reset_index(name="Number")
print(tij_data_sorted.head(25))

    Ind  i   j  Number
0     0  0  26       1
1     0  1   0       1
2     1  1   0       1
3     1  1   2       1
4     2  1   2       1
5     3  1   2       1
6     4  1   2       2
7     5  1   2       2
8     6  1   2       3
9     7  1   2       2
10    8  1   2      47
11    9  1   2       1
12   10  1   2       3
13   11  1   2       1
14   12  1   2       1
15   13  1   2       1
16   14  1   2       1
17   15  1   2       1
18   16  1   2       1
19   17  1   2       1
20   18  1   2       1
21   19  1   2       2
22   20  1   2       1
23   21  1   2       1
24   21  1   3       1


Now we can use the Amount of rows for a unique "Ind","i" and "j" setup to calculate the contact duration. Every row with this setup means 20 more seconds of conversation, so we just have to multiply the "Number" column to get the contact duration.

In [14]:
delta_t_list = []
for key, value in tij_data_sorted.iterrows():
    delta_t = value["Number"] * 20
    delta_t_list.append(delta_t)

After that we can attach the list of contact duration to the dataframe and with the help of groupby and .size() we can count the occurence of the different $\Delta t$.

In [15]:
tij_data_sorted["DeltaT"] = delta_t_list

tij_amount_delta_t = tij_data_sorted.groupby("DeltaT").size().reset_index(name="AmountOfDeltaT")
print(tij_amount_delta_t.head())

   DeltaT  AmountOfDeltaT
0      20           34057
1      40            8230
2      60            3186
3      80            1780
4     100            1132


To measure the total amount of conversations in this dataset we can accumulate the "AmountOfDeltaT" column. We need this to calculate the probability for the different contact durations.

In [16]:
cumulated_contacts = 0
for key, value in tij_amount_delta_t.iterrows():
    cumulated_contacts += value["AmountOfDeltaT"]

In [17]:
x_delta_t = tij_amount_delta_t["DeltaT"]

To get the probability for the contact durations you have to divide the amount of the occurence of the contact durations through the total amount of contacts.

In [18]:
y_probability = []
for key, value in tij_amount_delta_t.iterrows():
    y_probability.append(value["AmountOfDeltaT"]/cumulated_contacts)

## How to implement the calculate_triangle_duration function 
<a name="triangle_duration"></a>

For the triangle duration you need to prepare the dataset in a way that you can see if one person talked with at least two different persons at the same timestamp. So the first step would be to use the merge function of pandas to merge the tij togehter with itself by the parameters "Time" and "Time" and by the ID's "i" and "j". The output is a dataframe that contains rows with two pairs where one person is included in both conversations at the same time.

In [19]:
df_merge_1 = df.interaction.merge(df.interaction, left_on=["Time", "i"], right_on=["Time", "j"])
print(df_merge_1.head(15))

          Time  i_x  j_x  i_y  j_y
0   1480486400  110  130    7  110
1   1480486440    9   21  125    9
2   1480486460    9   21  125    9
3   1480486560   21  120  125   21
4   1480486580   21  120  125   21
5   1480486680    9   21  125    9
6   1480486680  110   76   40  110
7   1480486700    9   21  125    9
8   1480486720    9   21  125    9
9   1480486740    9   21  125    9
10  1480486760    9   21  125    9
11  1480486780    9   21  125    9
12  1480486800    9   21  125    9
13  1480486800   76  120   40   76
14  1480486820    9   21  125    9


To proof that this three people stood in front of each other and talked with each other at the same time you have to add the original dataframe on the dataframe again with merging it by the "Time".

In [21]:
df_merge_2 = df_merge_1.merge(df.interaction, left_on=["Time"], right_on=["Time"])
print(df_merge_2.head(15))

          Time  i_x  j_x  i_y  j_y    i    j
0   1480486400  110  130    7  110  125   21
1   1480486400  110  130    7  110  110  130
2   1480486400  110  130    7  110    7  110
3   1480486400  110  130    7  110    7  130
4   1480486440    9   21  125    9    9   21
5   1480486440    9   21  125    9  125    9
6   1480486440    9   21  125    9  125   21
7   1480486460    9   21  125    9    9   21
8   1480486460    9   21  125    9  125    9
9   1480486460    9   21  125    9  125   21
10  1480486560   21  120  125   21   21  120
11  1480486560   21  120  125   21  125   21
12  1480486560   21  120  125   21   77  110
13  1480486580   21  120  125   21  125    9
14  1480486580   21  120  125   21   21  120


Now you have to filter the rows from the dataframe that are real triangles. To do so we check if this six columns of i and j include every id two times with every possible combination in a triangle.

In [22]:
df_filter_triangle = df_merge_2[(df_merge_2["i_x"] == df_merge_2["j_y"])
                                & (df_merge_2["j_x"] == df_merge_2["j"])
                                & (df_merge_2["i_y"] == df_merge_2["i"])]
print(df_filter_triangle.head(15))

          Time  i_x  j_x  i_y  j_y    i    j
3   1480486400  110  130    7  110    7  130
6   1480486440    9   21  125    9  125   21
9   1480486460    9   21  125    9  125   21
17  1480486580   21  120  125   21  125  120
21  1480486680    9   21  125    9  125   21
33  1480486700    9   21  125    9  125   21
37  1480486720    9   21  125    9  125   21
43  1480486740    9   21  125    9  125   21
46  1480486760    9   21  125    9  125   21
51  1480486780    9   21  125    9  125   21
59  1480486800    9   21  125    9  125   21
64  1480486800   76  120   40   76   40  120
70  1480486820    9   21  125    9  125   21
89  1480487020  130  120   21  130   21  120
97  1480487020   21  120   77   21   77  120


Like in the previous tutorial for the contact_duration you have to calculate the differences between the timestamps for the triangles you found.

In [25]:
df_no_duplicates["Diff"] = df_no_duplicates.groupby(["i_x", "j_x", "i_y"])["Time"].diff()

The following steps are pretty much the same as in the calculate_contact_duration tutorial.

In [26]:
marker_list = []
for key, value in df_no_duplicates.iterrows():
    if value["Diff"] > 20:
        marker_list.append(1)
    else:
        marker_list.append(0)
df_no_duplicates["Marker"] = marker_list
df_no_duplicates["Ind"] = df_no_duplicates["Marker"].cumsum()

df_merge_dd_gb = df_no_duplicates.groupby(["Ind", "i_x", "j_x", "i_y"]).size().reset_index(name="Number")

In [27]:
delta_t_list = []
for key, value in df_merge_dd_gb.iterrows():
    if value["Number"] == 1:
        delta_t = 20
        delta_t_list.append(delta_t)
    elif value["Number"] > 1:
        delta_t = value["Number"] * 20
        delta_t_list.append(delta_t)

df_merge_dd_gb["DeltaT"] = delta_t_list

tij_amount_delta_t = df_merge_dd_gb.groupby("DeltaT").size().reset_index(name="AmountOfDeltaT")

cumulated_contacts = 0
for key, value in tij_amount_delta_t.iterrows():
    cumulated_contacts += value["AmountOfDeltaT"]

x_delta_t = []
y_probability = []
for key, value in tij_amount_delta_t.iterrows():
    x_delta_t.append(value["DeltaT"])
    y_probability.append(value["AmountOfDeltaT"]/cumulated_contacts)


In [86]:
test7 = 10**(np.linspace(math.log10(min(x_delta_t)), math.log10(max(x_delta_t)), 50)) # 17 ?

n, bins = np.histogram(x_delta_t, bins=test7, density=True)

## How to implement the inter-contact duration function 
<a name="inter-contact_duration"></a>

The last function of the distribution method creates the inter-contact duration. The inter-contact duration describes the time window in which a person switches the partner from the conversation. So it describes how long it takes that Person A changes his contact from B to C. To do so we need to prepare our data set again.

At first we have to create a list to get all unique individuals that are part of this dataset. To ensure that you get every individual you need to use both ID colums.

In [28]:
individuals = list(set(list(df.interaction.i) + list(df.interaction.j)))

In the next step you have to check every occurence of a person in either the "i" or the "j" column to filter the timestamps. Right after that you can get the time differences by using the created time_stamp list and you have to filter time differences of 20 seconds, because this would be an ongoing conversation.

In [30]:
import numpy as np 
inter_event_duration = np.array([])
for ind in individuals:
    time_stamp = df.interaction[(df.interaction.i == ind) | (df.interaction.j == ind)].Time.values
    diff = time_stamp[1:] - time_stamp[:-1]
    inter_event_duration = np.append(inter_event_duration, diff[diff > 20])

Right after that you can measure the x and y values for a log-log-scaled distribution.

In [31]:
bins = 10**np.linspace(np.log10(min(inter_event_duration)), np.log10(max(inter_event_duration)), 50)
x_delta_t, y_probability = np.histogram(inter_event_duration, bins=bins, density=True)

# Explanation of the degree methods 

The degree of a node (an individual) describes with how many other distinct nodes (individuals) a node was in contact with. This can help us if we use the attributes of this node (like Age, Gender...) to analyze if specific groups are more or less communicative than others. (within and outside of the communitys with the same or different attributes)

## How to implement the average_degreee function 
<a name="average_degree"></a>

At first you have to import the create_network function from the face2face library and the networkx library, because this makes it way easier to get the degrees from any node.

In [32]:
from face2face.imports.create_network import create_network_from_data
import networkx

At first we have to setup the dataframe for the metadata. As a first step i replace every nan values to "NaN" to make it more accessible. After that you have to create a networkX Graph from this dataframe to use the benefits from this object for the measurement of the degree.

In [33]:
df_meta_nan = df.metadata.fillna("NaN")
network = create_network_from_data(df)

To analyze the degrees based on specific attributes you have to get an overview which attributes are used in the metadata dataset. In this case the "ID" will be the attibute in the first column, thats why you have to remove it, because it makes no sense to analyze the "ID".

In [34]:
parameter_list = []
for col in df_meta_nan.columns:
    parameter_list.append(col)
parameter_list = parameter_list[1:]

As a next step you have to split the "ID" column based on the attribute values in the attribute columns into multiple lists so that you can use them to measure the average degree in the next step. You don't want to use the rows where the attribute value that you want to analyze is "NaN". Thats why you have to filter the dataframe by this condition before using the groupby function in a for loop for every attibute. 

In [35]:
liste1 = []
for i in parameter_list:
    liste = []
    dataframe = df.metadata.loc[df.metadata[i] != "NaN"]
    for region, df_region in dataframe.groupby(i):
        liste.append([df_region["ID"], region])
    liste1.append([i,liste])

  result = method(y)


Now that you have lists of lists with the ID's for every attribute and every attribute value you can use the network.degree function to get the degrees for the ID's in a list and measure the average degree by accumulating them and using the length of the list as n.

In [36]:
liste_degree= []
for i in liste1:
    liste3 = []
    for j in i[1]:
        avg_degree = 0
        for k in j[0]:
            avg_degree += network.degree[k]
        avg_degree = avg_degree/len(j[0])
        liste3.append([j[1],avg_degree])
    liste_degree.append([i[0], liste3])

In the end you can also add the total average degree for an attribute to the list by using the attribute value lists from before. 

In [37]:
for i in liste_degree:
    avg_gesamt = 0
    for j in i[1]:
        avg_gesamt += j[1] 
    avg_gesamt = avg_gesamt/len(i[1])
    i[1].append(["GlobalAvG", avg_gesamt])

## How to implement the group_list_degree function
<a name="group_list_degree"></a>

In [38]:
df = Data("WS16")

The start of this implementation is pretty similar to the avg_degree_attr function so you can skip most of it.

In [39]:
from face2face.imports.create_network import create_network_from_data
import networkx

In [40]:
df_meta_nan = df.metadata.fillna("NaN")
network = create_network_from_data(df)

In [41]:
parameter_list = []
for col in df_meta_nan.columns:
    parameter_list.append(col)
parameter_list = parameter_list[1:]

In this case you just need to get lists for every attribute, attribute value and the related ID's.

In [42]:
liste1 = []
for i in parameter_list:
    dataframe = df.metadata.loc[df.metadata[i] != "NaN"]
    for region, df_region in dataframe.groupby(i):
        liste1.append([i, region, df_region["ID"]])

As a next step you can replace the ID values by their degree values with the help of network.degree.

In [43]:
for i in liste1:
    Liste = []
    for j in i[2]:
        Liste.append(network.degree[i[2][j]])
    i[2] = Liste[:]

The lists can be used for comparing the correlation of the communicativity based on the different attribute values. 