# Describing the functionality of pandas and the basic functions from the face2face library

**Authors**: Andreas Kruff, Johann Schaible, Marcos Oliveira

**Version**: 12.05.2020

**Description**: This tutorial describes the underlying pandas methods that are used to build the face2face methods for the face-to-face interaction durations of this toolbox.

## Table of Contents
#### [Implement the contact duration function](#contact_duration)
#### [Implement the triangle duration function](#triangle_duration)
#### [Implement the inter-contact duration function](#inter-contact_duration)

# Explanation of the distribution methods

A very basic method to analyze tij-datasets is to measure the probability distribution for contact durations of the given dataset. To do so you need to know how to interpret the data. For the creating of the tij data sets, that are included in this library, RFID chips were used which exchanged packages among themselves, when two people with RFID chips stood in front of each other in a close range. If a whole package got exchanged between two RFID chips this occurence becomes a part of the dataset. This was used to filter real contacts from random encounters.

## How to implement the calculate_contact_duration function 
<a name="contact_duration"></a>

For this reason you have to prepare the datasets with the help of python and pandas methods to be able to analyze the data set in terms of its distribution. The next steps will show how the function "calculate_contact_duration" prepares the dataset to measure the probability, but at first we have to import a dataset to work with.

In [1]:
import face2face as f2f

df = f2f.Data("Synthetic")

As you can see below we imported the synthetic data set "Synthetic" as a Data object, which means it can contain up to two dataframes. With ".interaction" you can access the tij dataframe and with ".metadata" you can access the metadata, if metadata are provided. With that being said, we can use the Data object and the dataframes for the further analysis. 

In [2]:
type(df)

face2face.imports.load_all_data.Data

When you take a look at the first 25 entrys of this tij-dataset you can see that the two individuals with ID 7 and ID 182 occur mulitple times in this extract. The data set shows that they talked at timestamp 20 for the first time, which means they talked for at least 20 seconds and for a maximum of 39 seconds, because of the resolution of this measurement method. At the following timestamps 40 - 100 you can see additional contacts between this two individuals, and because the difference between the consecutive timestamps is 20 seconds it means that it was an ongoing conversation. This means that if the difference between two timestamps is 20 seconds you have to accumulate the interaction durations until the difference is bigger than 20 seconds. A difference bigger than 20 seconds means there was a break in the conversation. This is the main part of this function.

In [4]:
df.interaction.head(25)

Unnamed: 0_level_0,Time,i,j
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,20,7,182
1,40,7,182
2,40,14,15
3,40,68,92
4,40,7,182
5,60,7,182
6,60,7,182
7,80,7,182
8,100,7,182
9,100,7,182


To get a first overview of all contact compositions you can use the sort_values function from pandas and sort the datasets by ID i and j. Although it is not necessary for the calculations to sort it at the beginning, you can see the ongoing interactions a lot better in the output if you sort them.

In [5]:
tij_data_sorted = df.interaction.sort_values(by=["i", "j"])
tij_data_sorted.head(25)

Unnamed: 0_level_0,Time,i,j
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
33437,188780,0,3
33439,188800,0,3
33444,188860,0,3
33594,189940,0,3
33751,190600,0,3
46703,254260,0,3
46714,254280,0,3
46986,254860,0,3
46996,254880,0,3
13227,81840,0,7


As a next step we need to get the differences between every timestamp where the same two people talked to each other. To get that you can use the groupby functions from pandas to get all timestamps for every pair of contacts. With the .diff() function that we also used in the previous tutorial you can measure the difference between every timestamp and the previous timestamp. The output of this construct is a list that we attached on our dataframe with the column name "diff". As you can see there are a lot of 20 seconds differences but also some "nan" values. This "nan" values exist because there is always a first occurence of two people talking to each other and then there are no previous timestamps to measure the difference with. This means that this two people have talked for 20 seconds so far.

In [6]:
tij_data_sorted["diff"] = tij_data_sorted.groupby(["i", "j"])["Time"].diff()

In [7]:
tij_data_sorted.head(25)

Unnamed: 0_level_0,Time,i,j,diff
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
33437,188780,0,3,
33439,188800,0,3,20.0
33444,188860,0,3,60.0
33594,189940,0,3,1080.0
33751,190600,0,3,660.0
46703,254260,0,3,63660.0
46714,254280,0,3,20.0
46986,254860,0,3,580.0
46996,254880,0,3,20.0
13227,81840,0,7,


Our goal is to get the contact durations, so we need to accumulate all rows for every contact pair that have a difference from 20 seconds to the previous timestamp to get the contact durations. For that we have to create a list with markers, zero for nan values or 20 seconds and one for values that are bigger than 20 seconds. The zero indicates the first occurence of a contact between two individuals or an ongoing conversation, we want to make use of this by calculating the cumulative sum as the next step. This allows us to differentiate all the ongoing interactions between two individuals. When you build the cumulative sum all the ones are marking a new starting interaction between the two same individuals, so we know which rows stick together to one consecutive conversation, because just the marker for the differences bigger than 20 seconds affect the cumulative sum.

In [8]:
marker_list = []
for key, value in tij_data_sorted.iterrows():
    if value["diff"] > 20:
        marker_list.append(1)
    else:
        marker_list.append(0)

In [9]:
tij_data_sorted["Marker"] = marker_list
tij_data_sorted.head(25)

Unnamed: 0_level_0,Time,i,j,diff,Marker
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
33437,188780,0,3,,0
33439,188800,0,3,20.0,0
33444,188860,0,3,60.0,1
33594,189940,0,3,1080.0,1
33751,190600,0,3,660.0,1
46703,254260,0,3,63660.0,1
46714,254280,0,3,20.0,0
46986,254860,0,3,580.0,1
46996,254880,0,3,20.0,0
13227,81840,0,7,,0


Right after we attach the marker list to the dataframe we can create the already mentioned cumulative sum from the marker column as a new column. With the help of the cumulative sum you can identify every ongoing conversation, because it has the same number in the new "Ind"(Index) column. That the first two rows have the same "Ind" index, altough its not an ongoing conversation is no problem because we can use the ID's as well to unambigiously describe a conversation. 

In [10]:
tij_data_sorted["Ind"] = tij_data_sorted["Marker"].cumsum()
tij_data_sorted.head(25)

Unnamed: 0_level_0,Time,i,j,diff,Marker,Ind
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
33437,188780,0,3,,0,0
33439,188800,0,3,20.0,0,0
33444,188860,0,3,60.0,1,1
33594,189940,0,3,1080.0,1,2
33751,190600,0,3,660.0,1,3
46703,254260,0,3,63660.0,1,4
46714,254280,0,3,20.0,0,4
46986,254860,0,3,580.0,1,5
46996,254880,0,3,20.0,0,5
13227,81840,0,7,,0,5


With that being said we can again use the groupby function and use .size() to count how often a combination of the same "Ind", "i" and "j" row occur and create a new column "Number" for this amount.

In [11]:
tij_data_sorted = tij_data_sorted.groupby(["Ind", "i", "j"]).size().reset_index(name="Number")
tij_data_sorted.head(25)

Unnamed: 0,Ind,i,j,Number
0,0,0,3,2
1,1,0,3,1
2,2,0,3,1
3,3,0,3,1
4,4,0,3,2
5,5,0,3,2
6,5,0,7,1
7,6,0,7,1
8,6,0,17,1
9,7,0,17,1


Now we can use the amount of rows for a unique "Ind","i" and "j" setup to calculate the contact duration. Every row with this setup means 20 more seconds of conversation, so we just have to multiply the "Number" column to get the contact duration.

In [12]:
delta_t_list = []
for key, value in tij_data_sorted.iterrows():
    delta_t = value["Number"] * 20
    delta_t_list.append(delta_t)

After that we can attach the list of contact duration to the dataframe and with the help of groupby and .size() we can count the occurence of the different $\Delta t$.

In [13]:
tij_data_sorted["DeltaT"] = delta_t_list

tij_amount_delta_t = tij_data_sorted.groupby("DeltaT").size().reset_index(name="AmountOfDeltaT")
tij_amount_delta_t.head()

Unnamed: 0,DeltaT,AmountOfDeltaT
0,20,10451
1,40,2976
2,60,1680
3,80,868
4,100,606


To measure the total amount of conversations in this dataset we can accumulate the "AmountOfDeltaT" column. We need this to calculate the probability for the different contact durations.

In [14]:
cumulated_contacts = 0
for key, value in tij_amount_delta_t.iterrows():
    cumulated_contacts += value["AmountOfDeltaT"]

In [15]:
x_delta_t = tij_amount_delta_t["DeltaT"]

To get the probabilitys for all the possible contact durations you have to divide the amount of the occurence of the contact durations through the total amount of contacts.

In [16]:
y_probability = []
for key, value in tij_amount_delta_t.iterrows():
    y_probability.append(value["AmountOfDeltaT"]/cumulated_contacts)

## How to implement the calculate_triangle_duration function 
<a name="triangle_duration"></a>

For the triangle duration you need to prepare the dataset in a way that you can see if one person talked with at least two different persons at the same timestamp. So the first step would be to use the merge function of pandas to merge the tij data set together with itself by the parameters "Time" and "Time" and by the ID's "i" and "j". The output is a dataframe that contains rows with two pairs where one person is included in both conversations at the same time.

In [17]:
df_merge_1 = df.interaction.merge(df.interaction, left_on=["Time", "i"], right_on=["Time", "j"])
df_merge_1.head(15)

Unnamed: 0,Time,i_x,j_x,i_y,j_y
0,320,83,183,36,83
1,320,127,138,82,127
2,340,127,138,82,127
3,340,127,138,82,127
4,340,127,138,82,127
5,340,127,138,82,127
6,360,127,138,82,127
7,420,127,138,82,127
8,440,127,138,82,127
9,440,127,138,82,127


To proof that this three people stood in front of each other and talked with each other at the same time you have to add the original dataframe on the dataframe again by merging it with the "Time".

In [18]:
df_merge_2 = df_merge_1.merge(df.interaction, left_on=["Time"], right_on=["Time"])
df_merge_2.head(15)

Unnamed: 0,Time,i_x,j_x,i_y,j_y,i,j
0,320,83,183,36,83,36,183
1,320,83,183,36,83,82,138
2,320,83,183,36,83,82,127
3,320,83,183,36,83,83,183
4,320,83,183,36,83,36,83
5,320,83,183,36,83,127,138
6,320,83,183,36,83,82,138
7,320,127,138,82,127,36,183
8,320,127,138,82,127,82,138
9,320,127,138,82,127,82,127


Now you have to filter the rows from the dataframe that are real triangles. To do so we check if this six columns of "i" and "j" include every id two times with every possible combination in a triangle.

In [19]:
df_filter_triangle = df_merge_2[(df_merge_2["i_x"] == df_merge_2["j_y"])
                                & (df_merge_2["j_x"] == df_merge_2["j"])
                                & (df_merge_2["i_y"] == df_merge_2["i"])]
df_filter_triangle.head(15)

Unnamed: 0,Time,i_x,j_x,i_y,j_y,i,j
0,320,83,183,36,83,36,183
8,320,127,138,82,127,82,138
13,320,127,138,82,127,82,138
15,340,127,138,82,127,82,138
21,340,127,138,82,127,82,138
23,340,127,138,82,127,82,138
29,340,127,138,82,127,82,138
31,340,127,138,82,127,82,138
37,340,127,138,82,127,82,138
39,340,127,138,82,127,82,138


Like in the previous tutorial for the contact_duration you have to calculate the differences between the timestamps for the triangles you found.

In [20]:
df_filter_triangle["Diff"] = df_filter_triangle.groupby(["i_x", "j_x", "i_y"])["Time"].diff()
df_filter_triangle

Unnamed: 0,Time,i_x,j_x,i_y,j_y,i,j,Diff
0,320,83,183,36,83,36,183,
8,320,127,138,82,127,82,138,
13,320,127,138,82,127,82,138,0.00000
15,340,127,138,82,127,82,138,20.00000
21,340,127,138,82,127,82,138,0.00000
23,340,127,138,82,127,82,138,0.00000
29,340,127,138,82,127,82,138,0.00000
31,340,127,138,82,127,82,138,0.00000
37,340,127,138,82,127,82,138,0.00000
39,340,127,138,82,127,82,138,0.00000


The following steps are pretty much the same as in the calculate_contact_duration tutorial.

In [21]:
marker_list = []
for key, value in df_filter_triangle.iterrows():
    if value["Diff"] > 20:
        marker_list.append(1)
    else:
        marker_list.append(0)
df_filter_triangle["Marker"] = marker_list
df_filter_triangle["Ind"] = df_filter_triangle["Marker"].cumsum()

df_merge_dd_gb = df_filter_triangle.groupby(["Ind", "i_x", "j_x", "i_y"]).size().reset_index(name="Number")

In [22]:
delta_t_list = []
for key, value in df_merge_dd_gb.iterrows():
    if value["Number"] == 1:
        delta_t = 20
        delta_t_list.append(delta_t)
    elif value["Number"] > 1:
        delta_t = value["Number"] * 20
        delta_t_list.append(delta_t)

df_merge_dd_gb["DeltaT"] = delta_t_list

tij_amount_delta_t = df_merge_dd_gb.groupby("DeltaT").size().reset_index(name="AmountOfDeltaT")

cumulated_contacts = 0
for key, value in tij_amount_delta_t.iterrows():
    cumulated_contacts += value["AmountOfDeltaT"]

x_delta_t = []
y_probability = []
for key, value in tij_amount_delta_t.iterrows():
    x_delta_t.append(value["DeltaT"])
    y_probability.append(value["AmountOfDeltaT"]/cumulated_contacts)


You can also calculate the probabilities for the contact and for the triangle duration by using a histogram. You have to import numpy for the linspace and the histogram function and math for the logarithm function.

In [23]:
import numpy as np
import math

In [24]:
bins = 10**(np.linspace(math.log10(min(x_delta_t)), math.log10(max(x_delta_t)), 50)) # 17 ?

n, bins = np.histogram(x_delta_t, bins=bins, density=True)

## How to implement the inter-contact duration function 
<a name="inter-contact_duration"></a>

The last function of the distribution calculates the probabilities for the inter-contact duration. The inter-contact duration describes the time window in which a person switches the conversation partner. So it describes how long it takes that person A changes his contact from person B to person C. To do so we need to prepare our data set again.

At first we have to create a list to get all unique individuals that are part of this dataset. To ensure that you get every individual you need to use both ID colums.

In [25]:
individuals = list(set(list(df.interaction.i) + list(df.interaction.j)))

In the next step we have to check every occurence of a person in either the "i" or the "j" column to filter the timestamps. Right after that you can get the time differences by using the created time_stamp list and you have to filter time differences of 20 seconds, because this would be an ongoing conversation.

In [26]:
import numpy as np 
inter_event_duration = np.array([])
for ind in individuals:
    time_stamp = df.interaction[(df.interaction.i == ind) | (df.interaction.j == ind)].Time.values
    diff = time_stamp[1:] - time_stamp[:-1]
    inter_event_duration = np.append(inter_event_duration, diff[diff > 20])

Right after that you can measure the x and y values for a log-log-scaled distribution.

In [27]:
bins = 10**np.linspace(np.log10(min(inter_event_duration)), np.log10(max(inter_event_duration)), 50)
x_delta_t, y_probability = np.histogram(inter_event_duration, bins=bins, density=True)

After finishing this tutorial you should now be able to understand how you can model interaction data sets to be able to analyze them in terms of different kinds of interaction durations. If you want to see how you can use these functions in the toolbox and how to plot them you can have a look at "probability_distribution_contact_duration" tutorial. If you want to further analyse the resulting lists that contain all the $\Delta t$ in terms of their distribution you can have a look at the tutorials "How_to_use_statistical_characterization" and "Statistical_characterization".