# Week 4 Notes

## 4.1.1: Getting Started with Pandas

In this case study, we will attempt to group different samples of whiskey using their flavor characteristics.

``pandas`` is built on top of NumPy and is a great tool for data analysis.
- ``pandas.Series`` is a 1-dimensional array-like object with a name and index.
- ``pandas.DataFrame`` is a 2-dimensional array-like object with a column and row labels.

To create a ``pandas.Series``, we can use the ``pandas.Series()`` function.

In [None]:
import pandas as pd

x = pd.Series([1, 2, 3, 4, 5])
x

In [None]:
# Using explicit indices
y = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
y

In [None]:
# Using a dictionary
z = pd.Series({"a": 1, "b": 2, "c": 3, "d": 4, "e": 5})
z

To create a ``pandas.DataFrame``, we can use the ``pandas.DataFrame()`` function.

In [None]:
# Using a dictionary where the values are lists (can also be 1D numpy arrays)
z = pd.DataFrame(
    {
        "name": ["John", "Mary", "Mark"],
        "age": [30, 40, 50],
        "ZIP": [12345, 23456, 34567],
    }
)
z

We can get the index of a ``pandas.Series`` or ``pandas.DataFrame`` using the ``.index`` attribute. Using ``sorted()``, we can sort the index and create a list of the sorted indices.

We can also reorder a ``pandas.Series`` or ``pandas.DataFrame`` using the ``.reindex()`` method.

## 4.1.2: Loading and Inspecting Data

We will now load and inspect the data stored in ``whiskey.txt`` and ``regions.txt``, both of which are formatted in a CSV format.

In [None]:
import numpy as np
import pandas as pd

whiskies = pd.read_csv("whiskies.txt")
whiskies["Region"] = pd.read_csv("regions.txt")

We can use the ``.head()`` method to view the first few rows of the data. We can use the ``.tail()`` method to view the last few rows of the data.

In [None]:
whiskies.head()

In [None]:
whiskies.tail()

We would like to see the specific subset of the ``whiskies`` dataframe that corresponds to the flavors of whiskies. To do this, we can create a new dataframe using the followingg code:

In [None]:
flavors = whiskies.iloc[:, 2:14]
flavors.head()

## 4.1.3: Exploring Correlations

We want to find out if there are any strong linear correlations between the different taste attributes of each whisky. We can use the ``.corr()`` method to find the correlation between each pair of columns, and by default, this method uses the Pearson correlation coefficient.

In [None]:
corr_flavors = flavors.corr()
corr_flavors

The above output corresponds to a correlation matrix. Let us plot this matrix.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
plt.pcolor(corr_flavors)
plt.colorbar()
plt.show()

We now have a plot where we can see the correlation between each pair of taste attributes. We can also transpose the ``corr_flavors`` matrix to find correlations between the whiskies with respect to flavors (this can also be interpreted as the correlations between the whiskey refineries and the flavors of whiskey they produce).

In [None]:
corr_whiskies = flavors.T.corr()
corr_whiskies.iloc[:5, :5]

In [None]:
plt.figure(figsize=(10, 10))
plt.pcolor(corr_whiskies)
plt.colorbar()
plt.show()

## 4.1.4 Clustering Whiskies by Flavor Profile

Spectral co-clustering is a method for grouping data points into clusters. There exists a Python function called ``scipy.cluster.bicluster.SpectralCoclustering()`` that can be used to perform spectral co-clustering.

Although this problem is still computationally to solve directly, an approximate solution can be found using eigenvalues and eigenvectors of an adjacency matrix.

In [None]:
from sklearn.cluster import SpectralCoclustering

model = SpectralCoclustering(
    n_clusters=6, random_state=0
)  # create a spectral co-clustering model with 6 clusters (represnting the 6 regions)

model.fit(corr_whiskies)  # fit the model to the whiskies data

model.rows_[
    :, :10
]  # see the clusters as rows and the individual whiskies as columns, with "True" denoting that the whisky belongs to a certain cluster and "False" denoting that it does not

In [None]:
np.sum(model.rows_, axis=1)  # the number of whiskies in each cluster

In [None]:
model.row_labels_[:10]  # the output denotes the cluster that each whisky belongs to

## 4.1.5: Comparing Correlation Matrices

We are now ready to compare the correlation matrices between the actual regions of the whiskies to the approximate solutions.

In [None]:
whiskies["Group"] = pd.Series(
    model.row_labels_, index=whiskies.index
)  # add a column to the dataframe with the cluster labels
whiskies = whiskies.iloc[
    whiskies["Group"].argsort()
]  # sort the dataframe by the cluster labels
whiskies = whiskies.reset_index()  # reset the index

In [None]:
correlations = whiskies.iloc[
    :, 2:14
].T.corr()  # the correlation matrix of the clusters after updating the indices
correlations = np.array(correlations)  # convert the correlation matrix to a numpy array

## 4.2.1 Introduction to GPS Tracking of Birds

We are going to do some data processing and analysis on GPS data from three seagulls, as collected by the LifeWatch INBO project. The data is stored in a CSV file called ``bird_tracking.csv``

In [None]:
import pandas as pd

birddata = pd.read_csv("bird_tracking.csv")
birddata.info()  # basic information about the dataframe

In [None]:
birddata.head()  # the first five rows of the dataframe

## 4.2.2 Simple Data Visualizations

We will now plot the GPS data of one of the three seagulls. Note that this data represents positions on a sphere and not on a cartesian plane, so the data will appear quite distorted. A cartographic projection is used to make the data more readable, and will be done later on in this part. For now though, we want to get a feel for the data itself.

We start by importing the necessary modules.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

First, we will extract the indices in the ``birddata`` dataframe that correspond to the seagull Eric. We will store these indices in the array ``ix``.

In [None]:
eric_ix = birddata.bird_name == "Eric"

Then, we will extract the longitude and latitude of each of Eric's observations and record them as ``x`` and ``y`` coordinates in a cartesian plane. Again, this is not optimal for this particular data set as latitudes and longitudes correspond to a plan wrapped on a sphere, but it is fine for this example.

In [None]:
(x, y) = birddata.longitude[eric_ix], birddata.latitude[eric_ix]

We can now plot the data.

In [None]:
plt.figure(figsize=(10, 10))
plt.plot(x, y, ".")
plt.show()

From this graph, we can generally see a pattern in Eric's movement first heading northeast, then northwest, then northeast again for a longer distance. It appears that they then head in similar corridors back to the original starting point. There are also bigger gaps in some places in the graph, but it is unclear whether these are gaps due to measurement errors or because Eric was moving faster in some legs.

Now we will extend this plot for all three birds in which we are interested in this case study. We first start by storing the names of birds in the ``birddata`` dataframe in the array ``bird_names``.

In [None]:
bird_names = birddata.bird_name.unique()

Let us check the names of the birds in the dataframe.

In [None]:
bird_names

We can see that there are three birds in the dataframe: Eric, Nico, and Sanne.

Using very similar code looped over all three birds, we can now plot the data for each bird on a single figure.

In [None]:
plt.figure(figsize=(10, 10))  # initialize the figure

for bird_name in bird_names:
    ix = birddata.bird_name == bird_name
    (x, y) = birddata.longitude[ix], birddata.latitude[ix]
    plt.plot(x, y, ".", label=bird_name)

plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.legend(loc="lower right")
plt.show()

We can see that the flight paths of each bird are very similar, but Nico and Sanne seem to venture further south than Eric.

## 4.2.3 Examining Flight Speed

The column  ``speed_2d`` contains the speed of the seagulls approximated on a 2D cartesian plane. We will now examine and analyze this data, starting with the approximate speeds for just Eric.

In [None]:
eric_ix = birddata.bird_name == "Eric"  # get the indices of the bird data for Eric
speed = birddata.speed_2d[eric_ix]  # get the speed data for Eric

However, if you try to plot the speed data for eric using ``plt.hist``, you will get an error. This error does not come up for the first 10 data points. This indicates that this speed data may be containing objects that are not numbers. Let us run the following code to check for this error.

In [None]:
speed[np.isnan(speed) == True].shape

We can not only see that there are indeed values in the speed data that are listed as NaN (not a number), but also that there are exactly 85 of these values. Let us extract the indices of these values to ignore them in the histogram plot.

In [None]:
nan_ix = np.isnan(speed) == True

Let us turn this array into the list of indices to include by using the bitwise NOT operator ``~``.

In [None]:
ix_include = ~nan_ix

We can now plot the speed data for Eric using the indices we have extracted.

In [None]:
plt.figure(figsize=(10, 10))
plt.hist(speed[ix_include])
plt.show()

We will now normalize this graph and add labels to make the histogram more readable.

In [None]:
plt.figure(figsize=(10, 10))
plt.hist(speed[ix_include], bins=np.linspace(0, 30, 20), density=True)
plt.xlabel("Speed (m/s)")
plt.ylabel("Frequency")
plt.show()

This method of looking and excluding for NaNs in your data is a common and appreciated practice you should be familiar with. However, ``pandas`` actually has plotting functionality that can automatically handle this for you. The downside is that ``pandas`` plots are not as customizable as ``pyplot`` plots.

In [None]:
plt.figure(figsize=(10, 10))
birddata.speed_2d.plot(kind="hist", range=[0, 30])
plt.xlabel("Speed (m/s)")
plt.show()

## 4.2.4 Using DateTime

In ``birddata`` we have a column ``date_time`` that contains the date and time of the observations. Let us examine a few of these datetime objects.

In [None]:
birddata.date_time[0:3]

These entries are all `str`'s. If we want to operate on these datetime entries, we need to covert them into datetime objects. These objects are in the format ``datetime.datetime(year, month, day, hour, minute, second)``. Let us try converting the first datetime entry to a datetime object.

In [None]:
import datetime

date_str = birddata.date_time[0]  # get the first datetime entry, which is a string
print(type(date_str))  # check the type of the entry, returns "str"
date = datetime.datetime.strptime(
    date_str[
        :-3
    ],  # strip the string of the whitespace and the last three characters, which correspond to the UTC offset
    "%Y-%m-%d %H:%M:%S",  # give the format of the data in the string to effectively convert it to a datetime object
)
date

We can now adapt this code to convert all of the datetime entries in ``birddata`` to datetime objects.

In [None]:
timestamps = []  # create an empty list to store the datetime objects
for k in range(
    len(birddata.date_time)
):  # loop over the entries in the date_time column
    timestamps.append(
        datetime.datetime.strptime(
            birddata.date_time.iloc[k][
                :-3
            ],  # strip the string of the whitespace and the last three characters, which correspond to the UTC offset
            "%Y-%m-%d %H:%M:%S",  # give the format of the data in the string to effectively convert it to a datetime object
        )
    )  # append the datetime object to the timestamps list

Let us now add these datetime objects to the ``birddata`` dataframe.

In [None]:
birddata["timestamp"] = pd.Series(timestamps, index=birddata.index)
birddata.head()  # check the first few entries

Let us now make a list that stores the elapsed time for each observation for Eric from the beginning of data collection.

In [None]:
times = birddata.timestamp[birddata.bird_name == "Eric"]  # get the timestamps for Eric
elapsed_times = np.array(
    [time - times[0] for time in times]
)  # get the elapsed times for Eric
elapsed_times[:5]  # check the first few entries

Note that differences in DateTime objects are returned as special TimeDelta objects. To convert these DateTime objects to a certain time measurement, we can divide by a TimeDelta object corresponding to one unit of time. For example, if we want to convert the elapsed times to days, we can do so by dividing by a TimeDelta object corresponding to one day.

In [None]:
days_elapsed = elapsed_times / datetime.timedelta(days=1)
days_elapsed[:5]  # check the first few entries

Let's plot some of this data.

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(days_elapsed)
plt.xlabel("Observation")
plt.ylabel("Days elapsed")
plt.show()

For a perfect dataset with equal jumps, we expect a perfect line for this plot. However, we see some jumps. This indicates that there are some data points that are unevenly spaced, and so relying on algorithms that expect perfect spacing is not a good idea for this case.

## 4.2.5 Calculating Daily Mean Speed

To calculate daily mean speeds, we need to first group the data by day. This involves iterating through rows in the data frame and grouping the rows until we reach an index that corresponds to the next day.

In [None]:
next_day = 1  # initialize the index of the next day
inds = (
    []
)  # initialize an empty list to store the indices of observations in the current day
daily_mean_speeds = []  # initialize an empty list to store the mean daily speeds

for (i, t) in enumerate(days_elapsed):  # loop over the elapsed times
    if t < next_day:  # if the elapsed time is less than the next day
        inds.append(
            i
        )  # append the index of the current observation to the list of indices

    else:  # if the elapsed time is greater than the next day
        daily_mean_speeds.append(
            birddata.speed_2d[inds].mean()
        )  # get the mean speed for the current day
        next_day += 1  # increment the index of the next day
        inds = []  # reset the list of indices

daily_mean_speeds = np.array(daily_mean_speeds)  # convert the list to a numpy array
daily_mean_speeds[:5]  # check the first few entries

Let's plot the daily mean speeds.

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(daily_mean_speeds)
plt.xlabel("Day")
plt.ylabel("Daily mean speed (m/s)")
plt.show()

We see that Eric flies mostly at just above 2 m/s daily except for a few days when he flies significantly faster. These peaks correspond to the days when he flies at the highest speed, which naturally are his days of migration. Hence, the peaks in this graph represent the days and average daily speeds of migration.

In [None]:
# Comprehension Check Question
birddata[
    birddata.bird_name == "Sanne"
].date_time.head()  # what are the first few timestamps recorded for Sanne?

## 4.2.6 Using the Cartopy Module

Cartopy is a library for plotting maps. It is a great tool for visualizing data, especially location data such as longitude and latitude.

In [None]:
from cartopy import crs as ccrs
from cartopy import feature as cfeature

Let us visualize the flight paths of the three seagulls using Cartopy, specifically the Mercator standard projection.

In [None]:
proj = ccrs.Mercator()  # create a Mercator projection

plt.figure(figsize=(10, 5))  # create a figure
ax = plt.axes(projection=proj)  # create an axes using the Mercator projection
ax.set_extent([-25.0, 20.0, 52, 10])  # set the extent of the plot

for bird in bird_names:  # loop over the bird names
    ix = (
        birddata.bird_name == bird
    )  # get the indices of the observations for the current bird
    x, y = (
        birddata.longitude[ix],
        birddata.latitude[ix],
    )  # get the longitude and latitude
    ax.plot(
        x, y, transform=ccrs.Geodetic(), label=bird
    )  # plot the longitude and latitude

# More features to plot
ax.add_feature(cfeature.OCEAN)  # add the oceans
ax.add_feature(cfeature.LAND)  # add the land
ax.add_feature(cfeature.COASTLINE)  # add the coastlines
ax.add_feature(cfeature.BORDERS, linestyle=":")  # add the borders

ax.legend()  # add a legend
plt.show()  # show the plot

Here, we see the flight trajectories of all three birds, but these paths now correspond to paths on the globe rather than on a cartesian plane. We can also see exactly where the birds are flying: from the west coast of Africa up through the Portuguese and Spanish costs to the northern coast of Europe near Belgium, France, and Germany.

## 4.3.1 Introduction to Network Analysis

GRAPH THEORY MAKES A RETURN

A *network* is the real-world implementation of a *graph*, which is mathematically abstract.

## 4.3.2 Basics of NetworkX

NetworkX is a Python library for analyzing and visualizing networks and graphs.

In [None]:
import networkx as nx  # import the NetworkX module

Let us create an example undirected graph.

In [None]:
G = nx.Graph()  # create an empty graph
G.add_node(1)  # add a node
G.add_nodes_from([2, 3, 4])  # add multiple nodes

Let us see what nodes are in the graph G.

In [None]:
G.nodes()  # get the nodes

Let us add edges to the graph.

In [None]:
G.add_edge(1, 2)  # add an edge
G.add_edges_from([(1, 3), (1, 4)])  # add multiple edges

Let us see what edges are in the graph G.

In [None]:
G.edges()  # get the edges

Removing nodes and edges work analogously to adding them.

Let us find out the number of vertices and edges in the graph G.

In [None]:
G.number_of_nodes()  # get the number of nodes

In [None]:
G.number_of_edges()  # get the number of edges

## 4.3.3 Graph Visualization

Let's play around with the Karate Club graph, built into NetworkX.

In [None]:
K = nx.karate_club_graph()  # create the graph

# We visualize the graph using pyplot
plt.figure(figsize=(10, 5))  # create a figure
nx.draw(K, with_labels=True)  # draw the graph
plt.show()  # show the plot

The degrees of each vertex in a graph ``G`` can be accessed using ``G.degree()``. This returns a DegreeView object, which we can iterate over to get the degree of each vertex. It works very similarly to a Python dictionary.

In [None]:
K.degree()

Let us find out how many edges are in the graph ``K``.

In [None]:
K.number_of_edges()  # get the number of edges

## 4.3.4 Random Graphs

The simplest random graph model is the Erdos-Renyi (ER) graph model. It has two parameters:
1. `N`: the number of nodes
2. `p`: the probability of an edge between two nodes

Our goal is to create an ER graph generator as a Python function.

In [None]:
def er_graph(N: int, p: float) -> nx.Graph:
    """
    Generate an Erdos-Renyi graph with N nodes and probability p of an edge between any two nodes.

    :param N: the number of nodes
    :param p: the probability of an edge between two nodes

    :return: the generated ER random graph
    """
    from itertools import combinations

    from scipy.stats import bernoulli

    try:
        assert 0 <= p <= 1  # check that p is between 0 and 1
    except AssertionError:
        raise ValueError("p must be between 0 and 1")

    G = nx.Graph()  # create an empty graph
    G.add_nodes_from(range(N))  # add N nodes

    for node1, node2 in combinations(G.nodes(), 2):
        if bernoulli.rvs(p=p):
            G.add_edge(node1, node2)

    return G  # return the graph

Now let us use this function to create a random graph with 10 nodes and probability 0.2 of an edge between any two nodes.

In [None]:
plt.figure(figsize=(10, 5))  # create a figure
nx.draw(er_graph(N=20, p=0.2), with_labels=True)  # draw the graph
plt.show()  # show the plot

## 4.3.5 Plotting the Degree Distribution

In [None]:
def plot_degree_distribution(G: nx.Graph) -> None:
    """
    Plot the degree distribution of a graph.

    :param G: the graph to plot the degree distribution of
    """
    from matplotlib.pyplot import hist, title, xlabel, ylabel

    degree_sequence = [d for n, d in G.degree()]
    hist(degree_sequence, histtype="step")  # plot the degree distribution

    xlabel("Degree $k$")  # label the x-axis
    ylabel("$P(K=k)$")  # label the y-axis
    title("Degree Distribution")  # label the title

Let's try this function out on the Karate Club graph.

In [None]:
plt.figure(figsize=(10, 5))  # create a figure
plot_degree_distribution(er_graph(500, 0.08))
plt.show()

Let's try plotting several functions of the same ER graph parameters and look at the results.

In [None]:
plt.figure(figsize=(10, 5))  # create a figure

for graph in range(5):
    plot_degree_distribution(er_graph(500, 0.08))

plt.show()

In [None]:
# Comprehension Check #2
plt.figure(figsize=(10, 5))  # create a figure
for graph in range(10):
    plot_degree_distribution(er_graph(100, 0.03))
plt.show()

In [None]:
plt.figure(figsize=(10, 5))  # create a figure
for graph in range(10):
    plot_degree_distribution(er_graph(100, 0.3))
plt.show()

## 4.3.6 Descriptive Statistics of Empirical Social Networks

An adjacency matrix is a square matrix with ones and zeros on the main diagonal and off-diagonal entries. The adjacency matrix of a graph is the matrix of all the edges in the graph, with each row and column representing nodes.

Let's import the adjacency matrices for the relationship networks of two rural villages in India.

In [None]:
from numpy import loadtxt

A1 = loadtxt("adj_allVillageRelationships_vilno_1.csv", delimiter=",")
A2 = loadtxt("adj_allVillageRelationships_vilno_2.csv", delimiter=",")

Let us now convert these adjacency matrices into graph objects using the ``nx.to_networkx_graph()`` method.

In [None]:
G1 = nx.to_networkx_graph(A1)
G2 = nx.to_networkx_graph(A2)

In [None]:
def basic_net_stats(G: nx.Graph) -> dict:
    """
    Return a list of basic network statistics for a graph.

    :param G: the graph to compute the basic network statistics of

    :return: a list of basic network statistics
    """
    from numpy import array, mean, std

    return {
        "nodes": G.number_of_nodes(),
        "edges": G.number_of_edges(),
        "density": nx.density(G),
        "degrees": array(list(dict(G.degree()).values())),
        "average degree": mean(array(list(dict(G.degree()).values()))),
        "standard deviation of degree": std(list(dict(G.degree()).values())),
    }

In [None]:
print(
    basic_net_stats(G1)["nodes"],
    basic_net_stats(G1)["edges"],
    basic_net_stats(G1)["density"],
)

In [None]:
print(
    basic_net_stats(G2)["nodes"],
    basic_net_stats(G2)["edges"],
    basic_net_stats(G2)["density"],
)

Let's plot the degree distribution of the two village networks.

In [None]:
plot_degree_distribution(G1)
plot_degree_distribution(G2)

Here, we can see that many more people have small number of connections and very few people have more than 20 connections in both villages. The distributions are also quite similar between both villages.

They are very asymmetric with a long tail, suggesting that an ER model is not appropriate for the village networks, or real life social networks in general.

## 4.3.7 Finding the Largest Connected Component

``nx.connected_component_subgraphs(G)`` returns a generator of subgraphs of the input graph ``G``. Each subgraph is a connected component of ``G``.

In [None]:
gen1 = [G1.subgraph(c) for c in nx.connected_components(G1)]
gen2 = [G2.subgraph(c) for c in nx.connected_components(G2)]

We can use ``len()`` to get the number of nodes in each connected component. We can also use ``max()`` to get the largest connected component.

In [None]:
len(max(gen1, key=len))

In [None]:
len(max(gen2, key=len))

We see here that the largest connected component of village 1 is 15 nodes larger than the largest connected component of village 2.

Let's now find what proportions of nodes in both villages are in the largest connected component.

In [None]:
len(max(gen1, key=len)) / G1.number_of_nodes()

In [None]:
len(max(gen2, key=len)) / G2.number_of_nodes()

In [None]:
# Village 1
plt.figure(figsize=(10, 10))
nx.draw(max(gen1, key=len), with_labels=True)
plt.show()

In [None]:
# Village 2
plt.figure(figsize=(10, 10))
nx.draw(max(gen2, key=len), with_labels=True)
plt.show()

Notice that the largest connected component of village 1 seems to be homogeneously connected internally, while the largest conencted component of village 2 seems to contain two subcomponents that are connected together. We have ``network communities`` in our village 2 network.