# Basic Data Analysis in Python using the Drosophila Connectome
The Drosophila connectime is a publicly available dataset of neurons and connections within a single fruit fly brain. To learn more about the Drosophila connectome, explore these resources.

[Fruit Fly Brain Observatory](https://hemibrain.neuronlp.fruitflybrain.org)

[Neuprint](https://neuprint.janelia.org/)

[Codex Flywire Explorer](https://codex.flywire.ai)

In today's tutorial, we will be accessing the Hemibrain data from Janelia via the Neuprint API. You can reference the [documentation](https://connectome-neuprint.github.io/neuprint-python/docs/queries.html) to learn more about querying the Hemibrain connectome database.
- Learn to access Drosophila brain data from Neuprint by creating a query
- Access data from a Pandas dataframe with indexing and logical indexing
- Use describe to get quick stats
- Plot a histogram of synaptic sites
- Make a pivot table and heatmap of connections among neurons

# Getting set up
To get started, navigate to this site to create an account and obtain an authorization token: https://connectome-neuprint.github.io/neuprint-python/docs/quickstart.html#client-and-authorization-token.

Start by entering your client info here to start a Neuprint session. Just copy and paste your token into the space provided. We'll also import the most important packages we'll need.

In [None]:
# in Google Colab, run this cell to install the neuprint-python package
%pip install neuprint-python

In [None]:
from neuprint import Client

# insert personal token above. see https://connectome-neuprint.github.io/neuprint-python/docs/quickstart.html#client-and-authorization-token for instructions
c = Client('neuprint.janelia.org', dataset='hemibrain:v1.2.1', token='')
c.fetch_version()

In [None]:
# import important stuff here
import numpy as np
import pandas as pd

Every neuron, or piece of neuron, has its own body ID. Below is a manually created list of the body IDs for the labeled and annotated clock neurons in the Hemibrain. We'll use these body IDs to access information about these neurons from neuprint.

In [1]:
clock_bodyIds = [2068801704, 1664980698, 2007068523, 1975347348, 5813056917, 5813021192, 5813069648, 511051477,
                  296544364, 448260940, 5813064789, 356818551, 480029788, 450034902, 546977514, 264083994, 5813022274,
                  5813010153, 324846570, 325529237, 387944118, 387166379, 386834269, 5813071319, 1884625521,
                  2065745704, 5813001741, 5813026773]

## Fetch Dataframe of neurons from Neuprint
We'll start by making a query to fetch summary information about each of these neurons using the fetch_neurons function from the neuprint package. This function takes some neuron criteria as its input and returns two dataframes containing summary information about the individual neurons that match the criteria and information about the number of synaptic sites on the neurons. We'll only work with the first dataframe.

In [None]:
from neuprint import fetch_neurons

neuron_df, _ = fetch_neurons(clock_bodyIds)

Display the dataframe below and notice that it has many columns with information about this set of 28 clock neurons. In addition to a bodyId, each neuron has a type and an instance label. The pre and post columns indicate the numbers of presynaptic and postsynaptic sites that are attributed to the neuron. The presynaptic sites are where the neuron would be releasing neurotransmitters from. The postsynaptic sites on the neuron are where it is receiving inputs. Mito is the number of mitochondria that were counted in the neuron. CellBodyFiber is related to the hemilineage of the neuron. It indicates which neurons likely derived from the same stem cell.

For the purposes of this tutorial, let's work with the counts of synaptic sites.

In [None]:
neuron_df

# Working with a Pandas dataframe
The Pandas dataframe is similar to an Excel spreadsheet. We can use code to grab the data that we want from it.

In [None]:
# get the columns with cell type and post sites
neuron_df[['type','post']]

In [None]:
# get the first row of the dataframe
neuron_df.iloc[0]

In [None]:
# another way to get the first row of the dataframe
neuron_df[0:1]

In [None]:
# get the first three rows of the dataframe
neuron_df[0:3]

## Indexing with logical expressions
Let's say we wanted to grab only the rows that have information for the LNd cell type. We can use a logical expression in square brackets.

In [None]:
# get the LNd rows of the dataframe
neuron_df[neuron_df['type'] == 'LNd']

Take a moment to see what that logical expression inside the square brackets is. It is a Boolean data type with True/False entries. Only the rows that have 'LNd' in the 'type' column have a True entry.

In [None]:
neuron_df['type'] == 'LNd'

# Basic statistics with a dataframe
There are many Python methods for doing basic stats on a column of values from a dataframe. Below, I apply some of those methods to the 'post' column of the dataframe.

In [None]:
# get some summary stats about the post sites
neuron_df['post'].describe()

In [None]:
# get only the mean of the post sites
neuron_df['post'].mean()

In [None]:
# get only the mode of the post sites
neuron_df['post'].mode()

## Histogram plot
Use matplotlib to create a simple histogram from the values in the 'post' column of the dataframe.

In [None]:
# make a histogram of the post sites
import matplotlib.pyplot as plt

# choose the number of bins for your histogram
plt.hist(neuron_df['post'], bins=10)

# add labels and title
plt.xlabel('# of post sites')
plt.ylabel('Frequency')
plt.title('Distribution of Post Sites for Clock Neurons')

# Pivot table of neuron connectivity
To explore the connections that the clock neurons make with each other, we'll do another query with neuprint to obtain the data about those connections. This will return a dataframe that I have called 'connections'. It contains information about the strength of connections among the clock neurons.

In [None]:
# obtain dataframe of connections
from neuprint import fetch_simple_connections

connections = fetch_simple_connections(clock_bodyIds,clock_bodyIds)
connections

To make a heatmap of the connection strenghs in this dataframe, we need to convert the dataframe into a pivot table. I named it 'matrix'.

In [None]:
# create a pivot table of connections
matrix = connections.pivot(columns='bodyId_pre', index='bodyId_post', values='weight')
matrix

In [None]:
# make a heatmap of the connectivity matrix with seaborn
import seaborn

fig = plt.figure(figsize=(16, 12))
seaborn.heatmap(matrix, vmin=0, annot=True, cmap=seaborn.light_palette("purple", as_cmap=True), cbar_kws={'label': 'connection strength'})
plt.title('Connectivity matrix')
plt.xlabel('postsynaptic')
plt.ylabel('presynaptic')

Another way to show this data would be to collapse the connection strengths by cell type so that we can create an aggregated pivot table and heatmap.

In [None]:
# use the groupby function to get the total connections by type
connections_by_type = connections.groupby(['type_pre', 'type_post'], sort=False)['weight'].sum().reset_index()

# create a pivot table of connections by type
matrix = connections_by_type.pivot(columns='type_post', index='type_pre', values='weight')
matrix

In [None]:
# make a heatmap of the connectivity matrix with seaborn
import seaborn

fig = plt.figure(figsize=(16, 12))
seaborn.heatmap(matrix, vmin=0, annot=True, cmap=seaborn.light_palette("purple", as_cmap=True), cbar_kws={'label': 'connection strength'})
plt.title('Connectivity matrix')
plt.xlabel('postsynaptic')
plt.ylabel('presynaptic')

# It's your turn!
Try a query of your own, or work with the dataframes in this notebook to do some stats on a different column of data.