Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yadav clustering #2

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Yadav clustering #2

wants to merge 12 commits into from

Conversation

kellnett
Copy link

No description provided.

@kellnett kellnett changed the title Create 1D clustering Yadav.R Yadav clustering Mar 15, 2019
@jdkent
Copy link
Member

jdkent commented Mar 18, 2019

Awesome work Kelle! Glad you were able to figure out pull requests, I'm going to quote part of your email to me just so I can keep the conversation here:

Just as a quick summary, I found the function "IdClusters", which performs the UPGMA analysis and gives information concerning how many clusters fall within the cutoff given, and how many spines are in said clusters. From there I find number of clustered spines (given that the cluster has >1 spines in it). I run that with the random data a number of times to get a distribution of # of clustered spines given the total number of spines.

This next part I'm really trying to think through conceptually, so I'm not quite sure if I have it right yet, but I use dnorm() on the number of clusters (and the mean/std of that sample) to find the probability density function, then I take the pnorm() of that calculated dnorm to get a Cscore. I think that is how the Yadav paper (attached) calculates Cscore, but like I said, I'm still trying to think through it.

I like the idea of using Zscores too. My goal is to keep pursuing all of these 'possible' analyses routes to really compare the types of values we get. Once we get working code too I can run it on the whole set of data to see if you new analysis somewhat aligns with the original analysis I was using.

I'll step through the code and may leave questions/comments as I work through it.

Copy link
Member

@jdkent jdkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Kelle! I have a few comments/suggestions for further conversation.

1D clustering Yadav.R Outdated Show resolved Hide resolved
1D clustering Yadav.R Outdated Show resolved Hide resolved
spines_in_cluster_test_1D <- cluster_freq_test_1D %>% group_by(is_clustered) %>% summarise(num_clusters_test_1D = sum(Freq)) #count how many spines that are in a true cluster or not
spines_clustered_test_1D <- spines_in_cluster_test_1D[2,2] # define how many spines are in a cluster
spines_not_test_1D <- as.numeric(total_spines - spines_clustered_test_1D) # calculate how many spines are not clustered
spines_clustered_test_1D[is.na(spines_clustered_test_1D)] <- 0 #returns 0 instead of Na if no spines are clustered in the random sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the case when there are no clusters that contain more than one spine?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. sometimes there would be "no clusters" (no groups with more than one spine) and that would return an NA and ruin all the code

1D clustering Yadav.R Outdated Show resolved Hide resolved

# 3D random spines for loop
for(j in 1:100){
test_data_X <- data.frame(sample(df$X), df$Y, df$Z) # randomize the X's, Y's, and Z's to make a "biologically plausible" dataframe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only the first line and last line in this block are necessary:

test_data <- data.frame(sample(df$X), sample(df$Y), sample(df$Z))
test_dist_3D <- as.matrix(dist(test_data)) # creates distance matrix for random sample

But I think I see your thought process to constrain how "random" the datapoints are, so I'll just pin this for further conversation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have an idea for how to more accurately constrain the datapoints, since the random points still seem to be a little too random. (we should expect clustering on most dendritic branches, we just want to see changes in the degree).

I'm thinking that we need to set up some sort of selection criteria for the random coordinates.
The dendrites and spines essentially live in a known cylindrical space (length being the length of the dendrite which we have 3D coordinates to regenerate, and the width/radius being the max spine length--currently when I map the random 3D coordinates it doesn't really look like spines along a dendritic length.

So, if we make criteria that the random 3D coordinates must fall within the known limits, we may be able to have better randomization. I'm going to try thinking about that this week, but let me know what you think.

Copy link
Member

@jdkent jdkent Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea, but let me think through it.
I'll make up some numbers (that may not be biologically plausible).

if we have a 50um dendrite segment with 150 spine heads, then we have 150 x-coordinates, 150 y-coordinates and 150 z-coordinates.
If we randomized using the data available, we would have 150*150*150 or 150^3 or 3375000 possible combinations of 150 coordinates.

If we generate a list of possible coordinates within a cylinder incrementing by 0.01, this list should include the 3375000 coordinates produced by the previous method (since all those xyz-coordinates were actually observed and recorded) and several billion others.
napkin math (if I assume a radius of 5um):
volume of cylinder: 50*(pi*5^2) ~ 3927um
increments of 0.01: 3927 / 0.01 = 392700 coordinates
all combinations of selecting 150 points: 392700 choose 150 ~ 2187165081225137310382017233099995046337100876544142411431297970118841785272017984540668630264009105699815020447043722621456400930097774157111152285573435447744179823216702996167899049246262375503173540118312468464288508173005027094694777928571049706316046267083117726289897816726177377941620617732303764895199090971241059657147091983664496692488541826387269341802889197146363386214674094117662445818311269533829858635004649504925878489693364264340423726784051623161387140226054951726805351551643633068009282354287466945089405647311894109982832344478823923883759838211381115148 possible combinations of 150 coordinates

(I did not use a napkin).

From this example, randomizing the observed data constrains the space where spine heads "can" be by many orders of magnitude. I think the hop from 1D space to 3D space adds more dimensions for data to move and thus completely random data are less likely to cluster, and I don't know if there is an easy way to get around that, besides adding some conditionals that say spines that are too far from the next closest spine should be reselected.

Using the observed data should look closer to the biological reality (but probably not anywhere close to perfect), and the resulting coordinates should be within the cylinder of interest since we are using the observed data to generate "random" data points. Generating all possible datapoints will create a bunch of new places the spine heads can be, but I cannot think of a reason why they would be more clustered since we are still dealing with the problem of 3 dimensions versus 1 dimension (it's easier to cluster in 1 dimension than 3). Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the coordinates do not look right, do we need to include information about SOMA-DISTANCE in addition to the x-y-z coordinates? perhaps I'm still confused on what each variable represents in space.

std_curve_3D <- sd(curve_dnorm_3D)
mean_curve_3D <- mean(curve_dnorm_3D)

Cscore_3D <- pnorm(spines_clustered_3D, mean_test_3D, std_test_3D)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you are right that we will not see much useful output here since the amount of observed clustering is above anything that was simulated. We can leverage the z-score:

zscore <- (spines_clustered_3D - mean_test_3D) / std_test_3D

With the file you shared with me, I got about 6.23, So the observed density was 6.23 standard deviations above the simulated random distribution.

kellnett and others added 4 commits October 28, 2019 15:06
current code I'm using
I worked on it separately outside of GitHub so a lot is different
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants