Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network analysis #7

Closed
choldgraf opened this issue Mar 9, 2015 · 12 comments
Closed

Network analysis #7

choldgraf opened this issue Mar 9, 2015 · 12 comments
Assignees
Labels

Comments

@choldgraf
Copy link
Contributor

This issue is where we'll discuss the network analysis component. We can post graphs, code snippits, and brainstorms.

The network analysis project aims to find cluster of co-occurrence between departments, manufacturers, suppliers, product types, etc.

Project lead is @dariusmehri along with @nlin3330

@choldgraf choldgraf changed the title PRJ: Network analysis Network analysis Mar 9, 2015
@choldgraf
Copy link
Contributor Author

Just had a potential idea for this project. On top of using co-occurrence to build graphs, we could define similarities between departments based on their seasonal purchasing. E.g., if I plot the average number of POs per department for each week of the year, you could get a plot like this.

X-axis is time, Y-axis is dept
image

Then, you can build this into a correlation matrix like this:
image

Then you could use the above plot to create edges between departments and see what falls out (there already looks to be some clear clustering here).

Just a thought

@anthonysuen
Copy link

I think this is a great idea!

Anthony

On Mon, Mar 16, 2015 at 9:29 AM, Chris Holdgraf notifications@github.com
wrote:

Just had a potential idea for this project. On top of using co-occurrence
to build graphs, we could define similarities between departments based on
their seasonal purchasing. E.g., if I plot the average number of POs per
department for each week of the year, you could get a plot like this.

X-axis is time, Y-axis is dept
[image: image]
https://cloud.githubusercontent.com/assets/1839645/6670878/b061af00-cbbe-11e4-8569-d5d2b53a9f8d.png

Then, you can build this into a correlation matrix like this:
[image: image]
https://cloud.githubusercontent.com/assets/1839645/6670895/d185902a-cbbe-11e4-89a8-e15e9ffb99be.png

Then you could use the above plot to create edges between departments and
see what falls out (there already looks to be some clear clustering here).

Just a thought


Reply to this email directly or view it on GitHub
#7 (comment)
.

Anthony Suen

@testchange
Copy link
Contributor

Can you show the legend of the color code for the first graph(the color code for the second one is the correlation)? What does the bar chart with the lines on the top represents?

It is interesting to see that there is high correlation in the middle of the matrix. Maybe we can coordinate the seasonal buying among these departments.

@choldgraf
Copy link
Contributor Author

Ah good point @kaiweitan, the color code is actually relatively arbitrary. Matplotlib chooses the colors to accentuate the differences in the data, so in this case "white" isn't necessarily 0. When we make a final output of this, then we will make sure to get the colors right.

For the second big correlation matrix, it's currently sorted according to the clustering trees that you see on the margins. We could define "cuts" of those trees as clusters, though doing this is a bit of a dark art. Definitely worth looking into.

@nlin3330
Copy link
Contributor

Quick update on the network analysis. I managed to subset the data by months and the graph now looks much cleaner. However, it still doesn't mean much without labels. The next step for me would be to color code the difference between suppliers and departments as well as add in labels.
3months

@choldgraf
Copy link
Contributor Author

Very cool - we could choose different uses for colors. E.g., color code by manufacturer or supplier ID, rather than their category. That way we could see which organizations are persistently connected to others across time.

@dariusmehri
Copy link

hi nick, nice, but can you explain a bit how you subsetted by months? the
dataset you have seems small, you mean it is for only one month? if so,
which one?

one way to clean up the network graph is to remove nodes with centrality =
0, or with one tie, these ties are not of interest anyway and can be dropped

darius

On Thu, Mar 19, 2015 at 11:53 AM, nlin3330 notifications@github.com wrote:

Quick update on the network analysis. I managed to subset the data by
months and the graph now looks much cleaner. However, it still doesn't mean
much without labels. The next step for me would be to color code the
difference between suppliers and departments as well as add in labels.
[image: 3months]
https://cloud.githubusercontent.com/assets/7124729/6738328/8998aa46-ce2e-11e4-81d0-3c7f93f86793.png


Reply to this email directly or view it on GitHub
#7 (comment)
.

Darius Mehri
Ph.D. Candidate, Sociology
University of California, Berkeley

@nlin3330
Copy link
Contributor

Hi Darius,

Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).

@dariusmehri
Copy link

hi nick, i see, i did the same exact thing for 2013, there is some issue
with the dataset you are working with, the number of transactions is too
low (i.e, it looks like there are only a few hundred transactions when
there should be a about a hundred thousand or more), darius

On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com wrote:

Hi Darius,

Basically I converted the creation_time variable into a datetime variable
which allows ease in specifying a range of dates. For the graph this was
the first three months of the current data (1/1/2012-3/1/2012).


Reply to this email directly or view it on GitHub
#7 (comment)
.

Darius Mehri
Ph.D. Candidate, Sociology
University of California, Berkeley

@dariusmehri
Copy link

if you used drop_duplicates, there is a chance you may be throwing out too
much data

On Thu, Mar 19, 2015 at 2:53 PM, Darius Mehri darius_mehri@berkeley.edu
wrote:

hi nick, i see, i did the same exact thing for 2013, there is some issue
with the dataset you are working with, the number of transactions is too
low (i.e, it looks like there are only a few hundred transactions when
there should be a about a hundred thousand or more), darius

On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com
wrote:

Hi Darius,

Basically I converted the creation_time variable into a datetime variable
which allows ease in specifying a range of dates. For the graph this was
the first three months of the current data (1/1/2012-3/1/2012).


Reply to this email directly or view it on GitHub
#7 (comment)
.

Darius Mehri
Ph.D. Candidate, Sociology
University of California, Berkeley

Darius Mehri
Ph.D. Candidate, Sociology
University of California, Berkeley

@choldgraf
Copy link
Contributor Author

hey @nlin3330 it looks like you worked on the color-coding stuff in a recent commit...do you have any interesting output / plots from that analysis, or still a work-in-progress?

@dariusmehri
Copy link

hey guys, i am back online to work on the project, sorry again for the absence, my objective is by the end of the week to get the group some nice network graphs and some hard data, nick is away all week (he is out of the country), but i will be in touch with him on and off

here are some of the plans:

  1. i am going to try to trim out the nodes that have only one tie, this should bring down the clutter by a lot, if not then i will figure out a way to reduce it more
  2. did you guys get the department codes yet? the reason i am asking is that in addition to the time dependent analysis, i think it will be useful to compare the structure of transactions by field/ and or department, i.e. compare the hard sciences, to engineering to the social sciences, and so on, i expect we can uncover some structural differences that will be very interesting.
  3. get some more hard numbers (centrality, coherence, etc) and graph over time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants