# Visualize Directory
- Following thread "http://www.austintaylor.io/d3/python/pandas/2016/02/01/create-d3-chart-python-force-directed/"

## The Network Structure
- A dictionary with two lists, nodes and links.
- links contains the relationships between nodes
- nodes contains each individual node

```json
{
  "nodes":  [
    { "name": "desktop", "group":  1},
    { "name": "desktop/apples.txt", "group":  1},
    { "name": "desktop/pineapple/apples.txt", "group":  1},
    { "name": "desktop/bananas.txt", "group":  1}
  ],

  "links":  [
    { "source":  1,  "target":  0,  "value":  5555 },
    { "source":  2,  "target":  0,  "value":  1 },
    { "source":  3,  "target":  0,  "value": 1 }
  ]
}
```

## Setup

### Set group node option

In [1]:
set_groups_to_file_types = True

#base size of nodes, the size of nodes is increased by a factor involving the file/directory size
base_size = 2.5

### Modules

In [2]:
import os
import pandas
import json

### Set path of directory you wish to visualize

In [3]:
path = '/Users/danielcorcoran/desktop/github_repos/python_nb_data/'
export_path = "/users/danielcorcoran/desktop/github_repos/python_nb_networks/json/"

### Helper functions to get size of directory/files, define node sizes

#### Retrieving directory size in bytes given a `directory_path`

In [4]:
def get_directory_size(directory_path = ""):
    
    directory_size = 0
    
    for dirpath, dirnames, filenames in os.walk(directory_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            directory_size += os.path.getsize(fp)
            
    return directory_size

#### Retrieving file size in bytes given a `file_path`

In [5]:
def get_file_size(file_path = ""):
    
    file_size = os.path.getsize(file_path)
    
    return file_size

#### Calculate size bracket based on file size in megabytes, return a `node_size`

In [6]:
def get_size_bracket(file_size_megabytes, base_size = 0):
    
    if file_size_megabytes <= 1:
        node_size = 3
    elif file_size_megabytes <= 10:
        node_size = 4
    elif file_size_megabytes <= 100:
        node_size = 4.5
    elif file_size_megabytes <= 1000:
        node_size = 5
    elif file_size_megabytes <= 10000:
        node_size = 6
    else:
        node_size = 10
        
    return node_size + base_size

## Process

### Create Initial Dataframe containing absolute paths and folder paths

#### Create list to store all the absolute paths within the path directory, this will be used to branch out relationships

In [7]:
absolute_paths = []

In [8]:
for dirpath, dirnames, filenames in os.walk(path):

    #print(dirpath, dirnames, filenames)

    for dirname in dirnames:
        x = dirpath + "/" + dirname
        
        #remove commas, quotation marks, double forward slashes
        absolute_paths.append(x.strip().replace("//",
                                                "/").replace("'", "").replace(
                                                    '"', ''))

    for filename in filenames:
        y = dirpath + "/" + filename
        
        #remove commas, quotation marks, double forward slashes
        absolute_paths.append(y.strip().replace("//",
                                                "/").replace("'", "").replace(
                                                    '"', ''))

#### Store data in pandas dataframe

In [9]:
data = pandas.DataFrame(absolute_paths)
data.rename({0: "absolute_path"}, axis=1, inplace=True)
data.head(15)

Unnamed: 0,absolute_path
0,/Users/danielcorcoran/desktop/github_repos/pyt...
1,/Users/danielcorcoran/desktop/github_repos/pyt...
2,/Users/danielcorcoran/desktop/github_repos/pyt...
3,/Users/danielcorcoran/desktop/github_repos/pyt...
4,/Users/danielcorcoran/desktop/github_repos/pyt...
5,/Users/danielcorcoran/desktop/github_repos/pyt...
6,/Users/danielcorcoran/desktop/github_repos/pyt...
7,/Users/danielcorcoran/desktop/github_repos/pyt...
8,/Users/danielcorcoran/desktop/github_repos/pyt...
9,/Users/danielcorcoran/desktop/github_repos/pyt...


### Manipulate dataframe into desired format

#### Create `source` column

In [10]:
for index in range(data.shape[0]):
    item = data.iloc[index, 0]
    split = item.split("/")

    source = ("/").join(split[: len(split)-1])
    data.loc[index, "source"] = source

In [11]:
data.head()

Unnamed: 0,absolute_path,source
0,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...
1,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...
2,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...
3,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...
4,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...


#### Create `size_megabytes`, `size_bytes` columns

In [12]:
for index in range(data.shape[0]):
    item = data.loc[index, "absolute_path"]
    last_item = item.split("/")[-1]
    
    if last_item.startswith(".") == False and "." in last_item:
        print(last_item, "is a filename")
        size = get_file_size(item)
    else:
        print(last_item, "is a directory")
        size = get_directory_size(item)
        
    data.loc[index, "size_bytes"] = size
    
data["size_megabytes"] = data["size_bytes"]/1000000

scripts is a directory
.ipynb_checkpoints is a directory
.git is a directory
data is a directory
notebook_compare_tables_numerics.ipynb is a filename
notebook_append_lga_codes_revised.ipynb is a filename
.DS_Store is a directory
notebook_pandas_excel_parse.ipynb is a filename
notebook_backup_to_desktop.ipynb is a filename
notebook_standardizing_data.ipynb is a filename
notebook_unpivot_partnerships_full.ipynb is a filename
notebook_lists_in_python.ipynb is a filename
notebook_dealing_with_nulls.ipynb is a filename
README.md is a filename
.gitignore is a directory
notebook_pandas_nulls.ipynb is a filename
notebook_str.ipynb is a filename
notebook_validate_distinct_tables.ipynb is a filename
notebook_test_game.ipynb is a filename
notebook_pull_from_postgres.ipynb is a filename
notebook_append_lga_codes.ipynb is a filename
notebook_pydata_seattle_2017.ipynb is a filename
notebook_import_from_path.ipynb is a filename
test.py is a filename
notebook_append_lga_codes-checkpoint.ipynb is a fil

#### Create `node_size` column

In [13]:
for index in range(data.shape[0]):
    
    item = data.loc[index, "size_megabytes"]
    
    node_size = get_size_bracket(item, base_size)
    
    data.loc[index, "node_size"] = node_size

data.head(5)

Unnamed: 0,absolute_path,source,size_bytes,size_megabytes,node_size
0,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...,67.0,6.7e-05,5.5
1,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...,27164.0,0.027164,5.5
2,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...,192585.0,0.192585,5.5
3,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...,46821.0,0.046821,5.5
4,/Users/danielcorcoran/desktop/github_repos/pyt...,/Users/danielcorcoran/desktop/github_repos/pyt...,42374.0,0.042374,5.5


### Create groups based on file type (optional)
- These will be used to colour code the nodes in `nodes_list`

In [14]:
for index in range(data.shape[0]):
    
    absolute_path = data.loc[index, "absolute_path"]
    
    last_item = absolute_path.split("/")[-1] 
    
    if "." in last_item:
        data.loc[index, "file_extension"] = "." + last_item.split(".")[-1]
    else:
        data.loc[index, "file_extension"] = "folder"

In [15]:
unique_extensions = list(data["file_extension"].unique())
unique_extensions

['folder',
 '.ipynb_checkpoints',
 '.git',
 '.ipynb',
 '.DS_Store',
 '.md',
 '.gitignore',
 '.py',
 '.sample',
 '.xlsx']

### Create a list containing only destinations, this will be used to build the nodes list as part of the main dictionary

In [16]:
destination_list = list(data["absolute_path"])
destination_list

['/Users/danielcorcoran/desktop/github_repos/python_nb_data/scripts',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/.ipynb_checkpoints',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/.git',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/data',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_compare_tables_numerics.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_append_lga_codes_revised.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/.DS_Store',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_pandas_excel_parse.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_backup_to_desktop.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_standardizing_data.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/python_nb_data/notebook_unpivot_partnerships_full.ipynb',
 '/Users/danielcorcoran/desktop/github_repos/pytho

In [17]:
destination_list.append(path)

### Create the final dictionary

#### Build `nodes_list`

In [18]:
nodes_list = []

for index in range(len(destination_list)):

    if index == len(destination_list) - 1:
        
        group_index = 999999999999999999
        
        directory_size_bytes = get_directory_size(directory_path=path)
        
        directory_size_megabytes = directory_size_bytes / 1000000
        
        node_size = get_size_bracket(directory_size_megabytes, base_size)
        
        nodes_list.append({"group": group_index, 
                           "name": destination_list[index],
                           "size" : node_size
                          })
        
    else:
        
        if set_groups_to_file_types == True:
            group_text = data.loc[index, "file_extension"]
            group_index = unique_extensions.index(group_text)
        
        else:
            group_index = 1

    
        node_size = data.loc[index, "node_size"]
        nodes_list.append({"group": group_index, 
                           "name": destination_list[index],
                           "size" : node_size
                          })

In [19]:
nodes_list[:3]

[{'group': 0,
  'name': '/Users/danielcorcoran/desktop/github_repos/python_nb_data/scripts',
  'size': 5.5},
 {'group': 1,
  'name': '/Users/danielcorcoran/desktop/github_repos/python_nb_data/.ipynb_checkpoints',
  'size': 5.5},
 {'group': 2,
  'name': '/Users/danielcorcoran/desktop/github_repos/python_nb_data/.git',
  'size': 5.5}]

#### Build `links_list`

In [20]:
links_list = []

In [21]:
for index in range(data.shape[0]):

    try:

        target = index

        source_text = data.loc[index, "source"]

        source = destination_list.index(source_text)

        links_list.append({"source": source, "target": target, "value": 1})
    except:

        print(index, ' has failed, attempting alternative method')

        target = index

        source = len(destination_list) - 1

        links_list.append({"source": source, "target": target, "value": 1})

0  has failed, attempting alternative method
1  has failed, attempting alternative method
2  has failed, attempting alternative method
3  has failed, attempting alternative method
4  has failed, attempting alternative method
5  has failed, attempting alternative method
6  has failed, attempting alternative method
7  has failed, attempting alternative method
8  has failed, attempting alternative method
9  has failed, attempting alternative method
10  has failed, attempting alternative method
11  has failed, attempting alternative method
12  has failed, attempting alternative method
13  has failed, attempting alternative method
14  has failed, attempting alternative method
15  has failed, attempting alternative method
16  has failed, attempting alternative method
17  has failed, attempting alternative method
18  has failed, attempting alternative method
19  has failed, attempting alternative method
20  has failed, attempting alternative method
21  has failed, attempting alternative metho

## Process final data

### Merge nodes and links lists into one dictionary

In [22]:
json_data = {"nodes": nodes_list, "links": links_list}

### Convert python dictionary to json string

In [23]:
json_dump = json.dumps(json_data, indent=1, sort_keys=True)

### Export to filename 'pcap_export.json' to be used in index.html

In [24]:
json_out = open(export_path + "pcap_export.json", "w")
json_out.write(json_dump)
json_out.close()