[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AyanSinhaMahapatra/scancode-results-analyzer/blob/master/src/notebooks/load_results_files.ipynb)

## The Following 6 Cells are Only to be run in Google Colab 

[Link to Installing Conda in Colab Docs, which is used below](https://towardsdatascience.com/conda-google-colab-75f7c867a522)

In [None]:
%env PYTHONPATH=

In [None]:
%%bash

MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX
conda install --channel defaults conda python=3.6 --yes
conda update --channel defaults --all --yes

In [None]:
import sys
_ = (sys.path.append("/usr/local/lib/python3.6/site-packages"))

In [None]:
!conda install -c conda-forge pandas numpy matplotlib seaborn -y

In [None]:
!git clone -l -s git://github.com/AyanSinhaMahapatra/scancode-results-analyzer.git scancode-results-analyzer
%cd scancode-results-analyzer
!ls

In [None]:
sys.path.append('/content/scancode-results-analyzer/src')

# `load_results_file.py`

In [1]:
import sys
import numpy as np
import pandas as pd
import os

# Path To Local Folder
sys.path.append('/home/ayan/Desktop/nexB/gsoc20/scancode-results-analyzer/src')

In [2]:
from results_analyze.load_results_package import ResultsDataFramePackage
results_package = ResultsDataFramePackage()

## Import Data From JSON instead of Database, on Google Colab

In [4]:
json_filename = "lic_scancode_before.json"
json_filepath = os.path.join(results_package.json_input_dir, json_filename)
mock_metadata_filepath = os.path.join(results_package.json_input_dir, results_package.mock_metadata_filename)

In [5]:
path_json_dataframe = results_package.mock_db_data_from_json(json_filepath, mock_metadata_filepath)
path_json_dataframe

Unnamed: 0,path,json_content
0,mock/data/-/multiple-packages/random/1.0.0/too...,"{'_metadata': {'type': 'scancode', 'url': 'cd:..."


## Importing Data From Postgres DataBase

Simulating Data going into `ResultsDataFrameFile.create_file_level_dataframe` function, which is called by `ResultsDataFramePackage.create_package_level_dataframe`.
Using code snippets from `ResultsDataFramePackage.create_package_level_dataframe`.

In [3]:
path_json_dataframe = results_package.convert_records_to_json(20)

Creates `files_dataframe` and breaks at a good example, `file_list` is passed into `ResultsDataFrameFile.create_file_level_dataframe`.

In [6]:
files_dataframe, metadata_dataframe = results_package.modify_package_level_dataframe(path_json_dataframe)

for package_scan_result in files_dataframe.itertuples():
    file_list = package_scan_result[2]
    if package_scan_result[0] == 0:
        break

In [7]:
type(file_list)

list

In [8]:
np.shape(file_list)

(54,)

One of the entries inside the list of dicts.

In [9]:
file_list[3]

{'path': 'Issues/1906-libwebsoclets-output.c',
 'type': 'file',
 'name': '1906-libwebsoclets-output.c',
 'base_name': '1906-libwebsoclets-output',
 'extension': '.c',
 'size': 22888,
 'date': '2017-07-28',
 'sha1': '6c990c4a7fc56bf9f1df2b859cf5e4be7d285d5e',
 'md5': '3c9cf47646361f6c51e2c932c688dd88',
 'mime_type': 'text/x-c',
 'file_type': 'C source, ASCII text',
 'programming_language': 'C++',
 'is_binary': False,
 'is_text': True,
 'is_archive': False,
 'is_media': False,
 'is_source': True,
 'is_script': False,
 'licenses': [{'key': 'lgpl-2.1',
   'score': 100.0,
   'name': 'GNU Lesser General Public License 2.1',
   'short_name': 'LGPL 2.1',
   'category': 'Copyleft Limited',
   'is_exception': False,
   'owner': 'Free Software Foundation (FSF)',
   'homepage_url': 'http://www.gnu.org/licenses/lgpl-2.1.html',
   'text_url': 'http://www.gnu.org/licenses/lgpl-2.1.txt',
   'reference_url': 'https://enterprise.dejacode.com/urn/urn:dje:license:lgpl-2.1',
   'spdx_license_key': 'LGPL-2.

## Loads List of Dicts into DataFrame

In [10]:
from results_analyze.load_results_file import ResultsDataFrameFile
results_file = ResultsDataFrameFile()
file_level_dataframe = pd.DataFrame(file_list)

Viewing DataFrame Columns and their types by calling `DataFrame.dtypes`

In [11]:
file_level_dataframe.dtypes

path                    object
type                    object
name                    object
base_name               object
extension               object
size                     int64
date                    object
sha1                    object
md5                     object
mime_type               object
file_type               object
programming_language    object
is_binary                 bool
is_text                   bool
is_archive                bool
is_media                  bool
is_source                 bool
is_script                 bool
licenses                object
license_expressions     object
copyrights              object
holders                 object
authors                 object
packages                object
emails                  object
urls                    object
is_legal                  bool
is_manifest               bool
is_readme                 bool
is_top_level              bool
is_key_file               bool
is_generated              bool
is_licen

In [12]:
file_level_dataframe.shape

(54, 37)

In [13]:
results_file.modify_file_level_dataframe?

[0;31mSignature:[0m [0mresults_file[0m[0;34m.[0m[0mmodify_file_level_dataframe[0m[0;34m([0m[0mdataframe_files[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Takes a File Level DataFrame, drops unnecessary columns, drops all directory rows, drops same files,
drop files with no license detections, and makes sha1 column as the file level Index [Primary Key].

:param dataframe_files: pd.DataFrame
    File Level DataFrame

:returns has_data: bool
    If A File Level DataFrame is non-empty
[0;31mFile:[0m      ~/Desktop/nexB/gsoc20/scancode-results-analyzer/src/results_analyze/load_results_file.py
[0;31mType:[0m      method


In [14]:
results_file.modify_file_level_dataframe(file_level_dataframe)

True

In [15]:
file_level_dataframe.shape

(44, 20)

In [16]:
file_level_dataframe.dtypes

path                     object
size                      int64
mime_type                object
file_type                object
programming_language     object
is_binary                  bool
is_text                    bool
is_archive                 bool
is_media                   bool
is_source                  bool
is_script                  bool
licenses                 object
is_legal                   bool
is_manifest                bool
is_readme                  bool
is_top_level               bool
is_key_file                bool
is_generated               bool
is_license_text            bool
license_detections_no     int64
dtype: object

Some entries inside `file_level_dataframe`, here `licenses` column contains list of dicts, where list length is number of license detections per file.

In [17]:
file_level_dataframe.head(5)

Unnamed: 0_level_0,path,size,mime_type,file_type,programming_language,is_binary,is_text,is_archive,is_media,is_source,is_script,licenses,is_legal,is_manifest,is_readme,is_top_level,is_key_file,is_generated,is_license_text,license_detections_no
sha1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
c7253ba65436387ccb9bf393378cbeb725c7325e,Issues/1904-azure-iot-sdk-c-makefile.iot,3391,text/plain,ASCII text,,False,True,False,False,False,False,"[{'key': 'gpl-1.0-plus', 'score': 73.33, 'name...",False,False,False,True,False,False,False,2
a2d7a215eaedef8ab149bb6a5baedcb54a9850a3,Issues/1906-libwebsockets-output.c,33269,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,False,"[{'key': 'lgpl-2.1', 'score': 100.0, 'name': '...",False,False,False,True,False,False,False,3
6c990c4a7fc56bf9f1df2b859cf5e4be7d285d5e,Issues/1906-libwebsoclets-output.c,22888,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,False,"[{'key': 'lgpl-2.1', 'score': 100.0, 'name': '...",False,False,False,True,False,False,False,3
6db3a67499cbba3a63354231bd4d35183e44f2ef,Issues/1907-bison-2.4.3-getargs.c,15802,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,False,"[{'key': 'gpl-3.0-plus', 'score': 100.0, 'name...",False,False,False,True,False,False,False,2
5dd305e238554f6d0c5b9064a29607816a2e0878,Issues/1908-bzip2-1.0.5-bzip2.c,58670,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,False,"[{'key': 'bzip2-libbzip-2010', 'score': 62.79,...",False,False,False,True,False,False,False,3


These lines takes out all these licenses into `DataFrames`. 

In [18]:
lic_level_dataframe = file_level_dataframe.groupby('sha1').licenses.apply(lambda x: pd.DataFrame(x.values[0])).reset_index()
lic_level_dataframe.rename(columns={'level_1': 'lic_det_num'}, inplace=True)

These are only license level information in the columns.

In [19]:
lic_level_dataframe.dtypes

sha1                 object
lic_det_num           int64
key                  object
score               float64
name                 object
short_name           object
category             object
is_exception           bool
owner                object
homepage_url         object
text_url             object
reference_url        object
spdx_license_key     object
spdx_url             object
start_line            int64
end_line              int64
matched_rule         object
matched_text         object
dtype: object

In [20]:
results_file.modify_lic_level_dataframe?

[0;31mSignature:[0m [0mresults_file[0m[0;34m.[0m[0mmodify_lic_level_dataframe[0m[0;34m([0m[0mdataframe_lic[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Modifies License level DataFrame, from 'matched_rule' dicts, bring information to columns.
Maps Rule Names and other strings to integer values to compress.

:param dataframe_lic: pd.DataFrame
:return dataframe_lic: pd.DataFrame
[0;31mFile:[0m      ~/Desktop/nexB/gsoc20/scancode-results-analyzer/src/results_analyze/load_results_file.py
[0;31mType:[0m      method


In [21]:
lic_level_dataframe = results_file.modify_lic_level_dataframe(lic_level_dataframe)

In [22]:
lic_level_dataframe.dtypes

sha1                     object
lic_det_num               int64
key                      object
score                   float64
category                 object
is_exception               bool
start_line                int64
end_line                  int64
matched_text             object
identifier               object
is_license_text            bool
is_license_notice          bool
is_license_reference       bool
is_license_tag             bool
matcher                  object
rule_length               int64
matched_length            int64
match_coverage          float64
rule_relevance          float64
dtype: object

In [23]:
lic_level_dataframe.set_index('sha1', inplace=True)

Joins License level and File level Dataframes using Join operations, by the primary key `sha1`.

In [24]:
merged_df = file_level_dataframe.join(lic_level_dataframe, lsuffix='_file', rsuffix='_lic')
merged_df.reset_index(inplace=True)
merged_df.set_index(['sha1', 'lic_det_num'], inplace=True)

## Notice how under one file, there can be many license rows, and there are 2 Primary key columns on the left, where there is a one-to-many relationship.

In [25]:
merged_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,path,size,mime_type,file_type,programming_language,is_binary,is_text,is_archive,is_media,is_source,...,identifier,is_license_text_lic,is_license_notice,is_license_reference,is_license_tag,matcher,rule_length,matched_length,match_coverage,rule_relevance
sha1,lic_det_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0463f3f27739f3fecb1e7c51130541fb213d7d29,0,Issues/1912-libtool-2.2.10-argz.c,5903,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,...,lgpl-2.1-plus_newlib.RULE,False,True,False,False,3-seq,128,52,40.62,100.0
0463f3f27739f3fecb1e7c51130541fb213d7d29,1,Issues/1912-libtool-2.2.10-argz.c,5903,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,...,lead-in_unknown_11.RULE,False,False,True,False,2-aho,3,3,100.0,16.0
0463f3f27739f3fecb1e7c51130541fb213d7d29,2,Issues/1912-libtool-2.2.10-argz.c,5903,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,...,lgpl-2.1-plus_83_1.RULE,False,False,True,False,2-aho,6,6,100.0,33.0
0463f3f27739f3fecb1e7c51130541fb213d7d29,3,Issues/1912-libtool-2.2.10-argz.c,5903,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,...,lgpl-2.1-plus_83_1.RULE,False,False,True,False,2-aho,6,6,100.0,33.0
0463f3f27739f3fecb1e7c51130541fb213d7d29,4,Issues/1912-libtool-2.2.10-argz.c,5903,text/x-c,"C source, ASCII text",C++,False,True,False,False,True,...,lgpl_3.RULE,False,False,True,False,2-aho,7,7,100.0,38.0


This is returned to the `create_package_level_dataframe` function in the package level, where this happends for every row, i.e. every package. They all get merged into One main dataframe.