# VirusTotal (VT) Reports Download

The function `process_files` interacts with the VirusTotal API to retrieve detailed analysis reports for a list of sample identifiers (e.g., hashes). It extracts relevant data from the API responses, saves JSON reports for each file, and returns the data in a structured `pandas.DataFrame`. The function includes progress tracking and error handling to ensure smooth execution. It is designed for researchers or analysts working with malware datasets to retrieve and analyze file metadata efficiently. 

**Note: Due to VirusTotal policy, the downloaded JSON report files cannot be uploaded. However, you can request for VirusTotal API key on their website: https://www.virustotal.com/. You can use the API key and this script to automatically download all VirusTotal JSON reports.**


## Features
- Interacts with the VirusTotal API to retrieve analysis reports for files.
- Extracts key attributes, such as creation date, hash values, malicious counts, and malware classifications.
- Saves detailed JSON reports for each file in a specified output folder.
- Generates a `pandas.DataFrame` containing summarized data for all processed files.
- Displays progress during processing using a progress bar.

## Requirements
- Python 3.x
- Required libraries: `os`, `tqdm`, `pandas`, `json`, and a VirusTotal API library (`virustotal` or equivalent).
- An active VirusTotal API key.

## Usage
1. **Set Up the Environment**:
   Install the necessary libraries:
   ```bash
   pip install tqdm pandas virustotal-python
   ```

2. **Prepare Inputs**:
   - Obtain a VirusTotal API key.
   - Create a list of file identifiers (e.g., SHA-256 hashes) to process.

3. **Run the Function**:
   Call the `process_files` function with the API key, file list, and output folder path:
   ```python
   api_key = "your_api_key"
   test_list = ["file_hash1", "file_hash2", ...]
   output_folder = "output_reports"

   df = process_files(api_key, test_list, output_folder)
   print(df.head())
   ```

4. **Output**:
   - JSON reports saved in the specified output folder, named with the SHA-256 hash.
   - A `pandas.DataFrame` summarizing the extracted data.

## Key Extracted Attributes
- **Creation Date**: When the file was first seen or analyzed.
- **Malicious Count**: Number of antivirus engines that flagged the file as malicious.
- **Total Tested**: Total number of antivirus engines used in the analysis.
- **Hash Values**: MD5, SHA-1, and SHA-256 hash of the file.
- **Malware Names and Classifications**: Retrieved from sandbox verdicts.
- **Type Extension**: File type extension.


In [1]:
# importing libraries
import numpy as np
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import time
import virustotal_python
from virustotal_python import Virustotal
from pprint import pprint
from tqdm import tqdm
import json

In [3]:
def process_files(api_key, test_list, output_folder):
    # Create the output directory if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Initialize an empty list to hold extracted data
    extracted_data_list = []

    # Initialize the Virustotal object
    vtotal = Virustotal(API_KEY=api_key)
    
    # Use tqdm to display progress bar
    with tqdm(total=len(test_list), desc="Processing files") as pbar:
        for file_id in test_list:
            try:
                resp = vtotal.request(f"files/{file_id}")
                result = resp.data

                attributes = result.get('attributes', {})
                last_analysis_stats = attributes.get('last_analysis_stats', {})
                sandbox_verdicts = attributes.get('sandbox_verdicts', {}).get('C2AE', {})

                extracted_data = {
                    'creation_date': attributes.get('creation_date'),
                    'malicious': last_analysis_stats.get('malicious'),
                    'total_avi_tested': sum(last_analysis_stats.values()),
                    'md5': attributes.get('md5'),
                    'malware_names': sandbox_verdicts.get('malware_names', [None])[0],
                    'malware_classification': sandbox_verdicts.get('malware_classification', [None])[0],
                    'sha1': attributes.get('sha1'),
                    'sha256': attributes.get('sha256'),
                    'type_extension': attributes.get('type_extension')
                }

                extracted_data_list.append(extracted_data)

                # Save JSON report to a file named with the SHA-256 hash value
                sha256 = attributes.get('sha256')
                json_file_path = os.path.join(output_folder, f"{sha256}.json")
                with open(json_file_path, 'w') as json_file:
                    json.dump(result, json_file, indent=4)

            except Exception as e:
                print(f"Error processing file {file_id}: {e}")

            pbar.update(1)  # Update progress bar

    # Convert the list of dictionaries to a DataFrame
    df = pd.DataFrame(extracted_data_list)

    # Return the DataFrame
    return df

In [6]:
# testing using just three hashes
api_key = "" # insert your VT API here
test_list =  ["60468339f5464275bf51af4bb997ac81d05d75db", "8e9ab34c889dd3741fb251c30bdfc0ee97cfa174", "bd778bb52e3f58957d462e375e69fbf9829bc29b"] # Replace with actual file IDs
output_folder = "vt_reports_additional_ransomware" # The folder is created in your cwd and is used to store the JSON files
df_vt = process_files(api_key, test_list, output_folder)
display(df_vt.head())

Processing files: 100%|███████████████████████████| 3/3 [00:01<00:00,  2.67it/s]


Unnamed: 0,creation_date,malicious,total_avi_tested,md5,malware_names,malware_classification,sha1,sha256,type_extension
0,1121963149,49,75,e4109430128e56dc9da8d4a02ada3e0e,,UNKNOWN_VERDICT,60468339f5464275bf51af4bb997ac81d05d75db,7c3f822d3fb51567e8c629392bc83f55521f4f99aef2da...,exe
1,708992537,22,73,6e8f388c8adc4d1d8e59721a4052194b,,,8e9ab34c889dd3741fb251c30bdfc0ee97cfa174,7af6d15f32d699466ee92f978a6eda5ae3ad6223c65caa...,exe
2,1350925004,55,67,f4a0d8288e81387c734062a30053b1a3,,,bd778bb52e3f58957d462e375e69fbf9829bc29b,7aba7d0c06856f6be37274672498c1d296c714d0554262...,exe
