# **Data Collection Notebook**

## Objectives

* Check for Kaggle authentification and fetch dataset from Kaggle. Save as raw data in extracted csv file.

## Inputs

* Android_Malware.csv

## Outputs

* None

## Additional Comments

* The dataset is originally saved as a zip file when fetching from Kaggle. This zip file is then extracted before being deleted. For a certain part of the code, the zip file is part of the needed inputs for running the code. This is obsolete after extracting and saving the file as csv.


---

# Set Project Root Directory

Centralise the base path using project_root

In [None]:
import os
from pathlib import Path

# Resolve the project root
project_root = Path.cwd()
if project_root.name == "jupyter_notebooks":
    project_root = project_root.parent

# Fetch Dataset from Kaggle

* In this section, the chosen dataset is fetched via Kaggle API as a zip file. The zip file is then extracted as a csv file and deleted. Authentification for Kaggle is also deleted as it is no longer needed. 

Setup for Kaggle authentification

* Check local machine for kaggle.json file to allow authentication

* Drag kaggle.json file into root directory

Get token recognised for this session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

The following dataset is used in this project: [Kaggle Android Malware Detection URL](https://www.kaggle.com/datasets/subhajournal/android-malware-detection)

Get the path from the dataset URL. Define the dataset, add destination folder and download dataset

In [None]:
# Define path
KaggleDatasetPath = "subhajournal/android-malware-detection"
DestinationFolder = "inputs/datasets/raw"

# Check for destination folder or create it
os.makedirs(DestinationFolder, exist_ok=True)

# Download the dataset
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded zip dataset file, delete the zip and kaggle.json file

In [None]:
import os, glob, zipfile

# Extract dataset zip file
zip_path = glob.glob('inputs/datasets/raw/*.zip')[0]
with zipfile.ZipFile(zip_path, 'r') as z: z.extractall('inputs/datasets/raw')

# Delete dataset zip file
os.remove(zip_path)

# Delete kaggle.json
if os.path.exists('kaggle.json'): os.remove('kaggle.json')

---

# Load and Inspect Dataset

* In this section, the dataset is loaded and inspected to check that collecting and extracting the dataset works as expected.

Load the dataset and get overview of first to rows

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv(Path.cwd().parent / "inputs" / "datasets" / "raw" / "Android_Malware.csv")

# Show first 5 rows of dataset
df.head()

Get a dataframe summary (datatypes)

In [None]:
df.info()

---

# Conclusions and Next Steps

* Loading the Dataset from Kaggle worked as expected and a working Android_Malware.csv was extracted for further data analysis

* In the next notebook, an EDA should be used to further analyse the dataset for conclusions and insights

* Further analysis allows to move on to data cleaning and the later use of the data for building an ML model