# **Data Collection Notebook**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\acvo\\Documents\\vscode-projects\\pp5_android_malware_detector\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\acvo\\Documents\\vscode-projects\\pp5_android_malware_detector'

# Fetch Dataset from Kaggle

* Drag kaggle.json file into root directory

* Check for kaggle.json file to allow authentication

Get token recognised for this session

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

* The following dataset is used in this project: [Kaggle Android Malware Detection URL](https://www.kaggle.com/datasets/subhajournal/android-malware-detection)

Get the path from the dataset URL. Define the dataset, add destination folder and download dataset

In [5]:
# Define path
KaggleDatasetPath = "subhajournal/android-malware-detection"
DestinationFolder = "inputs/datasets/raw"

# Check for destination folder or create it
os.makedirs(DestinationFolder, exist_ok=True)

# Download the dataset
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/subhajournal/android-malware-detection
License(s): GNU Affero General Public License 3.0
Downloading android-malware-detection.zip to inputs/datasets/raw




  0%|          | 0.00/45.1M [00:00<?, ?B/s]
100%|██████████| 45.1M/45.1M [00:00<00:00, 524MB/s]


Unzip the downloaded zip dataset file, delete the zip and kaggle.json file

In [6]:
import os, glob, zipfile

# Extract dataset zip file
zip_path = glob.glob('inputs/datasets/raw/*.zip')[0]
with zipfile.ZipFile(zip_path, 'r') as z: z.extractall('inputs/datasets/raw')

# Delete dataset zip file
os.remove(zip_path)

# Delete kaggle.json
if os.path.exists('kaggle.json'): os.remove('kaggle.json')

---

# Load and Inspect Dataset

Load the dataset and get overview of first to rows

In [7]:
import pandas as pd

# Load dataset
df = pd.read_csv(f"inputs/datasets/raw/Android_Malware.csv")

# Show first 5 rows of dataset
df.head()

  df = pd.read_csv(f"inputs/datasets/raw/Android_Malware.csv")


Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,172.217.6.202-10.42.0.211-443-50004-6,10.42.0.211,50004,172.217.6.202,443.0,6.0,13/06/2017 11:52:39,37027,1,...,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
1,1,172.217.6.202-10.42.0.211-443-35455-6,10.42.0.211,35455,172.217.6.202,443.0,6.0,13/06/2017 11:52:39,36653,1,...,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
2,2,131.253.61.68-10.42.0.211-443-51775-6,10.42.0.211,51775,131.253.61.68,443.0,6.0,13/06/2017 11:52:42,534099,8,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
3,3,131.253.61.68-10.42.0.211-443-51775-6,10.42.0.211,51775,131.253.61.68,443.0,6.0,13/06/2017 11:52:43,9309,3,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware
4,4,131.253.61.68-10.42.0.211-443-51776-6,10.42.0.211,51776,131.253.61.68,443.0,6.0,13/06/2017 11:52:42,19890496,8,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Android_Adware


Get a dataframe summary (datatypes)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355630 entries, 0 to 355629
Data columns (total 86 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Unnamed: 0                    355630 non-null  int64  
 1   Flow ID                       355629 non-null  object 
 2    Source IP                    355630 non-null  object 
 3    Source Port                  355630 non-null  int64  
 4    Destination IP               355630 non-null  object 
 5    Destination Port             355630 non-null  float64
 6    Protocol                     355630 non-null  float64
 7    Timestamp                    355630 non-null  object 
 8    Flow Duration                355630 non-null  int64  
 9    Total Fwd Packets            355630 non-null  int64  
 10   Total Backward Packets       355630 non-null  int64  
 11  Total Length of Fwd Packets   355630 non-null  float64
 12   Total Length of Bwd Packets  355630 non-nul

Check for missing values in all columns

In [9]:
df.isnull().sum()

Unnamed: 0         0
Flow ID            1
 Source IP         0
 Source Port       0
 Destination IP    0
                  ..
Idle Mean          4
 Idle Std          4
 Idle Max          4
 Idle Min          4
Label              0
Length: 86, dtype: int64

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Conclusions and Next Steps

* Loading the Dataset from Kaggle worked as expected and a working Android_Malware.csv was extracted for further data analysis

* In the next notebook, an EDA should be used to further analyse the dataset for conclusions and insights

* Further analysis allows to move on to data cleaning and the later use of the data for building an ML model