# **Exploratory Data Analysis Notebook**

## Objectives

* Inspect and understand the dataset via Exploratory Data Analysis. Check Data Types, Missing Data, Variables and Correlations. Perform statistic analysis to gain insight into data.

## Inputs

* Android_Malware.csv

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Libraries and Load Dataset

In this section, all necessary standard libaries are imported to allow using their functions. The dataset is loaded to be able to access necessary data.

Import Libraries with necessary Settings

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
%matplotlib inline
sns.set(style="whitegrid")

Load Dataset from Inputs

In [None]:
df = pd.read_csv("inputs/datasets/raw/Android_Malware.csv")

---

# General Data Exploration

Get general overview of dataset

In [None]:
df.info()

* The first inspection shows, that there are some columns that have an object data type. For further analysis, this needs to be converted to numeric.

* The expected target variable - Label - is also an object. This needs to be converted as well. 

---

# Pandas Profiling Report

In this section, a pandas profiling report is created via ydata_profiling. The report serves as a general overview for the whole dataset and is saved in outputs folder for future reference.

Import ydata_profiling and create and save Pandas Profile Report

In [None]:
# Import Library
from ydata_profiling import ProfileReport

# Create Profiling Report
profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

# Save Report in Outputs Folder (create if not existing)
os.makedirs("outputs/eda", exist_ok=True)
profile.to_file("outputs/eda/EDA_Report.html")

* The report shows detailed information about all variables. It allows to clearly identify the target variable - the label column.

* The label column has four different values: android_sms_malware, android_adware, android_scareware and benign. These describe the class / category of each sample in the dataset. 

* Project Goal: Create a classification model to predict the label based on the features.

* The other columns are potential input features that can be used for training the model. 

* Further study of missing values, data types and correlation is needed to see which features are useful for a prediction model for the label target variable.

---

# Data Preparation

In this section, 

Check all columns to see naming and formatting issues

In [None]:
df.columns.tolist()

Clean column names to avoid hidden spaces

In [None]:
df.columns = df.columns.str.strip()
df.columns.tolist()

Encode target variable Label and remove after encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Label_encoded'] = le.fit_transform(df['Label'])

# Print label mapping
label_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("🎯 Label mapping (target encoding):")
print(label_mapping)

# Drop original Label column after encoding
df.drop(columns=['Label'], inplace=True)
print("🗑️ Dropped original 'Label' column after encoding.")

Convert malformed numeric object columns to numeric & show results of conversion

In [None]:
# Identify original object columns
original_obj_cols = df.select_dtypes(include='object').columns.tolist()
converted_cols = []

for col in original_obj_cols:
    try:
        df[col] = pd.to_numeric(df[col])
        converted_cols.append(col)
    except ValueError:
        pass  # Leave it as object if conversion fails

# Show conversion results
remaining_obj_cols = df.select_dtypes(include='object').columns.tolist()
print("\n✅ Successfully converted to numeric:")
print(converted_cols)

print("\n❌ Still categorical or string (object):")
print(remaining_obj_cols)

Check column values for suspicious looking numeric columns

In [None]:
# Get only relevant columns
suspected_numeric = ['CWE Flag Count', 'Down/Up Ratio', 'Fwd Avg Bytes/Bulk']

# Preview unique values to spot issues
for col in suspected_numeric:
    print(f"\n🔍 Unique values in '{col}':")
    print(df[col].unique()[:10])

Convert unique values of object columns to NaN / numeric

In [None]:
cols_to_clean = ['CWE Flag Count', 'Down/Up Ratio', 'Fwd Avg Bytes/Bulk']

# Convert to numeric (coerce errors to NaN)
for col in cols_to_clean:
    df[col] = pd.to_numeric(df[col], errors='coerce')

print("\n✅ Finished converting suspicious numeric columns.")

Drop unhelpful metadata object columns

In [None]:
columns_to_drop = ['Flow ID', 'Source IP', 'Destination IP', 'Timestamp']
df.drop(columns=columns_to_drop, axis=1, inplace=True)

print("\n🗑️ Dropped unhelpful metadata columns.")

Final check for remaining object columns

In [None]:
final_obj_cols = df.select_dtypes(include='object').columns.tolist()
print("\n Final object columns (categorical candidates):")
print(final_obj_cols)

Save converted dataframe as a copy to keep original one intact & save as csv file for easier access

In [None]:
# Make copy of dataframe for further use and easier separation
df_converted = df.copy()

# Save converted dataframe as csv file for easier later access of converted data
os.makedirs("outputs/data", exist_ok=True)
df_converted.to_csv("outputs/data/Android_Malware_converted.csv", index=False)

print("✅ Saved converted dataframe as copy of original")
print("✅ Saved converted dataframe to outputs/data/")

---

# Correlation Study

---

# Conclusion and Next Steps

* 