# **Exploratory Data Analysis Notebook**

## Objectives

* Inspect and understand the dataset via Exploratory Data Analysis. Check Data Types, Missing Data, Variables and Correlations. Perform statistic analysis to gain insight into data.

## Inputs

* Android_Malware.csv

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Libraries and Load Dataset

In this section, all necessary standard libaries are imported to allow using their functions. The dataset is loaded to be able to access necessary data.

Import Libraries with necessary Settings

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
%matplotlib inline
sns.set(style="whitegrid")

Load Dataset from Inputs

In [None]:
df = pd.read_csv("inputs/datasets/raw/Android_Malware.csv")

---

# Pandas Profiling Report

In this section, a pandas profiling report is created via ydata_profiling. The report serves as a general overview for the whole dataset and is saved in outputs folder for future reference.

Import ydata_profiling and create and save Pandas Profile Report

In [None]:
# Import Library
from ydata_profiling import ProfileReport

# Create Profiling Report
profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

# Save Report in Outputs Folder (create if not existing)
os.makedirs("outputs/eda", exist_ok=True)
profile.to_file("outputs/eda/EDA_Report.html")

* The report shows detailed information about all variables. It allows to clearly identify the target variable - the label column.

* The label column has four different values: android_sms_malware, android_adware, android_scareware and benign. These describe the class / category of each sample in the dataset. 

* Project Goal: Create a classification model to predict the label based on the features.

* The other columns are potential input features that can be used for training the model. 

* Further study of missing values, data types and correlation is needed to see which features are useful for a prediction model for the label target variable.

---

# General Data Exploration

Get general overview of dataset

In [None]:
df.info()

Check for missing values in all columns

In [None]:
df.isnull().sum()

---

# Correlation Study

---

# Conclusion and Next Steps

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.