# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** - DIKSHA


# **Project Summary -**

The objective of this project is to analyze and classify Android applications as Benign or Malicious, ensuring user security while maintaining a trustworthy app ecosystem. With the rapid increase in mobile applications, fraudulent apps pose significant risks, including privacy breaches and financial loss. This analysis leverages a dataset containing various app attributes such as Ratings, Number of Ratings, Price, Permissions (safe and dangerous), and Class (target variable) to identify key patterns and build a predictive model.

Dataset Overview:
The dataset consists of multiple attributes related to app characteristics, including numeric variables like Rating, Number of Ratings, Price, and permission counts, along with the target variable Class (0 = Benign, 1 = Malicious). The data underwent preprocessing steps such as handling missing values, removing duplicates, and cleaning invalid entries to ensure quality for further analysis.

Insights and Business Impact
Key insights derived:

Apps requesting excessive dangerous permissions are highly likely to be malicious.

Malicious apps often have low user ratings and limited reviews.

Most malicious apps are free or underpriced, pointing to pricing anomalies as a detection factor.

Risk varies across categories; Tools and Lifestyle apps tend to request more permissions.

These insights can create positive business impact by enabling:


Recommendations:
To achieve the business objective of reducing malicious apps:

Implement a risk-scoring mechanism combining permissions, ratings, and pricing anomalies.

Apply category-specific review policies for higher-risk app segments.

Educate developers on privacy-first app design to reduce unnecessary permission requests.

Introduce user transparency measures, such as a permission risk badge, to improve trust.



Conclusion
This analysis highlights that leveraging permissions, pricing, and user feedback data can effectively distinguish between malicious and benign apps. By combining machine learning models with strategic business policies, app marketplaces can ensure a secure ecosystem while fostering innovation. Adopting these measures will significantly enhance user safety, brand credibility, and long-term growth.





# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the rapid increase in Android applications, malicious apps pose severe threats such as privacy breaches, financial fraud, and security risks. Manual detection methods are inefficient and unreliable given the large volume of apps. Therefore, an automated system is needed to classify apps as benign or malicious using key attributes like permissions, ratings, and pricing to ensure user safety and maintain platform integrity.



#### **Define Your Business Objective?**

The primary objective is to build a robust machine learning-based classification system that accurately distinguishes between Benign and Malicious Android applications using historical app data. This system will:

Enhance platform security by identifying high-risk apps before publication.

Improve user trust and retention by reducing exposure to harmful applications.

Streamline the app review process by providing risk-based screening, reducing manual intervention.

Support compliance with security and privacy regulations by proactively preventing data misuse.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/classification/Android Authenticity Prediction/ANDRIOD AUTHENTICITY PREDICTION.csv")


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

# Create a heatmap of missing values
sns.heatmap(df.isnull(), cbar=False)

# Show the plot
plt.show()

### What did you know about your dataset?


App, Package, Category, etc

Description (minor missing values)
Rating, Number of ratings, Price, etc.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Text description of the app (has a few missing values)



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_count = df[column].nunique()
    print(f"{column}: {unique_count} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Load libraries
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/classification/Android Authenticity Prediction/ANDRIOD AUTHENTICITY PREDICTION.csv")

# 1. Drop duplicates
df.drop_duplicates(inplace=True)

# 2. Handle missing values
# Drop rows where 'Description' is missing (only 3 rows)
df = df.dropna(subset=['Description'])

# 3. Drop non-informative columns (App name, Package, Description)
df.drop(['App', 'Package', 'Description'], axis=1, inplace=True)

# 4. Encode categorical variable 'Category' using Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])

# 5. Separate features and target
X = df.drop('Class', axis=1)
y = df['Class']

# 6. Scale numerical features (Rating, Number of ratings, Price, etc.)
from sklearn.preprocessing import StandardScaler

# Identify numerical columns manually
numeric_cols = ['Rating', 'Number of ratings', 'Price']
scaler = StandardScaler()
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])

# 7. Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print shape
print(f"Training data: {X_train.shape}, Testing data: {X_test.shape}")


### What all manipulations have you done and insights you found?

Removed Duplicates:
Dropped all duplicate rows from the dataset.
Prevent bias from repeated data points.

Handled Missing Values:
Missing values only present in the Description column (3 rows).
Dropped rows with missing Description.

Dropped Non-informative Columns:
These columns do not contribute to prediction in our approach.

Encoded Categorical Features
Category: Converted using Label Encoding.
ML algorithms require numeric input.

Split Dataset:
80% training, 20% testing.
For unbiased performance evaluation.











## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Rating', y='Number of ratings', hue='Class', alpha=0.6, palette='coolwarm')

plt.title('Rating vs Number of Ratings (Colored by Class)')
plt.xlabel('App Rating')
plt.ylabel('Number of Ratings')
plt.yscale('log')  # Because Number of Ratings might have huge variation
plt.legend(title='Class', labels=['Benign (0)', 'Malicious (1)'])
plt.show()


##### 1. Why did you pick the specific chart?

Both features represent user engagement and trust in an app.

Apps with high ratings and more reviews are generally considered safe, while malicious apps often have low ratings or very few reviews.

##### 2. What is/are the insight(s) found from the chart?

Benign apps (Class 0) tend to cluster around higher ratings (4.0+) and a large number of ratings, suggesting strong user trust.

Malicious apps (Class 1) mostly appear with low ratings and fewer reviews, indicating limited adoption and negative feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:
Helps identify trustworthy apps based on strong user feedback.

Enables early detection of suspicious apps with low ratings and low reviews.


Negative:

Some malicious apps fake reviews or ratings, which can mislead the model.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Price', y='Number of ratings', hue='Class', alpha=0.6)
plt.title('Price vs Number of Ratings')
plt.xlabel('Price')
plt.ylabel('Number of Ratings')
plt.yscale('log')
plt.show()


##### 1. Why did you pick the specific chart?

To identify if expensive apps receive more or fewer user reviews and if malicious apps price unusually.



##### 2. What is/are the insight(s) found from the chart?


Most apps are free or cheap; malicious apps tend to cluster in free/cheap range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(14, 6))
sns.boxplot(data=df, x='Category', y='Price')
plt.xticks(rotation=90)
plt.title('Price Distribution by Category')
plt.show()


##### 1. Why did you pick the specific chart?

To detect pricing anomalies across categories.


##### 2. What is/are the insight(s) found from the chart?


Most categories have low median prices, but some have extreme outliers.

Outliers may indicate fake premium apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:
Helps monitor high-risk categories for overpricing or fake apps.

Guides pricing benchmarks for new apps.


Negative:
Different categories naturally have different pricing (e.g., finance vs games), so general rules may fail.

May penalize legitimate niche apps with higher prices.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='Class', y='Dangerous permissions count')
plt.title('Dangerous Permissions by Class')
plt.xlabel('Class (0=Benign, 1=Malicious)')
plt.ylabel('Dangerous Permissions Count')
plt.show()


##### 1. Why did you pick the specific chart?

Permissions are key indicators of maliciousness.



##### 2. What is/are the insight(s) found from the chart?


Malicious apps request more dangerous permissions than benign apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:
Strong indicator for malicious apps → key feature for ML model.

Improves app review process by flagging apps requesting excessive permissions.

Negative:
Some benign apps (e.g., social media) require many permissions → false positives possible.

Developers may try to bypass detection by splitting permission requests.

#### Chart - 5

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(14, 6))
avg_rating = df.groupby('Category')['Rating'].mean().sort_values()
sns.barplot(x=avg_rating.index, y=avg_rating.values, palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Rating by Category')
plt.ylabel('Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

To check which categories have the best or worst-rated apps.


##### 2. What is/are the insight(s) found from the chart?

Some categories consistently have lower average ratings, which might indicate low-quality or suspicious apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:
Helps identify categories with low-quality apps.

App stores can prioritize security checks in risky categories.

Negative:
Category-level analysis can oversimplify risk; malicious apps exist in all categories.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Safe permissions count', y='Dangerous permissions count', hue='Class', alpha=0.6, palette='coolwarm')
plt.title('Safe vs Dangerous Permissions (by Class)')
plt.xlabel('Safe Permissions Count')
plt.ylabel('Dangerous Permissions Count')
plt.show()


##### 1. Why did you pick the specific chart?

To see if apps with many dangerous permissions also request safe permissions and whether it correlates with maliciousness.


##### 2. What is/are the insight(s) found from the chart?

Malicious apps generally have higher dangerous permissions regardless of safe permissions.

Benign apps cluster with fewer dangerous permissions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Number of ratings', y='Dangerous permissions count', hue='Class', alpha=0.6)
plt.xscale('log')
plt.title('Ratings Count vs Dangerous Permissions')
plt.xlabel('Number of Ratings')
plt.ylabel('Dangerous Permissions Count')
plt.show()


##### 1. Why did you pick the specific chart?

To identify if popular apps (high ratings count) tend to request more dangerous permissions.


##### 2. What is/are the insight(s) found from the chart?

Most malicious apps have few ratings and high permissions.

Popular apps (many ratings) usually have moderate permissions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Price', y='Dangerous permissions count', hue='Class', alpha=0.6)
plt.title('Price vs Dangerous Permissions Count')
plt.xlabel('Price')
plt.ylabel('Dangerous Permissions Count')
plt.show()


##### 1. Why did you pick the specific chart?

To analyze if paid apps request unnecessary dangerous permissions.

##### 2. What is/are the insight(s) found from the chart?


Malicious apps are mostly free or very cheap, but some paid apps still request many dangerous permissions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(14, 6))
avg_perm = df.groupby('Category')['Dangerous permissions count'].mean().sort_values()
sns.barplot(x=avg_perm.index, y=avg_perm.values, palette='mako')
plt.xticks(rotation=90)
plt.title('Average Dangerous Permissions by Category')
plt.ylabel('Avg Dangerous Permissions')
plt.show()


##### 1. Why did you pick the specific chart?

To see which categories tend to request more dangerous permissions.



##### 2. What is/are the insight(s) found from the chart?

Categories like Tools, Communication, and Lifestyle request the most dangerous permissions.

Educational apps request fewer permissions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(6, 6))
df['Class'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['#66b3ff', '#ff6666'], labels=['Benign', 'Malicious'])
plt.title('Distribution of Benign vs Malicious Apps')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the class balance in the dataset (important for ML modeling and fraud detection strategy).


##### 2. What is/are the insight(s) found from the chart?

The dataset is likely imbalanced (usually, benign apps are far more than malicious ones).

Imbalance impacts model performance, requiring techniques like SMOTE or class weights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Knowing the class imbalance helps in better risk modeling and resource allocation for review teams.

Negative:

If the system over-prioritizes benign apps, malicious apps may bypass checks, impacting security.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select only numeric columns from your dataset
numeric_df = df.select_dtypes(include='number')

# Plot correlation heatmap
plt.figure(figsize=(12, 9))
sns.heatmap(numeric_df.corr(), cmap='YlGnBu', annot=False, linewidths=.5)
plt.title('Correlation Heatmap of Numeric Features with YlGnBu Palette')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is essential in the data analysis phase because it helps visualize the relationship between numeric variables, especially how strongly features correlate with the target variable (Class).

This is important for:

Feature selection → Removing highly correlated redundant features.

Understanding predictors → Knowing which features have the most influence on the outcome.

##### 2. What is/are the insight(s) found from the chart?

Features like Dangerous permissions count, Number of permissions, and Price may have strong positive or negative correlations with Class.

High correlation between some features (e.g., total permissions and dangerous permissions) suggests possible multicollinearity, which can affect model accuracy.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select a few important numeric columns and the target column 'Class'
selected_cols = ['Rating', 'Number of ratings', 'Price', 'Dangerous permissions count', 'Safe permissions count', 'Class']

# Create a pairplot
sns.pairplot(df[selected_cols], hue='Class', palette='coolwarm', diag_kind='kde')
plt.suptitle('Pair Plot of Selected Features by Class', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is one of the most useful tools for exploratory data analysis (EDA) because it shows:

Pairwise relationships between multiple features.

Distribution of individual features.



##### 2. What is/are the insight(s) found from the chart?

Malicious apps tend to have:

Higher Dangerous Permissions Count.

Lower Number of Ratings compared to benign apps.

Benign apps cluster with:

Moderate to high ratings and a low count of dangerous permissions.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

 Implement a Permission-Based Risk Scoring System
Use ML model predictions + feature insights (e.g., dangerous permissions count, low ratings, suspicious pricing).

Assign a risk score to every app before publishing.

Benefit:
Reduces malicious apps while minimizing false positives.

2. Focus on Early Detection for New Apps
Many malicious apps have low rating count and are free.

Prioritize manual or automated checks for new apps with excessive dangerous permissions.

Benefit:
Stops malicious apps before they gain traction.

3. Category-Specific Risk Profiling
Categories like Tools, Communication, and Lifestyle have higher dangerous permission requests.

Apply stricter review policies for these categories.

Benefit:
Improves efficiency by focusing efforts where risk is highest.

4. Educate Developers
Provide clear guidelines about requesting only essential permissions.

Penalize unnecessary dangerous permissions but allow exceptions with strong justification.

# **Conclusion**

The analysis of the Android app dataset reveals significant patterns that can effectively support the identification of malicious applications. Key insights include the strong correlation between dangerous permission requests and malicious behavior, the prevalence of free apps with low ratings among malicious entries, and the variation of risk across different app categories.

Through data wrangling, feature analysis, and visualization, we observed that apps requesting excessive dangerous permissions or showing anomalies in pricing and ratings are more likely to be malicious. These findings provide a solid foundation for building an accurate machine learning classification model to predict app authenticity.

Implementing a risk-based screening system, educating developers, and using category-specific policies can significantly enhance security, improve user trust, and prevent fraudulent applications from entering the marketplace. At the same time, a balanced approach must be adopted to avoid penalizing legitimate apps, ensuring continued innovation and a positive developer ecosystem.

In conclusion, leveraging these insights will enable the client to achieve their business objective of reducing security risks while maintaining a healthy app ecosystem, ultimately leading to better user safety, improved brand reputation, and sustained growth.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***