School of Computer Science
KIIT
Course: Computer Networks (IT-3009)
Project Topic: Suspicious Email Detection

Team Members:
Deepraj Bera 21051302
Abhishek Mallick 21051706
Soumyabrata Samanta 21051436
Abhrajit Das 21051026

Introduction:
In today's interconnected world, email remains one of the primary communication channels.However, the rise of cyber threats, including phishing and malware distribution through emails, has necessitated the development of robust techniques for detecting suspicious emails. This project aims to explore the intricacies of email-based cyber threats and equip students with the skills to design and implement a Suspicious Email Detection system using concepts from computer networks.

Project Objectives:

Understanding Email Threats:
- Goal: Develop a comprehensive understanding of various email threats, including phishing, malware, and spam, and their potential impacts on users and systems.
Dataset Acquisition:
- Goal: Identify and collect a suitable dataset comprising both legitimate and suspicious emails for training and evaluation purposes. A diverse and representative dataset is crucial for effective model training.
- Collected the Dataset from Kaggle
- The code starts by loading email data from a CSV file named 'mail_data.csv' into a pandas DataFrame.
```
raw_mail_data = pd.read_csv('mail_data.csv')
```
Preprocessing and Feature Extraction:
- Goal: Preprocess email data by addressing text, attachments, and headers. Extract relevant features from the data to facilitate the detection process. Proper preprocessing ensures the model's ability to discern patterns in the data.
- It replaces null values with an empty string.
- It then labels the data, assigning 'spam' emails a label of 0 and 'ham' (non-spam) emails a label of 1.
```
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')
mail_data.loc[mail_data['Category'] == 'spam', 'Category'] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category'] = 1
```

Data Splitting:

The data is split into training and testing sets using the train_test_split function from scikit-learn.

X = mail_data['Message']
Y = mail_data['Category']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

Text Feature Extraction:
- The text data is transformed into feature vectors using the TF-IDF vectorizer (TfidfVectorizer) from scikit-learn. This is a common technique in natural language processing to convert text data into numerical features.
```
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)
```
Model Training:
- A Logistic Regression model is chosen and trained on the transformed text features.
```
model = LogisticRegression()
model.fit(X_train_features, Y_train)
```

Model Evaluation:

The accuracy of the model is evaluated on both the training and test datasets.

# Accuracy on training data
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

# Accuracy on test data
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

# Confusion Matrix
conf_matrix = confusion_matrix(Y_test, prediction_on_test_data)

# Classification Report
class_report = classification_report(Y_test, prediction_on_test_data)

# Calculate AUC-ROC
prediction_on_training_data_prob = model.predict_proba(X_train_features)[:, 1]
prediction_on_test_data_prob = model.predict_proba(X_test_features)[:, 1]

roc_auc_train = roc_auc_score(Y_train, prediction_on_training_data_prob)
roc_auc_test = roc_auc_score(Y_test, prediction_on_test_data_prob)

# Plot ROC Curve for test data
fpr, tpr, _ = roc_curve(Y_test, prediction_on_test_data_prob)
roc_auc = auc(fpr, tpr)

Implementation:
- Goal: Implement the designed system using a programming language such as Python. Leverage libraries and frameworks for machine learning and text analysis to streamline the development process.
- Open Source Contribution
```
git clone https://github.com/deepraj21/Spam-email-detection
pip install -r requirements.txt
flask run
```
Validation and Testing:
- Goal: Validate the system's effectiveness through rigorous testing methodologies, including cross-validation techniques and parameter tuning. This step ensures the robustness and reliability of the Suspicious Email Detection system under different scenarios.

By achieving these objectives, the project aims to equip individuals with the skills and knowledge needed to address the evolving challenges posed by email-based cyber threats. The combination of understanding threats, acquiring and processing data, selecting appropriate algorithms, designing effective models, and rigorous evaluation contributes to the development of a robust and reliable email threat detection system.

Webapp Preview

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__pycache__		__pycache__
instance		instance
static		static
templates		templates
webapp_preview		webapp_preview
CN Project.pdf		CN Project.pdf
LICENSE		LICENSE
README.md		README.md
app.py		app.py
mail_data.csv		mail_data.csv
model.ipynb		model.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

School of Computer Science
KIIT
Course: Computer Networks (IT-3009)
Project Topic: Suspicious Email Detection

Team Members:
Deepraj Bera 21051302
Abhishek Mallick 21051706
Soumyabrata Samanta 21051436
Abhrajit Das 21051026

Webapp Preview

License

About

Releases

Packages

Languages

License

deepraj21/Spam-email-detection

Folders and files

Latest commit

History

Repository files navigation

School of Computer Science KIIT Course: Computer Networks (IT-3009) Project Topic: Suspicious Email Detection

Team Members: Deepraj Bera 21051302 Abhishek Mallick 21051706 Soumyabrata Samanta 21051436 Abhrajit Das 21051026

Webapp Preview

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

School of Computer Science
KIIT
Course: Computer Networks (IT-3009)
Project Topic: Suspicious Email Detection

Team Members:
Deepraj Bera 21051302
Abhishek Mallick 21051706
Soumyabrata Samanta 21051436
Abhrajit Das 21051026

Packages