Skip to content

alexamanpreet/Network-Log-and-Traffic-Analysis

Repository files navigation

Network-Log-and-Traffic-Analysis

Identify malicious behavior and attacks using Machine Learning with Python

LAB A

We'll be using IPython and panads functionality in this part.

Our first goal is to get the information from the log files off of disk and into a dataframe.

Since we're working with limited resources we'll use samples of the larger files.

Requirements

IPython
Pandas
Matplotlib
Seaborn
datetime
warnings

Tip

To access keyboard shortcuts click on a (non-code) cell or the text "In []" to the left of the cell, and press the H key. Or select Help from the menu above, and then Keyboard Shortcuts. **Very useful saved us a lot of time during editing.

Business Understanding

Overview

The dataset that we've selected is from the field of Network Analysis and Security. We are using log files generated by BRO Network Security Monitor as our dataset. The dataset we've choosen has about 20 million records ( about 2 GB in size) and has 22 features with a number of sub-features explained in the feature description sections that follow.

We'll be analyzing the log file, finding the correlation between attack behavioud and the features to come up with probable conclusions and results that helped us in identifying malicious behavior and potential threats and attacks in the network of our dataset.

The plan is to understand the dataset, the features, attack behaviours, and their descriptions in-detail as they are stated by Bro.

We will do a lot of preprocessing including elimination, grouping, standardization, and imputation to try and make the dataset more convenient to work on.

After getting the dataset ready to be processed for extracting valuable statistical information, we then visualized those statistical information using the most appropriate plots (in our case, box plot was used extensively). Then we grouped some of the features (use them to visualize relationships) and then use correlation matrix to represent all relationships between the different features that are important in our analysis (for example,the services and packets generated as well as received have a high corelation).

Purpose

We selected this dataset because it is a complex as well as a technical dataset that is used on live data retaining value depending on its freshness. We are interested in learning more about security, its attacks, and their patterns.

The amount of real-time processing that can be done by analyzing the data collected can reduce a lot of manual work and catch patterns in attacks that occur over a large period of time that a human cannot identify.

These logs also allow us to see the amount of data being transferred and allowing organizations to allocate bandwidth depending based on the future scope of usage patterns.

Data

Full dataset available here. This is the conn.log.

About

Identify malicious behavior and attacks using Machine Learning with Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published