# Analyzing Manage File Transfer

In this exercise, you train an ML model to determine whether a File Transfer instance is classified as malicious/suspicious or benign.

The exercise uses the _mftinput4_ data set from a non production Sterling Lab environment. 
The data set consists of approximately 800 file transfers inbound and outbound which are classifed on the above categories.
The data includes features such as File Age, Compression Ratio, Transfer Time, Packets Size, Transfer Rate.
The data set is available as a CSV file in this repository.

Explore the data to recognize whether you can use it to train a model that recognizes suspicious/malicious file transfers.

> _NOTE:  In the interest of time, this notebook performs a simple and superficial analysis of the data.
A more detailed study would require more time._

### 1. Import the required libraries and load the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load the data set into a Pandas data frame called "data"
data = pd.read_csv("data/mftinput4.csv")

# Obtain the length (rows) and width (columns) of the data set
data.shape

The data contains 100 rows and 13 columns.  

Use the `head` method  of the Pandas dataframe to preview the first five rows.

In [None]:
data.head()

### 2. Inspect basic information.

Use standard data analysis methods to start exploring the data.

Inspect the column names and associated data types.
The `info` method of a Pandas data frame displays the column names and data types in a data frame.

In [None]:
data.info()

Note the different value types:

* `Entropy`, `FileAge`, `CompressionRatio`, `FileSize`, `TransferTime`, `Age`, and `Outcome` contain integer values.
* `PacketsSize` and `TransferRate` contain float values.

Use the `describe` method to see basic statistical information for each column, such as percentiles, mean, and standard deviation.

In [None]:
data.describe()


The dataset consists of several file transfer variables, which are the input features, and one target variable: `Outcome`.

* `Entropy`:                degree of randomness or unpredictability in the file's content
* `FileAge`:                file age in number of seconds when uploaded
* `CompressionRatio`:       percentage of compression
* `FileSize`:               megabytes
* `TransferTime`:           seconds
* `PacketsSize`:            bytes in each packet transferred
* `TransferRate`            mega bytes per millisecond
* `Age`:                    file age in number of seconds when processed

* `Outcome`:                 target variable. Whether the file transfer content is suspicious (`1`) or not (`0`)

Count the number are malicious cases.

In [None]:
data.Outcome.value_counts()

15 of 100 cases are malicious file transfer cases.

### 3. Identify missing data

Plot the data to visualize the data distribution.
Use the `hist` method to plot a histogram.
You can use histograms to see how the data is distributed for each variable and detect outliers.

In [None]:
# Plot histograms of the columns on multiple subplots
plt.close('all')
data.hist(bins=20, figsize=(10, 8))

The dataset is evenly distributed .


Reuse the `head` method to see the `0` values in the data. Print the first 20 rows.

In [None]:
data.head(20)

Print the last 20 rows of the dataset and determine if those rows also contain `0` values.

In [None]:
data.tail(20)

Determine the number of `0` values in the dataset.

In [None]:
# Select all the rows and only the feature columns
feature_data = data.iloc[:, :-1]

# Count the total number of rows
num_cases = data.shape[0]

# Number & percent of '0's for each feature
numZero = (feature_data[:] == 0).sum()
perZero = ((feature_data[:] == 0).sum())/num_cases*100

print(f"\nRows, Feature columns: {feature_data.shape}")
print("\n== Number of 0's:")
print(numZero)
print("\n == Percentage of 0's:")
print(perZero)

The data set contains 56 zero values for `is executable` and 52 zero values for `is compressed`.
Aproximately half of the file transfers have a comprssed file and/or executable binary file.

To build and train a reliable ML model, you should address any missing values.
However, for the sake of simplicity, this exercise do not have missing values and/or  outliers in the dataset.

Verify whether if there are any missing data values.

## References

*File Transfer records were generated by extracting data from a Control Center Monitor instance database in a non production lab environment.