# Big Data | Log Parsing


## Part I - Splunk Analysis

### What scenarios is Splunk appropriate for? Really well suited to?

Splunk is a software product that enables searching, monitoring, and analyzing of machine-generated big data sets, through a web-faced interface<sup>1</sup>. It uses a standard API to connect directly to applications and devices, and was initially developed to meet the demand of supporting data-driven decision making for executives, without being dependent on an IT department. Splunk is able to connect to almost any log-based data feed, and they work with a variety of industries. They have three primary products (Splunk Enterprise, Splunk Cloud, and Splunk Light) which fit organizational needs relative to a respective source infrastructure, and data volume. Splunk also offers three primary services (Splunk IT Service Intelligence, Splunk Enterprise Security, and Splunk User Behavior Analytics), all servicing a wide array of IT and system-based analytic needs<sup>2</sup>.

Splunk continues to increase market share, for several reasons. The primary reason why Splunk remains relevant, is that they offer a very mature product. The majority of their competitors are open-sourced (BMC, CA, Tivoli, Dynatrace), which may appeal to cost-cutting organizations<sup>3</sup>. However, the major fault of these products, is that they are not fully baked solutions. They each offer a fraction of services and functionality that Splunk does, and organizations will often spend more money bridging the gap between all of these disparate products, than they would through investing in Splunk.

One example of Splunk in the 'real world', was how it's platform was applied to Domino's, to allow employees to access sales information, track sales performance, view customer satisfaction sentiments, understand order fulfillment speeds, and marketing promotion campaign effectiveness, in one centralized platform<sup>4</sup>.

According to one Splunk user<sup>5</sup> . . . 

<strong>Good:</strong>

1. Ad-hoc querying and analytics support.
2. Defining the field extractor (allows for the creation fields from events) is a one-time exercise, which is then reusable.
3. A highly mature product, which a large number of applications.
4. Real-time availability of time-series data,
5. Does offer a free version with limited functionality.
6. Documentation and community forums are publically accessible.

<strong>Bad:</strong>

1. Searching large data sets can be time consuming, and resource intensive.
2. As an index increases in size Windows CPU utilization may be excessive.
3. The price for the paid version.
<p></p>
<p></p>
<hr style="width:50%;">
<p></p>
<p></p>
<strong>Additional Splunk Componenets:</strong>

 - The Asset Investigator: shows malicious or possibly malicious content related to a particular asset searched by IP
 - Access Tracker: shows where the asset has connected and when the first access was

<strong>Pricing:</strong>
 - You basically pay for how much data you’re using, if it’s over the 500mb/day limit on free users.
 - The more GB/day you use, you typically get a better bang for your buck
     - Splunk pricing 100gb will cost around \$1500
     - Splunk pricing 10gb will cost \$2500
     - Splunk pricing 1gb will cost \$4500


#### References

1. https://en.wikipedia.org/wiki/Splunk
2. https://www.splunk.com/en_us/products.html
3. https://www.infoworld.com/article/3180801/analytics/why-splunk-keeps-beating-open-source-competitors.html
4. https://www.edureka.co/blog/splunk-use-case?utm_source=quora&utm_medium=crosspost&utm_campaign=social-media-edureka-pg
5. https://www.quora.com/Whats-good-and-bad-about-Splunk
6. https://government.diginomica.com/2017/10/30/splunk-pursuit-business-user/

## Part II - Parsing Log Data

### Import Libraries and Create Initial Data Frame

In [72]:
# Import libraries
import csv
import pandas as pd

# Create a blank list
results = []

# Open log file
with open('/Users/danehamlett/Desktop/School/Big Data/msnbc990928.seq', newline='') as inputfile:
    for row in csv.reader(inputfile):
        results.append(row)

# Create primary data frame
df = pd.DataFrame(results,columns=["Row"])

# Drop header rows
df.drop(df.index[:7], inplace=True)

# Calculate total rows
tot_rec = len(df.index)

### Create Data Frames and Analyze Data

#### What % of visitors visited a page of type 12 and page of type 17 in the same session?

In [73]:
# Create a data frame to analyze
a_df = df[df['Row'].str.contains('12') & df['Row'].str.contains('17')]

# Preview data
a_df.head()

Unnamed: 0,Row
356,1 1 12 15 10 12 17 11 1 1 12 15 3 10 17 11 1 2...
2153,12 17
2361,1 1 14 14 14 14 14 14 14 14 14 14 14 14 14 14 ...
3126,1 10 1 17 1 12 1
3204,1 12 1 17 1 3 1 10 1 1 12 12 1


#### Execute Metric Calculation

In [74]:
# Calculate metric and print result
a_visit = (len(a_df)/tot_rec)*100
print("The % of visitors who visited a page of type 12 and page of type 17 in the same session is " + 
      "%" + str(round(a_visit,4)) + " (" + str(len(a_df)) + "/" + str(tot_rec) + ").")

The % of visitors who visited a page of type 12 and page of type 17 in the same session is %0.2595 (2569/989818).


#### What % of visitors visited a page of type 12 AFTER page a page of type 17 in the same session (the two page views do not need to be consecutive)?

In [75]:
# Create new data frame
b_df = a_df[["Row"]].copy()

# Identify character positions
b_df['T_Position'] = b_df['Row'].str.find('12')
b_df['S_Position'] = b_df['Row'].str.find('17')

# Identify relative positions
b_df['T_After_S'] = b_df['T_Position'] > b_df['S_Position']
b_df['T_After_S'] = b_df['T_After_S'].astype(int)

# Preview data
b_df.head()

Unnamed: 0,Row,T_Position,S_Position,T_After_S
356,1 1 12 15 10 12 17 11 1 1 12 15 3 10 17 11 1 2...,4,16,0
2153,12 17,0,3,0
2361,1 1 14 14 14 14 14 14 14 14 14 14 14 14 14 14 ...,70,80,0
3126,1 10 1 17 1 12 1,12,7,1
3204,1 12 1 17 1 3 1 10 1 1 12 12 1,2,7,0


#### Execute Metric Calculation

In [76]:
# Calculate metric and print result
b_visit = (sum(b_df['T_After_S'])/tot_rec)*100
print("The % of visitors who visited a page of type 12 after a page of type 17 in the same session is " + 
      "%" + str(round(b_visit,4)) + " (" + str(sum(b_df['T_After_S'])) + "/" + str(tot_rec) + ").")

The % of visitors who visited a page of type 12 after a page of type 17 in the same session is %0.1181 (1169/989818).


### Description of Approach

For this exercise two common Python libraries were used: csv, and pandas.  The csv library was used to import the sequence data, importing sequence details row-by-row.  The Pandas library was then used to create a data frame that enabled a detailed analysis of the data.  The code was left in a verbose state intentionally, to ensure a working solution to the exercise was clear, and accurate.  This code can certainly be optimized and minimized, but the focus of this exercise was not to identify the most optimial approach.