# Web Log Analysis

**Description:** Consider the file weblog.txt. It is a log-file on a network. Each line in the file contains information about IP Address, Date & Time, submit/upload, URL, Protocol version, Status. Detailed description is as below:
- **IP Address:** IP address of the machine
- **Date & Time:** Date and Time at which event/action has occurred.
- **Submit/upload:** GET indicates requesting the server for data, POST indicates trying to upload the data.
- **URL:** Name of the URL 
- **Protocol version:** HTTP/1.1 indicates version of the HTTP protocol
- **Status:** Code indicating status of HTTP action.
  - 200: Success
  - 302: Requested resource has been temporarily moved
  - 404: File Not Found
  - 304: File Not Modified (means, no update is required)
  - 206: Partial content has been processed


###  Create a dictionary IP_Count where each key-value pair is IP-Address and its frequency in the entire file.

In [1]:
import re
fhand= open('weblog.txt','r')
IP_Count={}
for line in fhand:
    line=line.rstrip()
    if re.search('^10.1\\d{2}.\\d{1}.\\d{1}',line):
        IP_Count[line[:10]]= IP_Count.get(line[:10],0)+1
        
print(IP_Count)   
fhand.close()  

{'10.128.2.1': 4257, '10.131.2.1': 1626, '10.130.2.1': 4056, '10.129.2.1': 1652, '10.131.0.1': 4198}


###  Create a dictionary Daily_Stats where each key-value pair is Date (string containing only the date, without time) and number of actions carried out on that day

In [2]:
import re
fhand= open('weblog.txt','r')
Daily_Stats={}
for line in fhand:
    line=line.rstrip()
    if re.search("[0-9]/\\w{3}/[0-9]",line):
        Daily_Stats[line[12:23]]= Daily_Stats.get(line[12:23],0)+1
    
print(Daily_Stats)   
fhand.close() 

{'29/Nov/2017': 580, '30/Nov/2017': 2991, '01/Dec/2017': 468, '02/Dec/2017': 168, '03/Dec/2017': 105, '07/Nov/2017': 2, '08/Nov/2017': 106, '09/Nov/2017': 236, '10/Nov/2017': 64, '11/Nov/2017': 286, '12/Nov/2017': 338, '13/Nov/2017': 230, '14/Nov/2017': 150, '15/Nov/2017': 78, '16/Nov/2017': 384, '17/Nov/2017': 481, '18/Nov/2017': 96, '19/Nov/2017': 164, '20/Nov/2017': 58, '21/Nov/2017': 47, '22/Nov/2017': 60, '23/Nov/2017': 380, '24/Nov/2017': 94, '25/Nov/2017': 250, '26/Nov/2017': 179, '12/Dec/2017': 86, '13/Dec/2017': 133, '14/Dec/2017': 165, '15/Dec/2017': 100, '16/Dec/2017': 155, '17/Dec/2017': 92, '18/Dec/2017': 178, '19/Dec/2017': 55, '20/Dec/2017': 98, '21/Dec/2017': 72, '22/Dec/2017': 11, '23/Dec/2017': 43, '16/Jan/2018': 76, '17/Jan/2018': 29, '18/Jan/2018': 63, '29/Jan/2018': 5092, '15/Feb/2018': 20, '16/Feb/2018': 33, '17/Feb/2018': 65, '18/Feb/2018': 34, '19/Feb/2018': 32, '20/Feb/2018': 62, '21/Feb/2018': 110, '22/Feb/2018': 231, '23/Feb/2018': 127, '24/Feb/2018': 15, '25

###  Display the URL which has been accessed (either for submit or upload) for most number of times.

In [3]:
import re
fhand= open('weblog.txt','r')
URL={}
for line in fhand:
    line=line.rstrip()
    x=re.findall("/[a-z]+.[a-z]{3}",line)
    
    for i in range(len(x)):
        URL[x[i]]=URL.get(x[i],0)+1
        
for k,v in URL.items():
    if v==max(URL.values()):
        print("URL Accessed Most Number of Times:", k )
        print("Number of times: ",v)

URL Accessed Most Number of Times: /login.php
Number of times:  3426


###  Display the total count of HTTP GET requests and HTTP POST requests

In [4]:
import re
fhand= open('weblog.txt','r')
Get_Post={}
for line in fhand:
    line=line.rstrip()
    if re.search("[GP][EO][TS][ T]",line):
        Get_Post[line[33:37]]= Get_Post.get(line[33:37],0)+1
    
print(Get_Post)   
fhand.close()  

{'GET ': 15098, 'POST': 682}


###  Create a dictionary Status_Code where key is the status code, and value is a tuple consisting of a string stating meaning of the code and a number indicating the frequency of that code in the file.

In [2]:
import re
fhand= open('weblog.txt','r')
Status_Code={}
for line in fhand:
    line=line.rstrip()
    if re.findall("[200,206,302,304,404]$",line):
        Status_Code[line[-3:]]=Status_Code.get(line[-3:],0)+1

Status_Code1={}
for k,v in Status_Code.items():
    if k=='200':
        Status_Code1[k]=("Success",v)
    elif k=='206':
        Status_Code1[k]=("Partial content has been processed",v)
    elif k=='302':
        Status_Code1[k]=("Requested resource has been temporarily moved",v)
    elif k=='304':
        Status_Code1[k]=("File Not Modified ",v)
    elif k=='404':
        Status_Code1[k]=("File Not Found",v)

Status_Code1={int(k):tuple(i for i in v) for k,v in Status_Code1.items()}
print(Status_Code1)  
fhand.close()  

{200: ('Success', 11330), 302: ('Requested resource has been temporarily moved', 3498), 304: ('File Not Modified ', 658), 206: ('Partial content has been processed', 52), 404: ('File Not Found', 251)}


###  Create a histogram (list of tuples) indicating number of actions carried out in an hour of a day.

In [1]:
import re
fhand= open('weblog.txt','r')
time_Details=[]

for line in fhand:
    line=line.rstrip()
    if re.search("\\d:\\d",line):
        if line[24:26]!='':
            time_Details.append(line[24:26])
        
time_Details= list(map(int, time_Details))
td=[]
for i in range(len(time_Details)):
    x=time_Details.count(time_Details[i])
    td.append((time_Details[i],x))

hr=set(td) 
final_ans=list(hr)
final_ans.sort()
print(final_ans)

[(0, 118), (1, 53), (2, 48), (3, 164), (4, 246), (5, 283), (6, 575), (7, 313), (8, 284), (9, 187), (10, 138), (11, 255), (12, 732), (13, 766), (14, 581), (15, 1461), (16, 1169), (17, 754), (18, 734), (19, 847), (20, 5458), (21, 240), (22, 226), (23, 157)]
