# **Data Nasa**
https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html


## Load log file

### **Subtask:**
"Memuat file log ke pandas DataFrame. Karena file tidak memiliki ekstensi dan formatnya khusus, maka harus dibaca sebagai file teks biasa, lalu diparsing (dipecah) per baris."

### **Reasoning:**
"Baca file log baris per baris dan simpan setiap barisnya ke dalam sebuah list."

In [4]:
with open("/content/drive/MyDrive/PPW/NASA/access_log_Jul95", 'r', encoding='latin-1') as f:
    log_lines = f.readlines()

### **Reasoning:**
"Langkah sebelumnya berhasil membaca berkas log ke dalam daftar string. Tugas saat ini adalah memuat berkas log ke dalam daftar string. Langkah berikutnya adalah memverifikasi isi berkas dan menyelesaikan tugas tersebut."

In [5]:
print(log_lines[:5])

['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245\n', 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985\n', '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085\n', 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0\n', '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179\n']


### **Parse the log data**

### **Subtask:**
Setelah berhasil memuat file log NASA ke dalam Python sebagai list berisi baris-baris log."


In [6]:
import re

log_pattern = re.compile(r'(\S+) - - \[(.*?)\] "(\S+) (.*?) (\S+)" (\S+) (\S+)')
parsed_data = []

for line in log_lines:
    match = log_pattern.match(line)
    if match:
        hostname, timestamp, request_type, path, protocol, status_code, size = match.groups()
        parsed_data.append([hostname, timestamp, request_type, path, status_code, size])
    else:
        # Optional: Log lines that don't match the pattern
        # print(f"Skipping line: {line.strip()}")
        pass

print(parsed_data[:5])

[['199.72.81.55', '01/Jul/1995:00:00:01 -0400', 'GET', '/history/apollo/', '200', '6245'], ['unicomp6.unicomp.net', '01/Jul/1995:00:00:06 -0400', 'GET', '/shuttle/countdown/', '200', '3985'], ['199.120.110.21', '01/Jul/1995:00:00:09 -0400', 'GET', '/shuttle/missions/sts-73/mission-sts-73.html', '200', '4085'], ['burger.letters.com', '01/Jul/1995:00:00:11 -0400', 'GET', '/shuttle/countdown/liftoff.html', '304', '0'], ['199.120.110.21', '01/Jul/1995:00:00:11 -0400', 'GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', '200', '4179']]


### **Structure the data**

### **Subtask:**
Menyusun data hasil ekstraksi (hasil parsing log) ke dalam format yang terstruktur, misalnya menjadi pandas DataFrame, dan memberikan nama kolom yang sesuai."


In [7]:
import pandas as pd

df = pd.DataFrame(parsed_data, columns=['hostname', 'timestamp', 'request_type', 'path', 'status_code', 'size'])
display(df.head())

Unnamed: 0,hostname,timestamp,request_type,path,status_code,size
0,199.72.81.55,01/Jul/1995:00:00:01 -0400,GET,/history/apollo/,200,6245
1,unicomp6.unicomp.net,01/Jul/1995:00:00:06 -0400,GET,/shuttle/countdown/,200,3985
2,199.120.110.21,01/Jul/1995:00:00:09 -0400,GET,/shuttle/missions/sts-73/mission-sts-73.html,200,4085
3,burger.letters.com,01/Jul/1995:00:00:11 -0400,GET,/shuttle/countdown/liftoff.html,304,0
4,199.120.110.21,01/Jul/1995:00:00:11 -0400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,200,4179


In [8]:
len(df.timestamp)

1888722

In [9]:
# === Tampilkan jumlah duplikasi berdasarkan IP ===
dupe_count = df["hostname"].value_counts()
print("Jumlah kemunculan setiap IP:")
display(dupe_count.head(10))  # tampilkan 10 IP teratas

Jumlah kemunculan setiap IP:


Unnamed: 0_level_0,count
hostname,Unnamed: 1_level_1
piweba3y.prodigy.com,17572
piweba4y.prodigy.com,11591
piweba1y.prodigy.com,9868
alyssa.prodigy.com,7852
siltb10.orl.mmc.com,7573
piweba2y.prodigy.com,5922
edams.ksc.nasa.gov,5434
163.206.89.4,4906
news.ti.com,4863
disarray.demon.co.uk,4353


In [10]:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%d/%b/%Y:%H:%M:%S %z')
df_sorted = df.sort_values(by=['hostname', 'timestamp'])
display(df_sorted.head())

Unnamed: 0,hostname,timestamp,request_type,path,status_code,size
726764,***.novo.dk,1995-07-11 08:17:09-04:00,GET,/ksc.html,200,7067
726765,***.novo.dk,1995-07-11 08:17:11-04:00,GET,/images/ksclogo-medium.gif,200,5866
726787,***.novo.dk,1995-07-11 08:17:31-04:00,GET,/images/MOSAIC-logosmall.gif,200,363
726791,***.novo.dk,1995-07-11 08:17:33-04:00,GET,/images/USA-logosmall.gif,200,234
726793,***.novo.dk,1995-07-11 08:17:34-04:00,GET,/images/NASA-logosmall.gif,200,786


In [11]:
request_type_counts = df['request_type'].value_counts()
print("Nilai unik dalam kolom 'request_type' beserta jumlahnya:")
display(request_type_counts)

Nilai unik dalam kolom 'request_type' beserta jumlahnya:


Unnamed: 0_level_0,count
request_type,Unnamed: 1_level_1
GET,1884659
HEAD,3952
POST,111


In [12]:
request_type_counts = df['status_code'].value_counts()
print("Nilai unik dalam kolom 'status_code' beserta jumlahnya:")
display(request_type_counts)

Nilai unik dalam kolom 'status_code' beserta jumlahnya:


Unnamed: 0_level_0,count
status_code,Unnamed: 1_level_1
200,1698643
304,132627
302,46548
404,10774
500,62
403,54
501,14


### Berarti filter hanya pada path html,status code unik, and request type get

In [13]:
post_html_count = df[(df['request_type'] == 'POST') & (df['path'].str.endswith('.html'))].shape[0]
print(f"Jumlah permintaan POST dengan path diakhiri '.html': {post_html_count}")

Jumlah permintaan POST dengan path diakhiri '.html': 9


In [14]:
post_html_count = df[(df['request_type'] == 'GET') & (df['path'].str.endswith('.html'))].shape[0]
print(f"Jumlah permintaan POST dengan path diakhiri '.html': {post_html_count}")

Jumlah permintaan POST dengan path diakhiri '.html': 416446


### jadi filter hanya get nya aja

In [15]:
filtered_df = df[(df['request_type'] == 'GET') & (df['path'].str.endswith('.html')) & (df['status_code'] == '200')]
display(filtered_df.head())

Unnamed: 0,hostname,timestamp,request_type,path,status_code,size
2,199.120.110.21,1995-07-01 00:00:09-04:00,GET,/shuttle/missions/sts-73/mission-sts-73.html,200,4085
7,205.212.115.106,1995-07-01 00:00:12-04:00,GET,/shuttle/countdown/countdown.html,200,3985
18,ppptky391.asahi-net.or.jp,1995-07-01 00:00:18-04:00,GET,/facts/about_ksc.html,200,3977
22,waters-gw.starway.net.au,1995-07-01 00:00:25-04:00,GET,/shuttle/missions/51-l/mission-51-l.html,200,6723
37,gayle-gaston.tenet.edu,1995-07-01 00:00:50-04:00,GET,/shuttle/missions/sts-71/mission-sts-71.html,200,12040


### kemudian hostname yang sama di urutkan berdasarkan waktunya/ timestamp

In [16]:
filtered_df_sorted = filtered_df.sort_values(by=['hostname', 'timestamp'])
display(filtered_df_sorted.head())

Unnamed: 0,hostname,timestamp,request_type,path,status_code,size
726764,***.novo.dk,1995-07-11 08:17:09-04:00,GET,/ksc.html,200,7067
726817,***.novo.dk,1995-07-11 08:17:48-04:00,GET,/shuttle/missions/missions.html,200,8678
727034,***.novo.dk,1995-07-11 08:21:05-04:00,GET,/shuttle/missions/sts-35/mission-sts-35.html,200,12118
727042,***.novo.dk,1995-07-11 08:21:19-04:00,GET,/shuttle/missions/sts-35/mission-sts-35.html,200,12118
727149,***.novo.dk,1995-07-11 08:23:01-04:00,GET,/shuttle/resources/orbiters/columbia.html,200,6922


In [17]:
filtered_df_sorted.to_csv('/content/drive/MyDrive/PPW/NASA/filtered_sorted_log_data.csv', index=False)
print("DataFrame telah disimpan ke 'filtered_sorted_log_data.csv'")

DataFrame telah disimpan ke 'filtered_sorted_log_data.csv'


In [18]:
nama_file = '/content/drive/MyDrive/PPW/NASA/filtered_sorted_log_data.csv'
df = pd.read_csv(nama_file)

In [19]:
df

Unnamed: 0,hostname,timestamp,request_type,path,status_code,size
0,***.novo.dk,1995-07-11 08:17:09-04:00,GET,/ksc.html,200,7067
1,***.novo.dk,1995-07-11 08:17:48-04:00,GET,/shuttle/missions/missions.html,200,8678
2,***.novo.dk,1995-07-11 08:21:05-04:00,GET,/shuttle/missions/sts-35/mission-sts-35.html,200,12118
3,***.novo.dk,1995-07-11 08:21:19-04:00,GET,/shuttle/missions/sts-35/mission-sts-35.html,200,12118
4,***.novo.dk,1995-07-11 08:23:01-04:00,GET,/shuttle/resources/orbiters/columbia.html,200,6922
...,...,...,...,...,...,...
394680,zzz.pe.u-tokyo.ac.jp,1995-07-13 07:03:16-04:00,GET,/shuttle/countdown/liftoff.html,200,2602
394681,zzz.pe.u-tokyo.ac.jp,1995-07-13 07:04:06-04:00,GET,/shuttle/technology/sts-newsref/sts-lcc.html,200,32252
394682,zzz.pe.u-tokyo.ac.jp,1995-07-13 07:04:40-04:00,GET,/shuttle/missions/sts-70/movies/movies.html,200,1395
394683,zzz.pe.u-tokyo.ac.jp,1995-07-13 07:14:51-04:00,GET,/shuttle/missions/sts-70/images/images.html,200,4048
