# **Data Nasa New**
https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

## Load log file

### **Subtask:**
"Memuat file log ke pandas DataFrame. Karena file tidak memiliki ekstensi dan formatnya khusus, maka harus dibaca sebagai file teks biasa, lalu diparsing (dipecah) per baris."

### **Reasoning:**
"Baca file log baris per baris dan simpan setiap barisnya ke dalam sebuah list."

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
import requests

url = "https://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz"
filename = "NASA_access_log_Jul95.gz"

print(f"Mengunduh {filename} ...")

try:
    response = requests.get(url, stream=True)

    if response.status_code == 200:
        with open(filename, "wb") as f:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
        print("Download selesai!")
    else:
        print(f"Gagal mengunduh. Status: {response.status_code}")

except Exception as e:
    print(f"Terjadi error: {e}")


Mengunduh NASA_access_log_Jul95.gz ...
Download selesai!


In [14]:
import gzip

filename = '/content/drive/MyDrive/PPW/COBA/NASA_access_log_Jul95.gz'

print(f"--- Menampilkan 20 baris pertama dari file {filename} ---\n")

try:
    # 'rt' artinya read text mode.
    # encoding='latin-1' dipakai agar tidak error jika ada karakter aneh di log lama
    with gzip.open(filename, 'rt', encoding='latin-1') as f:
        for i, line in enumerate(f):
            print(f"Baris {i+1}: {line.strip()}")

            # Berhenti setelah 20 baris agar terminal tidak penuh
            if i >= 19:
                break

except FileNotFoundError:
    print(f"Error: File '{filename}' tidak ditemukan di folder ini.")
except Exception as e:
    print(f"Terjadi kesalahan: {e}")

--- Menampilkan 20 baris pertama dari file /content/drive/MyDrive/PPW/COBA/NASA_access_log_Jul95.gz ---

Baris 1: 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Baris 2: unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
Baris 3: 199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
Baris 4: burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
Baris 5: 199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
Baris 6: burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
Baris 7: burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0
Baris 8: 205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.h

### **Reasoning:**
"Langkah sebelumnya berhasil membaca berkas log ke dalam daftar string. Tugas saat ini adalah memuat berkas log ke dalam daftar string. Langkah berikutnya adalah memverifikasi isi berkas dan menyelesaikan tugas tersebut."

In [16]:
import gzip
import csv
import re
from datetime import datetime

# 1. Konfigurasi Nama File
input_file = '/content/drive/MyDrive/PPW/COBA/NASA_access_log_Jul95.gz'
output_file = '/content/drive/MyDrive/PPW/NASA_Jul95_cleaned.csv'

# 2. Pola Regex untuk membaca format Log Apache/Nasa
# Pola: Host - - [Tanggal] "Request" Status Bytes
log_pattern = re.compile(r'^(\S+) \S+ \S+ \[(.*?)\] "(.*?)" (\d{3}) (\S+)')

def format_date(date_str):
    """
    Mengubah format '01/Jul/1995:00:00:01 -0400'
    menjadi format ISO '1995-07-01T00:00:01Z'
    """
    try:
        # Ambil bagian tanggal saja, abaikan zona waktu -0400 untuk penyederhanaan
        clean_date = date_str.split(' ')[0]
        # Parse dari format lama
        dt_obj = datetime.strptime(clean_date, '%d/%b/%Y:%H:%M:%S')
        # Ubah ke format baru (ISO 8601)
        return dt_obj.strftime('%Y-%m-%dT%H:%M:%SZ')
    except ValueError:
        return date_str

print("Sedang memproses data... (Ini mungkin memakan waktu beberapa detik)")

# 3. Membuka file GZ dan File CSV Output
# encoding='latin-1' digunakan karena log lama kadang punya karakter aneh di URL
with gzip.open(input_file, 'rt', encoding='latin-1') as f_in, \
     open(output_file, 'w', newline='', encoding='utf-8') as f_out:

    # Siapkan penulis CSV
    # quoting=csv.QUOTE_NONNUMERIC akan memberi tanda kutip pada String, tapi tidak pada Angka
    writer = csv.writer(f_out, quoting=csv.QUOTE_NONNUMERIC)

    # 4. Tulis Header
    header = [
        "Remote host", "Remote logname", "Remote user",
        "Request time", "Request method", "Request URI",
        "Request Protocol", "Status", "Size of response (incl. headers)"
    ]
    writer.writerow(header)

    count = 0
    errors = 0

    # 5. Loop setiap baris log
    for line in f_in:
        match = log_pattern.match(line)
        if match:
            host, timestamp, request_full, status, size = match.groups()

            # Proses Tanggal
            time_iso = format_date(timestamp)

            # Proses Request (Split Method, URI, Protocol)
            # Contoh: "GET /images/nasa-logo.gif HTTP/1.0"
            req_parts = request_full.split()
            if len(req_parts) >= 3:
                method = req_parts[0]
                uri = req_parts[1]
                protocol = req_parts[2]
            elif len(req_parts) == 2: # Kadang protokol tidak ada
                method = req_parts[0]
                uri = req_parts[1]
                protocol = "HTTP/1.0"
            else:
                method = "UNKNOWN"
                uri = request_full
                protocol = "-"

            # Proses Size (Ubah '-' menjadi 0 dan pastikan integer)
            if size == '-':
                size_int = 0
            else:
                try:
                    size_int = int(size)
                except:
                    size_int = 0

            # Pastikan Status jadi integer
            status_int = int(status)

            # Tulis baris ke CSV
            # Logname dan User diisi "-" karena di log NASA isinya memang kosong
            writer.writerow([
                host, "-", "-", time_iso,
                method, uri, protocol,
                status_int, size_int
            ])

            count += 1
        else:
            errors += 1

print(f"Selesai! {count} baris berhasil diproses.")
if errors > 0:
    print(f"Ada {errors} baris yang formatnya aneh dan dilewati.")
print(f"File tersimpan sebagai: {output_file}")

Sedang memproses data... (Ini mungkin memakan waktu beberapa detik)
Selesai! 1891714 baris berhasil diproses.
Ada 1 baris yang formatnya aneh dan dilewati.
File tersimpan sebagai: /content/drive/MyDrive/PPW/NASA_Jul95_cleaned.csv


### **Parse the log data**

### **Subtask:**
Setelah berhasil memuat file log NASA ke dalam Python sebagai list berisi baris-baris log."


In [17]:
import pandas as pd

# 1. Load file CSV yang sudah dibersihkan
filename = '/content/drive/MyDrive/PPW/NASA_Jul95_cleaned.csv'
df = pd.read_csv(filename)

# 2. Tampilkan info dasar (optional, untuk cek tipe data)
# print(df.info())

# 3. Tampilkan 10 baris pertama dalam bentuk tabel rapi
# Jika di Jupyter Notebook, cukup ketik 'df.head(10)' di baris terakhir
print(df.head(50).to_markdown(index=False, numalign="left", stralign="left"))

# Catatan: .to_markdown() membutuhkan library 'tabulate'
# Install dulu jika error: pip install tabulate

| Remote host               | Remote logname   | Remote user   | Request time         | Request method   | Request URI                                           | Request Protocol   | Status   | Size of response (incl. headers)   |
|:--------------------------|:-----------------|:--------------|:---------------------|:-----------------|:------------------------------------------------------|:-------------------|:---------|:-----------------------------------|
| 199.72.81.55              | -                | -             | 1995-07-01T00:00:01Z | GET              | /history/apollo/                                      | HTTP/1.0           | 200      | 6245                               |
| unicomp6.unicomp.net      | -                | -             | 1995-07-01T00:00:06Z | GET              | /shuttle/countdown/                                   | HTTP/1.0           | 200      | 3985                               |
| 199.120.110.21            | -                | -             | 1995-07


### **Filterisasi:**

In [22]:
import pandas as pd
from google.colab import files

df = pd.read_csv(filename, sep=None, engine="python")

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("üîç Kolom yang tersedia:", df.columns.tolist())

filtered = df[
    (df["request_method"] == "GET") &
    (df["request_uri"].isin(["/images/NASA-logosmall.gif"])) &
    (df["status"] == 200)
]

print(f"Jumlah data hasil filter: {len(filtered)}")
display(filtered.head())


üîç Kolom yang tersedia: ['remote_host', 'remote_logname', 'remote_user', 'request_time', 'request_method', 'request_uri', 'request_protocol', 'status', 'size_of_response_(incl._headers)']
Jumlah data hasil filter: 89972


Unnamed: 0,remote_host,remote_logname,remote_user,request_time,request_method,request_uri,request_protocol,status,size_of_response_(incl._headers)
11,unicomp6.unicomp.net,-,-,1995-07-01T00:00:14Z,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
14,d104.aa.net,-,-,1995-07-01T00:00:15Z,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
29,205.189.154.54,-,-,1995-07-01T00:00:40Z,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
31,ppp-mia-30.shadow.net,-,-,1995-07-01T00:00:41Z,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
67,link097.txdirect.net,-,-,1995-07-01T00:01:31Z,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786


In [24]:
# simpan hasil
output_file = "/content/drive/MyDrive/PPW/COBA/filtered_log_nasa.csv"
filtered.to_csv(output_file, index=False)
files.download(output_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


### **Cek Duplikasi:**

In [25]:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# === Tampilkan jumlah duplikasi berdasarkan IP ===
dupe_count = df["remote_host"].value_counts()
print("Jumlah kemunculan setiap IP:")
display(dupe_count.head(10))  # tampilkan 10 IP teratas

# === Tampilkan hanya IP yang muncul lebih dari 1 kali ===
dupe_ips = dupe_count[dupe_count > 1]
print(f"Jumlah IP yang duplikat: {len(dupe_ips)}")
display(dupe_ips)

Jumlah kemunculan setiap IP:


Unnamed: 0_level_0,count
remote_host,Unnamed: 1_level_1
piweba3y.prodigy.com,17572
piweba4y.prodigy.com,11591
piweba1y.prodigy.com,9868
alyssa.prodigy.com,7852
siltb10.orl.mmc.com,7573
piweba2y.prodigy.com,5922
edams.ksc.nasa.gov,5434
163.206.89.4,4906
news.ti.com,4863
disarray.demon.co.uk,4353


Jumlah IP yang duplikat: 76282


Unnamed: 0_level_0,count
remote_host,Unnamed: 1_level_1
piweba3y.prodigy.com,17572
piweba4y.prodigy.com,11591
piweba1y.prodigy.com,9868
alyssa.prodigy.com,7852
siltb10.orl.mmc.com,7573
...,...
ppp04.monet.no,2
slip005.hol.nl,2
frl190.frl.orst.edu,2
line0a.kemp-du.pavilion.co.uk,2


In [26]:
# === Konversi kolom waktu menjadi datetime agar bisa diurutkan ===
df["request_time"] = pd.to_datetime(df["request_time"], errors="coerce")

# === Urutkan berdasarkan IP dan waktu ===
df_sorted = df.sort_values(by=["remote_host", "request_time"]).reset_index(drop=True)

# === Filter sesuai kriteria ===
filtered = df_sorted[
    (df_sorted["request_method"] == "GET") &
    (df_sorted["request_uri"].isin(["/images/NASA-logosmall.gif"])) &
    (df_sorted["status"] == 200)
]

# === Urutkan hasil akhir berdasarkan IP (Remote Host) ===
filtered = filtered.sort_values(by="remote_host").reset_index(drop=True)

# === Tampilkan hasil ===
print(f"Jumlah data hasil filter: {len(filtered)}")
display(filtered.head(20))  # tampilkan 20 baris teratas

# === Simpan hasil ke file baru dan download ===
output_file = "/content/drive/MyDrive/PPW/COBA/filtered_log_sorted_by_ip.csv"
filtered.to_csv(output_file, index=False)
#files.download(output_file)


Jumlah data hasil filter: 89972


Unnamed: 0,remote_host,remote_logname,remote_user,request_time,request_method,request_uri,request_protocol,status,size_of_response_(incl._headers)
0,***.novo.dk,-,-,1995-07-11 08:17:34+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
1,***.novo.dk,-,-,1995-07-11 08:19:17+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
2,007.thegap.com,-,-,1995-07-06 17:22:46+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
3,007.thegap.com,-,-,1995-07-06 19:22:12+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
4,007.thegap.com,-,-,1995-07-23 16:40:57+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
5,01-dynamic-c.wokingham.luna.net,-,-,1995-07-10 08:17:52+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
6,02-dynamic-c.wokingham.luna.net,-,-,1995-07-07 07:58:04+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
7,03-dynamic-c.wokingham.luna.net,-,-,1995-07-04 12:35:33+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
8,04-dynamic-c.rotterdam.luna.net,-,-,1995-07-03 21:58:40+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
9,04-dynamic-c.wokingham.luna.net,-,-,1995-07-04 16:22:43+00:00,GET,/images/NASA-logosmall.gif,HTTP/1.0,200,786
