## Python and Scapy

This notebook makes use of the Python Scapy library.

You are *not* required to learn/know Python.

You are *not* required to learn/know Scapy.

The only purpose of this notebook is to try to make the topics of the first lessons more concrete, in the hope of facilitating the understanding of the rest of the course.

The Python code below has this structure:
1.   Load in memory a file containing network traffic. How that traffic was collected and saved in a file is irrelevant here. This topic will be briefly discussed in the second part of the course.
2.   Display the content of selected messages in the file.
3.   Highlight in that content some of the concepts illustrated in the first lessons.



## Frames (Packets)

Network traffic consists of a sequence of messages called *frames* (or *packets*, although they are not the same thing). You will understand what frames are in the second part of the course. For the moment you may think of them as of an implementation detail that is *totally unknown and irrelevant* to applications, similar to electric power, transistors, electronic boards, tension levels and alike.

Applications exchange messages between themselves. How these messages are actually implemented is completely irrelevant to them.



## Preparation and download files with network traffic

Execute the two following code cells for installing and configuring the necessary software.


In [None]:
!pip install --pre scapy[basic]

In [None]:
from scapy.all import *

We use files containing network traffic and made freely available by [Chris Sanders](https://github.com/chrissanders/packets).

To download the necessary files in this Linux virtual machine execute the following cell:


In [None]:
!curl https://raw.githubusercontent.com/chrissanders/packets/master/dns_recursivequery_client.pcapng -o dns_recursivequery_client.pcapng
!curl https://raw.githubusercontent.com/chrissanders/packets/master/mail_sender_client_1.pcapng -o mail_sender_client_1.pcapng
!curl https://raw.githubusercontent.com/chrissanders/packets/master/http_google.pcapng -o http_google.pcapng

If you wanted to analyze other files that contain network traffic, you can upload them in the virtual machine as any other file (left section, click on the folder symbol, click on the upload symbol).


# A binary protocol: DNS (translating name to IP addresses)

Load network traffic in memory from pcap file.

We use Python code that reads an entire pcap file in memory but there are more efficient ways for processing pcap files.

In [None]:
packets = rdpcap('dns_recursivequery_client.pcapng')

This easy-to-understand loop iterates through all frames that transport DNS traffic and displays their content. We print a separation line between different frames for clarity.

Run this cell and let's analyze the format of the output (a format that is very common).


In [None]:
for pkt in packets:
    if DNS in pkt:
        print('----------------------')
        hexdump(pkt[DNS])

The output format consists of three vertical parts.
*   The *central* part shows the content of *each byte* (in this case, each byte of the DNS message). Each byte is represented with two base 16 digits (the reason is explained below). Each row displays 16 bytes.
*   The *left* part indicates the *offset* of the first byte of each row since the beginning of the message, again in base 16: 0 the first row, 16 (0x10) the second row, 32 (0x20) the third row and so on.
*   The *right* part displays the content of *each byte in ASCII* (the natural number represented by the byte is used as an ASCII code; the drawing associated with that code in the ASCII standard is displayed). There are thus 16 characters for each row, one for each byte. When the value of a byte does not correspond to any printable ASCII character, a '.' is shown instead.

A key point to observe here is that most bytes do not correspond to any ASCII character. The reason is because the DNS protocol has been specified in such a way that each single bit has a specified meaning and such meaning is not encoded in ASCII. It is for this reason that DNS is called a *binary protocol*.

## Why bytes in base 16?

A byte is a sequence of 8 bits. Thus a byte can represent all natural numbers in between 0 and 2^8-1=255.

With two base 16 digits you can represent all natural numbers in between 0 and 16^2-1=255 (much like with 2 base 10 digits you can represent all natural numbers in between 0 and 10^2-1=99). This is exactly the interval of natural numbers that can be represented by a byte.

Thus you can represent the content of a byte by means of two base 16 digits.

Base 16 digits are: 0,1...9,A,B,C,D,E,F.



## Understanding DNS messages (more or less...)

The code below displays the content of the frames above in a human-readable way. Function 'show()' of the Scapy library:
* analyzes all the bits of the frame,
* associate each slice of the frame with the corresponding meaning specified in the DNS protocol,
* prints a human-readable summary of such a meaning.

You will see that there are many complex pieces of information spread across the bytes. You are not required to understand them, just notice that such information is not encoded in ASCII.



In [None]:
for pkt in packets:
    if DNS in pkt:
        print('----------------------')
        pkt[DNS].show() # display in a different format

Actually the human-readable summary printed by Scapy may be improved.

The first frame is a DNS request sent by the client to the server while the second frame is a DNS response sent by the server to the client. Yet you can see in the output above that the 'opcode' of both frames is 'QUERY'.

Whether a frame is a request or a response is encoded in the 'qr' field: 0 in the first frame (meaning it is a request), 1 in the second frame (meaning it is a response).

If you want to know where 'qr' and 'opcode' are placed in the DNS messages, look at their hexdump again:

In [None]:
for pkt in packets:
    if DNS in pkt:
        print('----------------------')
        hexdump(pkt[DNS])

In the 3rd byte:

*   bit 7 (the most significant bit) is 'qr';
*   bits 6-3 are the 'opcode'; thus there can be 16 different opcodes; '0000' is specified in the DNS protocol as the encoding for a so-called 'standard query';
*   bits 2-0 have other meaning that we neglect.

If you convert the 3rd byte in binary you will see '00000' for the first frame and '10000' for the second frame, as described above.

Once again: you are not supposed to understand all these details. We mention them only for clarifying what a "binary" protocol is.

# A text protocol: SMTP (sending email)

Load network traffic in memory from pcap file.

In [None]:
packets = rdpcap('mail_sender_client_1.pcapng')

This easy-to-understand loop selects frames that contains TCP traffic where the destination port number is 25. If you do not know yet what a port is please do not worry; you will learn it in the next lessons. For the moment just assume that this the traffic containing data from the client application to the server application.

Then, it shows the "TCP payload", that is, the data that a client application has sent to a server application or the data sent the other way around.

In [None]:
for pkt in packets:
    if TCP in pkt:
      if pkt[TCP].dport == 25:
        print('----------------------')
        hexdump(pkt[TCP].payload)

In this case you can see that the ASCII representation of nearly all bytes seems to make sense, unlike what happened in the DNS traffic.

Let's try to print the TCP payload as if it were ASCII and see what happens:

In [None]:
for pkt in packets:
    if TCP in pkt:
      if pkt[TCP].dport == 25:
        if len (bytes(pkt[TCP].payload)) != 0: # do not print empty payloads
          print(bytes(pkt[TCP].payload))

You can see that every TCP payload is indeed expressed in ASCII. The meaning of 'MAIL FROM', 'RCPT TO' and so on is defined in the specification of the SMTP protocol (that we will study later in this course).

This protocol is called a *text* protocol because its messages are encoded as lines of characters (characters '\r\n' are the 2 ASCII characters that terminate a line).

The longest string (the one that stretches beyond the right of your screen) is the email message sent by the client to the server; all the other strings are SMTP messages for coordinating their interaction.

Character *b* is not part of the network traffic: it is printed by Scapy to clarify that it is printing a sequence of bytes. The same applies to the quotes at the beginning and at the end of each payload.



Above we have seen the SMTP data sent by the client application to the server application. The code below displays data in both directions, i.e., it shows the full conversation between SMTP client and SMTP server.

In [None]:
for pkt in packets:
    if TCP in pkt:
      if len (bytes(pkt[TCP].payload)) != 0: # do not print empty payloads
        if pkt[TCP].dport == 25:
          print("CLIENT TO SERVER")
        if pkt[TCP].sport == 25:
          print("SERVER TO CLIENT")
        print(bytes(pkt[TCP].payload))

Unfortunately, Scapy prints a somewhat confusing description of SMTP messages from the server to the client: at the trail of server messages it displays several bytes, indicated as "padding", that actually are *not* present in the SMTP traffic.

Try to imagine that those bytes do not exist (we could write some Python code that eliminates those bytes before printing but the code would be a bit tricky).

# Another text protocol: HTTP (browsing the web)

In [None]:
packets = rdpcap('http_google.pcapng')

Another text protocol is HTTP.

Below we print all the payloads as ASCII characters (obviously you could print them as hexdump as we did above).

In [None]:
for pkt in packets:
    if TCP in pkt:
      if len (bytes(pkt[TCP].payload)) != 0: # do not print empty payloads
        if pkt[TCP].dport == 80:
          print("CLIENT TO SERVER")
        if pkt[TCP].sport == 80:
          print("SERVER TO CLIENT")
        print(bytes(pkt[TCP].payload))

You can see that all messages are "long and complex" but seem to make sense. The meaning of '*GET*', '*User-Agent*', '*Expires*' and so on is defined in the specification of the HTTP protocol (that we will study later in this course).

Let us neglect the first message from server to client. You can see that the second message from server to client "makes sense" until "*X-XSS-Protection: 0*". After that there is a sequence of bytes that do not make sense in ASCII, and the sequence continues in the following messages.

The reason is because the message from server to client transports a file that was requested by the client; this file follows "*X-XSS-Protection: 0\r\n\r\n*"  and is *compressed* (as described by "*Content-Encoding: gzip*" at an earlier point of the response).

Notice that there are two pairs "*\r\n*" after "*X-XSS-Protection: 0*": the first terminates the line, the second one is an empty line.

You will learn the meaning of (some of) these details later in this course.

