<a href="https://colab.research.google.com/github/gitmystuff/INFO5737/blob/main/Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anomaly Detection

Your Name

## Getting Started

* Colab - get notebook from gitmystuff INFO5737 repository
* Save a Copy in Drive
* Remove Copy of
* Edit name
* Take attendance
* Clean up Colab Notebooks folder
* Submit shared link

## What is Anomaly Detection?

Anomaly detection, also known as outlier detection, is the process of identifying data points, events, or observations that deviate significantly from the normal or expected patterns within a dataset. These anomalies are often rare and can indicate unusual behavior, errors, fraud, or other significant events.

Here's a breakdown of the key aspects of anomaly detection:

**Core Idea:** To find the "unusual" in data by comparing it to what is considered "normal."

**Why is it Important?**

* **Identifying Problems:** Anomalies can signal critical issues like system failures, security breaches, or fraudulent activities.
* **Preventing Losses:** Early detection of anomalies can help organizations mitigate potential financial losses or damages.
* **Improving Efficiency:** Identifying and addressing anomalies in processes can lead to optimization and better performance.
* **Ensuring Data Quality:** Anomalies can highlight errors or inconsistencies in data collection or processing.

**Types of Anomalies:**

Anomalies can be broadly categorized into three main types:

* **Point Anomalies (Outliers):** Individual data points that are significantly different from the rest of the data. For example, a single unusually high transaction in a series of credit card purchases.
* **Contextual Anomalies (Conditional Outliers):** Data points that are unusual within a specific context but might be normal in another. For example, a very high temperature reading for a city in winter would be anomalous, but the same reading in summer might be normal. These are often found in time-series data.
* **Collective Anomalies (Group Anomalies):** A group of related data points that, when considered together, deviate from the expected pattern, even if individual points within the group might not be anomalous on their own. For example, a series of small, frequent transactions from the same account to multiple new accounts could indicate fraudulent activity, even if each transaction individually is not large enough to be flagged.

**Approaches to Anomaly Detection:**

Anomaly detection techniques can be broadly classified based on the availability of labeled data:

* **Supervised Anomaly Detection:** Requires a labeled dataset containing both normal and anomalous instances. Classification algorithms are trained to distinguish between the two classes. This approach is often challenging because labeled anomaly data is typically scarce and imbalanced.
* **Semi-Supervised Anomaly Detection:** Uses a dataset containing only normal instances to train a model. The model then identifies instances that do not fit the learned normal pattern as anomalies.
* **Unsupervised Anomaly Detection:** Does not rely on any labeled data. It assumes that normal data points are far more frequent than anomalies and tries to find patterns or instances that deviate from the majority of the data. This is the most common approach as labeled anomaly data is often unavailable.

**Common Unsupervised Anomaly Detection Algorithms (relevant to your interest in machine learning):**

* **Statistical Methods:** These methods assume a distribution for the normal data and identify data points that fall outside a defined statistical range (e.g., based on mean and standard deviation, or percentiles).
* **Clustering-Based Methods:** These methods group similar data points into clusters. Anomalies are often points that do not belong to any cluster or belong to small, isolated clusters. (e.g., K-Means)
* **Density-Based Methods:** These methods estimate the density of data points. Anomalies are points in low-density regions. (e.g., Local Outlier Factor - LOF, DBSCAN)
* **Distance-Based Methods:** These methods calculate the distance between data points. Anomalies are points that are far from their nearest neighbors. (e.g., k-Nearest Neighbors - kNN)
* **Tree-Based Methods:** These methods build decision trees to isolate anomalies. Anomalies tend to be isolated in fewer splits. (e.g., Isolation Forest, Robust Random Cut Forest)
* **Autoencoders (Neural Networks):** These neural networks are trained to reconstruct normal data. Anomalies, being different, result in higher reconstruction errors.
* **Principal Component Analysis (PCA):** This dimensionality reduction technique can identify anomalies as data points that have a large reconstruction error when projected onto the principal components of the normal data.

**Applications of Anomaly Detection:**

Anomaly detection is used in a wide range of domains, including:

* **Cybersecurity:** Intrusion detection, fraud detection, malware analysis, identifying unusual user behavior.
* **Finance:** Credit card fraud detection, detecting unusual trading activities.
* **Manufacturing:** Quality control, predictive maintenance of equipment.
* **Healthcare:** Detecting abnormal patient conditions, identifying anomalies in medical images.
* **Retail and E-commerce:** Identifying fraudulent transactions, detecting unusual purchasing patterns, supply chain optimization.
* **IoT and Sensor Data:** Monitoring sensor readings for unusual patterns indicating failures or anomalies.
* **Environmental Monitoring:** Detecting unusual climate patterns or pollution levels.

* https://www.sans.org/cyber-security-courses/applied-data-science-machine-learning/
* https://github.com/SecurityNik/Data-Science-and-ML/blob/main/Beginning%20Machine%20and%20Deep%20Learning%20with%20Zeek%20logs/08%20-%20beginning%20Machine%20Learning%20Anomaly%20Detection%20-%20isolation%20forest%20and%20local%20outlier%20factor.ipynb

## Network Traffic

Network traffic refers to the **flow of data across a computer network** at any given point in time. Think of it like cars on a highway – the more cars moving, the higher the traffic. In computer networks, this "traffic" consists of **data packets**, which are small segments of data that contain the information being transmitted, along with addressing and control information.

Here's a breakdown of what that means:

* **Data:** This is the actual information being exchanged, such as emails, web pages, videos, files, voice conversations, and commands.
* **Packets:** Data is broken down into these smaller units for efficient transmission across the network. Each packet has a header containing source and destination addresses, protocol information, and sequencing details, and a payload containing a piece of the actual data.
* **Flow:** The movement of these packets from a source device to a destination device across the network infrastructure (cables, routers, switches, wireless signals, etc.).
* **Time:** Network traffic is a dynamic measure, constantly changing based on network activity.

**Key Aspects of Network Traffic:**

* **Volume:** The amount of data being transferred, usually measured in bits per second (bps) or its multiples (Kbps, Mbps, Gbps).
* **Rate:** The speed at which data is being transferred.
* **Types:** Different kinds of data generate different types of traffic (e.g., web browsing (HTTP/HTTPS), email (SMTP/POP3/IMAP), file transfer (FTP/SFTP), streaming (RTP), etc.).
* **Direction:** Traffic can be inbound (coming into your network) or outbound (leaving your network). It can also be north-south (between a client and a server) or east-west (within a data center).
* **Protocols:** Network traffic relies on various communication protocols (like TCP/IP, UDP, DNS, HTTP) that define the rules for how data is formatted, transmitted, and received.

**Why is Understanding Network Traffic Important?**

* **Network Performance:** High traffic can lead to congestion, causing slowdowns and impacting user experience. Monitoring traffic helps in capacity planning and identifying bottlenecks.
* **Network Security:** Analyzing traffic patterns is crucial for detecting anomalies that might indicate security threats like intrusions, malware activity, or data exfiltration.
* **Troubleshooting:** Understanding traffic flow can help diagnose network connectivity issues and identify the source of problems.
* **Quality of Service (QoS):** Network traffic analysis allows for prioritizing certain types of traffic (e.g., voice or video) to ensure a better user experience for real-time applications.
* **Cost Management:** Monitoring bandwidth usage can help organizations manage internet costs and identify potential overages.

## TCP/IP Review

### OSI Framework

The Open Systems Interconnection (OSI) framework is a conceptual model created by the International Organization for Standardization (ISO) in 1984. It provides a standardized way to understand how different hardware and software components in a network communicate. The OSI model consists of seven distinct layers, each with specific responsibilities that ensure seamless data transmission across diverse network technologies.

Here are the seven layers of the OSI framework, starting from the top:

**1. Application Layer (Layer 7):**
This is the layer that directly interacts with end-user applications. It provides the interface between the application and the network, enabling users to access network services. Examples of protocols at this layer include HTTP (for web browsing), SMTP (for email), and FTP (for file transfer).

**2. Presentation Layer (Layer 6):**
The Presentation Layer is responsible for the formatting, encryption, and compression of data. It ensures that the information sent by one application is readable by another application on a different system, handling differences in data representation.

**3. Session Layer (Layer 5):**
This layer manages the establishment, maintenance, and termination of connections (sessions) between applications. It handles authentication and authorization and ensures that data streams are properly managed and synchronized.

**4. Transport Layer (Layer 4):**
The Transport Layer provides reliable or unreliable end-to-end delivery of data between hosts. Key protocols at this layer include TCP (Transmission Control Protocol), which offers reliable, connection-oriented communication, and UDP (User Datagram Protocol), which provides faster, connectionless communication. This layer handles segmentation, reassembly, flow control, and error control.

**5. Network Layer (Layer 3):**
The Network Layer is responsible for addressing and routing data packets across networks. It determines the best path for data to travel from source to destination using logical addresses (IP addresses). Routers operate at this layer.

**6. Data Link Layer (Layer 2):**
This layer handles the transfer of data between two directly connected nodes on a local network segment. It deals with physical addressing (MAC addresses), framing of data, and error detection within the local link. Protocols like Ethernet and Wi-Fi operate at this layer.

**7. Physical Layer (Layer 1):**
The Physical Layer is the lowest layer and deals with the physical transmission of raw bit streams over a physical medium (cables, wireless signals, etc.). It defines the electrical, mechanical, procedural, and functional specifications for activating, maintaining, and deactivating the physical link.

In the OSI model, data moves down through the layers on the sending device, with each layer adding its header information (encapsulation). Once the data reaches the receiving device, it moves up through the layers, with each layer interpreting and removing its corresponding header (decapsulation) until it reaches the Application Layer.

The OSI framework is a crucial tool for understanding network communication, troubleshooting issues, and developing network technologies. By dividing the complex process of networking into these seven distinct layers, it provides a clear and standardized way to analyze and manage network operations.

### TCP/IP 5 Layer Framework

The TCP/IP 5-layer framework is a conceptual model that organizes the suite of communication protocols used for the internet and most modern networks into five distinct layers. This layering helps to understand the different levels of abstraction involved in network communication.

Here's an overview of each layer:

**1. Application Layer:**

This is the top layer of the TCP/IP model, closest to the end-user and the applications they interact with. It provides the interface for these applications to utilize network services. The Application Layer defines the protocols that applications use to exchange data. Examples of protocols at this layer include:

* **HTTP (Hypertext Transfer Protocol):** Used for web browsing.
* **HTTPS (HTTP Secure):** Secure version of HTTP.
* **FTP (File Transfer Protocol):** Used for transferring files.
* **SFTP (SSH File Transfer Protocol):** Secure file transfer.
* **SMTP (Simple Mail Transfer Protocol):** Used for sending email.
* **POP3 (Post Office Protocol version 3):** Used for retrieving email.
* **IMAP (Internet Message Access Protocol):** Another protocol for retrieving email.
* **DNS (Domain Name System):** Translates domain names to IP addresses.
* **SSH (Secure Shell):** Provides secure remote access to systems.
* **Telnet:** Provides unsecure remote access to systems.
* **SNMP (Simple Network Management Protocol):** Used for network management.

The data unit at this layer is simply referred to as **data**.

**2. Transport Layer:**

The Transport Layer is responsible for providing end-to-end communication between applications running on different hosts. It manages the reliable or unreliable delivery of data segments. The key protocols at this layer are:

* **TCP (Transmission Control Protocol):** Offers a connection-oriented, reliable, and ordered delivery of data. It handles segmentation of data from the Application Layer, reassembly at the destination, flow control to prevent overwhelming the receiver, and error recovery through retransmission.
* **UDP (User Datagram Protocol):** Provides a connectionless and unreliable delivery of data. It offers faster communication with minimal overhead, suitable for applications where speed is more critical than guaranteed delivery (e.g., streaming, online gaming).

The data units at this layer are **segments** (for TCP) and **datagrams** (for UDP).

**3. Network Layer (or Internet Layer):**

This layer is responsible for addressing and routing data packets across the network. Its primary function is to enable communication between different networks. The main protocols at this layer are:

* **IP (Internet Protocol):** Provides logical addressing (IP addresses) for hosts and handles the routing of packets across networks. It is an unreliable, connectionless protocol, meaning it doesn't guarantee delivery or the order of packets.
* **ICMP (Internet Control Message Protocol):** Used for error reporting and network diagnostic functions (e.g., ping, traceroute).
* **ARP (Address Resolution Protocol):** Used to map IP addresses to physical MAC addresses within a local network segment.

The data unit at this layer is called a **packet**.

**4. Data Link Layer:**

The Data Link Layer focuses on the reliable transfer of data between two directly connected nodes on a local network segment. It deals with the physical addressing of devices (MAC addresses), framing of data for transmission over the physical medium, and basic error detection within the local link. Protocols at this layer include:

* **Ethernet (IEEE 802.3):** A common standard for wired local area networks.
* **Wi-Fi (IEEE 802.11):** A common standard for wireless local area networks.
* **PPP (Point-to-Point Protocol):** Used for establishing direct connections between two nodes, often used for dial-up or VPN connections.

The data unit at this layer is called a **frame**.

**5. Physical Layer:**

This is the lowest layer of the TCP/IP model, concerned with the physical transmission of data bits over a communication medium. It defines the physical and electrical specifications of the network hardware and transmission media. This layer deals with:

* **Physical cables and connectors.**
* **Radio frequencies for wireless transmission.**
* **Voltage levels and signaling rates.**
* **Synchronization of bits.**

This layer doesn't have high-level protocols in the same way as the upper layers. Instead, it defines the characteristics of the physical connection. The data unit at this layer is a **bit**.

Data travels down the TCP/IP stack through a process called **encapsulation**, where each layer adds its own header (and sometimes trailer) to the data it receives from the layer above. At the destination, this process is reversed through **decapsulation**, with each layer removing its header as the data moves up the stack to the receiving application.

### The Packet and Encapsulation

**What is a Packet?**

At its core, a **packet** is a fundamental unit of data transmission on a packet-switched network, like the internet. When data needs to be sent from one device to another, it's broken down into these smaller, manageable chunks called packets. Each packet contains:

* **Payload:** This is the actual data being transmitted – a piece of an email, a fragment of a web page, a segment of a video stream, etc.
* **Header:** This is control information added to the beginning of the payload. The header contains crucial details that help network devices route the packet to its destination and ensure proper reassembly. Information in the header can include source and destination addresses (both logical and physical), protocol identifiers, sequence numbers, error-checking information, and flags.
* **(Sometimes) Trailer:** Some protocols, particularly at lower layers, add a trailer to the end of the payload. Trailers often contain error detection mechanisms like Cyclic Redundancy Checks (CRC) to verify the integrity of the transmitted data.

**Packet as a Protocol Data Unit (PDU)**

A **Protocol Data Unit (PDU)** is a more formal term used to refer to the unit of data exchanged between peer layers in a networking model like TCP/IP or OSI. The term "packet" is most commonly associated with the Network Layer (Layer 3) in the TCP/IP model. However, each layer in the TCP/IP model has its own specific name for the PDU it handles:

* **Application Layer (Layer 1): Data**
* **Transport Layer (Layer 2): Segment (TCP), Datagram (UDP)**
* **Network Layer (Layer 3): Packet**
* **Data Link Layer (Layer 4): Frame**
* **Physical Layer (Layer 5): Bit**

So, while "packet" specifically refers to the PDU at the Network Layer, in a broader sense, people often use it informally to refer to the encapsulated data units being transmitted across the network.

**Encapsulation**

**Encapsulation** is the process by which each layer in the TCP/IP model adds its own header (and sometimes trailer) to the data it receives from the layer above. This creates the layered structure of the PDU as it travels down the protocol stack on the sending device.

Here's how encapsulation works within the TCP/IP 5-layer model:

1.  **Application Layer:** The application creates data that needs to be sent. This data is passed down to the Transport Layer.

2.  **Transport Layer:**
    * **TCP:** If TCP is used, the Application Layer data is segmented into smaller units. A TCP header is added to each segment. This header includes source and destination port numbers, sequence numbers, acknowledgment numbers, and control flags. The PDU at this layer is called a **segment**.
    * **UDP:** If UDP is used, a UDP header is added to the Application Layer data. The UDP header is simpler, containing source and destination port numbers and a checksum. The PDU at this layer is called a **datagram**.

3.  **Network Layer (Internet Layer):** The segment (from TCP) or datagram (from UDP) is passed down to the Network Layer. An IP header is added. This header contains source and destination IP addresses, routing information, time-to-live (TTL), and protocol information. The PDU at this layer is the **packet**.

4.  **Data Link Layer:** The IP packet is passed down to the Data Link Layer. A Data Link Layer header is added, which typically includes the physical (MAC) addresses of the source and destination network interface cards (NICs) on the local network segment. Some Data Link Layer protocols also add a trailer for error detection (e.g., CRC). The PDU at this layer is the **frame**.

5.  **Physical Layer:** The frame is passed down to the Physical Layer. This layer doesn't add headers or trailers in the same way. Instead, it converts the frame into a stream of bits and transmits these bits over the physical medium (e.g., copper wire, fiber optic cable, radio waves). The PDU at this layer is the **bit**.

**Summary of PDUs and Encapsulation in the TCP/IP 5-Layer Model:**

| Layer           | PDU Name     | Encapsulation Process                                                                | Key Header Information                                                                                                                                                              |
| :-------------- | :----------- | :------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Application     | Data         | Application-specific headers might be added by the application itself.                 | Varies depending on the application protocol (e.g., HTTP headers, SMTP commands).                                                                                                   |
| Transport       | Segment/Datagram | TCP header or UDP header is added to the Application Layer data.                      | Source and destination port numbers, sequence numbers, acknowledgment numbers, flags (TCP); source and destination port numbers, checksum (UDP).                                   |
| Network (IP)    | Packet       | IP header is added to the TCP segment or UDP datagram.                                 | Source and destination IP addresses, protocol type, time-to-live (TTL).                                                                                                               |
| Data Link       | Frame        | Data Link Layer header (and sometimes trailer) is added to the IP packet.              | Source and destination MAC addresses, error detection information (e.g., CRC).                                                                                                        |
| Physical        | Bit          | Frame is converted into a stream of bits for transmission over the physical medium. | Physical layer specifications (e.g., voltage levels, timing).                                                                                                                            |

At the receiving device, this encapsulation process is reversed (decapsulation). Each layer reads its header information, performs its designated function, and then removes the header (and trailer) before passing the remaining data up to the next layer in the stack. This layered approach ensures that each part of the communication process is handled by the appropriate layer, contributing to the overall functionality and reliability of network communication.

### Images

* Packet Tracer - https://www.computerhope.com/jargon/p/packet.
* Wireshark - https://www.wireshark.org/docs/wsug_html_chunked/ChUseMainWindowSection.html
* Encapsulation - http://www.tcpipguide.com/free/t_DataEncapsulationProtocolDataUnitsPDUsandServiceDa-2.htm#google_vignette

In [None]:
# pip install scapy -q

#### TCP

**TCP (Transmission Control Protocol)** is a connection-oriented, reliable, and ordered transport layer protocol. It establishes a connection before data transfer, ensures all data arrives at the destination correctly and in the order it was sent, and provides mechanisms for error recovery and flow control.

**Applications that use TCP:**

* **World Wide Web (HTTP/HTTPS):** For browsing websites.
* **Email (SMTP, POP3, IMAP):** For sending and receiving emails.
* **File Transfer (FTP, SFTP):** For transferring files between systems.
* **Secure Shell (SSH):** For secure remote command-line access.
* **Telnet:** For unsecure remote command-line access.
* **Databases (e.g., MySQL, PostgreSQL):** For client-server communication.

In [None]:
# Code along

**1. `###[ IP ]###`**

The IP header is crucial for **routing packets across networks**. It contains the necessary information to get a packet from its source to its destination. Here's an explanation of each field:

* **`version = 4`**:
    * Specifies the **IP version** being used. In this case, `4` indicates **IPv4**, which is the most widely used version of the Internet Protocol. The next generation is IPv6, which would have a `version` of `6`.

* **`ihl = None`**:
    * **Internet Header Length**. This field indicates the size of the IP header in **32-bit words**. A standard IPv4 header without options is 20 bytes (5 x 32-bit words).
    * The `None` value here usually means that Scapy will automatically calculate the correct IHL when the packet is sent or when a full representation of the packet is needed. If there were IP options present, the IHL would be greater than 5.

* **`tos = 0x0`**:
    * **Type of Service** (now more accurately referred to as the **Differentiated Services Field - DSCP** and Explicit Congestion Notification - ECN).
    * This byte is used to prioritize or classify packets, allowing for different treatment based on their importance. `0x0` indicates the default or best-effort service with no specific prioritization.

* **`len = None`**:
    * **Total Length**. This field specifies the total size of the IP packet in bytes, including both the header and the data (payload) it carries.
    * `None` here means Scapy will calculate the total length based on the size of the IP header and the encapsulated data (in this case, the TCP segment).

* **`id = 1`**:
    * **Identification**. This is a 16-bit integer that uniquely identifies each IP packet sent by a host. If an IP packet is fragmented (broken into smaller pieces) to traverse a network with a smaller Maximum Transmission Unit (MTU), all fragments of the original packet will have the same `id`. This allows the receiving host to reassemble the fragments correctly.

* **`flags = `**:
    * **IP Flags**. These are control bits that relate to fragmentation. The flags are:
        * **Bit 0:** Reserved (must be zero).
        * **Bit 1 (DF - Don't Fragment):** If set, this flag indicates that the packet should not be fragmented. If a router encounters a link with an MTU smaller than the packet size, the router will drop the packet and typically send an ICMP "Fragmentation Needed" message back to the sender.
        * **Bit 2 (MF - More Fragments):** If set, this flag indicates that this packet is a fragment and more fragments of the original packet will follow. The last fragment will have this flag set to zero.
    * An empty value here likely means that no fragmentation-related flags are set (both DF and MF are 0).

* **`frag = 0`**:
    * **Fragment Offset**. This 13-bit field indicates the position of the fragment's data relative to the beginning of the original, unfragmented packet. The offset is measured in units of 8 bytes. A value of `0` indicates that this is either an unfragmented packet or the first fragment.

* **`ttl = 64`**:
    * **Time To Live**. This 8-bit field limits the number of hops (routers) an IP packet can traverse on the network. Each router that processes the packet decrements the TTL value by at least one. When the TTL reaches zero, the packet is discarded to prevent routing loops from causing packets to circulate indefinitely. The initial TTL value is set by the sending host (often 64 or 128).

* **`proto = tcp`**:
    * **Protocol**. This 8-bit field indicates the next-level protocol encapsulated within the IP packet's data payload. Common values include:
        * `tcp` (or `6`): Transmission Control Protocol
        * `udp` (or `17`): User Datagram Protocol
        * `icmp` (or `1`): Internet Control Message Protocol
        * In this case, `tcp` indicates that the data being carried by this IP packet is a TCP segment (from the Transport Layer).

* **`chksum = None`**:
    * **Header Checksum**. This 16-bit field contains a checksum value calculated over the IP header. It's used for basic error detection to ensure the integrity of the header during transmission. If a router detects an error in the header using the checksum, it will discard the packet.
    * `None` here means Scapy will calculate the correct checksum before sending the packet.

* **`src = 192.168.1.100`**:
    * **Source IP Address**. This 32-bit field contains the IP address of the sender of the packet.

* **`dst = 192.168.1.101`**:
    * **Destination IP Address**. This 32-bit field contains the IP address of the intended recipient of the packet.

* **`\options\`**:
    * **IP Options**. This is a variable-length field that can contain optional features for the IP protocol, such as record route, timestamp, or security options. In this case, the backslash indicates that there are no IP options present.

In summary, the IP header provides the fundamental addressing and control information necessary for routing a packet from its source to its destination across an IP network. It identifies the source and destination, specifies how the packet should be handled (e.g., fragmentation, priority), and indicates the type of data being carried in its payload.

**2. `###[ TCP ]###` Section:**

* **`sport = 12345`**: Source port number. The application initiating the connection is using port 12345.
* **`dport = 80`**: Destination port number. The service being requested (likely HTTP) is listening on port 80.
* **`seq = 0`**: Sequence number. This is the initial sequence number (ISN) for the TCP connection establishment. It's used to track the order of data segments.
* **`ack = 0`**: Acknowledgment number. In the SYN packet, the acknowledgment number is usually 0 as no data has been received yet. It will be used in subsequent packets to acknowledge received data.
* **`dataofs = None`**: Data offset (number of 32-bit words in the TCP header). `None` means Scapy will calculate this.
* **`reserved = 0`**: Reserved bits, which are usually set to 0.
* **`flags = S`**: TCP flags. 'S' stands for SYN (Synchronize). This flag is set in the first packet of the TCP three-way handshake to initiate a connection. Other flags include 'A' (ACKnowledgment), 'F' (FINish), 'R' (ReSeT), 'P' (PUSH), and 'U' (URGent).
* **`window = 8192`**: Window size. This indicates the amount of data the sender is willing to receive without an acknowledgment.
* **`chksum = None`**: Checksum. Used for error detection of the TCP header and data. `None` means Scapy will calculate this.
* **`urgptr = 0`**: Urgent pointer. This field is used when the URG flag is set to indicate the offset of urgent data within the segment.
* **`\options\`**: TCP options can be included here (e.g., Maximum Segment Size - MSS). It's empty in this basic example.

In summary, a TCP packet, as represented by Scapy, will have an IP layer for addressing and routing, and a TCP layer containing header fields specific to TCP's connection-oriented and reliable nature, including sequence numbers, acknowledgment numbers, flags for connection control, and window size for flow control. The raw data payload would follow the TCP header.

#### UDP

**UDP (User Datagram Protocol)** is a connectionless and unreliable transport layer protocol. It sends data packets (datagrams) without establishing a connection beforehand and doesn't guarantee delivery, order, or error checking. This makes it faster and with lower overhead than TCP.

**Applications that use UDP:**

* **Domain Name System (DNS):** For looking up IP addresses from domain names.
* **Streaming Media (e.g., video conferencing, online gaming):** Where speed is often prioritized over guaranteed delivery.
* **Voice over IP (VoIP):** For real-time voice communication.
* **Online Gaming:** For fast-paced data exchange where occasional packet loss is acceptable.
* **Simple Network Management Protocol (SNMP):** For network management tasks.
* **Trivial File Transfer Protocol (TFTP):** A simpler file transfer protocol.

In [None]:
# Code along

**1. `###[ IP ]###`**

This section defines the **Internet Protocol (IP) layer** of the packet. It contains the header fields for IPv4:

* **`version = 4`**: Indicates that this is an IPv4 packet.
* **`ihl = None`**: Internet Header Length. This field specifies the size of the IP header in 32-bit words. `None` usually means Scapy will calculate this automatically when the packet is sent or displayed in full detail.
* **`tos = 0x0`**: Type of Service (now Differentiated Services Field - DSCP). This field is used to prioritize or classify packets. `0x0` indicates no specific prioritization.
* **`len = None`**: Total length of the IP packet (header + data). `None` means Scapy will calculate this.
* **`id = 1`**: Identification field. This is a sequence number that helps in reassembling fragmented IP packets at the destination.
* **`flags = `**: IP flags. These flags control fragmentation. The empty value likely means no flags are set.
* **`frag = 0`**: Fragment offset. If the packet is fragmented, this field indicates the position of the fragment within the original packet. `0` means it's either not fragmented or it's the first fragment.
* **`ttl = 64`**: Time To Live. This field limits the number of hops a packet can take through routers to prevent routing loops. It's initialized to a value (often 64) and decremented by each router.
* **`proto = 17`**: Protocol. This field indicates the next-level protocol encapsulated within the IP packet. `17` corresponds to **UDP (User Datagram Protocol)**.
* **`chksum = None`**: Header checksum. This field is used for error detection of the IP header. `None` means Scapy will calculate this.
* **`src = 192.168.1.100`**: Source IP address of the packet.
* **`dst = 192.168.1.102`**: Destination IP address of the packet.
* **`\options \`**: This section would contain any IP options, but it's empty here.

**2. `###[ UDP ]###`**

This section defines the **User Datagram Protocol (UDP) layer**, which is encapsulated within the IP packet (as indicated by `proto = 17` in the IP header).

* **`sport = 5000`**: Source port number. The application on the source host that sent this UDP datagram is using port 5000.
* **`dport = 53`**: Destination port number. The application on the destination host that should receive this UDP datagram is listening on port 53, which is the standard port for **DNS (Domain Name System)** queries.
* **`len = None`**: Length of the UDP datagram (header + data). `None` means Scapy will calculate this.
* **`chksum = None`**: Checksum. This field is used for error detection of the UDP header and data. `None` means Scapy will calculate this.

**3. `###[ Raw ]###`**

This section represents the **raw payload** of the UDP datagram. It's the actual data being transmitted by the application.

* **`load = b'This is a simulated UDP packet.'`**: This is the raw byte string payload. The `b'` prefix indicates a byte literal in Python. This is the data that the DNS application (listening on port 53 at the destination IP) would receive.

**4. `'IP / UDP 192.168.1.100:5000 > 192.168.1.102:53 / Raw'`**

This is a concise **summary** of the created packet, also provided by Scapy. It shows the layering and key information:

* **`IP`**: Indicates the outermost layer is IP.
* **`/`**: Separator indicating encapsulation.
* **`UDP`**: Indicates the next layer encapsulated within IP is UDP.
* **`192.168.1.100:5000`**: Source IP address and source port number.
* **`>`**: Indicates the direction of the packet (from source to destination).
* **`192.168.1.102:53`**: Destination IP address and destination port number.
* **`/`**: Separator.
* **`Raw`**: Indicates the raw payload being carried by the UDP datagram.

**In essence, this Scapy output describes a simulated network packet that:**

* Originates from IP address `192.168.1.100` and UDP port `5000`.
* Is destined for IP address `192.168.1.102` and UDP port `53` (likely a DNS server).
* Contains a raw payload of the text "This is a simulated UDP packet."

Scapy allows you to construct and manipulate packets at this level, making it a powerful tool for network analysis, testing, and security exploration. When you would send this packet using Scapy's `send()` function, the library would typically fill in the `None` values (like `ihl`, `len`, and `chksum`) before transmission.

### TCP/IP Network Anomalies

TCP/IP network anomalies are unusual patterns or behaviors in network traffic that deviate significantly from the expected or normal baseline of communication using the TCP/IP protocol suite. These anomalies can be caused by various factors, including:

* **Security Threats:** Intrusions, malware activity, denial-of-service (DoS) attacks, port scanning, and unauthorized access attempts.
* **Network Issues:** Equipment malfunction, misconfigurations, congestion, routing problems, and broadcast storms.
* **Policy Violations:** Unauthorized use of bandwidth, forbidden protocols, or access to restricted resources.
* **Operational Errors:** Mistakes in configuration or maintenance leading to unusual traffic patterns.

**Here are some examples of TCP/IP network anomalies:**

**Traffic Volume Anomalies:**

* **Sudden Spike in Traffic:** An unexpected and significant increase in network traffic to a specific host or across the entire network, potentially indicating a DoS attack or a compromised host participating in a botnet.
* **Unusual Drop in Traffic:** A sudden and significant decrease in traffic, which could indicate a network outage, a failing server, or a targeted attack disrupting communication.
* **Excessive Bandwidth Usage:** A host consuming far more bandwidth than its typical usage, potentially due to malware uploading data or a user engaging in unauthorized file sharing.

**Packet Characteristic Anomalies:**

* **Unusual Packet Sizes:** Packets that are significantly larger or smaller than the typical packet size for a particular type of communication. This could indicate fragmentation attacks or tunneling.
* **Irregular Packet Rate:** A sudden increase or decrease in the number of packets being transmitted, which might signal a DoS attack or network congestion.
* **Abnormal TCP Flags:** Unexpected or illogical combinations of TCP flags (SYN, ACK, FIN, RST, URG, PSH) in packets. For example, multiple SYN packets without corresponding ACK responses (SYN flood attack) or RST packets for established connections without a clear reason.
* **Out-of-Order Packets:** An unusually high number of TCP packets arriving out of sequence, potentially indicating network issues or an attacker trying to reorder packets.
* **Retransmission Anomalies:** An excessive number of retransmitted TCP segments, suggesting network congestion or unreliable connections.

**Connection and Flow Anomalies:**

* **High Number of Connections:** A host establishing an unusually large number of connections to different destinations, which could be a sign of malware spreading or a port scan.
* **Connections to Unusual Ports:** Communication attempts to ports that are not typically used by standard services, potentially indicating malware communication or reconnaissance activities.
* **Connections from Unusual Source IPs:** Traffic originating from IP addresses that are not normally associated with the internal network or trusted external partners.
* **Long-Duration Connections with Little Data Transfer:** Connections that remain open for extended periods without significant data exchange, which could be indicative of idle botnet connections.
* **Failed Connection Attempts:** A high number of unsuccessful connection attempts to a specific host or service, potentially indicating a DoS attack or a misconfigured service.

**Protocol-Specific Anomalies:**

* **HTTP Anomalies:** Unusual user-agent strings, abnormal request methods, or excessive requests for specific resources.
* **DNS Anomalies:** Sudden spikes in DNS queries, queries for unusual domain names, or a high number of failed DNS resolutions.
* **SMTP Anomalies:** Large volumes of outgoing emails, emails sent to unusual recipients, or emails originating from compromised internal hosts.

Detecting these TCP/IP network anomalies is crucial for maintaining network security, performance, and reliability. Machine learning techniques, as you are teaching, can be very effective in identifying these deviations from normal behavior by learning patterns in network traffic data.

# Random Forest vs Isolation Forest

**Random Forest** is a **supervised** algorithm used for **classification and regression**. It learns relationships between features and a target variable from labeled data to make predictions. It builds an ensemble of deep decision trees.

**Isolation Forest** is an **unsupervised** algorithm used for **anomaly detection**. It identifies outliers by isolating them in shallow decision trees built by random partitioning. It doesn't require labeled anomaly data.

**Key Difference:** Random Forest predicts a known outcome, while Isolation Forest identifies the unusual without prior knowledge of what's anomalous.

**Supervised learning** uses **labeled data** (input features paired with correct output targets) to train models for prediction or classification. The model learns the mapping between inputs and outputs.

**Unsupervised learning** uses **unlabeled data** to find hidden patterns, structures, or groupings within the data itself, without any predefined target variables.

## The Decision Tree

A **decision tree** is a tree-like supervised learning model used for both classification and regression. It makes decisions by recursively splitting the data based on the values of input features, aiming to create homogeneous subsets with respect to the target variable. Each internal node represents a feature test, each branch represents the outcome of the test, and each leaf node represents a prediction (class label or numerical value).

                                      Chillin with some grillin?
                                         Is it cloudy today?
                                            (Root Node)
                                                /\
                                           yes /  \ no
                                              /    \
                                         chance    grillin
                                        of rain?      
                                            /\
                                       yes /  \ no
                                          /    \
                                  no grillin     grillin

### Entropy

$H(S) = -p_1 \log_2(p_1) - p_0 \log_2(p_0)$

The entropy of a set S, denoted as H(S), is calculated by taking the negative of the probability of the first class (p₁) multiplied by the base-2 logarithm of that probability, and then subtracting the probability of the second class (p₀) multiplied by the base-2 logarithm of that probability.

In [None]:
import numpy as np  # We'll need numpy for the logarithm base 2

def entropy(target_col):
    """Calculates the entropy of a target variable with binary classes."""
    elements, counts = np.unique(target_col, return_counts=True)
    probabilities = counts / len(target_col) # Calculates p1 and p0 (probabilities of each class)
    entropy_value = 0
    for prob in probabilities:
        if prob > 0:  # Avoid log(0) error
            entropy_value -= prob * np.log2(prob) # Calculates -p * log2(p) for each class

    return entropy_value # Returns H(S)

### Information Gain

$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$

The information gain of splitting a dataset S on an attribute A, denoted as IG(S, A), is determined by taking the initial entropy of the dataset S, represented as H(S), and subtracting the sum of the entropies of each subset Sv created by splitting on the different values (v) of attribute A. Each of these subset entropies, H(Sv), is weighted by the proportion of the number of elements in that subset (|Sv|) relative to the total number of elements in the original dataset (|S|).

In [None]:
def information_gain(data, split_attribute, target_attribute):
    """Calculates the information gain of splitting on an attribute."""
    # Total entropy of the target variable
    total_entropy = entropy(data[target_attribute]) # H(S)

    # Unique values in the split attribute
    values = data[split_attribute].unique()
    weighted_entropy = 0

    for value in values:
        subset = data[data[split_attribute] == value]
        subset_size = len(subset)
        if subset_size > 0:
            subset_entropy = entropy(subset[target_attribute]) # H(Sv)
            weighted_entropy += (subset_size / len(data)) * subset_entropy # (|Sv| / |S|) * H(Sv)

    # Information Gain
    gain = total_entropy - weighted_entropy # IG(S, A) = H(S) - Σ (|Sv| / |S|) * H(Sv)
    return gain

### Gini Impurity

$Gini(S) = 1 - (p_1^2 + p_0^2)$

The Gini impurity of a set S, denoted as Gini(S), is calculated by taking one and subtracting the sum of the squared probabilities of each class. For a binary classification problem, this means subtracting the square of the probability of the first class (p₁) and the square of the probability of the second class (p₀) from one.

In [None]:
def gini(target_col):
    """Calculates the Gini impurity of a target variable with binary classes."""
    elements, counts = np.unique(target_col, return_counts=True)
    probabilities = counts / len(target_col) # Calculates p1 and p0 (probabilities of each class)
    gini_value = 1 - np.sum(probabilities**2) # Calculates 1 - (p1^2 + p0^2)
    return gini_value # Returns Gini(S)

### Gini Gain (Reduction in Gini Impurity)

$Gini\_Gain(S, A) = Gini(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Gini(S_v)$

The Gini gain achieved by splitting a dataset S on an attribute A, denoted as Gini\_Gain(S, A), is calculated by taking the initial Gini impurity of the dataset S, represented as Gini(S), and subtracting the weighted sum of the Gini impurities of each subset Sv created by splitting on the different values (v) of attribute A. Each subset's Gini impurity, Gini(Sv), is weighted by the proportion of the number of elements in that subset (|Sv|) relative to the total number of elements in the original dataset (|S|).

In [None]:
def gini_gain(data, split_attribute, target_attribute):
    """Calculates the Gini gain of splitting on an attribute."""
    # Total Gini impurity of the target variable
    total_gini = gini(data[target_attribute]) # Gini(S)

    # Unique values in the split attribute
    values = data[split_attribute].unique()
    weighted_gini = 0

    for value in values:
        subset = data[data[split_attribute] == value]
        subset_size = len(subset)
        if subset_size > 0:
            subset_gini = gini(subset[target_attribute]) # Gini(Sv)
            weighted_gini += (subset_size / len(data)) * subset_gini # (|Sv| / |S|) * Gini(Sv)

    # Gini Gain
    gain = total_gini - weighted_gini # Gini_Gain(S, A) = Gini(S) - Σ (|Sv| / |S|) * Gini(Sv)
    return gain

### Exercise

**Our Goal:** To understand how Information Gain, Gini Impurity, and Entropy guide the splitting of a decision tree when the target variable has three distinct classes.

**Scenario:** Imagine we have a dataset about different types of fruits based on their color, size, shape, and texture, and our target variable is the fruit type (Apple, Banana, Orange).

### Step 1: Setting up the Environment and Initial Data

In [None]:
import pandas as pd
import numpy as np

import pandas as pd
import numpy as np

np.random.seed(42) # for reproducibility

data = {
    'Fur': np.random.choice(['Furry', 'Not Furry'], size=12),
    'Legs': np.random.choice([4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 2, 2], size=12), # Bias towards 4 legs for cats/dogs
    'Sound': np.random.choice(['Meow', 'Woof', 'Hiss', 'Meow', 'Woof', 'Hiss', 'Meow', 'Woof', 'Hiss', 'Meow', 'Woof', 'Hiss'], size=12),
    'Tail': np.random.choice(['Long', 'Short', 'None', 'Long', 'Short', 'None', 'Long', 'Short', 'None', 'Long', 'Short', 'None'], size=12),
    'Animal': ['Cat', 'Dog', 'Turtle', 'Cat', 'Dog', 'Turtle', 'Cat', 'Dog', 'Turtle', 'Cat', 'Dog', 'Turtle']
}

In [None]:
# Code along


### Understanding Entropy

**What is Entropy?**

Entropy is a measure of the impurity, disorder, or randomness in a set of data. In the context of information theory, it quantifies the average amount of information needed to describe the outcome of a random event. In machine learning, particularly with decision trees, entropy is used to measure the impurity of a dataset with respect to the target variable.

A pure dataset (where all data points belong to the same class) has an entropy of 0. A dataset with an equal distribution of classes has maximum entropy.

**Why Use Entropy with the Target Variable?**

In decision trees, we use entropy to determine the "best" way to split the data at each node. The goal is to create splits that result in subsets of data that are as pure as possible (with respect to the target variable).

Here's how it works:

1.  **Initial Entropy:** We first calculate the entropy of the entire dataset's target variable. This gives us a baseline measure of the impurity before any splits are made.

2.  **Splitting and Information Gain:**
    * We then consider different features and potential split points within those features.
    * For each potential split, we calculate the entropy of the resulting subsets (after the split).
    * We compare the entropy of the original dataset to the weighted average of the entropies of the subsets. The reduction in entropy achieved by the split is called **information gain**.

3.  **Best Split Selection:** The feature and split point that yield the highest information gain (i.e., the greatest reduction in entropy) are chosen for that node in the decision tree.

4.  **Recursion:** This process is repeated recursively for each subset, building the tree until a stopping criterion is met (e.g., a subset is pure, or a maximum tree depth is reached).

**In essence, we use entropy to guide the tree construction process, favoring splits that create more homogeneous subsets, making the tree more effective at classifying or predicting the target variable.**

**Entropy Formula:**

The formula for entropy (H) for a set S is:

$$H(S) = - \sum_{i=1}^{c} p_i \log_2(p_i)$$

Where:

* \(H(S)\) is the entropy of the dataset S.
* \(c\) is the number of distinct classes in the target variable.
* \(p_i\) is the proportion (probability) of data points belonging to class i in the dataset S.

The logarithm is base 2 because we often deal with bits of information. By minimizing entropy, we maximize the information gained by each split, leading to a more efficient and accurate decision tree.


In [None]:
# Code along


**(Pause. Interpret the entropy value in the context of three classes. A higher value still indicates more impurity or a more even distribution of the fruit types.)**

### Step 3: Information Gain

Okay, let's explain information gain and how it's used to select the best feature for the first split in a decision tree.

**What is Information Gain?**

Information gain measures the reduction in entropy achieved by partitioning the dataset based on a particular feature. In simpler terms, it tells us how much "cleaner" or more organized the data becomes after we split it according to that feature. A higher information gain indicates a more effective split, as it means the resulting subsets are more homogeneous with respect to the target variable.

**How We Use Information Gain to Find the Best Feature for the First Split**

1.  **Calculate Initial Entropy:** First, as you've already done, we calculate the entropy of the target variable for the entire dataset. This is the entropy before any splitting is performed.

2.  **Calculate Entropy for Each Feature:**
    * For each feature in the dataset:
        * We determine the unique values of that feature.
        * For each unique value of the feature, we create a subset of the data containing only the rows where the feature has that value.
        * We calculate the entropy of the target variable for each of these subsets.

3.  **Calculate Weighted Average Entropy for Each Feature:**
    * For each feature, we calculate the weighted average entropy of its subsets. The weight for each subset's entropy is the proportion of data points that belong to that subset.

4.  **Calculate Information Gain for Each Feature:**
    * For each feature, we calculate the information gain by subtracting the weighted average entropy of that feature's subsets from the initial entropy of the entire dataset.

5.  **Select the Feature with the Highest Information Gain:**
    * The feature with the highest information gain is chosen as the feature for the first split. This is because it provides the greatest reduction in impurity and thus the most informative split.

**Information Gain Equation:**

The formula for information gain (IG) is:

$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$

Where:

* \(IG(S, A)\) is the information gain of splitting the dataset \(S\) on feature \(A\).
* \(H(S)\) is the entropy of the original dataset \(S\).
* \(Values(A)\) is the set of all possible values of feature \(A\).
* \(|S_v|\) is the number of data points in the subset \(S_v\) where feature \(A\) has value \(v\).
* \(|S|\) is the total number of data points in the dataset \(S\).
* \(H(S_v)\) is the entropy of the subset \(S_v\).

1.  **`information_gain(data, split_feature, target_feature)` function:**
    * Takes the DataFrame (`data`), the feature to split on (`split_feature`), and the target feature (`target_feature`) as input.
    * Calculates the `total_entropy` (entropy of the target variable in the whole dataset).
    * Finds the unique values of the `split_feature` and their counts.
    * Iterates through each unique value of the `split_feature`:
        * Creates a `subset` of the data where the `split_feature` equals the current value.
        * Calculates the probability (`subset_prob`) of that subset.
        * Calculates the entropy of the `target_feature` in the `subset`.
        * Adds the weighted entropy of the subset to `weighted_entropy`.
    * Returns the information gain by subtracting the `weighted_entropy` from the `total_entropy`.

2.  **Calculate Information Gain for Each Feature:**
    * We iterate through all the features in the DataFrame (except the target variable 'Animal').
    * We call the `information_gain` function for each feature to calculate its information gain.
    * We store the information gains in a dictionary `information_gains`.
    * We print the information gain for each feature.

3.  **Find the Feature with the Highest Information Gain:**
    * We use the `max()` function with the `key` argument to find the feature with the maximum information gain from the `information_gains` dictionary.
    * We print the name of the `best_feature`.

This code will calculate the information gain for each feature and tell you which feature would be the best choice for the first split in a decision tree based on maximizing information gain.

In [None]:
# information_gains = {}
# for feature in df.columns:
#     if feature != 'Animal':  # Don't calculate gain for the target variable itself
#         information_gains[feature] = information_gain(df, feature, 'Animal')
#         print(f"Information Gain for '{feature}': {information_gains[feature]:.4f}")

# # Find the Feature with the Highest Information Gain
# best_feature = max(information_gains, key=information_gains.get)
# print(f"\nBest feature to split on: '{best_feature}'")

**(Pause. Which feature seems most promising for the first split based on these values.)**

**Explanation:**

* We iterate through each of our features.
* For each feature, we use our `information_gain` function (which we defined earlier and works for multi-class targets) to calculate how much the entropy of 'Fruit' is reduced by splitting on that feature.
* The output shows the Information Gain for each potential first split. The feature with the highest Information Gain would be chosen as the root node.

### Understanding Recursion

Three common methods for generating the Fibonacci sequence in Python are using a loop (iteration), recursion, and memoization (dynamic programming).

**1. Iteration:**

   - **Approach:** This method uses a `for` or `while` loop to compute the Fibonacci numbers sequentially.
   - **Process:** Initialize the first two Fibonacci numbers (0 and 1), and then iteratively calculate the next number by adding the previous two.
   - **Example:**

In [None]:
# Code along

**2. Recursion:**

   - **Approach:** This method defines a function that calls itself to compute the Fibonacci numbers.
   - **Process:** Define base cases for the first two Fibonacci numbers (0 and 1), and then recursively define the function for the next number by adding the previous two.
   - **Example:**

In [None]:
# Code along

**3. Memoization:**

Memoization is an optimization technique that speeds up code by storing the results of expensive function calls and reusing them when the same inputs occur again. It essentially creates a cache of function results to avoid redundant computations.

   - **Approach:** This method combines recursion with dynamic programming (memoization) to optimize the calculation.
   - **Process:** Store the results of previous Fibonacci number calculations in a cache (dictionary or list) to avoid recomputing them.
   - **Example:**

In [None]:
# Code along

### Step 4: Understanding Gini Impurity

For a multi-class target with probabilities $p_1, p_2, ..., p_n$, the Gini Impurity is:

$$Gini(S) = 1 - \sum_{i=1}^{n} p_i^2$$


**Step 1: Splitting on "Sound"**

We've (hypothetically) determined that "Sound" is the best feature to split on (using information gain across all features). Now let's see how Gini impurity is used to evaluate this split.

1.  **Unique Values of "Sound":** The unique values in the 'Sound' column are 'Meow', 'Woof', and 'Hiss'.

2.  **Create Subsets:** We split the DataFrame into three subsets based on these values:



In [None]:
# subset_meow = df[df['Sound'] == 'Meow']
# subset_woof = df[df['Sound'] == 'Woof']
# subset_hiss = df[df['Sound'] == 'Hiss']

3.  **Calculate Gini Impurity of Subsets:** We calculate the Gini impurity of the 'Animal' column in each subset:

In [None]:
# gini_meow = gini(subset_meow['Animal'])
# gini_woof = gini(subset_woof['Animal'])
# gini_hiss = gini(subset_hiss['Animal'])

# print(f"Gini Impurity for 'Meow' subset: {gini_meow:.4f}")
# print(f"Gini Impurity for 'Woof' subset: {gini_woof:.4f}")
# print(f"Gini Impurity for 'Hiss' subset: {gini_hiss:.4f}")

4.  **Calculate Weighted Average Gini Impurity:** We calculate the weighted average Gini impurity of the split:

In [None]:
# weighted_gini_sound = (
#     (len(subset_meow) / len(df)) * gini_meow +
#     (len(subset_woof) / len(df)) * gini_woof +
#     (len(subset_hiss) / len(df)) * gini_hiss
# )
# print(f"Weighted Average Gini Impurity for 'Sound' Split: {weighted_gini_sound:.4f}")

**Why Do We Calculate the Weighted Gini of "Sound"?**

   * We calculate the weighted Gini impurity of "Sound" (and any other potential splitting feature) to evaluate the **quality** of that specific split. We're not looking for a single "total Gini score" for the whole process. Instead, we're comparing the Gini impurity *before* the split to the *expected Gini impurity after* the split.
   * The goal is to see how much the split reduces the impurity in the data. A greater reduction is better.
   * It's a Gini impurity value *specific to a particular split based on a particular feature*.
   * We have:
        * `initial_gini`: Gini impurity of the original dataset (before any split).
        * `weighted_gini_sound`: The *expected* Gini impurity *after* we split the dataset based on the "Sound" feature.

**What Does "Weighted" Mean?**

   * "Weighted" in "weighted Gini impurity" means that the Gini impurity of each resulting subset is given a weight based on the proportion of data points that fall into that subset.
   * **Why weight?** Because subsets can have different sizes. We want to give more importance to the Gini impurity of larger subsets, as they represent a larger portion of the data.

**Illustrative Example to Explain "Weighted"**

   Imagine we have 100 data points, and we're considering a split that creates two subsets:

   * Subset A: 90 data points, Gini impurity = 0.1
   * Subset B: 10 data points, Gini impurity = 0.5

   If we simply averaged the Gini impurities (0.1 + 0.5) / 2 = 0.3, it wouldn't accurately reflect the overall impurity after the split. Subset A is much larger and purer, so it should contribute more to our overall measure of impurity.

   The weighted average Gini impurity would be:

   (90/100) \* 0.1 + (10/100) \* 0.5 = 0.09 + 0.05 = 0.14

   This shows that the split gives us a relatively pure result overall because most of the data (90%) is in a pure subset.

**In Summary**

   * We calculate the weighted Gini impurity for each *potential split* to compare the effectiveness of different splits.
   * It's a *split-specific* score.
   * "Weighted" means we give more importance to larger subsets, which is crucial for accurately assessing the overall impurity after a split.

5.  **Lowest weighted Gini Impurity:** We're aiming for the smallest amount of impurity *after* the split.

   * **Purity = Better Predictions:** Our goal is to create subsets of data that are "pure" with respect to the target variable. A pure subset contains mostly or only data points belonging to a single class. Pure subsets make it easier to make accurate predictions.
   * **Impurity = Uncertainty:** High impurity means the data is mixed, and it's hard to predict the class of a data point in that subset. Low impurity means the data is organized, and prediction is more certain.
   * **Reducing Uncertainty:** By choosing the split that results in the greatest reduction in impurity (or the lowest weighted average impurity), we are effectively reducing the uncertainty about the target variable. This leads to a decision tree that can make more accurate classifications or predictions.



In [None]:
# import pandas as pd
# import numpy as np

# def calculate_weighted_gini(data, split_feature, target_feature):
#     values = data[split_feature].unique()
#     weighted_gini = 0
#     for value in values:
#         subset = data[data[split_feature] == value]
#         subset_prob = len(subset) / len(data)
#         weighted_gini += subset_prob * gini(subset[target_feature])
#         print(f"  Gini Impurity for '{value}': {gini(subset[target_feature]):.4f}")
#     return weighted_gini

# def find_best_split(data, features, target):
#     best_feature = None
#     min_weighted_gini = 1.0  # Initialize with maximum impurity

#     for feature in features:
#         weighted_gini = calculate_weighted_gini(data, feature, target)
#         print(f"Weighted Gini for '{feature}': {weighted_gini:.4f}\n")
#         if weighted_gini < min_weighted_gini:
#             min_weighted_gini = weighted_gini
#             best_feature = feature

#     return best_feature

# def build_tree_recursive(data, features, target, depth=0):
#     """Simplified recursive tree builder (for illustration)."""

#     print(f"\n\n--- Depth: {depth} ---")
#     print(f"Data:\n{data}")
#     print(f"Features: {features}\n\n")

#     # Stopping condition (simplified)
#     if len(np.unique(data[target])) == 1 or depth > 2 or not features:
#         print(f"  --> Leaf Node: {np.unique(data[target])[0] if len(np.unique(data[target])) == 1 else 'Mixed'}")
#         return

#     best_feature = find_best_split(data, features, target)
#     print(f"  Best Feature: {best_feature}")

#     unique_values = data[best_feature].unique()
#     print(f"  Unique Values: {unique_values}")
#     remaining_features = [f for f in features if f != best_feature]

#     for value in unique_values:
#         subset = data[data[best_feature] == value]
#         print(f"  Branch: {best_feature} = {value}")
#         build_tree_recursive(subset, remaining_features, target, depth + 1)  # Recursive call!

# # Initial call to start the tree building
# build_tree_recursive(df, df.columns.drop('Animal').tolist(), 'Animal')

```
Features: ['Fur', 'Legs', 'Sound', 'Tail']


  Gini Impurity for 'Furry': 0.6667
  Gini Impurity for 'Not Furry': 0.6667
Weighted Gini for 'Fur': 0.6667

  Gini Impurity for '2': 0.5600
  Gini Impurity for '4': 0.6122
Weighted Gini for 'Legs': 0.5905

  Gini Impurity for 'Hiss': 0.5714
  Gini Impurity for 'Woof': 0.5000
  Gini Impurity for 'Meow': 0.4444
Weighted Gini for 'Sound': 0.5278

  Gini Impurity for 'Short': 0.6667
  Gini Impurity for 'Long': 0.4444
  Gini Impurity for 'None': 0.6111
Weighted Gini for 'Tail': 0.5833

  Best Feature: Sound
  Unique Values: ['Hiss' 'Woof' 'Meow']
  Branch: Sound = Hiss
```
* **Gini Impurity Calculations:**
    * For each feature ('Fur', 'Legs', 'Sound', 'Tail'), the Gini impurity is calculated for each unique value of that feature.
    * For each feature, a "Weighted Gini" is also calculated.

* **Feature-Specific Gini Impurity:**
    * For 'Fur', both 'Furry' and 'Not Furry' have a Gini impurity of 0.6667, and the weighted Gini is also 0.6667.
    * For 'Legs', '2' has a Gini impurity of 0.5600, '4' has 0.6122, and the weighted Gini is 0.5905.
    * For 'Sound', 'Hiss' has a Gini impurity of 0.5714, 'Woof' has 0.5000, 'Meow' has 0.4444, and the weighted Gini is 0.5278.
    * For 'Tail', 'Short' has a Gini impurity of 0.6667, 'Long' has 0.4444, 'None' has 0.6111, and the weighted Gini is 0.5833.

* **Decision:**
    * "Best Feature" is identified as 'Sound'.
    * "Unique Values" for 'Sound' are listed as `['Hiss' 'Woof' 'Meow']`.
    * The image states "Branch: Sound".

**Interpretation:**

The decision tree algorithm selected 'Sound' as the first feature to split on. This selection was based on comparing the "Weighted Gini" values for all features. The feature 'Sound' has the lowest weighted Gini (0.5278) compared to 'Fur' (0.6667), 'Legs' (0.5905), and 'Tail' (0.5833).

The algorithm aims to minimize the impurity after the split, so it chooses the feature that results in the lowest weighted average impurity.

The unique values of 'Sound' indicate that the first split will create branches for 'Hiss', 'Woof', and 'Meow'.

## Iris

### The Data

In [None]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns
# from sklearn.datasets import load_iris
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.model_selection import train_test_split

# iris = load_iris()
# X = iris.data
# y = iris.target

# df = pd.DataFrame(data=X, columns=iris.feature_names)
# df['species'] = y
# df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})


# X_train, X_test, y_train, y_test = train_test_split(df.drop('species', axis=1),
#                                                     df['species'],
#                                                     test_size=0.20,
#                                                     random_state=42)

# print(X_train.shape)
# print(X_train.head())

### Decision Tree Classifier Model and Tree

In [None]:
# import pandas as pd
# import matplotlib.pyplot as plt
# from sklearn.datasets import load_iris
# from sklearn.tree import DecisionTreeClassifier
# from sklearn import tree
# from sklearn.model_selection import train_test_split

# print(X_train.head())

# model = DecisionTreeClassifier(criterion='gini', random_state=42).fit(X_train, y_train)

# plt.figure(figsize=(14, 14))
# tree.plot_tree(model,
#               feature_names=X_train.columns,
#               class_names=['setosa', 'versicolor', 'virginica'],
#               filled=False)

# plt.tight_layout();

In [None]:
# # example gini score on petal length
# 1 - np.sum(np.square(y_train.value_counts(normalize=True)))

### Natural Splits

In [None]:
# example = X_train.copy()
# example['species'] = y_train

# sns.pairplot(example, hue='species', palette=['red', 'blue', 'green']);

**Pair Plot**

* **Petal Length vs. Other Features:** In every scatter plot where petal length is involved (first row, third column), *setosa* (red) is clearly separated along the petal length axis.
* **Petal Width vs. Other Features:** The same is true for petal width (fourth column, second row); *setosa* is well-separated.
* **Petal Length vs. Petal Width:** This plot shows *setosa* forming a distinct cluster in the lower-left corner, reinforcing that both features together provide excellent separation.

In [None]:
# import matplotlib.pyplot as plt

# X_train.hist()
# plt.tight_layout();

**Histograms**

* **Petal Length Histogram:** The histogram for petal length shows a distinct group on the far left (around 1-2 cm) that corresponds to *setosa* (red in the pair plot). There's a clear separation from the other two species, which have petal lengths mostly greater than 3 cm.
* **Petal Width Histogram:** Similarly, the petal width histogram shows *setosa* grouped on the far left (around 0-0.5 cm), well-separated from the *versicolor* and *virginica* distributions.

**Explanation**

The visual separation in both the histograms and pair plots indicates that *setosa* has consistently smaller petal lengths and widths compared to *versicolor* and *virginica*. This "natural" separation means a decision tree can easily use these features to create simple rules (like "petal length <= a threshold") to isolate *setosa* with high accuracy, as seen in the decision tree where the first split is on petal length.

### Partitioning Examples

In [None]:
# plt.figure(figsize=(14, 14))
# tree.plot_tree(model,
#               feature_names=X_train.columns,
#               class_names=['setosa', 'versicolor', 'virginica'],
#               filled=False)

# plt.tight_layout();

where does the decision tree model get the specific values it uses for splitting (like `petal length (cm) <= 2.45`). Let's clarify this:

**The Model Learns the Split Values from the Training Data**

The decision tree algorithm *learns* these threshold values (e.g., 2.45 for petal length, 1.65 or 1.75 for petal width) directly from the **training data** you provide it.  It doesn't pull them out of thin air or use some pre-defined list.

Here's a more detailed breakdown:

1.  **Training Data:**
    * You start with a dataset (like the Iris dataset) that contains measurements of features (petal length, petal width, sepal length, sepal width) for a set of samples, along with the known class or category (species) for each sample.
    * This dataset is typically split into two parts:
        * **Training set:** This is the portion of the data the model uses to learn the relationships between features and the target variable (species).
        * **Testing set:** This is a separate portion used to evaluate how well the model generalizes to new, unseen data.

2.  **Algorithm's Search for Optimal Splits:**
    * The decision tree algorithm (like CART, ID3, or C4.5) examines the training data to find the best way to split the data at each node.
    * "Best" is defined based on a criterion like:
        * **Information Gain (for ID3, C4.5):** How much does this split reduce the entropy (uncertainty) about the target variable?
        * **Gini Impurity (for CART):** How much does this split reduce the "impurity" (mixing of classes) in the resulting subsets?
    * To find the best split, the algorithm essentially tries out different possible threshold values for each feature.
    * For example, for 'petal length', it might try splitting at 2.0 cm, 2.1 cm, 2.2 cm, 2.3 cm, 2.4 cm, 2.45 cm, 2.5 cm, and so on, evaluating the Information Gain or Gini Impurity for each potential split.
    * The algorithm then selects the threshold value that results in the greatest improvement in the chosen criterion (highest Information Gain or lowest Gini Impurity).

3.  **Example with Petal Length <= 2.45:**
    * In the tree you provided, the algorithm found that splitting the data at `petal length (cm) <= 2.45` produced the most significant reduction in impurity (or increase in Information Gain) at that stage of the tree building process.
    * This means that, among all the possible petal length values, 2.45 cm was the best at separating the Iris species in the training data at that point.

4.  **No Predefined Values:**
    * It's crucial to understand that the algorithm is *calculating* and *selecting* these values based on the data. It's not using any pre-set list of petal length or width values.
    * If you trained the same model on a slightly different training set (a different random sample of the Iris data), the exact split values might be slightly different, although the overall structure of the tree would likely be similar.

**In essence:** The decision tree model learns the rules (including the threshold values for splitting) by analyzing patterns in the training data and finding the splits that best organize the data points according to the target variable.

In [None]:
# # plot using hue to show different classes
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

# sns.scatterplot(x=example['petal length (cm)'],
#                 y=example['petal width (cm)'],
#                 hue=example['species'],
#                 palette=['red', 'blue', 'green'])
# plt.axvline(x=2.45, color='black')
# plt.axvline(x=4.75, color='black')
# plt.hlines(y=1.75, xmin=2.45, xmax=8, color='black')

# plt.xlim(0, 8)
# plt.legend()
# plt.show()

In [None]:
# # plot using hue to show different classes
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

# sns.scatterplot(x=example['petal length (cm)'][example['petal length (cm)']>2.45],
#                 y=example['petal width (cm)'],
#                 hue=example['species'],
#                 palette=['red', 'blue', 'green'])
# plt.axhline(y=1.75, color='black')
# plt.vlines(x=4.95, ymin=0, ymax=1.75, color='black')
# plt.vlines(x=5.45, ymin=0, ymax=1.75, color='black')
# plt.axhline(y=1.55, color='black')

# plt.xlim(2.45, 8)
# plt.ylim(0, )
# plt.legend()
# plt.show()

**Overall Explanation**

The decision tree algorithm aims to partition the Iris dataset by making a series of decisions based on the feature values, ultimately creating subsets that are as pure as possible with respect to the Iris species (setosa, versicolor, and virginica). It prioritizes the features that best separate the species.

**Step-by-Step Breakdown with "I" Perspective**

* **Initial State:** I start with the entire Iris dataset, which, as shown in the histograms (Image 2), has some overlap between species, particularly in petal length and width. This means the dataset isn't perfectly separated from the beginning.
* **First Split (Root Node):**
    * I observe in the tree (Image 1) that the first split occurs on `petal length (cm) <= 2.45`.
    * Looking at the scatter plot (Image 3), I can see that this split effectively isolates the *setosa* species (red points) on the left.
    * So, I'm essentially saying: "If a flower has a petal length of 2.45 cm or less, I confidently classify it as *setosa*." This creates a pure *setosa* leaf node.
* **Second Split (Right Branch):**
    * For the remaining data points (which are *versicolor* and *virginica*), I see that the tree splits on `petal length (cm) <= 4.75`.
    * In the scatter plot, this is the vertical line at around 4.75 cm.  This split further divides the dataset, though there's still some mixing of *versicolor* and *virginica*.
    * I'm now saying: "Of the flowers that are *not* *setosa*, if their petal length is 4.75 cm or less, they are *likely* *versicolor*."
* **Third Split (Left Branch from Second):**
    * The tree then splits the *likely versicolor* branch based on `petal width (cm) <= 1.65`.
    * This is the horizontal line at 1.65 cm in the scatter plot (Image 3).
    * I refine my classification: "Of the non-*setosa* flowers with petal lengths less than or equal to 4.75 cm, if their petal width is also less than or equal to 1.65 cm, I classify them as *versicolor*."  There's still a tiny bit of misclassification here, as seen in the tree's Gini impurity.
* **Fourth Split (Right Branch from Second):**
    * Finally, the tree splits the other branch (petal length > 4.75 cm) based on `petal width (cm) <= 1.75`.
    * This is the horizontal line at 1.75 cm in the scatter plot.
    * I conclude: "The remaining flowers (non-*setosa* with petal lengths greater than 4.75 cm) are classified based on petal width. If the petal width is greater than 1.75 cm, they are classified as *virginica*; otherwise, *versicolor*." Again, there's a small amount of misclassification.

**Intuitive Summary**

The tree essentially creates a series of if-else rules that carve up the data space. It starts with the most important feature (petal length) to make the biggest distinction (separating *setosa*). Then, it refines the classifications using other features (petal width) to separate the remaining species as cleanly as possible. The scatter plots visually show how these rules correspond to dividing the data points into rectangular regions.

## Random Forests

**Random Forest** is an ensemble supervised learning algorithm that builds multiple decision trees on different subsets of the data and using random subsets of features. For prediction (classification or regression), it aggregates the predictions of all the individual trees, typically through majority voting (classification) or averaging (regression), to improve accuracy and reduce overfitting.

### Ensemble Learning

In statistics and machine learning, **ensemble methods** use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

https://en.wikipedia.org/wiki/Ensemble_learning

See Ensemble Learning Notebook

The traditional way of performing hyperparameter optimization has been **grid search**, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search

### Bagging and Boosting

* Out of bag...
* Bagging parallel, boosting sequential

### Pruning - Hyperparameters

A parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training.

https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/

Here are some default parameters:

<pre>
hyperparameters = {
            'n_estimators': 100,
            'criterion': 'gini',
            'max_depth': None,
            'max_leaf_nodes': None,
            'bootstrap': True
            }

model = RandomForestClassifier()
</pre>

**Parameters vs Hyperparameters**:
* Parameter: Usually estimated or learned from data
* Hyperparameter: Values that are tuned by the data scientist

In [None]:
# import pandas as pd
# from sklearn.model_selection import train_test_split

# data = load_iris()
# X = data.data
# y = data.target

# iris = pd.DataFrame(data=X, columns=iris.feature_names)
# iris['species'] = y
# X_train, X_test, y_train, y_test = train_test_split(iris.drop('species', axis=1),
#                                                     iris['species'],
#                                                     test_size=0.20,
#                                                     random_state=42)

In [None]:
# import matplotlib.pyplot as plt
# from sklearn.tree import DecisionTreeClassifier
# from sklearn import tree
# from sklearn.metrics import accuracy_score

# hyperparameters = {
#             'criterion': 'entropy'
#             }

# model = DecisionTreeClassifier(random_state=42).set_params(**hyperparameters)
# model.fit(X_train, y_train)
# predictions = model.predict(X_test)
# print(accuracy_score(y_test, predictions))

# plt.figure(figsize=(14, 14))
# tree.plot_tree(model,
#               feature_names=X_train.columns,
#               class_names=['setosa', 'versicolor', 'virginica'],
#               filled=False)

# plt.tight_layout();

The above tree keeps splitting till all the nodes are pure and can lead to overfitting. The next cell introduces some (hyper)parameters that help avoid overfitting.

In [None]:
# import matplotlib.pyplot as plt
# from sklearn.tree import DecisionTreeClassifier
# from sklearn import tree
# from sklearn.metrics import accuracy_score

# hyperparameters = {
#             'criterion': 'entropy',
#             'max_depth': 3,
#             'max_leaf_nodes': 4
#             }

# model = DecisionTreeClassifier(random_state=42).set_params(**hyperparameters)
# model.fit(X_train, y_train)
# predictions = model.predict(X_test)
# print(accuracy_score(y_test, predictions))

# plt.figure(figsize=(14, 14))
# tree.plot_tree(model,
#               feature_names=X_train.columns,
#               class_names=['setosa', 'versicolor', 'virginica'],
#               filled=False)

# plt.tight_layout();

### Tuning Random Forest (Hyper)Parameters

**Focusing on Tree Structure and Complexity:**

* **`min_samples_split`**: This controls the minimum number of samples required to split an internal node.
    * **Suggested Values:** `[2, 5]`
    * **Justification:** A smaller value (like 2) allows for more complex trees that might overfit, while a larger value (like 5) can help prevent overfitting by ensuring nodes only split if they contain a reasonable number of data points. Limiting to two values keeps the search space manageable.
* **`min_samples_leaf`**: This controls the minimum number of samples required to be at a leaf node.
    * **Suggested Values:** `[1, 3]`
    * **Justification:** Similar to `min_samples_split`, a smaller value (like 1) can lead to more complex trees, while a larger value (like 3) can promote more generalized models by requiring a minimum number of samples in the final predictions.
* **`max_features`**: This determines the number of features to consider when looking for the best split at each node.
    * **Suggested Values:** `['sqrt', 0.5]`
    * **Justification:**
        * `'sqrt'` (or 'log2') considers the square root of the total number of features. This is a common and often effective default.
        * `0.5` considers half of the total number of features. This provides a different level of randomness in feature selection compared to 'sqrt'. Limiting to two options keeps the search efficient.

**Hyperparameter Related to Randomness:**

* **`random_state`**: While not directly a tuning parameter for the model's complexity, it's crucial for reproducibility.
    * **Suggested Value:** `[42]` (or any single integer)
    * **Justification:** Setting a `random_state` ensures that the random processes within the Random Forest (like bootstrapping and feature selection) produce the same results each time the code is run. This is important for consistent evaluation and comparison of different hyperparameter settings. You might not include this in the *tuning* grid but should emphasize its importance for good practice.

**Considerations for Keeping Execution Time Down:**

* **Number of Hyperparameters:** The example you provided has 5 hyperparameters. Adding 3 more (`min_samples_split`, `min_samples_leaf`, `max_features`) will increase the size of the grid significantly (2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 256 combinations). Be mindful of this. You might suggest that students initially try a smaller grid and then potentially expand if time allows.
* **Range of Values:** Keeping the number of values for each hyperparameter limited is key, as you've already done.

```python
hyperparameters = {
    'n_estimators': [50, 150],  # Slightly reduced max for faster execution
    'criterion': ['entropy', 'gini'],
    'max_depth': [3, 5],      # Slightly increased max depth
    'max_leaf_nodes': [6, 10], # Adjusted range
    'bootstrap': [True, False],
}
```

**Important Considerations**

* **Justification is Key:** Explain *why* you chose the specific values for each hyperparameter. Your reasoning should be based on your understanding of how these parameters affect the model's bias-variance trade-off, complexity, and potential for overfitting or underfitting.
* **Cross-Validation Strategy:** Use an appropriate cross-validation strategy (e.g., Stratified K-Fold since it's likely a classification task with potentially imbalanced classes) to get a reliable estimate of the model's performance for each hyperparameter combination.
* **Computational Limits:** Keep in mind the constraints on their hyperparameter choices and grid size and computational time. You might need to start with a smaller grid and iterate if time permits.

Random forests create many decision trees that sample data. The bootstrap hyperparameter sets sampling with or without replacement.

**Rule**: Never make adjustments to your model based on test set results.
            

### Random Forest Model with Grid Search

In [None]:
# # create dataframe from sklearn iris dataset; print shape, info, and head
# import pandas as pd
# from sklearn.datasets import load_iris

# iris = load_iris()
# X = iris.data
# y = iris.target

# df = pd.DataFrame(data=X, columns=iris.feature_names)
# df['species'] = y
# print(df.shape)
# print(df.info())
# df.head()

In [None]:
# # train test split using 25% for test size; print X_train shape and head
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(df.drop('species', axis=1),
#                                                     df['species'],
#                                                     test_size=0.25,
#                                                     random_state=42)

# print(X_train.shape)
# print(X_train.head())

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import GridSearchCV

# hyperparameters = {
#     'n_estimators': [50, 150],
#     'criterion': ['entropy', 'gini'],
#     'max_depth': [3, 5],
#     'max_leaf_nodes': [6, 10],
#     'bootstrap': [True, False],
# }

# grid_search = GridSearchCV(estimator = RandomForestClassifier(),
#                            param_grid = hyperparameters,
#                            scoring = 'accuracy',
#                            cv = 10)

# grid_search = grid_search.fit(X_train, y_train)

# best_accuracy = grid_search.best_score_
# best_parameters = grid_search.best_params_

# print('best accuracy', best_accuracy)
# print('best parameters', best_parameters)

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score

# # model = RandomForestClassifier(bootstrap = False,
# #                                criterion = 'entropy',
# #                                max_depth = 3,
# #                                min_samples_leaf = 5,
# #                                min_samples_split = 4,
# #                                random_state = 42)
# model = RandomForestClassifier(random_state = 42).set_params(**best_parameters) # * args, ** kwargs
# model.fit(X_train, y_train)
# predictions = model.predict(X_test)
# print(accuracy_score(y_test, predictions))

## Isolation Forest

### Outliers

Outliers are data points that deviate significantly from the general pattern or distribution of the rest of the data. They are observations that are far removed from the "typical" or "expected" values.

**Types of Outliers:**

* **Point Outliers:** Single data points that are unusual compared to the rest of the data (e.g., a very high income in a neighborhood).
* **Contextual Outliers:** Data points that are outliers only in a specific context (e.g., high electricity usage at night in a residential area).
* **Collective Outliers:** A group of data points that are unusual when considered together, even if individual points might not be outliers on their own (e.g., a coordinated series of small fraudulent transactions).

**Why are Outliers Important?**

* **Data Quality:** They can indicate errors in data collection or measurement.
* **Insights:** They can reveal valuable information about rare events or unusual behavior (e.g., fraud, network intrusions).
* **Model Accuracy:** They can distort statistical analyses and machine learning models.

**Python Methods for Outlier Detection:**

Here are some common Python methods for detecting outliers:

**1. Statistical Methods:**

   * **Z-Score:**
        * Calculates how many standard deviations a data point is away from the mean.
        * Points with a Z-score above a certain threshold (e.g., 3) are considered outliers.

In [None]:
# import numpy as np
# import pandas as pd
# from scipy import stats
# import matplotlib.pyplot as plt
# import seaborn as sns

# normal_values = np.random.normal(loc=15, scale=3, size=20)
# outlier_values = [50, 60]
# data = {'value': np.concatenate([normal_values, outlier_values])}
# df = pd.DataFrame(data)

# df['z_score'] = np.abs(stats.zscore(df['value']))
# df['is_outlier'] = df['z_score'] > 3

# mean = df['value'].mean()
# std = df['value'].std()

# plt.figure(figsize=(8, 5))
# sns.histplot(df['value'], kde=True)
# plt.axvline(mean, color='green', linestyle='dashed', linewidth=1, label=f'Mean: {mean:.2f}')
# plt.axvline(mean + std, color='orange', linestyle='dashed', linewidth=1, label=f'+1 Std Dev: {mean + std:.2f}')
# plt.axvline(mean - std, color='orange', linestyle='dashed', linewidth=1, label=f'-1 Std Dev: {mean - std:.2f}')
# plt.axvline(mean + 2 * std, color='cyan', linestyle='dashed', linewidth=1, label=f'+2 Std Dev: {mean + 2 * std:.2f}')
# plt.axvline(mean - 2 * std, color='cyan', linestyle='dashed', linewidth=1, label=f'-2 Std Dev: {mean - 2 * std:.2f}')
# plt.axvline(mean + 3 * std, color='magenta', linestyle='dashed', linewidth=1, label=f'+3 Std Dev: {mean + 3 * std:.2f}')
# plt.axvline(mean - 3 * std, color='magenta', linestyle='dashed', linewidth=1, label=f'-3 Std Dev: {mean - 3 * std:.2f}')
# # plt.axvline(df[df['is_outlier'] == True]['value'].min(), color='red', linestyle='dashed', linewidth=1, label='Outlier Bounds')
# # plt.axvline(df[df['is_outlier'] == True]['value'].max(), color='red', linestyle='dashed', linewidth=1)
# plt.title('Distribution of Data with Outliers and Standard Deviations')
# plt.xlabel('Value')
# plt.ylabel('Frequency')
# plt.legend()
# plt.show()

* **IQR (Interquartile Range):**
  * Defines a "normal" range using the 1st quartile (Q1) and 3rd quartile (Q3).
  * Points outside 1.5 * IQR from Q1 or Q3 are considered outliers.

In [None]:
# def find_iqr_outliers(data):
#   q1 = np.percentile(data, 25)
#   q3 = np.percentile(data, 75)
#   iqr = q3 - q1
#   lower_bound = q1 - 1.5 * iqr
#   upper_bound = q3 + 1.5 * iqr
#   return data[(data < lower_bound) | (data > upper_bound)]

# outliers = find_iqr_outliers(df['value'])
# print(outliers)

# plt.figure(figsize=(8, 6))
# sns.boxplot(y=df['value'])
# plt.title('Box Plot with IQR Outlier Visualization')
# plt.ylabel('Value')
# plt.show()

**2. Machine Learning Methods:**

   * **Isolation Forest:**
        * An unsupervised algorithm that isolates outliers by randomly partitioning the data space.
        * Anomalies are "easier" to isolate and have shorter path lengths in the trees.

In [None]:
# from sklearn.ensemble import IsolationForest

# model = IsolationForest(contamination=0.1)  # Adjust contamination
# model.fit(df[['value']])
# df['anomaly'] = model.predict(df[['value']])
# outliers = df[df['anomaly'] == -1]
# print(outliers)

### Anomaly Detection Through Isolation

Isolation Forest is an unsupervised machine learning algorithm specifically designed for **anomaly detection**. Unlike many other anomaly detection methods that try to model "normal" data and then identify deviations, Isolation Forest takes a different approach: it explicitly tries to **isolate** anomalies.

Here's a breakdown of the core ideas behind Isolation Forest:

**1. The Principle of Isolation:**

* Anomalies are data points that are "few and different." This means they have attribute values that are far from the typical values of the majority of the data.
* Because anomalies are different, they should be easier to separate (isolate) from the rest of the data with fewer partitioning steps.

**2. Isolation Trees (iTrees):**

* Isolation Forest builds an ensemble (a "forest") of **Isolation Trees (iTrees)**.
* Each iTree is a binary tree constructed by randomly partitioning the data. The partitioning process is as follows:
    * **Random Feature Selection:** A feature (attribute) from the dataset is randomly selected.
    * **Random Split Value Selection:** A split value within the range of the selected feature is randomly chosen.
    * **Data Partitioning:** The data points are divided into two child nodes based on whether their value for the chosen feature is below or above the split value.
    * This process is repeated recursively until each data point is isolated in its own leaf node or a predefined tree height limit is reached.

**3. Path Length:**

* The key concept in Isolation Forest is the **path length** of a data point in an iTree. The path length is the number of edges traversed from the root of the tree to the leaf node containing that data point.
* **Anomalies tend to have shorter path lengths.** This is because they are different and can be isolated with fewer random partitions.
* **Normal data points tend to have longer path lengths** because they are similar to other points and require more partitions to be isolated.

**4. Anomaly Score:**

* To determine an anomaly score for each data point, the algorithm calculates the average path length of that point across all the iTrees in the forest.
* This average path length is then normalized to produce an **anomaly score** between 0 and 1:
    * Scores closer to 1 indicate a higher likelihood of being an anomaly.
    * Scores closer to 0 indicate that the data point is likely normal.
    * A score around 0.5 typically means the point is neither clearly an anomaly nor a normal instance.

**In essence, Isolation Forest works by:**

* Randomly partitioning the data in multiple trees.
* Measuring how quickly each data point gets isolated.
* Assigning an anomaly score based on the average isolation path length across all trees. Data points that are isolated in fewer steps are considered more likely to be anomalies.

**Advantages of Isolation Forest:**

* **Efficient:** It has a linear time complexity with respect to the number of data points and features, making it suitable for large datasets.
* **Fast:** The random partitioning process is computationally inexpensive.
* **Effective for High-Dimensional Data:** It can handle datasets with many features.
* **Unsupervised:** It doesn't require labeled anomaly data for training.
* **Robust to Irrelevant Features:** The random feature selection helps in focusing on the features that contribute to isolating anomalies.
* **Low Memory Usage:** The memory requirements are relatively low compared to some other anomaly detection methods.

**Use Cases in Cybersecurity (and beyond):**

* **Network Intrusion Detection:** Identifying unusual network traffic patterns that might indicate attacks.
* **Fraud Detection:** Detecting anomalous financial transactions.
* **Endpoint Security:** Identifying unusual process behavior or file modifications on computers.
* **System Monitoring:** Detecting abnormal resource usage or system events.
* **Industrial Anomaly Detection:** Identifying faulty machinery behavior based on sensor data.

By understanding the principles of isolation, path length, and anomaly scoring, you can effectively use and explain the Isolation Forest algorithm in your cybersecurity class for analyzing network packets and detecting anomalies.

Note: The dataset we are using is a subset from a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). It is released under a NOLD 2.0 licence from the Norwegian Government, details of which can be found here: Norwegian Licence for Open Government Data (NLOD) 2.0.

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

In [None]:
# import pandas as pd
# import seaborn as sns
# from sklearn.ensemble import IsolationForest

# df = pd.read_csv('https://raw.githubusercontent.com/gitmystuff/Datasets/refs/heads/main/Xeek_Well.csv')
# df.describe()

In [None]:
# df.info()

**Context:**

The image shows the output of the `df.describe()` method in Python's Pandas library after reading a CSV file named 'Xeek_Well_15-9-15.csv' into a DataFrame called `df`. This method provides descriptive statistics of the numerical columns in the DataFrame.

**Dataset Description:**

The dataset appears to contain well log data, likely from an oil or gas exploration context. Each row represents a measurement taken at a specific depth in the well, and the columns represent various geophysical measurements.

**Columns and Their Descriptive Statistics:**

Here's a detailed explanation of each column and the statistics provided:

* **DEPTH_MD:**
    * This column likely represents the measured depth in meters (or feet) in the well.
    * **count:** 17717 - There are 17,717 data points (depth measurements).
    * **mean:** 1837.363674 - The average depth is approximately 1837.36 meters.
    * **std:** 784.314256 - The standard deviation indicates the spread or variability of the depth measurements.
    * **min:** 485.256000 - The minimum depth recorded is 485.256 meters.
    * **25%:** 1158.464000 - 25% of the depth measurements are less than or equal to 1158.464 meters.
    * **50%:** 1831.672000 - The median depth is 1831.672 meters (half the measurements are above, half are below).
    * **75%:** 2515.672000 - 75% of the depth measurements are less than or equal to 2515.672 meters.
    * **max:** 3200.128000 - The maximum depth recorded is 3200.128 meters.

* **CALI:**
    * This column likely represents the caliper log, which measures the diameter of the wellbore.
    * **count:** 17635.000000 - There are 17,635 caliper measurements (note: fewer than DEPTH_MD, indicating missing data).
    * **mean:** 14.006030 - The average caliper measurement is 14.006030 (units depend on the tool).
    * **std:** 3.873367 - The standard deviation indicates the variability in wellbore diameter.
    * **min:** 7.325138 - The minimum caliper measurement.
    * **25%:** 12.045594
    * **50%:** 13.956721
    * **75%:** 17.324830
    * **max:** 25.717396

* **RDEP:**
    * This column likely represents the deep resistivity log, which measures the electrical resistance of the formation far from the wellbore.
    * **count:** 17717.000000
    * **mean:** 1.474893
    * **std:** 1.356896
    * **min:** 0.264479
    * **25%:** 0.760341
    * **50%:** 1.007371
    * **75%:** 1.554278
    * **max:** 14.046203

* **RHOB:**
    * This column likely represents the bulk density log, which measures the density of the formation.
    * **count:** 17521.000000 - Again, missing data.
    * **mean:** 2.134858
    * **std:** 0.223124
    * **min:** 1.438999
    * **25%:** 1.978679
    * **50%:** 2.042522
    * **75%:** 2.333431
    * **max:** 2.648847

* **GR:**
    * This column likely represents the gamma ray log, which measures the natural radioactivity of the formation.
    * **count:** 17717.000000
    * **mean:** 59.154202
    * **std:** 29.483140
    * **min:** 6.024419
    * **25%:** 41.260944
    * **50%:** 62.451527
    * **75%:** 75.398460
    * **max:** 804.298950 - This high max value suggests potential outliers or unusual readings.

* **NPHI:**
    * This column likely represents the neutron porosity log, which measures the porosity (pore space) of the formation.
    * **count:** 13346.000000 - Significant missing data.
    * **mean:** 0.384906
    * **std:** 0.152182
    * **min:** 0.039013
    * **25%:** 0.249594
    * **50%:** 0.451589
    * **75%:** 0.510851
    * **max:** 0.733152

* **PEF:**
    * This column likely represents the photoelectric effect log, which measures the formation's response to gamma rays.
    * **count:** 17662.000000
    * **mean:** 4.095857
    * **std:** 8.318817 - A relatively high standard deviation suggests significant variability.
    * **min:** 1.525528
    * **25%:** 2.400372
    * **50%:** 2.910137
    * **75%:** 4.222030
    * **max:** 365.575592 - The extremely high max value points to outliers.

* **DTC:**
    * This column likely represents the compressional wave slowness (Delta-T Compressional) log, which measures the time it takes for a compressional sound wave to travel through the formation.
    * **count:** 17708.000000
    * **mean:** 127.240157
    * **std:** 36.507057
    * **min:** 7.415132
    * **25%:** 88.318405
    * **50%:** 142.943245
    * **75%:** 153.226116
    * **max:** 207.382553

**Observations:**

* **Missing Data:** The `CALI`, `RHOB`, and especially `NPHI` columns have fewer data points than `DEPTH_MD`, indicating missing values. This needs to be addressed before using the data for analysis or modeling.
* **Outliers:** The `GR` and `PEF` columns have very high maximum values, suggesting potential outliers that could skew statistical analyses.
* **Data Ranges:** The columns have vastly different ranges and units, which is typical for well log data. This often necessitates scaling or normalization before applying machine learning algorithms.

**In Summary:**

This dataset contains well log measurements with varying statistical properties and issues like missing data and outliers. Understanding these characteristics is crucial for proper data cleaning, preprocessing, and analysis to extract meaningful insights about the subsurface formation.

In [None]:
# print(df.shape)
# df = df.dropna()
# df.info()

In [None]:
# anomaly_inputs = ['NPHI', 'RHOB']

In [None]:
# model_IF = IsolationForest(contamination=float(0.1),random_state=42)

In [None]:
# model_IF.fit(df[anomaly_inputs])

1.  **`anomaly_inputs = ['NPHI', 'RHOB']`**:
    * This line creates a list named `anomaly_inputs`.
    * The list contains two strings: 'NPHI' and 'RHOB'.
    * These strings likely represent column names in a Pandas DataFrame.
    * The code indicates that these two columns ('NPHI' and 'RHOB') will be used as the input features for the anomaly detection model.

2.  **`model_IF = IsolationForest(contamination=float(0.1), random_state=42)`**:
    * This line creates an instance of the `IsolationForest` class from scikit-learn.
    * The instance is assigned to the variable `model_IF`.
    * `contamination=float(0.1)`: This parameter tells the Isolation Forest algorithm to expect that approximately 10% of the data points are anomalies. It helps the algorithm determine a threshold for classifying data points as anomalies.
    * `random_state=42`: This parameter sets the random seed for the algorithm. Setting a random seed ensures that the results of the algorithm are reproducible. If you run the code multiple times with the same random seed, you will get the same results.

3.  **`model_IF.fit(df[anomaly_inputs])`**:
    * This line trains the `model_IF` (the Isolation Forest model) on the data.
    * `df[anomaly_inputs]`: This selects the columns specified in the `anomaly_inputs` list (i.e., 'NPHI' and 'RHOB') from a Pandas DataFrame called `df`. It creates a new DataFrame containing only these two columns.
    * `model_IF.fit(...)`: The `fit()` method is called on the Isolation Forest model to learn the patterns of "normal" data from the selected columns.

In summary, the code selects two columns from a DataFrame, configures an Isolation Forest model to detect anomalies assuming 10% contamination, and then trains the model on the selected data.

In [None]:
# df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
# df['anomaly'] = model_IF.predict(df[anomaly_inputs])

In [None]:
# df.loc[:, ['NPHI', 'RHOB','anomaly_scores','anomaly'] ]

The code performs anomaly detection using a trained Isolation Forest model on a Pandas DataFrame. It calculates anomaly scores and assigns anomaly labels to the data.

Here's a step-by-step explanation:

1.  **`df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])`**:
    * This line calculates an anomaly score for each data point using the trained Isolation Forest model (`model_IF`).
    * `model_IF.decision_function(df[anomaly_inputs])`:
        * `model_IF`: This is the trained Isolation Forest model.
        * `decision_function()`: This is a method of the Isolation Forest model that calculates an "anomaly score" for each data point. The anomaly score reflects how "isolated" a data point is; lower scores generally indicate higher likelihood of being an anomaly.
        * `df[anomaly_inputs]`: This selects the columns from the DataFrame `df` that were used as input features during the model training. These are the same columns for which we now want to calculate anomaly scores.
    * `df['anomaly_scores'] = ...`: The calculated anomaly scores are then assigned to a new column named `anomaly_scores` in the DataFrame `df`.

2.  **`df['anomaly'] = model_IF.predict(df[anomaly_inputs])`**:
    * This line assigns anomaly labels to each data point based on the anomaly scores calculated in the previous step.
    * `model_IF.predict(df[anomaly_inputs])`:
        * `model_IF`: The trained Isolation Forest model.
        * `predict()`: This is a method of the Isolation Forest model that assigns a label to each data point. Typically, it assigns:
            * `1`: for normal data points (inliers).
            * `-1`: for anomalous data points (outliers).
        * `df[anomaly_inputs]`: Again, this selects the input feature columns from the DataFrame.
    * `df['anomaly'] = ...`: The predicted anomaly labels (1 or -1) are assigned to a new column named `anomaly` in the DataFrame `df`.

3.  **`df.loc[:, ['NPHI', 'RHOB', 'anomaly_scores', 'anomaly']]`**:
    * This line selects and displays specific columns from the DataFrame `df`.
    * `df.loc[:, ... ]`: This is a way to select rows and columns by label in Pandas.
        * `:`: Selects all rows.
        * `['NPHI', 'RHOB', 'anomaly_scores', 'anomaly']`: Selects the columns named 'NPHI', 'RHOB', 'anomaly_scores', and 'anomaly'.
    * The result of this line is a new DataFrame (or a view of the original DataFrame) containing only the specified columns, which is then displayed. This allows you to see the original data ('NPHI', 'RHOB') along with the calculated anomaly information.

In summary, the code calculates anomaly scores and labels using the Isolation Forest model and then displays a subset of the DataFrame to show the original data and the anomaly results.

In [None]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

# print("Mean values for normal and anomalous data:")
# print(df.groupby('anomaly')[['NPHI', 'RHOB', 'anomaly_scores']].mean())

# print("\nStandard deviations:")
# print(df.groupby('anomaly')[['NPHI', 'RHOB', 'anomaly_scores']].std())

# print("\nDescriptive statistics for normal data:")
# print(df[df['anomaly'] == 1][['NPHI', 'RHOB']].describe())  # Normal data

# print("\nDescriptive statistics for anomalous data:")
# print(df[df['anomaly'] == -1][['NPHI', 'RHOB']].describe()) # Anomalous data

# # Box plots to compare distributions
# plt.figure(figsize=(10, 6))
# sns.boxplot(x='anomaly', y='NPHI', data=df)
# plt.title('Box Plot of NPHI by Anomaly Group')
# plt.show()

# plt.figure(figsize=(10, 6))
# sns.boxplot(x='anomaly', y='RHOB', data=df)
# plt.title('Box Plot of RHOB by Anomaly Group')
# plt.show()

# # Histograms to see frequency distributions
# plt.figure(figsize=(10, 6))
# sns.histplot(data=df, x='anomaly_scores', hue='anomaly', kde=True)
# plt.title('Distribution of Anomaly Scores')
# plt.show()

In [None]:
# def outlier_plot(data, outlier_method_name, x_var, y_var,
#                  xaxis_limits=[0,1], yaxis_limits=[0,1]):

#     print(f'Outlier Method: {outlier_method_name}')

#     # Create a dynamic title based on the method
#     method = f'{outlier_method_name}_anomaly'

#     # Print out key statistics
#     print(f"Number of anomalous values {len(data[data['anomaly']==-1])}")
#     print(f"Number of non anomalous values  {len(data[data['anomaly']== 1])}")
#     print(f'Total Number of Values: {len(data)}')

#     # Create the chart using seaborn
#     g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])
#     g.map(sns.scatterplot, x_var, y_var)
#     g.fig.suptitle(f'Outlier Method: {outlier_method_name}', y=1.10, fontweight='bold')
#     g.set(xlim=xaxis_limits, ylim=yaxis_limits)
#     axes = g.axes.flatten()
#     axes[0].set_title(f"Outliers\n{len(data[data['anomaly']== -1])} points")
#     axes[1].set_title(f"Inliers\n {len(data[data['anomaly']==  1])} points")
#     return g

**1. Function Definition:**

```python
def outlier_plot(data, outlier_method_name, x_var, y_var, xaxis_limits=[0,1], yaxis_limits=[0,1]):
```

* This line defines a function named `outlier_plot`.
* It takes several arguments:
    * `data`: The Pandas DataFrame containing the data.
    * `outlier_method_name`: A string representing the name of the outlier detection method used (e.g., "Isolation Forest").
    * `x_var`: The name of the column in the DataFrame to be used for the x-axis of the plot.
    * `y_var`: The name of the column in the DataFrame to be used for the y-axis of the plot.
    * `xaxis_limits`: (Optional) A list specifying the x-axis limits for the plot. It defaults to [0, 1].
    * `yaxis_limits`: (Optional) A list specifying the y-axis limits for the plot. It defaults to [0, 1].

**2. Print Outlier Method Name:**

```python
print(f'Outlier Method: {outlier_method_name}')
```

* This line prints the name of the outlier detection method to the console. This helps in identifying the results.

**3. Create a Dynamic Title:**

```python
method = f'{outlier_method_name}_anomaly'
```

* This line creates a string variable `method` by appending "_anomaly" to the `outlier_method_name`. This is likely used to create a more descriptive title for the plot.

**4. Print Key Statistics:**

```python
print(f"Number of anomalous values {len(data[data['anomaly']== -1])}")
print(f"Number of non anomalous values {len(data[data['anomaly']== 1])}")
print(f"Total Number of Values: {len(data)}")
```

* These lines print some important statistics about the detected anomalies:
    * `len(data[data['anomaly'] == -1])`: Calculates the number of data points labeled as anomalous (assuming -1 represents anomalies).
    * `len(data[data['anomaly'] == 1])`: Calculates the number of data points labeled as non-anomalous (assuming 1 represents normal data).
    * `len(data)`: Prints the total number of data points in the DataFrame.

**5. Create the Chart using Seaborn:**

```python
g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])
g.map(sns.scatterplot, x_var, y_var)
```

* This section uses the Seaborn library to create a visualization:
    * `g = sns.FacetGrid(data, col='anomaly', height=4, hue='anomaly', hue_order=[1,-1])`: Creates a `FacetGrid`. This is a multi-plot grid that allows you to create separate plots for different subsets of your data.
        * `data`: The DataFrame.
        * `col='anomaly'`:  Specifies that the grid should have columns based on the unique values in the 'anomaly' column (i.e., it will create one subplot for anomalies and one for non-anomalies).
        * `height=4`: Sets the height of each subplot.
        * `hue='anomaly'`:  Colors the data points based on the 'anomaly' column.
        * `hue_order=[1,-1]`:  Specifies the order of the hues (colors) in the legend.
    * `g.map(sns.scatterplot, x_var, y_var)`:  Maps a scatter plot onto each facet (subplot) of the grid.
        * `sns.scatterplot`:  The type of plot (a scatter plot).
        * `x_var`:  The column to use for the x-axis.
        * `y_var`:  The column to use for the y-axis.

**6. Set Chart Titles and Labels:**

```python
g.fig.suptitle(f'Outlier Method: {outlier_method_name}', y=1.10, fontweight='bold')
g.set(xlim=xaxis_limits, ylim=yaxis_limits)

axes = g.axes.flatten()
axes[0].set_title(f"Outliers\n{len(data[data['anomaly'] == -1])} points")
axes[1].set_title(f"Inliers\n{len(data[data['anomaly'] == 1])} points")
```

* `g.fig.suptitle(...)`: Sets the overall title of the entire figure (the grid of plots).
    * `f'Outlier Method: {outlier_method_name}'`:  Uses an f-string to include the name of the outlier detection method in the title.
    * `y=1.10`:  Adjusts the vertical position of the title.
    * `fontweight='bold'`:  Makes the title bold.
* `g.set(xlim=xaxis_limits, ylim=yaxis_limits)`:  Sets the x-axis and y-axis limits for all subplots, using the `xaxis_limits` and `yaxis_limits` arguments passed to the function.
* `axes = g.axes.flatten()`:  Gets a 1D array of all the axes (subplots) in the grid.
* `axes[0].set_title(...)`: Sets the title of the first subplot (likely the one showing outliers). It includes the number of outlier points.
* `axes[1].set_title(...)`: Sets the title of the second subplot (likely the one showing inliers/normal data). It includes the number of inlier points.

**7. Return the Plot:**

```python
return g
```

* The function returns the `FacetGrid` object (`g`), which can be further customized or displayed.

**In summary, this `outlier_plot` function takes data, the outlier detection method name, and two variables to plot, and then generates a visualization that:**

* Shows the data points in a scatter plot.
* Separates the data into two plots: one for outliers and one for normal data.
* Provides titles and labels that clearly indicate the outlier detection method and the number of points in each group.

This function is a useful tool for visualizing and understanding the results of outlier detection algorithms.

In [None]:
# outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

1.  **Function Call:**

    ```python
    outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
    ```

    * The `outlier_plot` function is called.
    * It's passed the following arguments:
        * `df`:  A Pandas DataFrame containing the data.
        * `'Isolation Forest'`:  The name of the outlier detection method used.
        * `'NPHI'`:  The column from the DataFrame to be used for the x-axis of the plot.
        * `'RHOB'`: The column from the DataFrame to be used for the y-axis of the plot.
        * `[0, 0.8]`:  The x-axis limits for the plot.
        * `[3, 1.5]`: The y-axis limits for the plot.

2.  **Function Execution:**

    Inside the `outlier_plot` function (as defined in a previous snippet), the following actions are performed:

    * **Print Outlier Method Name:** The string 'Outlier Method: Isolation Forest' is printed.
    * **Print Key Statistics:**
        * The number of anomalous values is calculated and printed. This is done by counting the rows in the DataFrame where the 'anomaly' column is equal to -1.
        * The number of non-anomalous values is calculated and printed. This is done by counting the rows in the DataFrame where the 'anomaly' column is equal to 1.
        * The total number of values (rows) in the DataFrame is printed.
    * **Create the Chart:**
        * A Seaborn `FacetGrid` is created, which will generate a multi-plot grid. The data is split into two columns based on the 'anomaly' column. Data points are colored based on the 'anomaly' column.
        * A scatter plot is mapped onto each facet of the grid, using 'NPHI' for the x-axis and 'RHOB' for the y-axis.
        * Titles and labels are set for the figure and the individual subplots. The number of anomalous and non-anomalous points is included in the subplot titles.

3.  **Output:**

    The function generates a visualization (and prints some statistics) that shows:

    * A scatter plot of 'NPHI' versus 'RHOB'.
    * Two separate plots: one for data points identified as outliers, and one for data points identified as inliers (normal data).
    * The number of data points classified as outliers and inliers.
    * The name of the outlier detection method used (Isolation Forest).

In summary, this code segment calls a plotting function to visualize and summarize the results of an Isolation Forest anomaly detection analysis, displaying the relationship between two variables ('NPHI' and 'RHOB') and highlighting the separation between detected outliers and normal data.

In [None]:
# model_IF = IsolationForest(contamination=float(0.3), random_state=42)
# model_IF.fit(df[anomaly_inputs])

# df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
# df['anomaly'] = model_IF.predict(df[anomaly_inputs])
# outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

1.  **Model Initialization and Training:**

    ```python
    model_IF = IsolationForest(contamination=float(0.3), random_state=42)
    model_IF.fit(df[anomaly_inputs])
    ```

    * An Isolation Forest model (`model_IF`) is created.
    * `contamination=float(0.3)`: This parameter is set, indicating the model is expected to find approximately 30% of the data points as anomalies.
    * `random_state=42`: A seed is set for the random number generator, ensuring the model's behavior is reproducible across different runs.
    * `model_IF.fit(df[anomaly_inputs])`: The model is trained using the data specified by the `anomaly_inputs` variable. This means the model learns what "normal" data looks like.

2.  **Anomaly Score and Label Assignment:**

    ```python
    df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
    df['anomaly'] = model_IF.predict(df[anomaly_inputs])
    ```

    * `df['anomaly_scores'] = ...`:  Anomaly scores are calculated for each data point using the trained model's `decision_function()`. Lower scores indicate a higher likelihood of being an anomaly. These scores are stored in a new column named `anomaly_scores` in the DataFrame.
    * `df['anomaly'] = ...`: Anomaly labels are assigned to each data point using the trained model's `predict()` method.  The typical output is 1 for normal data points and -1 for anomalies. These labels are stored in a new column named `anomaly`.

3.  **Visualization (Implied):**

    ```python
    outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
    ```

    * A function `outlier_plot` is called to visualize the results.
    * This function likely generates a scatter plot of the data, separating the data points labeled as anomalies from those labeled as normal.
    * The plot uses the 'NPHI' and 'RHOB' columns as x and y axes, respectively.
    * The x-axis limits are set to [0, 0.8], and the y-axis limits are set to [3, 1.5].
    * The function also likely prints some summary statistics.

4.  **Output Statistics:**

    ```
    Outlier Method: Isolation Forest
    Number of anomalous values 3986
    Number of non anomalous values 9304
    Total Number of Values: 13290
    ```

    * These lines print:
        * The name of the method used: "Isolation Forest."
        * The number of data points classified as anomalies: 3986.
        * The number of data points classified as non-anomalous: 9304.
        * The total number of data points: 13290.

**In essence, the code:**

1.  Trains an Isolation Forest model on a subset of the data.
2.  Uses that model to assign anomaly scores and labels to the entire dataset.
3.  Visualizes the results and provides summary statistics about the number of anomalies detected.

The key takeaway is that the Isolation Forest algorithm identifies data points that deviate significantly from the learned "normal" patterns, and these deviations are then highlighted and quantified.

**The `contamination` Parameter**

The primary driver of the difference in the number of anomalous values is the `contamination` parameter of the `IsolationForest` algorithm in scikit-learn.

* **What it does:** The `contamination` parameter is an estimate of the proportion of outliers in the dataset. It's a crucial parameter that influences how the Isolation Forest algorithm determines the threshold for classifying data points as anomalies.
* **How it affects the results:**
    * If you set a higher `contamination` value (e.g., 0.3, meaning you expect 30% outliers), the algorithm will be more aggressive in flagging data points as anomalies. It will find more points that it considers "different" to reach that target.
    * If you set a lower `contamination` value (e.g., 0.01, meaning you expect 1% outliers), the algorithm will be more conservative and flag fewer points as anomalies. It will only identify the most extreme deviations.

**Why the Numbers Change in Your Examples**

The different numbers of anomalous and non-anomalous values indicate that the `contamination` parameter was likely set to different values in the different runs of the Isolation Forest algorithm.

* **Example 1 (1329 Anomalies):** In the first example, the code shows:

    ```
    Number of anomalous values 1329
    Number of non anomalous values 11961
    Total Number of Values: 13290
    ```

    This suggests a relatively low contamination setting, as only a small fraction of the 13290 points are classified as outliers.

* **Example 2 (3986 Anomalies):** In the second example, the code shows:

    ```
    Number of anomalous values 3986
    Number of non anomalous values 9304
    Total Number of Values : 13290
    ```

    This indicates a higher contamination setting, as a significantly larger proportion of the data is now labeled as anomalous.

**Important Considerations**

* **Choosing `contamination`:**
    * Ideally, you should have some prior knowledge or estimate of the actual proportion of outliers in your data.
    * If you don't have a good estimate, you might need to experiment with different `contamination` values and evaluate the results based on your domain knowledge and the visual patterns in the data.
* **Impact on Interpretation:** The `contamination` parameter heavily influences how you interpret the results. A higher value might highlight more subtle deviations, while a lower value will focus on the most extreme outliers.

In conclusion, the difference in the number of detected anomalies is directly caused by the different settings of the `contamination` parameter in the Isolation Forest algorithm.

In [None]:
# anomaly_inputs = ['NPHI', 'RHOB', 'GR', 'CALI', 'PEF', 'DTC']
# model_IF = IsolationForest(contamination=0.1, random_state=42)
# model_IF.fit(df[anomaly_inputs])
# df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
# df['anomaly'] = model_IF.predict(df[anomaly_inputs])

In [None]:
# outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);

1.  **Feature Selection:**

    ```python
    anomaly_inputs = ['NPHI', 'RHOB', 'GR', 'CALI', 'PEF', 'DTC']
    ```

    * A list named `anomaly_inputs` is created, containing the names of six columns: 'NPHI', 'RHOB', 'GR', 'CALI', 'PEF', and 'DTC'. These column names likely represent different well log measurements. This means the Isolation Forest model will now consider these six variables to identify anomalies, instead of just 'NPHI' and 'RHOB' as in previous examples.

2.  **Model Initialization and Training:**

    ```python
    model_IF = IsolationForest(contamination=0.1, random_state=42)
    model_IF.fit(df[anomaly_inputs])
    ```

    * An Isolation Forest model (`model_IF`) is created.
    * `contamination=0.1`: The model is configured to expect 10% of the data points to be anomalies.
    * `random_state=42`: A random seed is set for reproducibility.
    * `model_IF.fit(df[anomaly_inputs])`: The model is trained using the specified columns from the DataFrame `df`.

3.  **Anomaly Score and Label Assignment:**

    ```python
    df['anomaly_scores'] = model_IF.decision_function(df[anomaly_inputs])
    df['anomaly'] = model_IF.predict(df[anomaly_inputs])
    ```

    * `df['anomaly_scores'] = ...`: Anomaly scores are calculated for each data point using the trained model. Lower scores indicate a higher likelihood of being an anomaly. These scores are stored in the `anomaly_scores` column.
    * `df['anomaly'] = ...`: Anomaly labels (1 for normal, -1 for anomaly) are assigned to each data point and stored in the `anomaly` column.

4.  **Visualization (Implied):**

    ```python
    outlier_plot(df, 'Isolation Forest', 'NPHI', 'RHOB', [0, 0.8], [3, 1.5]);
    ```

    * The `outlier_plot` function is called to visualize the results.
    * It's passed the DataFrame, the method name, 'NPHI' and 'RHOB' as the variables to plot, and the x and y axis limits.

5.  **Output Statistics:**

    ```
    Outlier Method: Isolation Forest
    Number of anomalous values 1329
    Number of non anomalous values 11961
    Total Number of Values: 13290
    ```

    * The output provides:
        * The name of the method: "Isolation Forest."
        * The number of anomalous values: 1329.
        * The number of non-anomalous values: 11961.
        * The total number of values: 13290.

**Key Differences from Previous Examples:**

The main difference here is the **input features** used for anomaly detection. In this case, the Isolation Forest model is trained and makes predictions based on all six columns: 'NPHI', 'RHOB', 'GR', 'CALI', 'PEF', and 'DTC'. Previous examples might have used only 'NPHI' and 'RHOB'.

This change in input features can significantly affect the results of the anomaly detection. The model now considers more information to determine what is "normal" and what is "anomalous."

In [None]:
# palette = ['#ff7f0e', '#1f77b4']
# sns.pairplot(df, vars=anomaly_inputs, hue='anomaly', palette=palette)

**Purpose of a Pairplot**

A pairplot is a powerful visualization tool that helps you understand the relationships between multiple variables in your dataset. It creates a matrix of plots:

* **Diagonal Subplots:** On the diagonal, you'll typically find the univariate distribution of each variable (e.g., histograms or kernel density estimates). This shows you the shape and spread of individual variables.
* **Off-Diagonal Subplots:** The off-diagonal subplots are scatter plots. Each scatter plot shows the relationship between two different variables.

**Why Pairplots Are Useful in Anomaly Detection**

1.  **Visualizing Relationships:**
    * Anomaly detection often relies on the assumption that anomalies deviate from the normal relationships between variables. Pairplots help you visually identify these deviations.
    * For example, if two variables are normally strongly correlated, anomalies might appear as points that fall far off the general trend.

2.  **Identifying Potential Separability:**
    * Pairplots can reveal if anomalies tend to cluster in certain regions of the data space. This can help you understand if the anomaly detection algorithm is effectively separating anomalies from normal data.

3.  **Feature Analysis:**
    * Pairplots can give you insights into which features might be most useful for distinguishing anomalies. Variables that show clear separation between normal and anomalous points in the scatter plots are likely to be important.

4.  **Data Exploration:**
    * Before or after anomaly detection, pairplots are helpful for general data exploration. They can reveal patterns, correlations, and potential issues (e.g., outliers) in the data itself, which might be relevant to the anomaly detection process.

**In the context of the image, the pairplot is likely used to:**

* Visually explore the relationships between the features used in the anomaly detection.
* See how the anomalies (likely colored differently, as indicated by the legend) are distributed with respect to these relationships.
* Gain a better understanding of the characteristics that make the anomalies different from the normal data.

**Left Plot (Vertical Pattern):**

* **X-axis:** This axis has a very limited range, with most data points clustered at the leftmost side. The scale suggests it might represent a variable with low values.
* **Y-axis:** This axis has a wider range of values.
* **Relationship:** There's a strong vertical pattern.
    * The "normal" data (blue) is concentrated in a narrow band on the left.
    * The "anomalies" (orange) are scattered more widely along the vertical axis, but mostly on the right side of the blue cluster.

    **Interpretation:** This suggests that the variable on the X-axis is a strong discriminator between normal and anomalous data. The normal data has a very specific, low range of values for this variable. The anomalous data has a wider range, including higher values, indicating that deviations in this variable are key to identifying outliers.

**Right Plot (Diagonal/Curved Pattern):**

* **X-axis:** This axis has a wider range of values.
* **Y-axis:** This axis also has a wider range of values.
* **Relationship:** There's a more complex, diagonal or curved relationship.
    * The "normal" data (blue) forms a fairly dense, elongated cluster with a noticeable curve.
    * The "anomalies" (orange) are scattered around this cluster, often forming a boundary or a less dense outer ring.

    **Interpretation:** This indicates that both variables are important for distinguishing anomalies. The normal data exhibits a specific, non-linear correlation. The anomalous data deviates from this relationship, suggesting that outliers have unusual combinations of values for these two variables.

**Overall Implications for Anomaly Detection:**

* These plots show that the anomaly detection algorithm is likely identifying outliers based on deviations from the typical relationships between the variables.
* The left plot suggests that one variable (X-axis) might be a stronger indicator of anomalies when it has higher values.
* The right plot implies that a combination of both variables, specifically deviations from their curved relationship, is important for spotting outliers.

To give you a more specific interpretation, I would need to know what the X and Y axes represent (the names of the variables).

## Compare / Contrast Random Forest and Isolation Forest
Random Forest and Isolation Forest are both powerful ensemble learning algorithms that utilize decision trees. However, they are designed for fundamentally different tasks: **classification/regression** (Random Forest) and **anomaly detection** (Isolation Forest). Here's a comparison and contrast of the two:

**Random Forest:**

* **Primary Task:** Supervised learning for **classification** (predicting a categorical label) and **regression** (predicting a continuous value).
* **Goal:** To build a robust and accurate predictive model by learning the relationships between features and the target variable from labeled data.
* **Mechanism:**
    * Constructs multiple decision trees (a "forest") on different subsets of the training data (bootstrapping).
    * When building each tree, it considers only a random subset of features at each split.
    * The final prediction is made by aggregating the predictions of all the trees (e.g., majority vote for classification, average for regression).
* **Key Characteristics:**
    * **Supervised Learning:** Requires labeled training data (features and corresponding target variable).
    * **Prediction-Focused:** Aims to predict the value of a target variable for new, unseen data.
    * **Ensemble of Deep Trees:** Individual trees can grow relatively deep to capture complex relationships.
    * **Feature Importance:** Can provide insights into which features are most important for the prediction task.
    * **Handles High Dimensionality:** Performs well with a large number of features.
    * **Robust to Overfitting:** The ensemble nature and random feature selection help prevent overfitting to the training data.
* **Output:** A predicted class label (classification) or a predicted numerical value (regression).

**Isolation Forest:**

* **Primary Task:** Unsupervised learning for **anomaly detection** (identifying data points that deviate significantly from the normal data).
* **Goal:** To isolate anomalous data points more easily than normal data points.
* **Mechanism:**
    * Builds multiple isolation trees (iTrees) by randomly partitioning the data.
    * For each tree, it randomly selects a feature and then randomly selects a split value within the range of that feature.
    * Anomalies, being rare and different, tend to be isolated in fewer splits (shorter path length in the trees).
    * A score is calculated for each data point based on its average path length across all the iTrees. Shorter average path lengths indicate a higher likelihood of being an anomaly.
* **Key Characteristics:**
    * **Unsupervised Learning:** Does not require labeled anomaly data. It learns the "normal" data distribution implicitly.
    * **Isolation-Focused:** Explicitly tries to isolate anomalies rather than modeling the normal data.
    * **Ensemble of Shallow Trees:** Individual trees are typically kept shallow to isolate anomalies quickly.
    * **Anomaly Score:** Outputs an anomaly score for each data point, indicating its degree of abnormality.
    * **Effective for High-Dimensional Data:** Can handle datasets with many features.
    * **Efficient:** Generally has lower computational complexity compared to some other anomaly detection methods.
* **Output:** An anomaly score (typically between -1 and 1, where values closer to -1 indicate a higher probability of being an anomaly).

**Here's a table summarizing the key differences:**

| Feature           | Random Forest                       | Isolation Forest                     |
| :---------------- | :---------------------------------- | :----------------------------------- |
| **Primary Task** | Classification, Regression          | Anomaly Detection                    |
| **Learning Type** | Supervised                          | Unsupervised                         |
| **Goal** | Prediction of a target variable     | Isolation of anomalies               |
| **Data Labels** | Requires labeled data               | Does not require labeled anomalies   |
| **Tree Depth** | Typically deep                      | Typically shallow                    |
| **Splitting** | Based on maximizing information gain (or minimizing error) w.r.t. target | Based on random feature and split value |
| **Output** | Class label or numerical value      | Anomaly score                        |
| **Anomaly Handling** | Can be used for anomaly detection if anomalies are labeled as a separate class, but not its primary purpose. | Specifically designed for anomaly detection. |

**Similarities:**

* **Ensemble Methods:** Both algorithms rely on creating an ensemble (forest) of decision trees.
* **Tree-Based:** Both utilize decision tree structures as their base learners.
* **Randomness:** Both introduce randomness in the tree building process (subsampling of data and/or features) to improve robustness and generalization.
* **Handle High Dimensionality:** Both can effectively handle datasets with a large number of features.

**In essence:**

* **Random Forest learns the relationship between features and a known outcome.** It's like learning what a "cat" looks like from labeled images of cats and non-cats.
* **Isolation Forest learns what "normal" data looks like by seeing how easily data points can be isolated.** Anomalies are those points that are easily separated because they are different from the majority. It's like finding the unusual-looking object in a collection of similar items without explicitly defining what "unusual" means beforehand.

Understanding these fundamental differences is crucial for choosing the appropriate algorithm for a given machine learning task. If you have labeled data and want to predict a specific outcome, Random Forest is likely the better choice. If you have unlabeled data and want to identify rare and unusual instances, Isolation Forest is a more suitable algorithm.

## UNSW_NB15 Isolation Forest

https://colab.research.google.com/drive/1O07CpGmQLe9BzI3QLd2KnH3wWD_MMiEB?usp=sharing