
    
<img src="https://astanait.edu.kz/wp-content/uploads/2020/05/aitu-logo-3.png" alt="alt text" width="150" height="200" class="blog-image">
  

<h1 style="text-align:center;">Big Data in Law Enforcement (practice) </h1>

<h1 style="text-align:center;">HTTP Request, Simple APIs,  </h1>


<h2 id="">Overview of HTTP </h2>


When you, the client, access a web page in your web browser, a series of interactions occur with the server where the web page is hosted. This process involves the use of the **HTTP (Hypertext Transfer Protocol)**.

***Client's Request:*** When you navigate to a web page, your web browser sends an HTTP request to the server hosting that page. This request includes the URL of the page you want to access. By default, the server attempts to find the primary resource, usually an "<code>index.html</code>" file, associated with the requested URL.

***Server's Response:*** If the server successfully locates the requested resource, it generates an HTTP response. This response includes various details about the resource, such as its type, size, and other relevant information.

The HTTP response typically contains not just the requested HTML file but can also include other resources, like images, stylesheets, and scripts that the web page depends on.

The diagram below illustrates this process. 

![HTTP-Protocol.png](attachment:80613623-6a2d-4351-8df4-c503bfa6beb3.png)


The HTTP protocol serves as the communication mechanism that allows you to exchange information with web servers, making it possible to access and retrieve web pages, images, and various other web resources.


<h2 id="URL">Uniform Resource Locator: URL</h2>


In the context of cyber security and law enforcement within a Jupyter Notebook environment, it's essential to comprehend the structure of **Uniform Resource Locators (URLs)** when investigating online activities and potential security threats. URLs can be dissected into three fundamental components:

**Scheme:** In this investigative context, the ***scheme** consistently represents the protocol used for communication. For instance, it often takes the form of "<code>http://</code>," indicating standard web traffic.

**Internet Address or Base URL:** This component serves as the starting point for identifying the online location where pertinent resources can be discovered. For cyber security and law enforcement professionals, examples of these addresses could include "<code>https://tisane.ai/</code>" and "<code>https://www.iacpcybercenter.org/</code>."

**Route:** The "route" or "path" indicates the specific location on web servers where potentially relevant resources are stored. This path is akin to a file directory and may appear as "<code>/org/ID.png</code>."

In the realm of cyber security and law enforcement, a thorough understanding of these URL components is pivotal. It empowers professionals to precisely specify the communication protocol, pinpoint the web server's location, and identify the resource's storage location on the internet. Such knowledge plays a vital role when conducting investigations, analyzing potential security incidents, and gathering digital evidence, all within the Jupyter Notebook environment.

The term ***"Uniform Resource Identifier (URI)"*** is a broader concept encompassing various ways to identify resources on the web. URLs, represent a specific subset of URIs and are primarily used to specify the location of web resources. Additionally, you might encounter the term ***"endpoint"*** which refers to the URL associated with a specific operation provided by a web server. An endpoint serves as the precise web address for accessing a particular function or service offered by the server.

<h2 id="RE">Request </h2>


The process of interacting with web servers can be divided into two key phases: the Request and the Response processes. Let's focus on the Request process, particularly when using the "GET" method.

In the request, we utilize the "GET" method, which is a specific type of "HTTP" method. This method is employed to request resources from the web server. The request also includes the location of the desired resource, specified as "<code>/index.html</code>", and the version of the "HTTP" protocol being used.

Additionally, the Request header plays a crucial role in the process. It conveys supplementary information along with the "HTTP" request. This header includes various details that help the server understand the request and how to respond effectively.

When an <code>HTTP</code> request is made, an <code>HTTP</code> method is sent, this tells the server what action to perform.  A list of several <code>HTTP</code> methods is shown below. 

![Summary-of-HTTP-methods-and-description-of-its-actions.png](attachment:f08f9dd7-4b5d-4c45-9fdc-8f6e9c11cce9.png)

<h2 id="RES">Response</h2>


The diagram presented below illustrates the Response phase. 

![HTTP_ResponseMessageExample.png](attachment:0d093c2b-db2d-481e-8dfc-d33f855265b9.png)

The response start line begins with the version number, denoted as "<code>HTTP/1.0</code>." Following that, there is a status code, "200," which signifies success, and a descriptive phrase, "OK," providing additional context about the success status.

Within the response header, you'll find valuable information that assists in understanding the response, including metadata and details relevant to the resource.

Concluding the response is the response body, which contains the actual requested file, in this instance, an "<code>HTML</code>" document. It's worth noting that certain requests may include headers, which can carry additional information and instructions related to the response.


Here are some common HTTP status code examples categorized by their class:
![Снимок экрана 2023-10-19 220349.png](attachment:bb3a0752-f038-4a82-9e0e-c9a7b6c79534.png)

<h2 id="RP">Requests in Python</h2>


Requests is a Python Library that allows you to send <code>HTTP/1.1</code> requests easily. We can import the library as follows:


In [2]:
import requests


We also have to use the following libraries:


In [5]:
import os
!pip install Pillow




We can make a <code>GET</code> request via the method <code>get</code> to https://astanait.edu.kz/en/cybersecurity-2/:


In [6]:
url = 'https://astanait.edu.kz/en/cybersecurity-2/'
r = requests.get(url)

We have the response object <code>r</code>, which give information about the request, particularly the status of the request. We can view the status code using the attribute <code>status_code</code>.


In [7]:
r.status_code

200

Let's view the request headers:


In [8]:
r.request.headers

{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

This dictionary represents HTTP headers typically used in an HTTP request. Let's break down the meaning of these headers:

**User-Agent:** This header indicates the software or user agent making the HTTP request. In this case, it specifies that the user agent is "python-requests/2.31.0," which is commonly used to indicate that the request is being made by the Python requests library version 2.31.0.

**Accept-Encoding:** This header informs the server about the types of content encoding that the client (or user agent) can understand. It specifies that the client can accept data that is encoded using "gzip," "deflate," and "br" (Brotli) compression algorithms. The server can choose one of these methods to compress the response content for more efficient data transfer.

**Accept:** The "Accept" header indicates the media types that the client can accept in the response. In this case, it's set to "/," which is a wildcard that means the client is willing to accept any media type. This header tells the server that it can send any type of content.

**Connection:** The "Connection" header specifies the type of connection to be used. "keep-alive" indicates that the client (in this case, Python requests) wishes to keep the connection open for possible future requests, which can improve performance by reducing the overhead of creating new connections for each request.

These headers provide essential information to the server, allowing it to understand the client's capabilities and preferences when making an HTTP request. The server can then respond accordingly, considering these headers in its response.

Also we can view the request body, in the following line, as there is no body for a get request we get a <code>None</code>:


In [15]:
print ( r.request.body)

None


We can view the <code>HTTP</code> response header using the attribute <code>headers</code>. This returns a python dictionary of <code>HTTP</code> response headers.


In [16]:
r.headers


{'Server': 'nginx', 'Date': 'Sun, 22 Oct 2023 09:24:58 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Link': '<https://astanait.edu.kz/wp-json/>; rel="https://api.w.org/", <https://astanait.edu.kz/wp-json/wp/v2/pages/7651>; rel="alternate"; type="application/json", <https://astanait.edu.kz/?p=7651>; rel=shortlink', 'X-TEC-API-VERSION': 'v1', 'X-TEC-API-ROOT': 'https://astanait.edu.kz/wp-json/tribe/events/v1/', 'X-TEC-API-ORIGIN': 'https://astanait.edu.kz', 'Set-Cookie': 'pll_language=en; expires=Mon, 21-Oct-2024 09:24:58 GMT; Max-Age=31536000; path=/; secure; SameSite=Lax', 'X-Cache-Status': 'MISS', 'X-Content-Type-Options': 'nosniff', 'X-Powered-By': 'PleskLin', 'Content-Encoding': 'br'}

These headers provide information about the response that the server sends back to the client. Here's an explanation of some of the key headers in this response:

**Server:** This header identifies the web server software used to handle the request. In this case, the server is "nginx."

**Date:** The "Date" header specifies the date and time when the server generated the response. It's set to "Thu, 19 Oct 2023 16:09:57 GMT."

**Content-Type:** The "Content-Type" header indicates the type of content in the response. In this case, it's "text/html," which means the content is in HTML format, and "charset=UTF-8" specifies the character encoding used.

**Transfer-Encoding:** This header indicates how the content is encoded during transmission. "chunked" means that the response is sent in chunks.

**Connection:** The "Connection" header specifies the type of connection to be used. "keep-alive" indicates that the server wishes to keep the connection open for possible future requests, similar to the "Connection" header in the request you provided earlier.

**Set-Cookie:** This header is used to set cookies in the client's browser. In this case, it sets a cookie named "pll_language" with a value of "en." It also specifies the expiration date and other attributes of the cookie.

**Content-Encoding:** The "Content-Encoding" header indicates the encoding applied to the response content. "br" indicates that Brotli compression is used to compress the response content, reducing its size for faster transmission.

**X-TEC-API-VERSION, X-TEC-API-ROOT, and X-TEC-API-ORIGIN:** These headers are specific to the API used by the server. They provide information about the API version, root URL, and origin.

**X-Cache-Status:** This header typically indicates the status of caching. "MISS" suggests that the response was not retrieved from a cache but generated by the server.

**X-Content-Type-Options:** This header specifies that the response should not be sniffed for a different content type. It helps prevent certain types of attacks.

**X-Powered-By:** This header indicates the technology or software powering the server. In this case, it's "PleskLin," suggesting the server is using the Plesk hosting control panel.

These response headers are essential for both the client and server to communicate effectively and ensure the correct handling of the HTTP response. They convey various details about the response, such as content type, encoding, and server information.

As the <code>Content-Type</code> is <code>text/html</code> we can use the attribute <code>text</code> to display the <code>HTML</code> in the body. We can review the first 100 characters:


In [17]:
r.text[0:100]

'<!DOCTYPE html>\n<html lang="en-GB" class="no-js">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewpo'

<h2 id="URL_P">GET request </h2>

To tailor the results of your query, you can utilize the GET method, which enables the retrieval of data from an API or server. When employing this method, you initiate a GET request to the server.

As in previous examples, the process begins with the Base URL, and within the Route, you append "<code>/get</code>" to clearly indicate your intention to execute a GET request. This addition to the route signifies your objective to retrieve specific data through the execution of a GET request

We are going to with a basic HTTP Request & Response Service http://httpbin.org, which allow us to do get and request operations. 


In [18]:
url_get = ' http://httpbin.org/get'

A ***query string*** is a part of a uniform resource locator (URL), this sends other information to the web server. 


To create a Query string, add a dictionary. The keys are the parameter names and the values are the value of the Query string. 


In [19]:
load  = {"name": "Janat", "ID":"1"}

Then passing the dictionary <code>load</code> to the <code>params</code> parameter of the <code> get()</code> function:


In [23]:
r = requests.get(url_get, params = load)

We can see the <code>URL</code> and  name and values.


In [24]:
r.url

'http://httpbin.org/get?name=Janat&ID=1'

The start of the query is a <code>?</code>, followed by a series of parameter and value pairs. The first parameter name is <code>name</code> and the value is <code>Janat</code>. The second parameter name is <code>ID</code> and the Value is <code>1</code>. Each pair, parameter, and value is separated by an equals sign, <code>=</code>.
The series of pairs is separated by the ampersand <code>&</code>.

We don't have request body.


In [26]:
print (r.request.body)

None


And we can see the status code.


In [27]:
r.status_code

200

We can view the response as text:


In [29]:
print(r.text)

{
  "args": {
    "ID": "1", 
    "name": "Janat"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-6534ed1e-72fe4b114a5e616a72c80366"
  }, 
  "origin": "178.91.38.65", 
  "url": "http://httpbin.org/get?name=Janat&ID=1"
}



Aslo we can look at the <code>'Content-Type'</code>.


In [30]:
r.headers['Content-Type']

'application/json'

As can be seen the content type is in the <code>JSON</code> format we can use the method <code>json()</code>, it returns a Python <code>dict</code>:


In [31]:
r.json()

{'args': {'ID': '1', 'name': 'Janat'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, br',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-6534ed1e-72fe4b114a5e616a72c80366'},
 'origin': '178.91.38.65',
 'url': 'http://httpbin.org/get?name=Janat&ID=1'}

The key <code>args</code> has the name and values:


In [32]:
r.json()['args']

{'ID': '1', 'name': 'Janat'}

<h2 id="POST">Post Requests  </h2>


Similar to a GET request, a POST request is employed to transmit data to a server. However, in the case of a POST request, the data is sent within the request body. When sending a POST Request in Python, you modify the URL by changing the route to "<code>POST</code>." This modification explicitly indicates your intention to initiate a POST request and send data within the request body.


In [33]:
url_post = 'http://httpbin.org/post'

This particular endpoint receives data either as a file or in the form. Using a form is a convenient method for structuring an HTTP request to transmit data to a server.

To make a <code>POST</code> request we use the <code>post()</code> function, the variable <code>load</code> is passed to the parameter <code> data </code>:


In [34]:
r_post = requests.post(url_post, data = load)

If we compare the URL from the response object of the <code>GET</code> and <code>POST</code> request we see the <code>POST</code> request doesn't have name or value pairs.


In [36]:
print (r_post.url)
print (r.url)

http://httpbin.org/post
http://httpbin.org/get?name=Janat&ID=1


If we compare the <code>POST</code> and <code>GET</code> request body, we see only the <code>POST</code> request has a body:


In [38]:
print(r_post.request.body)
print(r.request.body)

name=Janat&ID=1
None


We can view the form as well:


In [39]:
r_post.json()['form']

{'ID': '1', 'name': 'Janat'}

 Check out <a href="https://requests.readthedocs.io/en/master/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01">Requests </a> for more.


<hr>


<h2 id="URL_P">API </h2>


An API stands for **Application Programming Interface** and serves as a software intermediary that enables communication between two different applications.

The advantages of utilizing APIs are as follows:

**Automation:** APIs reduce the need for manual effort and allow for the automation of tasks, resulting in more efficient workflows that can be easily updated to enhance speed and productivity.

**Efficiency:** APIs enable the use of pre-built functionality, saving time and resources compared to attempting to develop complex features from scratch.

However, there is one significant disadvantage of using APIs:

**Security:** Poorly integrated APIs can pose security risks, making them vulnerable to attacks. This can lead to data breaches or losses, which may have financial or reputational consequences.
In this notebook, one of the applications we will work with is the Random User Generator. This open-source, free API provides developers with randomly generated user data, which can be used as placeholders for testing purposes. Similar to how Lorem Ipsum is used for generating dummy text, Random User Generator generates dummy user profiles. The API can return multiple results and allows for specifying details such as gender, email, images, usernames, addresses, names, and more. You can find more information about Random User Generator in their documentation.

Another example of a simple API used in this notebook is the Fruityvice application. The Fruityvice API provides a webservice that offers information about various fruits. It can be used to access interesting data about fruits and serve educational purposes. This webservice is completely free to use and contribute to.

## RandomUser API


There are various Get Methods parameters that we can generate. For more information, please visit this [documentation](https://randomuser.me/documentation?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork1005-2023-01-01) page. Some of them are:
- get_password()
- get_phone()
- get_picture()
- get_postcode()
- get_registered()
- get_state()
- get_cell()
- get_city()
- get_dob()
- get_email()
- get_first_name()
- get_full_name()
- get_gender()
- get_id()
- get_id_number()

Let's explore one of the interesting API  library `randomuser`. 

In [40]:
! pip install randomuser



In [42]:
from randomuser import RandomUser
import pandas as pd

First, we have to create a random user object, user.


In [43]:
user = RandomUser()

After, using `generate_users()` function, we can create a list of random 10 users.


In [44]:
some_list  = user.generate_users(10)
some_list

[<randomuser.RandomUser at 0x19f972a1a50>,
 <randomuser.RandomUser at 0x19f97287d10>,
 <randomuser.RandomUser at 0x19f972a1a90>,
 <randomuser.RandomUser at 0x19f972a1ad0>,
 <randomuser.RandomUser at 0x19f972a1b10>,
 <randomuser.RandomUser at 0x19f972a1b90>,
 <randomuser.RandomUser at 0x19f972a1bd0>,
 <randomuser.RandomUser at 0x19f972a1c10>,
 <randomuser.RandomUser at 0x19f972a1c50>,
 <randomuser.RandomUser at 0x19f972a1b50>]

Using mentioned "Get Methods" functions, can generate the required parameters to construct a dataset. For example, to get emails, we can use get_email() function.

In [45]:
emails = user.get_email()
emails

'lorena.fontai@example.com'

For instance you need to generate 10 users with their emails, full name and city. We can use "for-loop" to print these 10 users.


In [47]:
for user in some_list:
    print ( user.get_full_name(), " ", user.get_email(), user.get_city())

Angelina Roche   angelina.roche@example.com Nîmes
Ömür Akgül   omur.akgul@example.com Kırklareli
Mary Soto   mary.soto@example.com Gresham
Gema Pastor   gema.pastor@example.com Zaragoza
Michael Payne   michael.payne@example.com New York
Jenny Myers   jenny.myers@example.com Memphis
Kunigunde Wiegel   kunigunde.wiegel@example.com Freiberg am Neckar
Lotta Kalm   lotta.kalm@example.com Vesanto
Milja Lehtonen   milja.lehtonen@example.com Kerimäki
Lawrence Porter   lawrence.porter@example.com Rochmond


Also we can convert generated users to dataframe object to perform further analysis 

In [56]:
# first we define the the function get_user
def get_users():
    users= []
    for user in RandomUser.generate_users(10):
        users.append({"Name":user.get_full_name(),"Gender":user.get_gender()})
    return pd.DataFrame(users)

get_users()

Unnamed: 0,Name,Gender
0,Ruben Robert,male
1,سورنا موسوی,male
2,Eeli Polon,male
3,Ali Ertepınar,male
4,Arnaud Harris,male
5,Rita Carroll,female
6,Daniel Sandersen,male
7,Marie Vincent,female
8,Grace Turner,female
9,Ella Thomsen,female


In [57]:
# and cast it to dataframe
df1 = pd.DataFrame(get_users())
df1

Unnamed: 0,Name,Gender
0,Delores Carlson,female
1,Eric Holsen,male
2,Marcus Christiansen,male
3,Josep Saez,male
4,Olímpia Silva,female
5,Delphine Ambrose,female
6,Domenica Lucas,female
7,Daniela Rojas,female
8,Chiara Menard,female
9,Belen Lopez,female


## Practice 1  
(40 pt.)



1. Generate photos of the random 10 users.

In [87]:
# write your code below


2. Generate a table with the following information about the users: id, name, gender, state, city, email and picture


In [88]:
# write your code below


## AbuseIPDB API
One of the common way of API using is retrieving data through ***requests*** library. In this practice we are going to import data from AbuseIPDB website. AbuseIPDB is a project that is closely related to cybersecurity and law enforcement efforts to combat online threats and malicious activities. It serves as a valuable resource for webmasters, system administrators, and other interested parties to report and track IP addresses associated with malicious activities. The retrieved data can be used to enhance the security of online systems by blocking or monitoring IP addresses known to be involved in activities such as hacking, spamming, distributed denial of service (DDoS) attacks, and other cyber threats.

Also it can be a valuable resource for law enforcement agencies investigating cybercrimes. It can assist in tracking down and identifying individuals or groups involved in online criminal activities.

We will start by importing all required libraries.


In [58]:
import requests
import json

# Defining the api-endpoint
url = 'https://api.abuseipdb.com/api/v2/check'

querystring = {
    'ipAddress': '118.25.6.39',
    'maxAgeInDays': '90'
}

headers = {
    'Accept': 'application/json',
    'Key': 'ba7fd4f0576b15db7673e809371fc301576f8a0494824127a18fdfb700782111636f98459e502870'
}

response = requests.request(method='GET', url=url, headers=headers, params=querystring)

# Formatted output
decodedResponse = json.loads(response.text)
print(json.dumps(decodedResponse, sort_keys=True, indent=4))

{
    "data": {
        "abuseConfidenceScore": 0,
        "countryCode": "CN",
        "domain": "tencent.com",
        "hostnames": [],
        "ipAddress": "118.25.6.39",
        "ipVersion": 4,
        "isPublic": true,
        "isTor": false,
        "isWhitelisted": false,
        "isp": "Tencent Cloud Computing (Beijing) Co. Ltd",
        "lastReportedAt": "2023-10-11T20:00:28+00:00",
        "numDistinctUsers": 1,
        "totalReports": 1,
        "usageType": "Data Center/Web Hosting/Transit"
    }
}


In [72]:
import requests
import json

# Defining the api-endpoint
url = 'https://api.abuseipdb.com/api/v2/blacklist'

querystring = {
    'confidenceMinimum':'90'
}

headers = {
    'Accept': 'application/json',
    'Key': 'ba7fd4f0576b15db7673e809371fc301576f8a0494824127a18fdfb700782111636f98459e502870'
}

response = requests.request(method='GET', url=url, headers=headers, params=querystring)

# Formatted output

decodedResponse = json.loads(response.text)
#print(json.dumps(decodedResponse, sort_keys=True, indent=4))

In [73]:
# And we can convert it to *pandas* data frame. 

df = pd.DataFrame(decodedResponse["data"])
df

Unnamed: 0,ipAddress,countryCode,abuseConfidenceScore,lastReportedAt
0,218.92.0.113,CN,100,2023-10-22T10:17:02+00:00
1,8.209.240.18,JP,100,2023-10-22T10:17:02+00:00
2,61.177.172.136,CN,100,2023-10-22T10:17:01+00:00
3,43.155.163.177,KR,100,2023-10-22T10:17:01+00:00
4,43.138.214.217,CN,100,2023-10-22T10:17:01+00:00
...,...,...,...,...
9995,36.138.181.32,CN,100,2023-10-22T09:30:19+00:00
9996,183.96.3.190,KR,100,2023-10-22T09:30:18+00:00
9997,123.30.249.49,VN,100,2023-10-22T09:30:18+00:00
9998,120.194.152.58,CN,100,2023-10-22T09:30:18+00:00


Usually the results comes in a nested json format. So the data needs to be normalized. You can use pd.json_normalize() function

Let's extract country code from this dataframe using loc functions.


In [60]:
df.iloc[1, 0]

'CN'

## Practice 2
60 (pt.)

1. Find **abuseConfidenceScore** of 43.139.107.162

In [90]:
# Write your code here


2. [Here](https://github.com/public-apis/public-apis#anti-malware) you can find a list of  free Anti-Malware APIs. Choose any API of your interest and use it to load/extract some information.
the results should be retrieved as json.loads() and converted to pandas dataframe

In [91]:
# Write your code here


#### After completing the tasks, rename your current Jupyter Notebook in the following format: CS-21XX_YourName_YourSurname, save it and download. You must send the notebook only through the email to A.Zhuldassov@astanait.edu.kz, in the topic of email you must write Practice3.  

## Prepared by
Course instructor Abat Zhuldassov