# Exploring the HyperText Transport Protocol using socket

## Aashita Kesarwani
The following code is part of an assignment from the online course [Using Python to Access Web Data](https://www.coursera.org/learn/python-network-data). 

The code retrieves a document using the HTTP protocol using `socket` so as to examine the HTTP Response headers.

In [1]:
import socket
mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysocket.connect(('data.pr4e.org',80))
msg = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\n\n'.encode()
mysocket.send(msg)
while True:
    data = mysocket.recv(512)
    if (len(data) < 1):
        break
    data = data.decode() 
    print(data)

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 467
Connection: close
Date: Fri, 24 Mar 2017 00:41:08 GMT
Server: Apache
Last-Modified: Sat, 24 Sep 2016 20:36:08 GMT
ETag: "1d3-53d46d841582a"
Accept-Ranges: bytes

Why should you learn to write programs?

Writing programs (or programming) is a very creative 
and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else so
lve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  



**The usual way to open a url using `urllib` generates `HTTP Error 400: Bad Request`:**

In [2]:
import urllib.request
url = 'http://data.pr4e.org/intro-short.txt HTTP/1.0\n\n'
html2 = urllib.request.urlopen(url).read()
print(html2)

HTTPError: HTTP Error 400: Bad Request

Using `requests` module:

In [3]:
import requests
url = 'http://data.pr4e.org/intro-short.txt HTTP/1.0\n\n'
html = requests.get(url)
print(html.headers)

{'Content-Type': 'text/html; charset=iso-8859-1', 'Content-Length': '458', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=15', 'Date': 'Fri, 24 Mar 2017 00:41:17 GMT', 'Server': 'Apache'}


In [4]:
type(html.headers)

requests.structures.CaseInsensitiveDict

In [5]:
html.text

'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>300 Multiple Choices</title>\n</head><body>\n<h1>Multiple Choices</h1>\nThe document name you requested (<code>/intro-short.txt HTTP/1.0\n\n</code>) could not be found on this server.\nHowever, we found documents with names similar to the one you requested.<p>Available documents:\n<ul>\n<li><a href="/intro-short.txt/1.0%0a%0a">/intro-short.txt/1.0\n\n</a> (common basename)\n</ul>\n</body></html>\n'

For pretty printing and parsing of HTML text, `BeautifulSoup` module is used here

In [6]:
from bs4 import BeautifulSoup as bs
soup = bs(html.text, "lxml")
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
 <head>
  <title>
   300 Multiple Choices
  </title>
 </head>
 <body>
  <h1>
   Multiple Choices
  </h1>
  The document name you requested (
  <code>
   /intro-short.txt HTTP/1.0
  </code>
  ) could not be found on this server.
However, we found documents with names similar to the one you requested.
  <p>
   Available documents:
  </p>
  <ul>
   <li>
    <a href="/intro-short.txt/1.0%0a%0a">
     /intro-short.txt/1.0
    </a>
    (common basename)
   </li>
  </ul>
 </body>
</html>



**Observation: Though `requests` module give us the header information, we couldn't retrieve the text as we did using `socket`**

In [7]:
soup('body')

[<body>
 <h1>Multiple Choices</h1>
 The document name you requested (<code>/intro-short.txt HTTP/1.0
 
 </code>) could not be found on this server.
 However, we found documents with names similar to the one you requested.<p>Available documents:
 </p><ul>
 <li><a href="/intro-short.txt/1.0%0a%0a">/intro-short.txt/1.0
 
 </a> (common basename)
 </li></ul>
 </body>]