<a href="https://colab.research.google.com/github/bharathulaprasad/AWC_Customization/blob/main/Web_Data_Mining_Lab_9_Crawling%20Through%20Forms%20and%20Logins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Crawling Through Forms and Logins

##Submitting a Basic Form

Website: https://pythonscraping.com/pages/files/form.html

In [None]:
<form method="post" action="processing.php">
First name: <input type="text" name="firstname"><br>
Last name: <input type="text" name="lastname"><br>
<input type="submit" value="Submit" id="submit">
</form>

A couple of things to notice here: first, the names of the two input fields are firstname and lastname. This is important. The names of these fields determine the names of the variable parameters that will be POSTed to the server when the form is submitted. If you want to mimic the action that the form will take when POSTing your own data, you need to make sure that your variable names match up.

The second thing to note is that the action of the form is at processing.php (the absolute path is https://pythonscraping.com/files/processing.php). Any POST requests to the form should be made on this page, not on the page that the form itself resides.

Submitting a form with the Requests library is done below

In [None]:
import requests
params={'firstname':'First','lastname':'Last'}
r=requests.post('https://pythonscraping.com/pages/files/processing.php', params)
print(r.text)

Hello there, Firstsdsddf Last!


Complicated forms

For most cases, we have to look at name and action attribute

Example

In [None]:
<form action="http://post.oreilly.com/client/o/oreilly/forms/
 quicksignup.cgi" id="example_form2" method="POST">
 <input name="client_token" type="hidden" value="oreilly" />
 <input name="subscribe" type="hidden" value="optin" />
 <input name="success_url" type="hidden" value="http://oreilly.com/store/
 newsletter-thankyou.html" />
 <input name="error_url" type="hidden" value="http://oreilly.com/store/
 newsletter-signup-error.html" />
 <input name="topic_or_dod" type="hidden" value="1" />
 <input name="source" type="hidden" value="orm-home-t1-dotd" />
 <fieldset>
 <input class="email_address long" maxlength="200" name=
 "email_addr" size="25" type="text" value=
 "Enter your email here" />
 <button alt="Join" class="skinny" name="submit" onclick=
 "return addClickTracking('orm','ebook','rightrail','dod'
 );" value="submit">Join</button>
 </fieldset>
</form>

Just the following will work for the above form

In [None]:
import requests
params = {'email_addr': 'ryan.e.mitchell@gmail.com'}
r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi",data=params)
print(r.text)

One way to track GET request is to look at URL of a website. It it says something like this

http://domainname.com?thing1=foo&thing2=bar

This corresponds to a form of this type

In [None]:
<form method="GET" action="someProcessor.php">
<input type="sometype" name="thing1" value="foo"/>
<input type="sometype" name="thing2" value="bar"/>
<input type="submit" value="Submit" />
</form>

This also corresponds to a Python parameter object

{'thing1':'foo', 'thing2':'bar'}

#Radio Buttons, Checkboxes, and Other Inputs
Obviously, not all web forms are a collection of text fields followed by a submit button. Standard HTML contains a wide variety of possible form input fields: radio buttons, checkboxes, and select boxes, to name a few. HTML5 adds sliders (range input fields), email, dates, and more. With custom JavaScript fields, the possibilities are endless, with color pickers, calendars, and whatever else the developers come up with next.

Regardless of the seeming complexity of any sort of form field, you need to worry about only two things: the name of the element and its value. The element’s name can be easily determined by looking at the source code and finding the name attribute. The value can sometimes be trickier, as it might be populated by JavaScript immediately before form submission. Color pickers, as an example of a fairly exotic form field, will likely have a value of something like #F03030.


Submitting Files and Images

An example of a form to upload a file is https://pythonscraping.com/files/form2.html

Code to upload is

In [None]:
import requests
files={'uploadFile': open('files/python.png','rb')}
r=requests.post('https://pythonscraping.com/pages/files/processing2.php', files=files)
print(r.text)

Handling Logins and Cookies

Most modern websites use cookies to keep track of who is logged in and who is not. After a site authenticates your login credintials, it stores them in your browser's cookie, which usually contains a server-generated token, time-out and tracking information. The site then uses this cookie as a sort of proof of authentication, which is shown to each page you visit during your time on the site. 

There is a simple login form at https://pythonscraping.com/pages/cookies/login.html (the username can be anything, but the password must be “password”). This form is processed at https://pythonscraping.com/pages/cookies/welcome.php, which contains a link to the main page, https://pythonscraping.com/pages/cookies/profile.php.

(Start from here: Reason of error, https and welcome.php)

In [None]:
#Website is https://pythonscraping.com/pages/cookies/login.html
import requests
params={'username':'AAA', 'password':'password'}
r=requests.post('https://pythonscraping.com/pages/cookies/welcome.php',params) #Changed 'login.html' to 'welcome.php' and 'http' to 'https'
print ('Cookie is set to')
print(r.cookies.get_dict())
print('Going to profile page')
r=requests.get('https://pythonscraping.com/pages/cookies/welcome.php',cookies=r.cookies)
print(r.text)

Cookie is set to
{'loggedin': '1', 'username': 'AAA'}
Going to profile page

<h2>Welcome to the Website!</h2>
You have logged in successfully! <br><a href="profile.php">Check out your profile!</a>


#HTTP Basic Access Authentication

Website is https://pythonscraping.com/pages/auth/login.php
USername is any user name and password is 'password'

In [None]:
import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth=HTTPBasicAuth('names', 'password')
r=requests.post(url='https://pythonscraping.com/pages/auth/login.php', auth=auth)
print(r.text)

<p>Hello names.</p><p>You entered password as your password.</p>
