# Learning Web-Scraping.
There is no AI/Data Science without data and websites are the places where data is hosted in the digital age. Webscraping is thus a crucial skill to have. 

**Note:** Not every website allows you to scrape and may treat it as breach of terms of service. In such cases, you might loose access to the website. 

All credits to [Devakumar P](https://www.kaggle.com/imdevskp) who has actually written the code. This notebook aims at explaining the code to understand web scraping better.


In [15]:
# Required Libraries
pip install beautifulsoup4



In [0]:
# Importing required Libraries
import requests
from bs4 import BeautifulSoup

In [0]:
#Saving required URL in a variable. 
req_url = "https://www.mohfw.gov.in/"

In [18]:
#Pinging the URL and getting the response. 
res = requests.get(req_url)
res

<Response [200]>

In [0]:
# Creating an HTML parser. 
## All the data in a webpage is stored in tags. All these tags will now be extracted into the code window using the html parser
### We will have to extract required data from these tags.
html = BeautifulSoup(res.content, 'html.parser')
html

Now inspect the URL (Right Click => Inspect) to find the tag in which the data we need is stored. In the URL, there is only one table aout daily status of Corona cases. This is the data we need.  
Refer to this [HTML tutorial](https://www.w3schools.com/tags/tag_thead.asp) to know how tables are stored on websites

In [26]:
# Extracting the table header
thead = html.find_all('thead')
thead

[<thead>
 <tr>
 <th><strong>S. No.</strong></th>
 <th><strong>Name of State / UT</strong></th>
 <th><strong>Total Confirmed cases (Including 72 foreign Nationals) </strong></th>
 <th><strong>Cured/Discharged/<br/>Migrated</strong></th>
 <th><strong>Death</strong></th>
 </tr>
 </thead>]

In [28]:
# Select a table header and Extract the column names from a table header. 
## If you have referred the tutorial already, all cells of the table header are stored inside the tag <tr>
head = thead[0].find_all('tr')
head

[<tr>
 <th><strong>S. No.</strong></th>
 <th><strong>Name of State / UT</strong></th>
 <th><strong>Total Confirmed cases (Including 72 foreign Nationals) </strong></th>
 <th><strong>Cured/Discharged/<br/>Migrated</strong></th>
 <th><strong>Death</strong></th>
 </tr>]

In [29]:
# Extracting all the table bodies in the URL
tbody = html.find_all('tbody')
tbody

[<tbody>
 <tr>
 <td>1</td>
 <td>Andhra Pradesh</td>
 <td>427</td>
 <td>11</td>
 <td>7</td>
 </tr>
 <tr>
 <td>2</td>
 <td>Andaman and Nicobar Islands</td>
 <td>11</td>
 <td>10</td>
 <td>0</td>
 </tr>
 <tr>
 <td>3</td>
 <td>Arunachal Pradesh</td>
 <td>1</td>
 <td>0</td>
 <td>0</td>
 </tr>
 <tr>
 <td>4</td>
 <td>Assam</td>
 <td>29</td>
 <td>0</td>
 <td>1</td>
 </tr>
 <tr>
 <td>5</td>
 <td>Bihar</td>
 <td>64</td>
 <td>19</td>
 <td>1</td>
 </tr>
 <tr>
 <td>6</td>
 <td>Chandigarh</td>
 <td>21</td>
 <td>7</td>
 <td>0</td>
 </tr>
 <tr>
 <td>7</td>
 <td>Chhattisgarh</td>
 <td>31</td>
 <td>10</td>
 <td>0</td>
 </tr>
 <tr>
 <td>8</td>
 <td>Delhi</td>
 <td>1154</td>
 <td>27</td>
 <td>24</td>
 </tr>
 <tr>
 <td>9</td>
 <td>Goa</td>
 <td>7</td>
 <td>5</td>
 <td>0</td>
 </tr>
 <tr>
 <td>10</td>
 <td>Gujarat</td>
 <td>516</td>
 <td>44</td>
 <td>25</td>
 </tr>
 <tr>
 <td>11</td>
 <td>Haryana</td>
 <td>185</td>
 <td>29</td>
 <td>3</td>
 </tr>
 <tr>
 <td>12</td>
 <td>Himachal Pradesh</td>
 <td>32</td>
 <t

In [31]:
# In the table body, each table row is stored inside a <tr> tag. 
## So we will select a table body and then extract all the rows from it/
body = tbody[0].find_all('tr')
body

[<tr>
 <td>1</td>
 <td>Andhra Pradesh</td>
 <td>427</td>
 <td>11</td>
 <td>7</td>
 </tr>, <tr>
 <td>2</td>
 <td>Andaman and Nicobar Islands</td>
 <td>11</td>
 <td>10</td>
 <td>0</td>
 </tr>, <tr>
 <td>3</td>
 <td>Arunachal Pradesh</td>
 <td>1</td>
 <td>0</td>
 <td>0</td>
 </tr>, <tr>
 <td>4</td>
 <td>Assam</td>
 <td>29</td>
 <td>0</td>
 <td>1</td>
 </tr>, <tr>
 <td>5</td>
 <td>Bihar</td>
 <td>64</td>
 <td>19</td>
 <td>1</td>
 </tr>, <tr>
 <td>6</td>
 <td>Chandigarh</td>
 <td>21</td>
 <td>7</td>
 <td>0</td>
 </tr>, <tr>
 <td>7</td>
 <td>Chhattisgarh</td>
 <td>31</td>
 <td>10</td>
 <td>0</td>
 </tr>, <tr>
 <td>8</td>
 <td>Delhi</td>
 <td>1154</td>
 <td>27</td>
 <td>24</td>
 </tr>, <tr>
 <td>9</td>
 <td>Goa</td>
 <td>7</td>
 <td>5</td>
 <td>0</td>
 </tr>, <tr>
 <td>10</td>
 <td>Gujarat</td>
 <td>516</td>
 <td>44</td>
 <td>25</td>
 </tr>, <tr>
 <td>11</td>
 <td>Haryana</td>
 <td>185</td>
 <td>29</td>
 <td>3</td>
 </tr>, <tr>
 <td>12</td>
 <td>Himachal Pradesh</td>
 <td>32</td>
 <td>13</td>

So far, we have extracted HTML tags of the required tabular data. Next, we will convert it into Python data formats so that we can analyse them further.


In [35]:
head_rows = []
for tr in head:
  # Inside each <tr>, each cell information is stored in either <th> or <td> tags. We will first extract all the cell level tags in a <tr> and store it in 'td'
  td = tr.find_all(['th', 'td'])
  # From each cell level tag 'td', we will extract the information as a string and store it as a list.   
  row = [i.text for i in td]
  head_rows.append(row)
head_rows

[['S. No.',
  'Name of State / UT',
  'Total Confirmed cases (Including 72 foreign Nationals) ',
  'Cured/Discharged/Migrated',
  'Death']]

Similarly, we will repeat the operation for all the rows in table body.

In [37]:
body_rows = []
for tr in body:
  # Inside each <tr>, each cell information is stored in either <th> or <td> tags. We will first extract all the cell level tags in a <tr> and store it in 'td'
  td = tr.find_all(['th', 'td'])
  # From each cell level tag 'td', we will extract the information as a string and store it as a list.   
  row = [i.text for i in td]
  body_rows.append(row)
body_rows

[['1', 'Andhra Pradesh', '427', '11', '7'],
 ['2', 'Andaman and Nicobar Islands', '11', '10', '0'],
 ['3', 'Arunachal Pradesh', '1', '0', '0'],
 ['4', 'Assam', '29', '0', '1'],
 ['5', 'Bihar', '64', '19', '1'],
 ['6', 'Chandigarh', '21', '7', '0'],
 ['7', 'Chhattisgarh', '31', '10', '0'],
 ['8', 'Delhi', '1154', '27', '24'],
 ['9', 'Goa', '7', '5', '0'],
 ['10', 'Gujarat', '516', '44', '25'],
 ['11', 'Haryana', '185', '29', '3'],
 ['12', 'Himachal Pradesh', '32', '13', '1'],
 ['13', 'Jammu and Kashmir', '245', '6', '4'],
 ['14', 'Jharkhand', '19', '0', '2'],
 ['15', 'Karnataka', '232', '57', '6'],
 ['16', 'Kerala', '376', '179', '2'],
 ['17', 'Ladakh', '15', '10', '0'],
 ['18', 'Madhya Pradesh', '564', '0', '36'],
 ['19', 'Maharashtra', '1985', '217', '149'],
 ['20', 'Manipur', '2', '1', '0'],
 ['21', 'Mizoram', '1', '0', '0'],
 ['22', 'Odisha', '54', '12', '1'],
 ['23', 'Puducherry', '7', '1', '0'],
 ['24', 'Punjab', '151', '5', '11'],
 ['25', 'Rajasthan', '804', '21', '3'],
 ['26', '

Use both these lists to create the data frame.

In [0]:
import pandas as pd

In [40]:
df = pd.DataFrame(body_rows[:len(body_rows)-2], columns=head_rows[0])
df

Unnamed: 0,S. No.,Name of State / UT,Total Confirmed cases (Including 72 foreign Nationals),Cured/Discharged/Migrated,Death
0,1,Andhra Pradesh,427,11,7
1,2,Andaman and Nicobar Islands,11,10,0
2,3,Arunachal Pradesh,1,0,0
3,4,Assam,29,0,1
4,5,Bihar,64,19,1
5,6,Chandigarh,21,7,0
6,7,Chhattisgarh,31,10,0
7,8,Delhi,1154,27,24
8,9,Goa,7,5,0
9,10,Gujarat,516,44,25


In [0]:
now  = datetime.now()
#df_bs['Date'] =  

In [0]:
from datetime import datetime


In [46]:
now

datetime.datetime(2020, 4, 13, 7, 1, 28, 338612)

In [47]:
now.strftime("%m/%d/%Y")

'04/13/2020'