<a href="https://colab.research.google.com/github/debiasri/Web-scraping-for-data-science/blob/Web-scraping-using-open-data/Webscrapping_in_datascience.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A  basic example of web scrapping with data from https://data.gov.in/major-indicator/covid-19-india-data-source-mohfw

In [1]:
# Importing required libraries
import requests
from bs4 import BeautifulSoup as bs # defines the basic interface called by the tree builders

In [2]:
# Importing other important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Lets take a webpage with a table containing data on the results of urban clash game with the url: http://www.hubertiming.com/results/2017GPTR10K
# Opening the particular url using openurl function
from urllib.request import urlopen
url='https://dibyendudeb.com/what-is-web-scraping-and-why-it-is-so-important-in-data-science/'
html= urlopen(url)
soup=bs(html,'lxml')

In [5]:
# to print all rows in a table
records=soup.find_all('tr')
# creating list with the text
text_list=[]
for row in records:
  row_store=row.find_all('td')
  text_store=str(row_store) # creating a string object from the given object
  onlytext=bs(text_store,'lxml').get_text() # using BeautifulSoup method to collect the text as a list using  get_text() function
  text_list.append(onlytext)

In [6]:
df=pd.DataFrame(text_list)
df.head(10)
print(records)

[<tr><td><strong>Name</strong></td><td><strong>Gender</strong></td><td><strong>Age</strong></td><td><strong>Height</strong></td><td><strong>Weight</strong></td></tr>, <tr><td>Ramesh</td><td>Male</td><td>18</td><td>5.6</td><td>59</td></tr>, <tr><td>Dinesh</td><td>Male</td><td>23</td><td>5.0</td><td>55</td></tr>, <tr><td>Sam</td><td>Male</td><td>22</td><td>5.5</td><td>54</td></tr>, <tr><td>Dipak</td><td>Male</td><td>15</td><td>4.5</td><td>49</td></tr>, <tr><td>Rahul</td><td>Male</td><td>18</td><td>5.9</td><td>60</td></tr>, <tr><td>Rohit</td><td>Male</td><td>20</td><td>6.0</td><td>69</td></tr>, <tr><td>Debesh</td><td>Male</td><td>25</td><td>6.1</td><td>70</td></tr>, <tr><td>Deb</td><td>Male</td><td>21</td><td>5.9</td><td>56</td></tr>, <tr><td>Debarati</td><td>Female</td><td>29</td><td>5.4</td><td>54</td></tr>, <tr><td>Dipankar</td><td>Male</td><td>22</td><td>5.7</td><td>56</td></tr>, <tr><td>Smita</td><td>Female</td><td>25</td><td>5.5</td><td>60</td></tr>, <tr><td>Dilip</td><td>Male</td><

Splitting the single column into multiple columns according to the comma separated values

In [7]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4
0,[Name,Gender,Age,Height,Weight]
1,[Ramesh,Male,18,5.6,59]
2,[Dinesh,Male,23,5.0,55]
3,[Sam,Male,22,5.5,54]
4,[Dipak,Male,15,4.5,49]
5,[Rahul,Male,18,5.9,60]
6,[Rohit,Male,20,6.0,69]
7,[Debesh,Male,25,6.1,70]
8,[Deb,Male,21,5.9,56]
9,[Debarati,Female,29,5.4,54]


In [11]:
# Removing the opening bracket from the column 0
df1[0] = df1[0].str.strip('[')
# Removing the closing bracket from the column 9
df1[4] = df1[4].str.strip(']')
df1.head()

Unnamed: 0,0,1,2,3,4
0,Name,Gender,Age,Height,Weight
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49


Creating another dataframe to collect the column headers

In [12]:
# Storing the table headers in a variable [see in inspect that the headers are in "strong" tag]
headers = soup.find_all('strong')
# Using BeautifulSoup again to arrange the header tags
header_list = []# creating a list of the header values
col_headers = str(headers)
header_only = bs(col_headers, "lxml").get_text()
header_list.append(header_only)
print(header_list)

['[Name, Gender, Age, Height, Weight]']


Converting the list to a pandas data frame

In [13]:
df2 = pd.DataFrame(header_list)
df2.head()

Unnamed: 0,0
0,"[Name, Gender, Age, Height, Weight]"


Now again we have to split the column into several columns to separate the values

In [14]:
df3 = df2[0].str.split(',', expand=True)
df3.head()

Unnamed: 0,0,1,2,3,4
0,[Name,Gender,Age,Height,Weight]


Concatenating the two data frames

In [25]:
concatenate = [df3, df1]

df4 = pd.concat(concatenate)
df4

Unnamed: 0,0,1,2,3,4
0,[Name,Gender,Age,Height,Weight]
0,Name,Gender,Age,Height,Weight
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49
5,Rahul,Male,18,5.9,60
6,Rohit,Male,20,6.0,69
7,Debesh,Male,25,6.1,70
8,Deb,Male,21,5.9,56


In [16]:
# Assigning the first row as table header
df4=df4.rename(columns=df4.iloc[0])
df4.head(10)

Unnamed: 0,[Name,Gender,Age,Height,Weight]
0,[Name,Gender,Age,Height,Weight]
0,Name,Gender,Age,Height,Weight
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49
5,Rahul,Male,18,5.9,60
6,Rohit,Male,20,6.0,69
7,Debesh,Male,25,6.1,70
8,Deb,Male,21,5.9,56


In [17]:
# You can see that the table header here has got replicated as the first record in the table, so we need to correct this problem
df5 = df4.drop(df4.index[0])
df5.head()

Unnamed: 0,[Name,Gender,Age,Height,Weight]
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49
5,Rahul,Male,18,5.9,60


Now get some basic idea about the data in hand

In [18]:
df5.info()
df5.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19 entries, 1 to 19
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   [Name     19 non-null     object
 1    Gender   19 non-null     object
 2    Age      19 non-null     object
 3    Height   19 non-null     object
 4    Weight]  19 non-null     object
dtypes: object(5)
memory usage: 912.0+ bytes


(19, 5)

In [19]:
# Eliminating rows with any missing value
df5 = df5.dropna(axis=0, how='any')
df5.head()

Unnamed: 0,[Name,Gender,Age,Height,Weight]
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49
5,Rahul,Male,18,5.9,60


In [20]:
# Some more data refinement to make the dataset more perfect
df5.rename(columns={'[Name': 'Name'},inplace=True)
df5.rename(columns={' Weight]': 'Weight'},inplace=True)
print(df5.head())

     Name  Gender  Age  Height Weight
1  Ramesh    Male   18     5.6     59
2  Dinesh    Male   23     5.0     55
3     Sam    Male   22     5.5     54
4   Dipak    Male   15     4.5     49
5   Rahul    Male   18     5.9     60


In [22]:
df5.head()

Unnamed: 0,Name,Gender,Age,Height,Weight
1,Ramesh,Male,18,5.6,59
2,Dinesh,Male,23,5.0,55
3,Sam,Male,22,5.5,54
4,Dipak,Male,15,4.5,49
5,Rahul,Male,18,5.9,60
