<img src="https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png" align="left"></img><br><br><br><br><br><br><br><br>


## Part 1 - Webscraping covid19 data

**Author List**: Deepankar Singh <br>
Site to scrape covid data: <br>
https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory

---
<a id='sec4'></a>
# Problem

Use BeautifulSoup and Requests or Pandas to scrape the table “COVID-19 pandemic by <br>
location” under Statistics / Total cases and deaths on this Wikipedia page: <br>
https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory

# Solution

In [181]:
import requests # The requests library is an 
# HTTP library for getting and posting content etc.
import bs4 as bs # BeautifulSoup4 is a Python library 
# for pulling data out of HTML and XML code.
# We can query markup languages for specific content
import pandas as pd
import numpy as np

In [182]:
source = requests.get("https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory") 
# a GET request will download the HTML webpage.

In [183]:
# Convert source.content to a beautifulsoup object 
# beautifulsoup can parse (extract specific information) HTML code
soup = bs.BeautifulSoup(source.content, features='html.parser') 
# we pass in the source content
# features specifies what type of code we are parsing, 
# here 'html.parser' specifies that we want beautiful soup to parse HTML code

In [184]:
covid_data = soup.find(id='covid19-container')

In [185]:
covid_table = covid_data.find_all('table')[0]

In [186]:
df = pd.read_html(str(covid_table), header=0)[0]
df = df.drop(df.columns[[0,5]],axis=1)
df.columns=['country', 'cases', 'deaths','recovered']
df.head()

Unnamed: 0,country,cases,deaths,recovered
0,World[e],78704434,1730663,44323101
1,United States[f],18633794,329491,8232907
2,India,10099066,146444,9663382
3,Brazil,7366677,189264,6405356
4,Russia[g],2963688,53096,2370857


In [187]:
df['country'] = df['country'].str.replace(r"\[.*\]","")
df = df[df['country'] != 'World']
df = df.set_index('country')
df = df[:237]
df

Unnamed: 0_level_0,cases,deaths,recovered
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,18633794,329491,8232907
India,10099066,146444,9663382
Brazil,7366677,189264,6405356
Russia,2963688,53096,2370857
France,2505875,61978,187272
...,...,...,...
Wallis and Futuna,4,0,1
American Samoa,3,0,0
Samoa,2,0,0
Vanuatu,1,0,1


#### Drop all the rows that do not contain numerical data

In [188]:
for col in df.columns:
 df = df[pd.to_numeric(df[col], errors='coerce').notnull()]
df

Unnamed: 0_level_0,cases,deaths,recovered
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,18633794,329491,8232907
India,10099066,146444,9663382
Brazil,7366677,189264,6405356
Russia,2963688,53096,2370857
France,2505875,61978,187272
...,...,...,...
Marshall Islands,4,0,2
Wallis and Futuna,4,0,1
American Samoa,3,0,0
Samoa,2,0,0


#### Convert the datatype of all DataFrame values from objects to integers.

In [189]:
df = df.apply(pd.to_numeric) 
df

Unnamed: 0_level_0,cases,deaths,recovered
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,18633794,329491,8232907
India,10099066,146444,9663382
Brazil,7366677,189264,6405356
Russia,2963688,53096,2370857
France,2505875,61978,187272
...,...,...,...
Marshall Islands,4,0,2
Wallis and Futuna,4,0,1
American Samoa,3,0,0
Samoa,2,0,0


#### Drop all rows of countries with zero recorded deaths or non-numeric death data.

In [190]:
df = df[ df['deaths'] != 0]
df

Unnamed: 0_level_0,cases,deaths,recovered
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,18633794,329491,8232907
India,10099066,146444,9663382
Brazil,7366677,189264,6405356
Russia,2963688,53096,2370857
France,2505875,61978,187272
...,...,...,...
Northern Mariana Islands,113,2,32
British Virgin Islands,72,1,70
Fiji,46,2,44
Sahrawi Arab DR,31,3,27


#### Create a new column called cases_per_deaths and assign it the value of number of cases divided by deaths.

In [191]:
df = df.assign(cases_per_deaths =  df['cases'] // df['deaths'])
df

Unnamed: 0_level_0,cases,deaths,recovered,cases_per_deaths
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United States,18633794,329491,8232907,56
India,10099066,146444,9663382,68
Brazil,7366677,189264,6405356,38
Russia,2963688,53096,2370857,55
France,2505875,61978,187272,40
...,...,...,...,...
Northern Mariana Islands,113,2,32,56
British Virgin Islands,72,1,70,72
Fiji,46,2,44,23
Sahrawi Arab DR,31,3,27,10


In [192]:
#### Sort the DataFrame so that the countries with the highest number of cases_per_deathscome first. 
#### Print the first 20 rows of your sorted DataFrame

In [193]:
df = df.sort_values(by='cases_per_deaths', ascending=False)
df[:20]

Unnamed: 0_level_0,cases,deaths,recovered,cases_per_deaths
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Singapore,58482,29,58322,2016
USS Theodore Roosevelt,1102,1,751,1102
Eritrea,877,1,599,877
Qatar,142605,243,140404,586
Burundi,760,2,687,380
Curaçao,4051,12,2385,337
Botswana,11982,38,11147,315
United Arab Emirates,197124,645,172984,305
Maldives,13537,48,12983,282
Bahrain,90817,350,88826,259


###### Countries with a high value of cases per death number indicates that the number of deaths per cases is low for them and countries with a low value of cases per death number means that less number of people died per number of cases.