# cricbuzz-scraping-project

Use the "Run" button to execute the code.

## Pick a website and describe your objective

Outline:
- We're going to scrape https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting
- We'll get a list of topics. For each topic, we'll get topic title and topic page URL
- For each repository, we'll grab repo name, username and repo URL
- For each topic, we'll create csv file.

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [11]:
topic_url = 'https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting'

In [12]:
response = requests.get(topic_url)

In [60]:
response.status_code

200

In [14]:
len(response.text)

312440

In [15]:
page_contents = response.text

In [16]:
page_contents[:1000]

'\r\n\r\n<!DOCTYPE html><html lang="en" itemscope itemtype="http://schema.org/WebPage"><head><meta charset="utf-8"><script>var is_mobile = /symbian|tizen|midp|uc(web|browser)|MSIE (5.0|6.0|7.0|8.0)|tablet/i.test(navigator.userAgent);\tif(is_mobile && window.location.hostname != "www1.cricbuzz.com") window.location.hostname = "m.cricbuzz.com";</script><style>body{background:#E3E6E3; font-family: helvetica,"Segoe UI",Arial,sans-serif;color:#222;font-size:14px; line-height: 1.5; margin:0;}\tbody, .cb-comm-pg, .cb-hm-mid {min-height:1000px}\t.container{width:980px;margin:0 auto;}\t.page{max-width: 980px;margin: 0 auto;position: relative;}\t.cb-col-8 {width:8%;}\t.cb-col-10 {width:10%;}\t.cb-col-14 {width:14%;}\t.cb-col-16 {width:16%;}\t.cb-col-20 {width:20%;}\t.cb-col-25 {width:25%;}\t.cb-col-27 {width:27%;}\t.cb-col-30 {width:30%;}\t.cb-col-33 {width:33%;}\t.cb-col-40 {width:40%;}\t.cb-col-46 {width:46%;}\t.cb-col-47 {width:47%;}\t.cb-col-50 {width:50%;}\t.cb-col-60 {width:60%;}\t.cb-col-

In [17]:
with open('webpage.html','w')as f:
    f.write(page_contents)

## Use beautiful soup to parse and extract information

In [18]:
!pip install beautifulsoup4 --upgrade --quiet

In [19]:
from bs4 import BeautifulSoup

In [20]:
doc = BeautifulSoup(page_contents,'html.parser')

In [21]:
type(doc)

bs4.BeautifulSoup

In [41]:
selection_class = 'text-white'
topic_title_tags = doc.find_all('a',{'class': selection_class})

In [42]:
len(topic_title_tags)

19

In [43]:
topic_title_tags[:5]

[<a class="text-white" href="/cricket-news" target="_self">News</a>,
 <a class="text-white" href="/cricket-schedule/series" target="_self">Series</a>,
 <a class="text-white" href="/cricket-team" target="_self">Teams</a>,
 <a class="text-white" href="/cricket-videos" target="_self">Videos</a>,
 <a class="text-white" href="/cricket-stats/icc-rankings/men/batting">Rankings</a>]

In [46]:
topic_title_tag4 = topic_title_tags[4]

In [47]:
topic_title_tag4.parent

<div class="cb-subnav cb-hm-mnu-itm feature-button cursor-pointer" id="rankingDropDown" title="ICC Rankings"><a class="text-white" href="/cricket-stats/icc-rankings/men/batting">Rankings</a><span class="cb-caret-down"></span><nav class="cb-sub-navigation"><a class="cb-subnav-item" href="/cricket-stats/icc-rankings/men/batting" target="_self" title="ICC Rankings Men">ICC Rankings - Men</a><a class="cb-subnav-item" href="/cricket-stats/icc-rankings/women/batting" target="_self" title="ICC Rankings Women">ICC Rankings - Women</a></nav></div>

In [53]:
topic_link_tags = doc.find_all('a',{'class':'text-white'})

In [54]:
len(topic_link_tags)

19

In [58]:
topic0_url = "https://www.cricbuzz.com" + topic_link_tags[4]['href']
print(topic0_url)

https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting


In [105]:
topic_titles = []
for tag in topic_title_tags[0:5]:
    topic_titles.append(tag.text)
print(topic_titles)

['News', 'Series', 'Teams', 'Videos', 'Rankings']


In [106]:
topic_urls = []
base_url = 'https://www.cricbuzz.com'
for tag in topic_link_tags[0:5]:
    topic_urls.append(base_url + tag.get('href'))
    
topic_urls

['https://www.cricbuzz.com/cricket-news',
 'https://www.cricbuzz.com/cricket-schedule/series',
 'https://www.cricbuzz.com/cricket-team',
 'https://www.cricbuzz.com/cricket-videos',
 'https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting']

In [107]:
!pip install pandas --quiet

In [108]:
import pandas as pd

In [110]:
topics_dict = {
    'title': topic_titles,
    'url': topic_urls,
}

In [111]:
topics_dict

{'title': ['News', 'Series', 'Teams', 'Videos', 'Rankings'],
 'url': ['https://www.cricbuzz.com/cricket-news',
  'https://www.cricbuzz.com/cricket-schedule/series',
  'https://www.cricbuzz.com/cricket-team',
  'https://www.cricbuzz.com/cricket-videos',
  'https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting']}

In [112]:
topics_df = pd.DataFrame(topics_dict)

In [113]:
topics_df

Unnamed: 0,title,url
0,News,https://www.cricbuzz.com/cricket-news
1,Series,https://www.cricbuzz.com/cricket-schedule/series
2,Teams,https://www.cricbuzz.com/cricket-team
3,Videos,https://www.cricbuzz.com/cricket-videos
4,Rankings,https://www.cricbuzz.com/cricket-stats/icc-ran...


## Create CSV file with extracted information

In [115]:
topics_df.to_csv('topics.csv',index=None)