# **Scraping the Articles on Travelandleisure.com**
![](https://i.imgur.com/0UIHlcB.jpeg)

* Scraping :- Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

* ['Traveland Leisure'](https://www.travelandleisure.com/travel-guide) is a Travel tour guide website from you can get information about countries historical places, hotel, best things to do and many more. Here you can see all details according to your location,and country choice.

* we will scrape https://www.travelandleisure.com/travel-guide to get the details of countries like name of countries, url of countries and full article of all country using python libraries requests and BeautifulSoup Then we will fetch the data in .CSV format using Pandas Library for further analysis.

* Outline of the steps we will follow:-

1. download the web-page using 'Requests'.
2. Parse the HTML Source code using 'BeautifulSoup'.
3. Extract the country name, url, article from the web-page.
4. Complile the extracted information into python lists and Dictionaries.
5. Extract and combine data from page.
6. Save the Extracted information into .CSV format

* First of all we need to install and import the library to download the webpages

In [1]:
# install the request and beautifulSoup library
!pip install requests --upgrade --quiet
!pip install BeautifulSoup4 --upgrade --quiet

In [2]:

# importing the library
import requests
from bs4 import BeautifulSoup

In [3]:
# Url of web-site
url = 'https://www.travelandleisure.com/travel-guide'

In [4]:
# using variable to get the webpage
response = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'})

In [5]:
# checking the status code of the downloaded page
response.status_code

200

In [6]:
# checking the total size of website
len(response.text)

186130

In [7]:
page_content = response.text

In [8]:
# checking the first 1000 words of the website
page_content[:1000]

'<!DOCTYPE html>\n<html id="glossaryTemplate_1-0" class="comp no-js taxlevel-1   glossaryTemplate article-html html mntl-html" data-ab="99,99,72,99,72,99,99,99,99" data-resource-version="1.152.0" lang="en" data-travelandleisure-resource-version="1.152.0" data-mantle-resource-version="3.14.433" data-tracking-container="true">\t\t\t\t\n<!--\n<globe-environment environment="k8s-prod" application="travelandleisure" dataCenter="us-east-1"/>\n-->\n<head class="loc head">\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t<script type="text/javascript">var Mntl = window.Mntl || {};</script>\n\t\t\t\t\t\n\t\t\t\t\t    <link rel="preconnect" href="//js-sec.indexww.com">\n    <link rel="preconnect" href="//c.amazon-adsystem.com">\n    <link rel="preconnect" href="//securepubads.g.doubleclick.net">\n\n\t\t\t\t\t\n\t\t\t\t\t\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n<meta name="robots" content="max-image-preview:large, NOODP, NOYDIR" />\n\n<meta name="viewport" content="width=d

In [9]:
# creating a file and loading the pagee contents in it
with open('webpage.html','w') as f:
  f.write(page_content)

**Use BeautifulSoup to Parse and Extract the Information**
* using the BeautifulSoup Library to extract information of the web-page

In [10]:
doc = BeautifulSoup(page_content,'html.parser')

In [11]:
# check the type of doc which will be beautifulSoup object
type(doc)

bs4.BeautifulSoup

In [12]:
# Checking the Title of the Webpage
doc.title

<title>Travel Destinations A-Z: Find Destinations by Letter</title>

### **Using Properties and Methods to Extract the Required Information**

**Creating Function to grab all Country Name**

In [13]:
def country(doc):
  # create a List to store country name
  country_name=[]
  for i in doc.find_all('a', class_='link-list__link type--dog-bold type--dog-link'):
    name = i.text.replace('\n', '')
    ct_name = name if name else 'no country'
    country_name.append(ct_name)
  return country_name
  # return the country_name dictionary

In [14]:
# Checking the total number of countries
len(country(doc))

100

In [15]:
country(doc)[:5]

[' Amsterdam', ' Anguilla', ' Aspen', ' Atlanta', ' Auckland']

In [16]:
# Top 10 countries
country_name = [i.strip() for i in country(doc)]
country_name[:10]

['Amsterdam',
 'Anguilla',
 'Aspen',
 'Atlanta',
 'Auckland',
 'Austin',
 'Bahamas',
 'Bali',
 'Bangkok',
 'Barcelona']

**Creating Function to grab the Url**

In [17]:
def country_link(doc):
  # Creating a dictionary to store Url
  country_link=[]
  for i in doc.find_all('a', class_='link-list__link type--dog-bold type--dog-link'):
    link = i.get('href')
    new_link = link if link else 'No Link'
    country_link.append(new_link)
  return country_link

In [18]:
# Checking the top 3 Url
country_link(doc)[:3]

['https://www.travelandleisure.com/travel-guide/amsterdam',
 'https://www.travelandleisure.com/travel-guide/anguilla-lesser-antilles',
 'https://www.travelandleisure.com/travel-guide/aspen']

In [19]:
# Checking the number of Urls
len(country_link(doc))

100

**Creating the Function to grab article's Link**

In [20]:
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

In [21]:
def get_url(link):
  with requests.Session() as session:
    response = session.get(link, headers=headers)
  if response.status_code ==200:
    soup = BeautifulSoup(response.content,'html.parser')
    return soup
  else:
    print('invaild url',format(link))
    print(response.status_code)
    return None

**Creating a function to grab article's Content, Title, Sub-heading, Author name, Article's Date**

In [22]:
# Creating a function to grab content
def get_content(doc):
  content_div = doc.find('div', class_ = 'loc article-content')
  #create a list to save all content
  all_content = []
  #searching content and sub-headings
  for tag in content_div.find_all(['h2', 'p']):
    #checking tag is content or sub-heading
    if tag.name == 'p':
      all_content.append(f'{tag.text.strip()} \n')
    else:
      all_content.append(f'\n\nHeading: {tag.text.strip()}\n\n')
  return ' '.join(all_content)

#creating function to grab Title of article
def get_title(doc):
  title = doc.find('h1').text.strip()
  # article_title = title.text.strip()

  return title


#creating function to grab Sub-Title
def get_sub_heading(doc):
  try:
    sub_heading = doc.find('p',class_='comp type--dog article-subheading').text
    return sub_heading
  except:
    return 'No Sub Title'

#creating function to grab author name
def get_author(doc):
  author_name = set()
    #finding the class into url
  authors = doc.find_all('a',class_='mntl-attribution__item-name')
  for author in authors:
    #changing the output into text format
    author = author.text
    #save the author name into author list
    author_name.add(author)
  author_names = ' and '.join(list(author_name))
  return author_names


#creating function to grab article date
def get_date(doc):
  article_date = []
  date = doc.find('div',class_='mntl-attribution__item-date')
  new_date = date.text.strip()
  #checking date start with 'updated on' or 'Published on'
  if new_date.startswith('Updated on') or new_date.startswith('Published on'):
    #remove from these words from date
    new_date = new_date.replace('Updated on','').replace('Published on','').strip()
  #save the date into date list

  return new_date

## **Now, We will Scrap the All data from the Web-site using the Function**

* Here we will scrap the all url, so that we can collect the data of all countries present in the web-site
* For that we will create a function Scrap() which will give the all data.

In [23]:
def scrap():
  # Creating the dictionary to store country name, url and article
  all_details = {'Country-Name':[],'Link':[],'title':[],'sub-heading':[],'author':[],'date':[], 'content': []}
  for i in range(0,1):
    url = 'https://www.travelandleisure.com/travel-guide'
    response = requests.get(url, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'})
    if response.status_code !=200:
      raise Exception('Failed to load page{}'.format(url))
    doc=BeautifulSoup(response.text)
    #saving country name into a variable
    country_name = country(doc)
    #saving country's article url into a variable
    country_links = country_link(doc)
    #updating the all details dictonery
    all_details['Country-Name'].extend(country_name)
    all_details['Link'].extend(country_links)

    for link in country_links:

      doc = get_url(link)
      #saving the title into a variable
      title = get_title(doc)
      #saving the sub-heading into a variable
      sub_heading = get_sub_heading(doc)
      #saving the authore name into a variable
      author_name = get_author(doc)
      #saving the all data into all details dict
      date = get_date(doc)
      # getting the content
      content = get_content(doc)

      # saving to the dict
      all_details['title'].extend([title])
      all_details['sub-heading'].extend([sub_heading])
      all_details['author'].extend([author_name])
      all_details['date'].extend([date])
      all_details['content'].extend([content])

  return all_details

In [24]:
scrap()

{'Country-Name': [' Amsterdam',
  ' Anguilla',
  ' Aspen',
  ' Atlanta',
  ' Auckland',
  ' Austin',
  ' Bahamas',
  ' Bali',
  ' Bangkok',
  ' Barcelona',
  ' Belize',
  ' Berlin',
  ' Bermuda',
  ' Big Sur',
  ' Boston',
  ' Brooklyn',
  ' Buenos Aires',
  ' Cape Cod',
  ' Cape Town',
  ' Charleston',
  ' Chicago',
  ' Copenhagen',
  ' Costa Rica',
  ' Dallas',
  ' Denver',
  ' Dolomites',
  ' Dubai',
  ' Dublin',
  ' Finger Lakes',
  ' Florence',
  ' Florida Keys',
  ' Hilton Head Island',
  ' Hong Kong',
  ' Houston',
  ' Iceland',
  ' Istanbul',
  ' Jaipur',
  ' Lake Tahoe',
  ' Las Vegas',
  ' Lima',
  ' Lisbon',
  ' London',
  ' Los Angeles',
  ' Los Cabos',
  ' Madrid',
  ' Maine',
  " Martha's Vineyard",
  ' Maui',
  ' Melbourne',
  ' Mexico City',
  ' Miami',
  ' Milan',
  ' Montana',
  ' Mustique',
  ' Nantucket',
  ' Napa Valley',
  ' Nashville',
  ' New Delhi',
  ' New Orleans',
  ' New York City',
  ' Nicaragua',
  ' Oahu',
  ' Oaxaca',
  ' Orlando',
  ' Palm Springs',
  

## **Now, We will Import the Pandas Library and Store the Data in it**
* Data_df variable will store the information in Pandas dataFrame
* We will scrape the all data

In [25]:
#importing pandas library
import pandas as pd
data_df = pd.DataFrame.from_dict(scrap(), orient='index').T

In [26]:
data_df

Unnamed: 0,Country-Name,Link,title,sub-heading,author,date,content
0,Amsterdam,https://www.travelandleisure.com/travel-guide/...,How to Plan a Perfect Trip to Amsterdam,"\nDiscover the best hotels, restaurants, and t...",Lindsay Cohn and Evie Carrick,"September 12, 2023",Gautier Houba/Travel + Leisure \n Iconic canal...
1,Anguilla,https://www.travelandleisure.com/travel-guide/...,This Secluded Caribbean Island Has White-sand ...,\nParadise is a Caribbean island with white-sa...,Lindsay Cohn,"September 26, 2023",The Dominican Republic brims with all-inclusiv...
2,Aspen,https://www.travelandleisure.com/travel-guide/...,Aspen Travel Guide,No Sub Title,Evie Carrick,"January 26, 2022",Arguably no U.S. mountain town is as synonymou...
3,Atlanta,https://www.travelandleisure.com/travel-guide/...,Atlanta Travel Guide,No Sub Title,Ellie Nan Storck,"March 2, 2021",While it's been years since Atlanta became one...
4,Auckland,https://www.travelandleisure.com/travel-guide/...,New Zealand's Largest City Has Exceptional Nat...,"\nHow to plan the perfect trip to Auckland, Ne...",Amy Louise Bailey,"May 16, 2022",There are so many new and exciting things to s...
...,...,...,...,...,...,...,...
95,Turks and Caicos,https://www.travelandleisure.com/travel-guide/...,How to Plan an Unforgettable Trip to Turks and...,"\nDiscover the best hotels, restaurants, and t...",Skye Sherman,"May 26, 2023",Matt Anderson Photography/Getty Images \n Turk...
96,Vancouver,https://www.travelandleisure.com/travel-guide/...,Vancouver Travel Guide,No Sub Title,Paul Feinstein,"November 11, 2021",There are a lot of misconceptions when it come...
97,Venice,https://www.travelandleisure.com/travel-guide/...,Italy's Floating City Is One of the Most Memor...,\nVisit Venice for an unforgettable adventure....,Julia Buckley,"July 6, 2021","Canals, gondolas, and the Rialto Bridge. You t..."
98,Vienna,https://www.travelandleisure.com/travel-guide/...,This Gorgeous European City Is Known for Its F...,No Sub Title,Patricia Doherty,"August 5, 2021","Vienna (Wien in German), Austria's capital cit..."


##**Now we will save our Data in a .CSV format**

In [27]:
data_df.to_csv('article.csv',index=None)

![](https://i.imgur.com/s53viNQ.png)

# **References**
## **Summary**


* The Project was a web scraping project which composite the libraries and function to fatch the data in .csv format and then can be downloaded as excel file for further analysis.
* We have used request and BeautifulSoup library to downloaded and scaro the web-page respectively.
* We used find() and find_all() methods to find different tags required from the website.
* Then we created multiple functions to grab the below information.

 1. Country's Name
 2. Country's article Url
 3. Title of article
 4. Sub-heading of article
 5. Author name
 6. Date of article
 7. Content
* After scraping the multiple functions, we store the data in Pandas dataframe which is a library.
* Finally we convert the stored data into .CSV file and then downloaded as excel file.

## **Challanges**
* I had a problem where an article was written by two authors, and I needed to extract their names. To solve this, I joined the names with "and" and converted them into a list.
* I also faced challenges in organizing the content sequentially, such as extracting sub-headings and content in the right order. To address this, I used the find attribute on both tags, and then implemented an if-else statement to combine the data from both tags in the correct sequence.

## **References**
* website scraping [Travel and Leisure](https://www.travelandleisure.com/travel-guide)
* Request Documentation - https://docs.python-requests.org/en/latest/
* Beautiful Documentation - https://beautiful-soup-4.readthedocs.io/en/latest/
* Extend and Append method - https://www.geeksforgeeks.org/append-extend-python/
* Pandas Documentation - https://pandas.pydata.org/
* Python Documentation- https://docs.python.org/3/tutorial/index.html




