# Title of Your Project
### Author: Your Name

## Introduction

In this Jupyter Notebook, I document the process of web scraping for Project 2 - Task 3. The focus is on extracting [type of data, e.g., product listings, article entries, weather data] from [website name], which is a [brief description of the website and its relevance].

This notebook includes:
- Setup and import of scraping modules
- Initialization and use of the `LinkScraper` class to collect relevant links
- Utilization of the `DataScraper` class to extract detailed information
- Initial data exploration to understand the dataset
- Discussion of the dataset's potential applications and the meaning of its fields

The code repository for this project can be found at [GitHub Repository URL](#).

**Please note**: Web scraping has been conducted in accordance with the website's `robots.txt` file and terms of service, ensuring that scraping is permitted.


## Setup

To begin scraping, we need to set up our environment. This involves importing the custom scraping classes we've defined in our Python package. We'll also establish any initial parameters required for the scraping process, such as the base URL of the website we intend to scrape and any specific class names that guide the scraping process.

The `LinkScraper` class will gather the necessary links from the website, and the `DataScraper` class will extract detailed data from each page linked. These classes are part of a Python package named 'my_scraping_package', which is structured as an installable package.


In [3]:
# Import the necessary libraries and modules
from webscraping import LinkScraper, DataScraper
import pandas as pd

# Define the base URL of the website we will be scraping
base_url = 'https://www.yellowpagesnepal.com/'

# If needed, define any other parameters such as class names or identifiers that will guide the scraping process
# class_names = ['class1', 'class2', 'class3', ...]


## Initializing the LinkScraper

Now that our environment is set up, we can begin the scraping process. The first step is to initialize our `LinkScraper` class, which is designed to navigate through the website and collect all relevant links. These links will be the targets for our detailed data scraping later on.


## Specifying Class Names for Scraping

In accordance with the project guidelines, I will now specify the class names corresponding to the sections of the website that contain the data of interest. These class names have been determined by inspecting the website's HTML structure and identifying the containers that hold the relevant data.

This targeted approach allows us to collect a dataset with specific information that we will describe and analyze. The chosen dataset is unique within the class and will be used to answer several questions related to [describe the type of analysis or questions of interest].


In [6]:
# Specify the class names that contain the data we want to scrape
class_names = ['cat-content', 'filter-content ml-3', 'paging', 'add-image']

# Initialize the LinkScraper with the base URL and the specified class names
scraper = LinkScraper(base_url, class_names)

# Perform the scraping process and collect the links
links_df = scraper.get_dataframe()

# Display the first few rows of the DataFrame
links_df.head()



Unnamed: 0,Link
0,https://www.yellowpagesnepal.com/abhi-raj-thre...
1,https://www.yellowpagesnepal.com/dowellsonic
2,https://www.yellowpagesnepal.com/puantistres
3,https://www.yellowpagesnepal.com/bahuvida-limited
4,https://www.yellowpagesnepal.com/nylon2018


## Detailed Data Extraction

With the list of relevant links in hand, the next step is to extract detailed information from each linked page. The `DataScraper` class is designed for this purpose. It will visit each URL we collected and parse the page's content to extract the data fields we are interested in, such as business names, addresses, contact information, and any other details provided on the page.

This extracted data will form the basis of our dataset, which will be used to answer questions regarding [insert the types of questions or analysis you plan to conduct with the data, such as business distribution, contact information availability, etc.].


In [9]:
# Import the DataScraper class from the webscraping module
# (assuming it's in the same package as LinkScraper)
from webscraping import DataScraper

# Instantiate the DataScraper
data_scraper = DataScraper()

# Scrape detailed data from each link
# (Assuming your LinkScraper has a method to get a list of links)
links = links_df['Link'].tolist()[:10]  # Convert the 'Link' column of your DataFrame to a list
data_scraper.scrape_sites(links)   # Scrape data from the list of links

# Convert the scraped data to a DataFrame
detailed_data_df = data_scraper.to_dataframe()

# Display the first few rows of the detailed data DataFrame to ensure it looks correct
detailed_data_df.head()


Scraping https://www.yellowpagesnepal.com/abhi-raj-thresher-manufacturers-and-treders
Scraping https://www.yellowpagesnepal.com/dowellsonic
Scraping https://www.yellowpagesnepal.com/puantistres
Scraping https://www.yellowpagesnepal.com/bahuvida-limited
Scraping https://www.yellowpagesnepal.com/nylon2018
Scraping https://www.yellowpagesnepal.com/butwal-krishi-yantra-ghar
Scraping https://www.yellowpagesnepal.com/1
Scraping https://www.yellowpagesnepal.com/shengzhou-fusheng-machinery-and-electrics
Scraping https://www.yellowpagesnepal.com/comeonbaby1
Scraping https://www.yellowpagesnepal.com/zhejiang-jinxia-new-material-technology-co-ltd


Unnamed: 0,updated_date,views,description,phone,email,website,reviews,related_categories,postal_code,fax_number
0,"Oct 20, 2017",3158,We are manufacturer of multi crop thresher and...,91-9587969826,Abhijitpandey555@gmail.com,,No review posted,Agricultural Equipment & Implements,,
1,"Dec 21, 2017",2525,"HANGZHOU DOWELL ULTRASONIC TECHNOLOGY CO.,LTD ...",86-86-571-23289358,davidwang@dowellsonic.com,http://www.dowellsonic.com/,No review posted,Agricultural Equipment & Implements,,86-571-23289358
2,"Dec 27, 2017",2069,"Shanghai L&H International Co., Ltd. is an int...",692-86-21-67186908,Sales@gdbr-bearing.com,http://www.gdbr-bearing.com,No review posted,Agricultural Equipment & Implements,,
3,"Apr 26, 2018",2268,Bahuvida Group is an Indian multinational cong...,"91-+919398671645, 91-9398671645",bahuvidalimited@gmail.com,,No review posted,Agricultural Equipment & Implements,500034.0,+91-40-23558850
4,"Apr 28, 2018",2485,"Huzhou Hengxin Label Manufacture Co.,Ltd. is l...","86-86-572-3773115, 86-86-13666518088",info@hengxin-label.com,http://www.hengxin-label.com/,No review posted,Agricultural Equipment & Implements,,86-572-3771088


In [10]:
# Specify the filename for the CSV of detailed data
detailed_csv_filename = 'detailed_scraped_data.csv'

# Save the detailed data to a CSV file
data_scraper.save_to_csv(detailed_csv_filename)

# Confirm saving the file
print(f"Detailed scraped data has been saved to {detailed_csv_filename}")


Data saved to detailed_scraped_data.csv
Detailed scraped data has been saved to detailed_scraped_data.csv


## Initial Data Exploration

Now that we have our detailed dataset, let's perform some initial explorations to understand what we have collected. We will look at the data types, check for missing values, and summarize the dataset to ensure its integrity and readiness for analysis.

We will also describe each field to clarify what they represent and how they can be used for answering our questions or providing insights.


In [11]:
# Display the info of the DataFrame to get an overview of the data types and missing values
detailed_data_df.info()

# Display statistical summaries for numerical fields in the DataFrame
detailed_data_df.describe()

# If there are categorical data fields, consider including frequency counts for those as well
# detailed_data_df['category_field'].value_counts()

# Check for missing values in the DataFrame
detailed_data_df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   updated_date        10 non-null     object
 1   views               10 non-null     int64 
 2   description         10 non-null     object
 3   phone               10 non-null     object
 4   email               10 non-null     object
 5   website             10 non-null     object
 6   reviews             10 non-null     object
 7   related_categories  10 non-null     object
 8   postal_code         10 non-null     object
 9   fax_number          10 non-null     object
dtypes: int64(1), object(9)
memory usage: 928.0+ bytes


updated_date          0
views                 0
description           0
phone                 0
email                 0
website               0
reviews               0
related_categories    0
postal_code           0
fax_number            0
dtype: int64