# Scraping Startup Data from NIC (National Incubation Center) Hyderabad Sindh, Pakistan

## Introduction

The National Incubation Center (NIC) Hyderabad is a hub for fostering innovation and entrepreneurship in Sindh, Pakistan. This notebook aims to scrape and compile data on startups from the NIC Hyderabad website. The objective is to gather information on the startups, including their names, descriptions, and links to their detailed profiles, which can be used for further analysis and research.

## 1. Setup and Libraries

In this section, we will import the necessary libraries required for web scraping and data processing.

In [55]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import requests
from bs4 import BeautifulSoup

## 3. Web Scraping
### Base URL and Headers
We define the base URL for NIC Hyderabad cohorts and set headers to mimic a real browser.

In [56]:
# Base URL adding 1, 2, 3, end of will make the change the round/cohort
url = 'https://nichyderabad.com/about/cohort-'

# Set headers to mimic a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

## 3. Fetching Data from NIC Hyderabad
We fetch the web pages for each cohort and parse the HTML content.

In [57]:
# to temprary save all the responces
responses = {'cohorts':[], 'sope':[]}

# itrating ovwer all the cohorts page
for i in range(1, 5): # till this date 24/06/2024 we only have four cohorts
  # fatching the page of the iterate index
  response= requests.get(url+f'{i}', headers=headers)
  # only if the request is seccefull
  if response.status_code == 200:
    # save the cohort to the responses
    responses['cohorts'].append(i)
    # turning the responce to sope
    soup = BeautifulSoup(response.content, 'html.parser')
    # saving that sope in responses
    responses['sope'].append(soup)
  else: print(response.status_code, 'for searching cohort', i)

### Extracting Data for Each Cohort
We extract the relevant data (company names, URL of the each startup's profile page, and cohort session) from each cohort's page.
#### Cohort 1

In [58]:
# for cohort 1
soup_cohort1 = responses['sope'][0]

In [59]:
# selecting all "div" eiliments with class content-container
content_container = soup_cohort1.find_all('div', class_='content-container')
print('Total number of retrieve compony in cohort-1', len(content_container))

Total number of retrieve compony in cohort-1 20


In [60]:
NIC = {'company':[], 'NIC-URL':[], 'cohort':[]}

for i in content_container:
  if i.find('p').find_all('a') != []:
    company_url=i.find('p').find_all('a')[0].get('href')
    NIC["company"].append(company_url.split('/')[-2])
    NIC["NIC-URL"].append(company_url)

# adding 1 to each cohort equel to the number of the other observation
NIC["cohort"]=[1]*len(NIC["company"])

In [61]:
NIC_df = pd.DataFrame(NIC)
NIC_df.head()

Unnamed: 0,company,NIC-URL,cohort
0,emplai,https://nichyderabad.com/about/cohort-1/emplai/,1
1,agridunya-technologies,https://nichyderabad.com/about/cohort-1/agridu...,1
2,seevitals-solutions,https://nichyderabad.com/about/cohort-1/seevit...,1
3,monitr,https://nichyderabad.com/about/cohort-1/monitr/,1
4,upni-market,https://nichyderabad.com/about/cohort-1/upni-m...,1


#### Fetching the all the other cohorts

In [62]:
total_page = len(responses['sope'])
# cohort 2 and on
for i in range(1, total_page):
  sope_cohort = responses['sope'][i]
  # find all "a" tages with heading-link class ans store in list
  a_tags = sope_cohort.find_all('a', class_='heading-link')
  print(f'Total number of retrieve compony in cohort-{i+1} are', len(a_tags))
  for j in a_tags:
    # add the company, url, and cohort 2 to the NIC_df
    NIC_df.loc[len(NIC_df)] = [j.get('href').split('/')[-2], j.get('href'), 2]

Total number of retrieve compony in cohort-2 are 28
Total number of retrieve compony in cohort-3 are 20
Total number of retrieve compony in cohort-4 are 14


In [63]:
NIC_df.drop_duplicates(inplace=True)
NIC_df[NIC_df['company'].duplicated(keep=False)]

Unnamed: 0,company,NIC-URL,cohort
23,nichyderabad.com,https://nichyderabad.com/?page_id=9974&preview...,2
30,nichyderabad.com,https://nichyderabad.com/?page_id=9957&preview...,2
31,nichyderabad.com,https://nichyderabad.com/?page_id=9976&preview...,2
35,nichyderabad.com,https://nichyderabad.com/?page_id=9913&preview...,2
75,nichyderabad.com,https://nichyderabad.com/?page_id=10985&previe...,2
79,nichyderabad.com,https://nichyderabad.com/?page_id=10953&previe...,2


*Note: Now that we have the like for each page of the startup on NIC website, we will start to fatch the data from thos page*

In [64]:
def get_soup(link_:str) -> BeautifulSoup:
    response = requests.get(link_, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    else:
        print(response.status_code, 'for', link)

In [65]:
import re

for i in NIC_df['NIC-URL'].index:
  soup = get_soup(NIC_df['NIC-URL'][i])
  table_div = soup.find('div', class_= re.compile(r'fusion-text fusion-text-2.*'))
  # print(i, NIC_df['NIC-URL'][i])
  table = table_div.find('table')

  # Iterate over each row in the table
  for row in table.find_all('tr'):
      # Extract columns from each row
      columns = row.find_all('td')
      # Get the text content of each column and strip any extra whitespace
      columns = [col.get_text(strip=True) for col in columns]
      # add this info to the NIC_df where only have three columns [company	NIC-URL	cohort]
      if len(columns) >= 2:
        NIC_df.loc[i, columns[0]] = columns[1]

In [66]:
# find rows having nan
print(NIC_df[NIC_df.isna().any(axis=1)])
NIC_df.head()

Empty DataFrame
Columns: [company, NIC-URL, cohort, STARTUP, STARTUP STAGE, DOMAIN, WEBSITE, FOUNDER, CONTACT, EMAIL]
Index: []


Unnamed: 0,company,NIC-URL,cohort,STARTUP,STARTUP STAGE,DOMAIN,WEBSITE,FOUNDER,CONTACT,EMAIL
0,emplai,https://nichyderabad.com/about/cohort-1/emplai/,1,EmplAi,Accelerate,"HR, Attendance",http://www.emplai.ai/,Aftab Ahmed Saraz (CEO & Founder),–,–
1,agridunya-technologies,https://nichyderabad.com/about/cohort-1/agridu...,1,AgriDunya Technologies,Accelerate,AgTech,https://www.agridunya.com/,Rahul Dembani,+92 330 0792000,support@agridunya.com
2,seevitals-solutions,https://nichyderabad.com/about/cohort-1/seevit...,1,SeeVitals,Accelerate,Health Tech,http://www.seevitals.com/,Dr. Nimra Qureshi,+92 3451591251,info@SeeVitals.com
3,monitr,https://nichyderabad.com/about/cohort-1/monitr/,1,Monitr,Accelerate,"Ed-Tech, SAAS",www.monitr.site,Daeyan Hafeez Siddiqui,0345 3531573,daeyansidi826@gmail.com
4,upni-market,https://nichyderabad.com/about/cohort-1/upni-m...,1,Upni Market,Accelerate,E-commerce,www.Upnimarket.com,Ahsan Zahid,+92 322 0232991,info@upnimarket.com


## 4. Saving the Data
Finally, we save the collected and processed data to a CSV file for future analysis.

In [67]:
NIC_df.to_csv('NIC_startups-data.csv')