In [1]:
import re
import numpy as np
import pandas as pd
import json

# 1. Purpose of this Phase
#### The aim of this phase was to collect and build a comprehensive database of tech startups in Saudi Arabia, focusing on the essential information for each company to facilitate later analysis.

## 2. Data Source  

The dataset used in this analysis comes from **Hussein Attar’s Saudi Tech Startups** report, which provides comprehensive information about startups in Saudi Arabia, including their industries, funding stages, and other key details.  

🔗 [Saudi Tech Startups – Hussein Attar](https://www.husseinattar.com/en/saudi-tech-startups/)


In [2]:
df = pd.read_csv("OriginalData.csv")

#### The main columns extracted from the site are

In [3]:
df.columns

Index(['Name', 'Website', 'Featured', 'Stage', 'Industry', 'Tags', 'Region',
       'Startup HQ'],
      dtype='object')

In [4]:
df.head(5)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia


# Data Collection

Our dataset already contains rich information, but we further enriched it by deriving additional features from the existing columns:

- NumOfTags: Represents the number of tags (activities) associated with each company. This helps us determine whether a company is specialized in one area or operates across multiple fields.

- Funding (USD): Created by mapping each funding stage to its corresponding investment range, providing an estimate of the amount of money raised at each stage.

- CompanySize: Derived from the funding values, this feature classifies companies into small, mid-level, or large categories based on the amount of funding they have secured.

- Funding_min: Indicates the minimum funding amount for each company, converted into Saudi Riyals for local relevance.

- rate_percent: Shows the percentage share of total funding rounds in each region, giving insights into where startups are concentrated and how rapidly they are expanding.

- Align with 2030: Identifies whether a company’s activities are aligned with Saudi Vision 2030 or unrelated to its strategic goals.

# 3. Steps Implemented in the Code

## Adding Number of Tags per Company
#### A new column NumOfTags was created to count the number of tags for each company by splitting the Tags column on commas:

In [5]:
xx = lambda x: len(str(x).split(",")) if pd.notnull(x) else 0
df["NumOfTags"] = df["Tags"].apply(xx)
df.head(5)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1


## Estimating Funding Range by Stage
#### A dictionary each_stage_funding was defined to map each funding stage to an approximate USD range:

In [6]:
each_stage_funding = {
    "Pre-Seed": "500K – 2M",
    "Seed": "1M – 3M",
    "Pre-Series A": "2M – 6M",
    "Series A": "6M – 15M",
    "Pre-Series B": "10M – 18M",
    "Series B": "15M – 30M",
    "Series C": "25M – 50M",
    "IPO": "500M – 2B+",
    "Private Equity": "50M – 500M+"
}

In [7]:
df["Stage"].value_counts() 

Stage
Seed              247
Pre-Seed          100
Pre-Series A       63
Series A           27
Series B           12
Private Equity      9
Series C            3
Pre-Series B        3
IPO                 2
Name: count, dtype: int64

In [8]:
df["Funding (USD)"] = df["Stage"].map(each_stage_funding)
df.head(10)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags,Funding (USD)
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1,15M – 30M
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2,25M – 50M
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1,50M – 500M+
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3,25M – 50M
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1,10M – 18M
5,Rewaa,https://rewaatech.com,True,Series A,Technology,"SaaS,ERP",Riyadh,Saudi Arabia,2,6M – 15M
6,Zid,https://zid.sa/,True,Series B,"Technology ,Retail","SaaS,E-Commerce",Riyadh,Saudi Arabia,2,15M – 30M
7,Classera,https://classera.com,True,Series A,Education,SaaS,Makkah,Saudi Arabia,1,6M – 15M
8,Gathern,https://gathern.co,True,Pre-Series A,"Travel & Tourism,Real Estate","Marketplace,Hospitality",Riyadh,Saudi Arabia,2,2M – 6M
9,Sary,https://www.sary.com/en,True,Series C,"Food & Beverages ,Retail","B2B Financing,Marketplace",Riyadh,Saudi Arabia,2,25M – 50M


# Classifying Company Size Based on Funding
The `CompanySize` column was created to classify companies as:

- **Startup / Small**
- **Mid-level**
- **Large**
- **Enterprise**

based on their funding range.

In [9]:
def classify_from_funding(funding):
    if pd.isna(funding):
        return "Unknown"
    if "K" in funding:
        return "Startup / Small"
    elif any(m in funding for m in ["1M", "2M", "3M", "5M"]):
        return "Startup / Small"
    elif any(m in funding for m in ["10M", "15M", "18M", "20M"]):
        return "Mid-level"
    elif any(m in funding for m in ["25M", "30M", "50M"]):
        return "Large"
    elif any(m in funding for m in ["100M", "200M", "500M"]):
        return "Enterprise"
    else:
        return "Unknown"

df["CompanySize"] = df["Funding (USD)"].apply(classify_from_funding)

print(df[["Name", "Funding (USD)", "CompanySize"]].head(15))

           Name Funding (USD)      CompanySize
0          Lean     15M – 30M  Startup / Small
1       Foodics     25M – 50M  Startup / Small
2         Salla   50M – 500M+            Large
3          Nana     25M – 50M  Startup / Small
4          Mozn     10M – 18M        Mid-level
5         Rewaa      6M – 15M  Startup / Small
6           Zid     15M – 30M  Startup / Small
7      Classera      6M – 15M  Startup / Small
8       Gathern       2M – 6M  Startup / Small
9          Sary     25M – 50M  Startup / Small
10       Syarah     15M – 30M  Startup / Small
11  Goldenscent      6M – 15M  Startup / Small
12       Mrsool      6M – 15M  Startup / Small
13     Grintafy       2M – 6M  Startup / Small
14       Hakbah      6M – 15M  Startup / Small


# Converting Funding to Numeric Minimum Value and Analyzing Funding by Region

A `Funding_min` column was created representing the minimum USD funding for better analysis.

We calculated total funding and number of companies per region and added percentage and per-company metrics.


In [10]:
m = lambda f: float(re.search(r"([\d\.]+)([MK])", f).group(1))*1_000_000 if pd.notna(f) and re.search(r"([\d\.]+)([MK])", f) else 0
df["Funding_min"] = df["Funding (USD)"].apply(m)


region_stats = df.groupby("Region").agg(
    company_count=("Name", "count"),
    total_funding=("Funding_min", "sum")
).reset_index()

# هذا حساب rate بالنسبه المئويه لكل منطقه وهذا العامود المهم والمطلوب 
region_stats["rate_percent"] = ((region_stats["total_funding"] / region_stats["total_funding"].sum()) * 100).round(2)

# نحسب التمويل لكل شركه في المنطقه الللي نبيه هنا
region_stats["rate"] = (region_stats["total_funding"] / region_stats["company_count"]).round(2)
r = lambda x: f"${x/1_000_000:.2f}M"
region_stats["rate_million"] = region_stats["rate"].apply(r)

# الاجمالي عندنا بيكون بهذا العامود 
tot = lambda x: f"${x/1_000_000:.2f}M"
region_stats["total_funding_million"] = region_stats["total_funding"].apply(tot)

print(region_stats[["Region", "company_count", "total_funding_million", "rate_percent", "rate_million"]])

    Region  company_count total_funding_million  rate_percent rate_million
0     Asir              2              $501.00M          0.96     $250.50M
1  Eastern             20             $2598.00M          4.97     $129.90M
2    Jazan              1                $1.00M          0.00       $1.00M
3  Madinah              5             $1502.00M          2.87     $300.40M
4   Makkah             69             $7149.00M         13.68     $103.61M
5   Riyadh            365            $39517.00M         75.60     $108.27M
6   jeddah              4             $1002.00M          1.92     $250.50M


In [11]:
# بدمج الجدولين هنا مع بعض وبختار فقط الاعمده اللي ابيها 
df = df.merge(
    region_stats[["Region", "rate_percent"]],
    on="Region",
    how="left"
)


df.head(5)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags,Funding (USD),CompanySize,Funding_min,rate_percent
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1,15M – 30M,Startup / Small,15000000.0,75.6
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1,50M – 500M+,Large,50000000.0,13.68
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3,25M – 50M,Startup / Small,25000000.0,75.6
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1,10M – 18M,Mid-level,10000000.0,75.6


In [12]:
df["Tags"].value_counts() 

Tags
Marketplace                                          53
SaaS                                                 31
AI & Machine Learning                                30
E-Learning                                           13
E-Commerce                                           12
                                                     ..
AI & Machine Learning,Marketplace,SaaS,E-Learning     1
AI & Machine Learning,Cloud Solution ,SaaS            1
Cloud Solution ,SaaS,ERP                              1
3D Printing,Trading                                   1
On-demand B2B Warehousing Platform                    1
Name: count, Length: 219, dtype: int64

In [13]:
#Dictionary
aligned_tags = {"3D Printing", "Advertisement", "Affiliate Marketing", "Aggregator", "AgriTech",
                 "AI & Machine Learning", "APP Dev & PaaS", "Autonomous Driving", 
                 "B2B Financing", "BioTech", "Blockchain Solution", "BNPL", "Car Solution",
                   "ClimateTech", "Cloud Kitchen", "Cloud Solution", "Computer Vision", "Crowdfunding",
                     "Cyber Security", "Digital Bank", "Drop-Shipping", "E-Commerce", "E-Learning", "ERP", 
                     "Expense Mangement", "Financial Solution", "Fintech Infrastructure", "Fleet Mangement", 
                     "Fund Distribution", "Game Dev", "Gifting", "Hardware & loT & Drones", "Hospitality", "HR & Recruitment", 
                     "IT Solutions Services", "Industry 4.0", "Last Mile Delivery", "Logistics", "Loyalty", 
                     "Marketplace", "Media Production & Distribution", "On-demand B2B Warehousing Platform", "Payment Gateway",
                       "Podcasts / Audiobooks", "RegTech", "Renewable Energy", "Robo-Advisory", "ROSCA", "SaaS", "SmartCity/Home",
                         "SME-Funding", "Social Commerce", "Streaming", "Trading", "Transport & Micro Mobility", "Web3& Metaverse",
                           "Wholesale", "ePOS", "eSports Platform"}

# Normalize
aligned_lower = {t.strip().lower() for t in aligned_tags}


# Aligning Tags with Vision 2030

Created the column `Allign with 2030` to identify companies aligned with Vision 2030 based on their tags.


In [14]:
df["Allign with 2030"] = (
    df["Tags"].fillna("")
      .str.split(r"\s*[,،]\s*")
      .apply(lambda items: "Yes" if any(x.strip().lower() in aligned_lower
                                        for x in items if x.strip()) else "No")
)


print(df["Allign with 2030"].value_counts(dropna=False))
df.head(30)


Allign with 2030
Yes    442
No      24
Name: count, dtype: int64


Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags,Funding (USD),CompanySize,Funding_min,rate_percent,Allign with 2030
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1,15M – 30M,Startup / Small,15000000.0,75.6,Yes
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6,Yes
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1,50M – 500M+,Large,50000000.0,13.68,Yes
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3,25M – 50M,Startup / Small,25000000.0,75.6,Yes
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1,10M – 18M,Mid-level,10000000.0,75.6,Yes
5,Rewaa,https://rewaatech.com,True,Series A,Technology,"SaaS,ERP",Riyadh,Saudi Arabia,2,6M – 15M,Startup / Small,6000000.0,75.6,Yes
6,Zid,https://zid.sa/,True,Series B,"Technology ,Retail","SaaS,E-Commerce",Riyadh,Saudi Arabia,2,15M – 30M,Startup / Small,15000000.0,75.6,Yes
7,Classera,https://classera.com,True,Series A,Education,SaaS,Makkah,Saudi Arabia,1,6M – 15M,Startup / Small,6000000.0,13.68,Yes
8,Gathern,https://gathern.co,True,Pre-Series A,"Travel & Tourism,Real Estate","Marketplace,Hospitality",Riyadh,Saudi Arabia,2,2M – 6M,Startup / Small,2000000.0,75.6,Yes
9,Sary,https://www.sary.com/en,True,Series C,"Food & Beverages ,Retail","B2B Financing,Marketplace",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6,Yes


# Calculating Average Funding and Opportunity Level

Converted funding ranges to numeric average (`Funding_Avg`) and classified opportunity level as **Low / Medium / High**.


In [None]:
# To add Averge funding column
def to_number(x):
    if pd.isna(x): return np.nan
    x = str(x).replace("+", "").strip()
    if "K" in x: return float(x.replace("K", "")) * 1e3
    if "M" in x: return float(x.replace("M", "")) * 1e6
    if "B" in x: return float(x.replace("B", "")) * 1e9
    return float(x)

def range_to_avg(r):
    if pd.isna(r): return np.nan
    r = str(r).replace("–", "-").replace(" ", "")
    parts = r.split("-")
    if len(parts) == 2:
        low, high = to_number(parts[0]), to_number(parts[1])
        return (low + high) / 2
    return to_number(parts[0])

def format_number(v):
    if pd.isna(v): return ""
    if v >= 1e9: return f"{v/1e9:.2f}B"
    if v >= 1e6: return f"{v/1e6:.2f}M"
    if v >= 1e3: return f"{v/1e3:.2f}K"
    return str(v)


avg_num = df["Funding (USD)"].apply(range_to_avg)
df["Funding_Avg"] = avg_num.apply(format_number)

In [16]:
# Add Opportunity_Level
LOW_MAX, MED_MAX = 2e6, 20e6
def classify(v):
    if pd.isna(v): return ""
    if v < LOW_MAX: return "Low"
    elif v < MED_MAX: return "Medium"
    else: return "High"

df["Opportunity_Level"] = avg_num.apply(classify)

In [17]:
df.head(10)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags,Funding (USD),CompanySize,Funding_min,rate_percent,Allign with 2030,Funding_Avg,Opportunity_Level
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1,15M – 30M,Startup / Small,15000000.0,75.6,Yes,22.50M,High
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6,Yes,37.50M,High
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1,50M – 500M+,Large,50000000.0,13.68,Yes,275.00M,High
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3,25M – 50M,Startup / Small,25000000.0,75.6,Yes,37.50M,High
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1,10M – 18M,Mid-level,10000000.0,75.6,Yes,14.00M,Medium
5,Rewaa,https://rewaatech.com,True,Series A,Technology,"SaaS,ERP",Riyadh,Saudi Arabia,2,6M – 15M,Startup / Small,6000000.0,75.6,Yes,10.50M,Medium
6,Zid,https://zid.sa/,True,Series B,"Technology ,Retail","SaaS,E-Commerce",Riyadh,Saudi Arabia,2,15M – 30M,Startup / Small,15000000.0,75.6,Yes,22.50M,High
7,Classera,https://classera.com,True,Series A,Education,SaaS,Makkah,Saudi Arabia,1,6M – 15M,Startup / Small,6000000.0,13.68,Yes,10.50M,Medium
8,Gathern,https://gathern.co,True,Pre-Series A,"Travel & Tourism,Real Estate","Marketplace,Hospitality",Riyadh,Saudi Arabia,2,2M – 6M,Startup / Small,2000000.0,75.6,Yes,4.00M,Medium
9,Sary,https://www.sary.com/en,True,Series C,"Food & Beverages ,Retail","B2B Financing,Marketplace",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6,Yes,37.50M,High


# Adding Year of Establishment

Read a JSON file (`founding_years_465.json`) and mapped founding years to the DataFrame.


In [18]:
with open("founding_years_465.json", "r", encoding="utf-8") as f:
    company_years = json.load(f)

df["Year of establishment"] = df["Name"].map(company_years)
df["Year of establishment"] = df["Year of establishment"].astype("Int64")


df[["Name", "Year of establishment"]]
df.head(5)

Unnamed: 0,Name,Website,Featured,Stage,Industry,Tags,Region,Startup HQ,NumOfTags,Funding (USD),CompanySize,Funding_min,rate_percent,Allign with 2030,Funding_Avg,Opportunity_Level,Year of establishment
0,Lean,https://www.leantech.me,,Series B,Financial Services,Fintech Infrastructure,Riyadh,Saudi Arabia,1,15M – 30M,Startup / Small,15000000.0,75.6,Yes,22.50M,High,2019
1,Foodics,https://www.foodics.com/,,Series C,"Technology ,Food & Beverages","SaaS,Financial Solution",Riyadh,Saudi Arabia,2,25M – 50M,Startup / Small,25000000.0,75.6,Yes,37.50M,High,2014
2,Salla,https://salla.com/,True,Private Equity,Technology,SaaS,Makkah,Saudi Arabia,1,50M – 500M+,Large,50000000.0,13.68,Yes,275.00M,High,2016
3,Nana,https://nana.co/en,True,Series C,Food & Beverages,"Last Mile Delivery,Marketplace,E-Commerce",Riyadh,Saudi Arabia,3,25M – 50M,Startup / Small,25000000.0,75.6,Yes,37.50M,High,2016
4,Mozn,https://www.mozn.sa,True,Pre-Series B,Technology,AI & Machine Learning,Riyadh,Saudi Arabia,1,10M – 18M,Mid-level,10000000.0,75.6,Yes,14.00M,Medium,2017


# Save new dataset after collection 
After all additions and calculations, the final dataset was saved

In [None]:
#df.to_csv("NewDatasetCollection.csv", index=False) 

### Notes on the Data Collection Phase

- No new information was added from official company websites or social media (LinkedIn, Twitter) other than founding year.  
- Companies without clear founding year were contacted via email, but no responses were received.  
- All operations in this phase focused only on data collection and basic funding calculations, with no further cleaning or processing applied.
