# Data Analysis

#### What is a pump and dump startup?

The "pump and dump" startup is when some founders, who probably think they're the main character, just get to finessin' everyone. It’s giving classic scammer vibes, where they cap about their company being the next big thing. They flex about a fake business model and some sus-sounding tech, all to create a ton of hype.

Then they get all the delulu investors to pile in, and the stock price goes crazy. It’s like, everyone gets FOMO, so they start throwing their money at it. But the founders, they're the ones getting the bag. Once the price is high-key bussin', they dump all their shares and dip. Everyone else is left with an L, and the company is basically canceled, periodt. 

It's giving the same energy as when a creator does a rug-pull on their own community. Like, they'll act all chill and pretend everything's gucci, but it's just a whole lot of cringe. They're not building anything real, they're just chasing clout and leaving everyone else cooked. So, if a startup seems too good to be true, it’s probably a hard pass.


#### What do they have in common?

1. Weak user retention - The focus is on acquiring new "investors" (or users, in a crypto context) to drive up demand and price. Retention of these investors is not a goal; in fact, the goal is to sell to them before they realize the asset is worthless. An early product might have a massive number of users, but if the monthly active user (MAU) metric is growing much slower than total users or if the churn rate is high, the "success" is unsustainable.
 
2. Valuation Growth To Revenue Ratio - The market cap (valuation) skyrockets based on false hype and manipulated trading activity, while legitimate revenue from a sustainable business model is minimal or nonexistent. 

3. Unstable Growth Metrics - If the company fails to find a stable, repeatable growth engine, its trajectory will look more like an initial spike followed by decline.

4. Qualitative data over quantitative evidence - Early on, many YC companies lack hard quantitative data. They may pitch investors on "founder's past experience," "founder's attitude," or "impressive technical milestones" rather than proven revenue and retention. 

5. Prioritizing network leverage over silent development - The scam relies on a network of people—often through spam emails, social media, or online forums—to spread misinformation and create false buying pressure. The "development" is purely in the manipulation of the market, not the product.

6. Over-emphasis on marketing and narrative - The entire business is the marketing. Fraudsters craft a compelling (but fake) narrative about a revolutionary product or impending breakthrough to manipulate the stock price. The product or company itself is merely an empty shell; all resources go into promotion.


#### So what do we care in data?

1. User retention - How many users does the product have on regular basis? The MAU/DAU metrics are critical for distinguishing legitimate startups from potential "pump and dumps."

2. Market Cap - Valuation relative to revenue ratio. Excessively high valuations without corresponding revenue generation may signal hype-driven inflation rather than real value.

3. Cringe Linkedin/X Posts - Analysis of founders' communication style and marketing approach. Excessive hype, unrealistic promises, or misleading claims on social platforms can be red flags.

4. Funding Rounds - Pattern and timing of fundraising activities. Rapid, successive rounds without product milestones may indicate an emphasis on capital accumulation over business building.

5. App Store Ratings / Reddit - External validation from actual users. Poor ratings, negative reviews, and critical discussions on platforms like Reddit can reveal discrepancies between marketing claims and product reality.

6. Founder's history - Track record of previous ventures, education background, and industry experience. Serial founders with failed ventures or founders lacking domain expertise may present higher risk.

### Setting up imports

In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import os
from datetime import datetime
import json
from pathlib import Path
from typing import Dict, Any


In [None]:
df_1 = pd.read_csv('companies.csv')
df_2 = pd.read_csv('2023-02-27-yc-companies.csv')
df_3 = pd.read_csv('2023-07-13-yc-companies.csv')
df_4 = pd.read_csv('companies.csv')
df_5 = pd.read_csv('founders.csv')
df_6 = pd.read_csv('schools.csv')
df_7 = pd.read_csv('tags.csv')
df_8 = pd.read_csv('YC Companies - Batches.csv')
df_9 = pd.read_csv('yc-companies-august-2025.csv')

In [25]:
df_1.head()


Unnamed: 0,name,vertical,year,batch,url,description
0,Clickfacts,B2B,2005,s2005,http://clickfacts.com,
1,Kiko,Consumer,2005,s2005,http://kiko.com,We're the best online calendar solution to eve...
2,Loopt,Enterprise,2005,s2005,http://loopt.com,
3,Parakey,Consumer,2005,s2005,http://parakey.com,
4,Reddit,Consumer,2005,s2005,http://reddit.com,


In [26]:
df_2.head()


Unnamed: 0,company_id,company_name,short_description,long_description,batch,status,tags,location,country,year_founded,num_founders,founders_names,team_size,website,cb_url,linkedin_url
0,28425,Pando Bioscience,Decipher complex diseases using protein networks,Pando Bioscience is a Boston-based synthetic b...,W23,Active,"['Drug discovery', 'Biotech', 'Diagnostics', '...",Boston,US,2022.0,2,"['Will (Yangxiaolu) Cao', 'Yang Wang']",3.0,https://,,
1,28421,Apollo Group,A marketplace for consumers to hire trained bl...,Apollo Group is a diversified technology group...,W23,Active,"['Home Services', 'International', 'Marketplac...","Lahore, Pakistan",PK,2023.0,1,['Usman Gul'],8.0,http://www.apollo-group.io,https://www.crunchbase.com/organization/apollo...,https://www.linkedin.com/in/gulsf/
2,28416,Pierre,A new way to review code,"Pierre enables engineers, designers and busine...",W23,Active,"['Developer Tools', 'Collaboration', 'AI-Enhan...",San Francisco,US,2023.0,2,"['Ian Ownbey', 'Jacob Thornton']",2.0,https://heypierre.app,,
3,28415,moonrepo,A developer productivity platform for managing...,moonrepo is a developer tool that reduces buil...,W23,Active,"['Developer Tools', 'SaaS', 'Open Source', 'En...","Portland, OR",US,2022.0,2,"['Miles Johnson', 'James Pozdena']",2.0,https://moonrepo.dev,,https://www.linkedin.com/company/moonrepo/
4,28414,Lasso,robotic process automation for chrome using GP...,🧐 The problem: Traditional Robotic Process Aut...,W23,Active,"['Generative AI', 'B2B', 'SaaS', 'Developer To...",San Francisco,US,2023.0,2,"['Gautam Bose', 'Lucas Ochoa']",2.0,https://www.getlassoai.com/,,


In [27]:
df_3.head()


Unnamed: 0,company_id,company_name,short_description,long_description,batch,status,tags,location,country,year_founded,num_founders,founders_names,team_size,website,cb_url,linkedin_url
0,28415,moonrepo,Open source build tool for monorepos and large...,moonrepo is a developer tool that reduces buil...,W23,Active,"['Developer Tools', 'SaaS', 'Productivity', 'O...","Portland, OR",US,2022.0,2,"['Miles Johnson', 'James Pozdena']",2.0,https://moonrepo.dev,,https://www.linkedin.com/company/moonrepo/
1,28412,SpeedyBrand,Generative-AI powered Marketing Content for SMBs,SpeedyBrand provides SMBs with generative-AI p...,W23,Active,[],,,2023.0,3,"['Jatin Mehta', 'Ayush Jasuja', 'Ranti Dev Sha...",3.0,https://speedybrand.io,,https://www.linkedin.com/company/89983427/
2,28409,BerriAI,Stop OpenAI Errors w/ 1 line of code 👈,Stop OpenAI Errors in 1 line of code\r\n\r\n``...,W23,Active,"['Artificial Intelligence', 'Developer Tools',...",San Francisco,US,2023.0,2,"['Krrish Dholakia', 'Ishaan Jaffer']",2.0,https://berri.ai/,,https://www.linkedin.com/company/berri-ai/
3,28425,Pando Bioscience,Decipher complex diseases using protein networks,Pando Bioscience is a Boston-based synthetic b...,W23,Active,"['Synthetic Biology', 'Biotech', 'Diagnostics'...",Boston,US,2022.0,2,"['Will (Yangxiaolu) Cao', 'Yang Wang']",3.0,https://,,
4,28413,SpecCheck,Unifying how the optical industry does business.,SpecCheck is an all-in-one solution that provi...,W23,Active,"['SaaS', 'Payments', 'Health Tech', 'B2B', 'API']","Los Angeles, CA",US,,2,"['Joe DeMaria', 'Arnold Villatoro']",5.0,https://www.speccheckrx.com/,,https://www.linkedin.com/company/speccheck/


In [28]:
df_4.head()


Unnamed: 0,name,vertical,year,batch,url,description
0,Clickfacts,B2B,2005,s2005,http://clickfacts.com,
1,Kiko,Consumer,2005,s2005,http://kiko.com,We're the best online calendar solution to eve...
2,Loopt,Enterprise,2005,s2005,http://loopt.com,
3,Parakey,Consumer,2005,s2005,http://parakey.com,
4,Reddit,Consumer,2005,s2005,http://reddit.com,


In [29]:
df_5.head()


Unnamed: 0,first_name,last_name,hnid,avatar_thumb,current_company,current_title,company_slug,top_company
0,Debanjum,Singh Solanky,110,https://bookface-images.s3.amazonaws.com/avata...,Khoj,,khoj,False
1,Jon,Wade,__JW__,https://bookface-images.s3.amazonaws.com/avata...,,,union54,False
2,Sy,Bohy,__sy__,https://bookface-images.s3.amazonaws.com/avata...,Seam,,seam,False
3,Abdelrahman,Hosny,_ahosny,https://bookface-images.s3.amazonaws.com/avata...,ShipBlu,,shipblu,False
4,Christopher,Chae,_chrischae,https://bookface-images.s3.amazonaws.com/avata...,Relate,,relate,False


In [30]:
df_6.head()


Unnamed: 0,hnid,school,field_of_study,year
0,110,"Birla Institute of Technology and Science, Pilani",Electronics And Instrumentation,2009
1,110,"Birla Institute of Technology and Science, Pilani",Electronics And Instrumentation,2010
2,110,"Birla Institute of Technology and Science, Pilani",Electronics And Instrumentation,2011
3,110,"Birla Institute of Technology and Science, Pilani",Electronics And Instrumentation,2012
4,110,"Birla Institute of Technology and Science, Pilani",Electronics And Instrumentation,2013


In [31]:
df_7.head()


Unnamed: 0.1,Unnamed: 0,id,tag
0,0,379,Community
1,1,379,Social Media
2,2,379,Social
3,3,379,Social Network
4,4,378,Calendar


In [32]:
df_8.head()


Unnamed: 0.1,Unnamed: 0,This spreadsheet updates itself every week with the YC companies and calculates some handy stats over them. It uses the Neptyne Google Sheets Add on to this. See the How this works tab,Unnamed: 2,Unnamed: 3,Unnamed: 4,"Last updated: 03 March, 2025",Unnamed: 6,Unnamed: 7,Made with Neptyne
0,,,,,,,,,
1,,,,,,,,,
2,,Company,Website,One line,Size,Sector,Batch,Status,Tags
3,,Leeroo,https://www.leeroo.com/,"Turning workflows to production-ready, e2e tra...",3,B2B,X25,Active,Artificial Intelligence
4,,Red Barn Robotics,https://www.redbarnrobotics.com,A Roomba for weeds on a farm.,3,Industrials,W25,Active,"Robotics, Agriculture, Sustainable Agriculture"


In [33]:
df_9.head()

Unnamed: 0,id,name,slug,small_logo_thumb_url,website,all_locations,long_description,one_liner,team_size,industry,...,regions/11,former_names/6,tags/5,former_names/7,former_names/8,former_names/9,former_names/10,former_names/11,former_names/12,former_names/13
0,5,CircuitHub,circuithub,https://bookface-images.s3.amazonaws.com/small...,https://circuithub.com,"London, England, United Kingdom",CircuitHub offers on-demand electronics manufa...,On-Demand Electronics Manufacturing,58.0,Industrials,...,,,,,,,,,,
1,6,iCracked,icracked,/company/thumb/missing.png,http://icracked.com,"Redwood City, CA, USA",Founded in 2010 and located in the heart of Si...,On-demand smartphone repair in 3 countries and...,51.0,Consumer,...,,,,,,,,,,
2,7,42Floors,42floors,/company/thumb/missing.png,http://42floors.com,"San Francisco, CA, USA; Remote",*Acquired by Knotel in 2018\r\n\r\n42Floors wa...,We make it easy to search for office space.,60.0,Real Estate and Construction,...,,,,,,,,,,
3,8,PlanGrid,plangrid,https://bookface-images.s3.amazonaws.com/small...,http://plangrid.com,"San Francisco, CA, USA",PlanGrid is the leader in construction product...,Mobile applications for the construction indus...,355.0,Real Estate and Construction,...,,,,,,,,,,
4,9,WireOver,wireover,/company/thumb/missing.png,http://wireover.com,"Cambridge, MA, USA",WireOver is a desktop application that leverag...,Really secure file sending for big files.,2.0,B2B,...,,,,,,,,,,
