# Patterns in the Fastest Growing American Companies
Analysis by a, Fall 2022

### Outline: 
1. Introduction
2. Data Collection
3. Data Cleanup/Managment
5. Exploratory Data Analysis
6. Testing
7. Insights/Conclusion

# Introduction

Throughout America, tens of millions of companies are constantly competing against each other. Businesses that can gain an edge in the competition are far more likely to succeed in the short and long-term. As a result, patterns surrounding successful businesses can give substantial insight into what makes certain businesses gain their respective edge over the competition. 

 The advantages a business has over their competition can be defined through the special resources they have access to, such as the talent they hire. This analysis will specifically analyze a businesses access to talent clusters that can positively benefit their workforce. 

Many of today's most innovative companies are taking great advantage of the activity within these talent hot spots. In 2017 itself, the 10 largest tech hubs in America were responsible for 58% of all U.S. Patents, and this trend can be seen similarly throughout the world, from Tokyo to Paris

This analysis will prove the patterns between top American companies and their location's and the correlation between a companies success. Some  areas that are known for hosting a certain industry such as technology in California cities should have a dominance of that industry be seen through the data. The eventual conclusion should ascertain where certain companies should locate their headquarters to hire the best talent, and how many employees they should ideally hire.

Source: https://hbr.org/2018/09/navigating-talent-hot-spots

# Data Collection

To begin, we need a reliable source of data on successful American companies. It is crucial that the data we obtain has detailed information on each company, such as location: in order to pinpoint the location of talent hotspots, the number of employees each company has, for patterns in a specific business’s growth and their employee count, and lastly their growth and revenue. Additionally, the data must be up to date, as the environment modern businesses compete in is constantly changing, especially surrounding the latest technology. 

The source we will be using for this data is the Inc. 5000, for 2021. The Inc. 5000 is a list of the five thousand fastest growing private companies in the U.S. The list provides substantial information on each company, and importantly has all the data we need to identify the location and employees of each business, as well as how much they have grown and made in 2021. On top of that, it also provides the industry each company is within, which will be useful to narrow down patterns within a certain industry.

Inc. 5000: Introducing the Inc. 5000 Fastest-Growing Private Companies in America
Data Source: ebstax/inc-5000 | Workspace | data.world

To begin, we need to get our hands on the dataset. Unfortunately trying to scrape the Inc. 5000 site usually results in you being detected as a bot. Moreover, the data is displayed in a scrolling box. As a result, the data is only visible by scrolling to the user and the chrome driver as well, forcing you to have the program scroll down, scrape, and repeat over the entire list of 5000 companies. Luckily for us, the data on 2019’s Inc. 5000 can be found on data.world. 

Now that you have the dataset, you can begin by importing the libraries you will need. Libraries such as matplotlib, sklearn and pandas are crucial to visualizing and analyzing the data we just obtained.

```
import pandas as pd
import folium
from folium.plugins import MarkerCluster
from geopy.geocoders import Nominatim
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
```

With the libraries downloaded, we can now read the excel sheet containing our data. We also print out the first few rows of the data in order to get a feeling for the format of the dataset including column names and whether the data requires parsing.

```
xlsheet = pd.read_excel(r"C:\Users\sdiv\Documents\inc2021.xlsx")
print(xlsheet.head())
```

```
Column1.inc5000companyId  ...  Column1.editorsPick
0                    116153  ...               [List]
1                    115394  ...               [List]
2                     81708  ...               [List]
3                    121083  ...               [List]
4                    115418  ...               [List]

[5 rows x 38 columns]
```

# Data Cleanup/Managment

As you can see, the data as it is appears difficult to work with. We can use print(xlsheet.to_string()) to see the rest of the columns and data. Unfortunately still the column names are messy and there are many columns that are unnecessary or contain little to no useful data. There is also the issue of visualizing 5000 rows of data. To make later steps easier, we will clean up the data by creating a new dataset and copying the data we need from the original dataset to the new one. We then drop the last 4000 companies, and only analyze the data from the top 1000 companies to make analysis visibily easier. Lastly, we remove rows with NoneType or missing values for data that we need for analysis. By the end of the data cleanup, we should only be left with data on each companies name, rank, workers, location, growth/revenue, and their industry.

```
df_copy = xlsheet[["Column1.company", "Column1.rank", "Column1.workers", "Column1.state_s", "Column1.city", "Column1.zipcode", "Column1.growth", "Column1.revenue", "Column1.industry"]].copy()
df_copy.rename(columns={'Column1.company': 'Company',
                        'Column1.rank': 'Rank',
                        'Column1.workers': 'Workers',
                        'Column1.state_s': 'State',
                        'Column1.city': 'City',
                        'Column1.zipcode': 'Zip Code',
                        'Column1.growth': 'Growth',
                        'Column1.revenue': 'Revenue',
                        'Column1.industry': 'Industry'},
               inplace=True)

df_copy.drop(df_copy.index[1001:5001], inplace=True)
df_copy.dropna(subset=["Company", "Rank", "Workers", "City", "Growth", "Revenue", "Industry"])

print(df.head())
```

```
Rank  Workers State  ... Growth        Revenue                      Industry
0     1       75    CA  ...  48345     52 Million  Business Products & Services
1     2      720    CA  ...  39734   45.7 Million                        Health
2     3        9    MS  ...  36955  193.5 Million                     Insurance
3     4      330    MA  ...  32997   40.7 Million                        Health
4     5      151    CA  ...  28972  121.8 Million            Financial Services
```

# Exploratory Data Analysis

Now with the data managed and reduced, we can move on to analysis. At face the data doesn't provide much insight. The companies are ranked based on growth percent, with the #1 spot Human Bees growing 48,345% during 2021. Other than that, we will need to present the data in a different form to analyze. In order to analyze location patterns in the data, we can use folium map to visualize the companies locations 

## Folium Map

Folium map requires latitude and longitude to chart locations. To find this information we need to use geopy. We create the map and then give geopy's geolocator the city the company is in, which will then return latitude and longitude values we can use on the map. After ensuring that geopy did not return NoneType we can place a marker at the company location.

```
geolocator = Nominatim(user_agent="MyApp")
map_osm = folium.Map(zoom_start=10)


def create_map(df):
    for index, company in df.iterrows():
        location = geolocator.geocode(company["City"])
        if location is not None:
            if location.latitude is not None and location.longitude is not None:
                folium.Marker(location=[location.latitude, location.longitude]).add_to(map_osm)


create_map(df_copy)
map_osm.save("map.html")
```

________________________MAP ONE___________________________________

We now have the data charted on the map, and can visually interpret the data. From this map, we can see certain clusters of company locations, such as in Los Angoles, San Francisco, Seattle, Washington, New York City, and many more. However, these are all major American cities, and do not give insight on where specific industries can locate talent. New York City naturally has many business headquarters, but is New York a good place for finance companies, advertising, or technology companies? Points in the same location will not be visible as well. Our next map will give insight on these details as well as cluster points on the same location that were not visible before.

To create the cluster map, we must create a new marker cluster layer ontop of the folium map using the folium plugin MarkerCluster. The markers on the map will now merge into clusters when close, and the clusters colors will intensify based on the number of markers within it. In order to allow for analysis based on industry, we will create a function that will return a specific color based on the companies industry, which will be used to color code each marker based on the industry it belongs to.

```
cluster = MarkerCluster(name='cluster_layer').add_to(map_osm)

def return_color(indu):
    color = "red"
    icon = "ban-circle"
    if indu == "Advertising & Marketing":
        color = "red"
        icon = "info-sign"
    elif indu == "Agriculture & Natural Resources":
        color = "blue"
        icon = "info-sign"
    elif indu == "Arts & Entertainment":
        color = "green"
        icon = "info-sign"
    elif indu == "Automotive":
        color = "purple"
        icon = "info-sign"
    elif indu == "Business Products & Services":
        color = "orange"
        icon = "info-sign"
    elif indu == "Computer Hardware":
        color = "darkred"
        icon = "info-sign"
    elif indu == "Construction":
        color = "lightred"
        icon = "info-sign"
    elif indu == "Consumer Products & Services":
        color = "beige"
        icon = "info-sign"
    elif indu == "Software":
        color = "darkblue"
        icon = "info-sign"
    elif indu == "DEI Advocacy":
        color = "darkgreen"
        icon = "info-sign"
    elif indu == "E-Commerce":
        color = "cadetblue"
        icon = "info-sign"
    elif indu == "Economic/Financial Equity":
        color = "darkpurple"
        icon = "info-sign"
    elif indu == "Education":
        color = "white"
        icon = "info-sign"
    elif indu == "Energy":
        color = "pink"
        icon = "info-sign"
    elif indu == "Engineering":
        color = "lightblue"
        icon = "info-sign"
    elif indu == "Environmental Services":
        color = "lightgreen"
        icon = "info-sign"
    elif indu == "Financial Services":
        color = "gray"
        icon = "info-sign"
    elif indu == "Food & Beverage":
        color = "black"
        icon = "info-sign"
    elif indu == "Government Services":
        color = "lightgray"
        icon = "info-sign"
    elif indu == "Health Products":
        color = "red"
        icon = "cloud"
    elif indu == "Health":
        color = "blue"
        icon = "cloud"
    elif indu == "Human Resources":
        color = "green"
        icon = "cloud"
    elif indu == "Insurance":
        color = "purple"
        icon = "cloud"
    elif indu == "IT Management":
        color = "orange"
        icon = "cloud"
    elif indu == "IT Systems Development":
        color = "orange"
        icon = "cloud"
    elif indu == "Legal":
        color = "darkred"
        icon = "cloud"
    elif indu == "Logistics & Transportation":
        color = "lightred"
        icon = "cloud"
    elif indu == "Manufacturing":
        color = "beige"
        icon = "cloud"
    elif indu == "Media":
        color = "darkblue"
        icon = "cloud"
    elif indu == "Pharmaceutical":
        color = "darkgreen"
        icon = "cloud"
    elif indu == "Real Estate":
        color = "cadetblue"
        icon = "cloud"
    elif indu == "Retail":
        color = "darkpurple"
        icon = "cloud"
    elif indu == "Security":
        color = "white"
        icon = "cloud"
    elif indu == "Software":
        color = "pink"
        icon = "cloud"
    elif indu == "Sports":
        color = "lightblue"
        icon = "cloud"
    elif indu == "Telecommunications":
        color = "lightgreen"
        icon = "cloud"
    elif indu == "Travel & Hospitality":
        color = "gray"
        icon = "cloud"
    else:
        print("wrong name --> " + indu)  # check for industry that was not listed
    return color, icon

    def create_map(df):
    for index, company in df.iterrows():
        location = geolocator.geocode(company["City"])
        if location is not None:
            if location.latitude is not None and location.longitude is not None:
                colorindu, iconindu = return_color(company["Industry"])
                folium.Marker(location=[location.latitude, location.longitude],
                              icon=folium.Icon(color=colorindu, icon=iconindu),
                              popup="index - {0}, {1}".format(location.latitude, location.longitude)
                                    + " Company: " + str(company["Company"])
                                    + " Rank: " + str(company["Rank"])
                                    + " Workers: " + str(company["Workers"])
                                    + " City: " + str(company["City"])
                                    + " Growth: " + str(company["Growth"])
                                    + " Revenue: " + str(company["Revenue"])
                                    + " Industry: " + str(company["Industry"])
                              ).add_to(cluster)

folium.LayerControl().add_to(map_osm)

create_map(df_copy)
map_osm.save("map.html")
```

_____________________MAP TWO CLUSTER MAP___________________

At face value, the map doesn't appear to provide much information, but by zooming in the clusters split to show clusters focused around individual cities, showing the exact number of companies located around said cluster. By clicking on a cluster, the map will zoom in to reveal further more focused clusters, until it reveals all the company points within that area. The map now has a specific color code and icon for different industries. Clicking on the icons reveal the specific details for that company which were collected earlier. Exploring the map gives enoumous insight on where exactly clusters of company headquarters are located in the US, as well as showing what industries are dominant in that region.

# Employee Plot Graph

Lastly, we will create a plot graph that can assess the relationship between employee count and a companies growth. 

```
ax = df_copy.plot.scatter(x='Workers', y='Growth', s=.5, title='Growth to Workers')
ax.set_xlabel("Workers")
ax.set_ylabel("Growth")
plt.xlim(0, 1000)
plt.ylim(0, 50000)

plt.show()
```
______LINE GRAPH_______

## Testing

After creating the cluster map, you can use it to find out what industries are dominating certain cities.

In upper San Francisco, CA, there is a cluster of 29 companies. Out of these 29, 11 are software companies, with otheres being related to security and business products/services. The data represented Other cities are more well rounded, such as New York City, with an even amount of industries based within the city. Lower Manhatten is home to: 8 advertising, 6 business products and services, 7 software companies, etc. There is a visible trend for certain industries to have far more companies located within a certain state and city, like software within California cities. 

The line graph is quite simple, showing that the majority of companies are within 100 employees and have varying growth; however, it can be noted that companies with over 200 employees nearly all have low growth, except for a few outliers. 

# Conclusion

As a reminder, the objective of this program was to locate talent hot spots for certain industries to gain a competitive edge and if the quantity of workers is critical to their growth.We begun with the hypothesis that through our data analysis, we would be able to prove that companies within similar industries gather in similar locations to scout better talent. Additionally, if given a industry, we would be able to ascertain where they would want to locate their headquarters to gain an edge on their competition.

After testing our hypothesis on the interactive cluster map and analyzing the line graph, we can conclude that certain industrys without a doubt cluster within certain cities where there is more talent for their work. These industries include software, health, business products/services, advertising, and IT managment to name a few. Other industries however, such as food and beverage, education, and real estate had no correlation or patterns in their headquarters locations. Lastly, industries like energy had no patterns in their location, but their locations may be influenced by natural resource locations more than scouting oppportunities. The line graph proved that employees with lower amounts of employees, typically around 100 had the greatest amount of growth throughout 2021, and that while hiring, companies should avoid overhiring as companies with more than 200 employees typically grew less during 2021.

