# Profit Analysis of Fortune 500 Companies

My aim in this project is to find out how the profits of the largest companies in the US changed historically.

You can find a data set of Fortune 500 companies spanning over 50 years since the list’s first publication in 1955, put together from [Fortune’s public archive](https://archive.fortune.com/magazines/fortune/fortune500_archive/full/2005/). I’ve gone ahead and created a CSV of the data required for this project as [fortune500.csv](https://s3.amazonaws.com/dq-blog-files/fortune500.csv).

In [1]:
### Import all the necessary libraries ###

%matplotlib inline  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
sns.set(style = 'darkgrid')


UsageError: unrecognized arguments: # This will enable matplotlib to display charts in Jupyter Notebook


In [None]:
#Import the csv file as a Dataframe using Pandas
df = pd.read_csv('fortune500.csv')

## Investigating Data Set

In [None]:
df.head()

In [None]:
df.tail()

Dataframe is looking good. I have the needed columns, and each row corresponds to a single company in a single year.

Let's rename those columns so that it is easy to refer to them later.

In [None]:
#Renaming the column headers
df.columns = ['year','rank','company','revenue','profit']

In [2]:
df.head()

NameError: name 'df' is not defined

Next, I need to explore the data set to see if it is complete or not.

In [None]:
len(df)

In [None]:
df.shape

Okay, that looks good — that’s 500 rows for every year from 1955 to 2005, inclusive.

Let’s check whether the data set has been imported as expected. A simple check is to see if the data types (or dtypes) have been correctly interpreted.

In [None]:
#checking if our dataset has been properly imported
df.dtypes

It looks like there’s something wrong with the profits column — we would expect it to be a float64 like the revenue column. This indicates that it probably contains some non-integer values, so let’s take a look.

In [None]:
non_numeric_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numeric_profits].head()

Just as suspected! Some of the values are strings, which have been used to indicate missing data. Are there any other values that have crept in?

In [None]:
set(df.profit[non_numeric_profits])

In [None]:
#checking how many values are missing
len(df.profit[non_numeric_profits])

It’s a small fraction of the data set, though not completely inconsequential as it is still around 1.5%.

If rows containing N.A. are, roughly, uniformly distributed over the years, the easiest solution would just be to remove them. So let’s have a quick look at the distribution.

In [None]:
bin_sizes, _, _ = plt.hist(df.year[non_numeric_profits], bins=range(1955, 2006))

At a glance, I can see that the most invalid values in a single year is fewer than 25, and as there are 500 data points per year, removing these values would account for less than 4% of the data for the worst years. Indeed, other than a surge around the 90s, most years have fewer than half the missing values of the peak.

For my purposes, let’s say this is acceptable and go ahead and remove these rows.

In [None]:
#Removing non numeric profits
df = df.loc[~non_numeric_profits]
df.profit = df.profit.apply(pd.to_numeric)

In [None]:
len(df)

In [None]:
#checking profit's type
df.dtypes

Looking good. Data set setup is now complete.

## Plotting with matplotlib

Next, I can get to addressing the question at hand by plotting the average profit by year. I might as well plot the revenue as well, so first I can define some variables and a method to reduce the code.

In [None]:
#Plotting Average Profit by Year
group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
avgs = group_by_year.mean()
x = avgs.index
y1 = avgs.profit
def plot(x, y, ax, title, y_label):
    ax.set_title(title)
    ax.set_ylabel(y_label)
    ax.plot(x, y)
    ax.margins(x=0, y=0)

In [None]:
fig, ax = plt.subplots()
plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2005', 'Profit (millions)')

Wow, that looks like an exponential, but it’s got some huge dips. They must correspond to the early 1990s recession and the dot-com bubble. It’s pretty interesting to see that in the data. But how come profits recovered to even higher levels post each recession?

Maybe the revenues can tell us more.

In [None]:
#Plotting Average Revenue by Year
y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2005', 'Revenue (millions)')

That adds another side to the story. Revenues were not as badly hit — that’s some great accounting work from the finance departments.

In [None]:
def plot_with_std(x, y, stds, ax, title, y_label):
    ax.fill_between(x, y - stds, y + stds, alpha=0.2)
    plot(x, y, ax, title, y_label)
fig, (ax1, ax2) = plt.subplots(ncols=2)
title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2005'
stds1 = group_by_year.std().profit.values
stds2 = group_by_year.std().revenue.values
plot_with_std(x, y1.values, stds1, ax1, title % 'profits', 'Profit (millions)')
plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
fig.set_size_inches(14, 4)
fig.tight_layout()

## Conclusion

The standard deviations are huge! Some Fortune 500 companies make billions while others lose billions, and the risk has increased along with rising profits over the years.

Perhaps some companies perform better than others; are the profits of the top 10% more or less volatile than the bottom 10%?