<a href="https://colab.research.google.com/github/annaqas/codecademy_training/blob/main/startup_transformation_bycodecademy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The management team of the company you work for is concerned about the status of the company after a global pandemic.

The CFO (Chief Financial Officer) asks you to perform some data analysis on the past six months of the company’s financial data, which has been loaded in the variable financial_data.

In [None]:
import codecademylib3
from sklearn import preprocessing
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

# load in financial data
financial_data = pd.read_csv('financial_data.csv')

print(financial_data.head())

Notice that financial_data has three columns – Month, Revenue, and Expenses.
Store each column in three separate variables called month, revenue, and expenses.
Create a plot of revenue over the past six months.
Create a plot of expenses over the past six months.

In [None]:
month = financial_data['Month']
revenue = financial_data['Revenue']
expenses = financial_data['Expenses']

plt.plot(month,revenue)
plt.xlabel('Month')
plt.ylabel('Amount ($)')
plt.title('Revenue')
plt.show()
plt.clf()
plt.plot(month,expenses)
plt.xlabel('Month')
plt.ylabel('Amount ($)')
plt.title('Expenses')
plt.show()

As shown, revenue seems to be quickly decreasing while expenses are increasing. If the current trend continues, expenses will soon surpass revenues, putting the company at risk.

After you show this chart to the management team, they are alarmed. They conclude that expenses must be cut immediately and give you a new file to analyze called expenses.csv.

Use pandas to read in expenses.csv and store it in a variable called expense_overview.

Print the first seven rows of the data.

Store the Expense column in a variable called expense_categories and the Proportion column in a variable called proportions.
create a pie chart of the different expense categories. Use plt.clf() again to clear the previous plot, then create a pie chart using the plt.pie() method, passing in two arguments: proportions and labels = expense_categories
Give your pie chart a title using plt.title(), then use plt.show() at the end to show the plot.

In [None]:
expense_overview = pd.read_csv('expenses.csv')
print(expense_overview.head(7))

expense_categories = expense_overview['Expense']
proportions = expense_overview['Proportion']

plt.clf()
plt.pie(proportions, labels = expense_categories)
plt.title('Expenses by categories')
plt.axis('Equal')
plt.tight_layout()
plt.show()

It seems that Salaries, Advertising, and Office Rent make up most of the expenses, while the rest of the categories make up a small percentage.

Before you hand this pie chart back to management, you would like to update the pie chart so that all categories making up less than 5% of the overall expenses (Equipment, Utilities, Supplies, and Food) are collapsed into an “Other” category.

Update the pie chart accordingly.

In [None]:
expenses_categories = ['Salaries', 'Advertising', 'Office Rent', 'Other']
proportions = [0.62, 0.15, 0.15, 0.08]
plt.clf()
plt.pie(proportions, labels=expenses_categories)
plt.title('Expenses categories')
plt.axis('Equal')
plt.tight_layout()
plt.show()

Salaries make up 62% of expenses. The management team determines that to cut costs in a meaningful way, they must let go of some employees.

Each employee at the company is assigned a productivity score based on their work. The management would like to keep the most highly productive employees and let go of the least productive employees.

First, use pandas to load in employees.csv and store it in a variable called employees.

Print the first few rows of the data.

Notice that there is a Productivity column, which indicates the productivity score assigned to that employee.

Sort the employees data frame (in ascending order) by the Productivity column and store the result in a variable called sorted_productivity.

In [None]:
employees = pd.read_csv('employees.csv')
print(employees.head())
sorted_productivity = employees.sort_values(by=['Productivity'])
print(sorted_productivity)

You should now see the employees with the lowest productivity scores at the top of the data frame.

The company decides to let go of the 100 least productive employees.

Store the first 100 rows of sorted_productivity in a new variable called employees_cut and print out the result.

In [None]:
employees_cut = sorted_productivity.head(100)
print(employees_cut)

The COO (Chief Operating Officer) is debating whether to allow employees to continue to work from home post-pandemic.

He first wants to take a look at roughly how long the average commute time is for employees at the company. He asks for your help to analyze this data.

The employees data frame has a column called Commute Time that stores the commute time (in minutes) for each employee.

Create a variable called commute_times that stores the Commute Time column.
Let’s do some quick analysis on the commute times of employees.

Use print() and .describe() to print out descriptive statistics for commute_times.

What are the average and median commute times? Might it be worth it for the company to explore allowing remote work indefinitely so employees can save time during the day?

In [None]:
commute = employees['Commute Time'] 
print(commute.describe())

Let’s explore the shape of the commute time data using a histogram.

First, use plt.clf() to clear the previous plots. Then use plt.hist() to plot the histogram of commute_times. Finally, use plt.show() to show the plot. Feel free to add labels above plt.show() if you would like to practice!

What do you notice about the shape of the data? Is it symmetric, left skewed, or right skewed?
'Right skewed'

In [None]:
plt.clf()
plt.hist(commute_times)
plt.title("Employee Commute Times")
plt.xlabel("Commute Time")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

The data seems to be skewed to the right. To make it more symmetrical, we might try applying a log transformation.
Right under the commute_times variable, create a variable called commute_times_log that stores a log-transformed version of commute_times.
To apply log-transform, you can use numpy’s log() function.

Replace the histogram for commute_times with one for commute_times_log.
Notice how the shape of the data changes from being right skewed to a more symmetrical (and even slightly left-skewed) in shape. After applying log transformation, the transformed data is more “normal” than before.

In [None]:
print(commute_times.skew())
commute_times_log = np.log(commute_times)
plt.clf()
plt.hist(commute_times_log)
plt.title("Employee Commute Times")
plt.xlabel("Commute Time Logged")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

In this project, you performed data analysis to help a management team answer important questions about the status of the company during a difficult time.

You did this by analyzing data sets and applying common data transformation techniques. These are important skills to have as a data analyst.

Other analysis:

Apply standardization to the employees data using StandardScaler() from sklearn. Refer to this article if you need help.

Explore the income and productivity features in more detail. Can you find a relationship between productivity and income?

Happy coding!