# Company Analysis (Quiz 3 Prep)

## Load libraries and data

Import packages.

In [1]:
import pandas as pd
import numpy as np

### Dataset

There are two tables in our dataset.

- `df_companies` contains some basic information about a few hundred fictitous companies.
- `df_info` contains some additional information about those companies.

In [2]:
df_companies = pd.read_csv('https://github.com/bdi475/datasets/raw/56972ccafac4b703553bd66f4e96720b0d43127d/fake-companies.csv')

df_companies.head()

Unnamed: 0,company_id,company_name,country,is_profitable,branches
0,572907,Smith Group,Armenia,True,1
1,859451,Martinez-Cooper,Brazil,True,2
2,355366,Estrada-Mccann,Saint Kitts and Nevis,False,1
3,512925,Mitchell Inc,United States Minor Outlying Islands,True,4
4,664465,Aguirre Ltd,Fiji,True,4


In [3]:
df_info = pd.read_csv('https://github.com/bdi475/datasets/raw/56972ccafac4b703553bd66f4e96720b0d43127d/fake-company-info.csv')

df_info.head()

Unnamed: 0,company_id,ceo,industry,revenue,inception_date
0,572907,Joshua Castillo,Software,6445200,2015-08-01
1,859451,Jenna Perkins,Education,4263696,2012-11-29
2,355366,James Neal,Financial Services,612742,2018-05-18
3,512925,Sarah Yates,Financial Services,3732464,2016-03-07
4,664465,Krystal Wallace,Manufacturing,5840134,2018-04-03


## Questions

### Q1. How many companies are in `df_companies`?

How many companies (i.e., number of rows) does `df_companies` contain?


In [4]:
# YOUR CODE BEGINS
df_companies.shape[0]
# YOUR CODE ENDS

250

**Answer**: 250

### Q2. How many companies are profitable?

The `is_profitable` column in `df_companies` indicates whether a company was profitable or not at the time of exit.

#### 💡 Hint

Find the number of rows where `is_profitable` is `True`.

In [5]:
# YOUR CODE BEGINS
df_companies[df_companies['is_profitable']].shape[0]
# YOUR CODE ENDS

207

**Answer**: 207

### Q3. Does `df_companies` contain any row with missing `country` value (`NaN`)?

In [6]:
# YOUR CODE BEGINS
df_companies[df_companies['country'].isna()]
# YOUR CODE ENDS

Unnamed: 0,company_id,company_name,country,is_profitable,branches
120,528219,Stein Group,,True,4
222,146245,Lopez-Scott,,True,1


**Answer**: Yes

### Merge DataFrames (ungraded)

Merge `df_info` into `df_companies` using `pd.merge()`.

- We will provide you the template code to merge the two tables in the quiz.
- Although this task is not graded, you will need to join two tables to answer the remaining questions.

In [7]:
df = pd.merge(
    left=df_companies,
    right=df_info,
    on='company_id',
    how='inner'
)
print(df.shape) # should print (250, 9)

(250, 9)


In [8]:
df.head(3)

Unnamed: 0,company_id,company_name,country,is_profitable,branches,ceo,industry,revenue,inception_date
0,572907,Smith Group,Armenia,True,1,Joshua Castillo,Software,6445200,2015-08-01
1,859451,Martinez-Cooper,Brazil,True,2,Jenna Perkins,Education,4263696,2012-11-29
2,355366,Estrada-Mccann,Saint Kitts and Nevis,False,1,James Neal,Financial Services,612742,2018-05-18


### Q4. Which company recorded the largest revenue in India?

#### 💡 Hint

- Find rows where the country column's value is `'India'`.
- Then, look for the company with the largest `revenue`.

In [9]:
# YOUR CODE BEGINS
df[df['country'] == 'India'].sort_values('revenue', ascending=False).head(1)
# YOUR CODE ENDS

Unnamed: 0,company_id,company_name,country,is_profitable,branches,ceo,industry,revenue,inception_date
234,578730,Newton-Reynolds,India,True,2,Joseph Martinez,Software,6235307,2012-08-02


**Answer**: Newton-Reynolds

### Q5. Out of the companies with an inception year of 2018, which company recorded the largest revenue?

#### 💡 Hint

- Extract the year from the `inception_date` column into a new column named `year`.
- Then, filter rows where `year == 2018`.
- After filtering, sort by revenue in descending order. The first row in the returned table contains the answer you're looking for.

#### 🔑 Sample Code

The sample code below converts `my_column` to a datetime type and extracts the year to a new column.

```python
# YOUR CODE BEGINS
# convert my_column to a datetime type
df['my_column'] = pd.to_datetime(df['my_column'])

# extract year into a new column
df['year'] = df['my_column'].dt.year

# write more code below to first filter rows where year == 2018,
# and then find the company with the largest revenue

# YOUR CODE ENDS
```

In [10]:
# YOUR CODE BEGINS
df['inception_date'] = pd.to_datetime(df['inception_date'])
df['year'] = df['inception_date'].dt.year

df[df['year'] == 2018].sort_values('revenue', ascending=False).head(1)
# YOUR CODE ENDS

Unnamed: 0,company_id,company_name,country,is_profitable,branches,ceo,industry,revenue,inception_date,year
36,348443,Barnes-Weiss,Myanmar,True,1,James Nichols,Software,9704591,2018-05-06,2018


**Answer**: Barnes-Weiss

### Q6. Which country has the **third** largest total revenue?

#### 💡 Hint

- The country with the third largest total revenue has a sum of 25415016.
- Group by `country` and find the sum of `revenue` for each country.
- Sort by `revenue` in descending order and find the third largest `revenue`.

#### 🔑 Sample Code

Replace `my_column` and `another_column` with your own values in the sample code below.

```python
# YOUR CODE BEGINS
df_by_country = df.groupby('my_column', as_index=False).agg({
      'another_column': 'sum'
})

df_by_country.sort_values('another_column', ascending=False).head(3)
# YOUR CODE ENDS
```

In [11]:
# YOUR CODE BEGINS
df_by_country = df.groupby('country', as_index=False).agg({
      'revenue': 'sum'
})

df_by_country.sort_values('revenue', ascending=False).head(3)
# YOUR CODE ENDS

Unnamed: 0,country,revenue
118,Saint Kitts and Nevis,30231545
92,Malta,28409854
16,Bosnia and Herzegovina,25415016


**Answer**: Bosnia and Herzegovina

### Q7. Which industry has the lowest average revenue?

#### 💡 Hint

- Group by `industry` and find the mean of `revenue` for each `industry`.
- Your code will be similar to that of the previous question.

In [12]:
# YOUR CODE BEGINS
df_by_industry = df.groupby('industry', as_index=False).agg({
      'revenue': 'mean'
})

df_by_industry.sort_values('revenue').head(1)
# YOUR CODE ENDS

Unnamed: 0,industry,revenue
3,Manufacturing,4468217.0


**Answer**: Manufacturing

### Q8. Among the companies that are not profitable, which company has the largest number of branches?

#### 💡 Hint

- Filter companies where the value of `is_profitable` column is `False`.
- Then, sort by `branches` in descending order.

In [13]:
# YOUR CODE BEGINS
df_no_profit = df[~df['is_profitable']]
df_no_profit.sort_values('branches', ascending=False).head(1)
# YOUR CODE ENDS

Unnamed: 0,company_id,company_name,country,is_profitable,branches,ceo,industry,revenue,inception_date,year
9,296886,Pearson Ltd,France,False,7,Gina Buckley,Agriculture,7896239,2014-04-19,2014


**Answer**: Pearson Ltd