# Introduction
In this practical session we will investigate whether founder/family owned companies are better of worse managed than other firms, on average, *because* of their ownership. In other words, we want to investigate whether or not the fact that a company is owned by its founder, or their family members, has an effect on the quality of management. 

The result may be useful if a founder/family owned company is deciding whether to change ownership or not, or for an investor who is considering investing in a founder/family owned company.


The dataset we use to perform analytics is a cross-section of 10,282 firms in manufacturing collected between 2004 and 2015 from 24 countries by the [World Management Survey](https://worldmanagementsurvey.org/data/dwms-public-sector/wms-methodology/).

Let's load the data.

In [None]:
import pandas as pd
pd.set_option('mode.chained_assignment',None)
df = pd.read_csv("https://osf.io/download/5tse9/")
df.head()

In addition to the dataset, we can load some *meta-data* which explains each variable.

In [None]:
url = 'https://github.com/data-analytics-in-business/gabor-management-case-study/raw/main/data/VARIABLES_wms.csv'
df_variables = pd.read_csv(url)
df_variables.head(5)

# Outcome Variable
Our outcome variable will be the *management* (`management`) score, which is the average of 18 scores collected using questions that attempt to measure the quality of specific management practices. Each score is measure on a scale of 1 to 5, where 1 is worst and 5 is best. 

Let's `describe` the distribution of management scores in the dataset.

In [None]:
df['management'].describe()

**Question**: why do you think the max management score is not 5?

# Causal Variable
We want our outcome variable to indicate whether the company was owned by founder/family or not.

Let's begin by looking at the counts of the values in the `ownership` column 

In [None]:
df['ownership'].value_counts()

We see there are 11 categories in total, some indicating whether the company was founder/family owned and some indicating other types of owners (e.g., dispersed shareholders, private individuals, government, etc).

From these categories, let's create a binary variable that is 1 if the firm is founder/family owned and 0 otherwise.

To do this, we will drop any rows where `ownership` is missing and use the `startswith` string method to check if each `ownership` value starts with `Founder`/`Family` or not, and map to an `int` value so the resulting values are 0 or 1.

In [None]:
df.dropna(subset=['ownership'],inplace=True)
df['foundfam_owned'] = df['ownership'].str.startswith(('Founder','Family')).astype(int)

Let's look at the counts of the values in `foundfam_owned` to see how many examples we have in each of our created categories.

In [None]:
df['foundfam_owned'].value_counts()

**Question**: Do we have more or less founder/family owned companies in the dataset than companies with other types of ownership?

# Exercise
Calculate the mean of the `management` scores for firms which are and are not founder/family owned.

Is the mean `management` score higher/lower for firms which are founder/family owned? 

What's the difference between the two means? 

Is the difference a good estimate of the effect of being founder/family owned on management quality? Why?

In [None]:
# (SOLUTION)
df.groupby("foundfam_owned")["management"].mean()