# Introduction
In this practical session we will investigate whether founder/family owned companies are better of worse managed than other firms, on average, *because* of their ownership. In other words, we want to investigate whether or not the fact that a company is owned by its founder, or their family members, has an effect on the quality of management. 

The result may be useful if a founder/family owned company is deciding whether to change ownership or not, or for an investor who is considering investing in a founder/family owned company.


The dataset we use to perform analytics is a cross-section of 10,282 firms in manufacturing collected between 2004 and 2015 from 24 countries by the [World Management Survey](https://worldmanagementsurvey.org/data/dwms-public-sector/wms-methodology/).

Let's load the data.

In [1]:
import pandas as pd
pd.set_option('mode.chained_assignment',None)
df = pd.read_csv("https://osf.io/download/5tse9/")
df.head()

Unnamed: 0,firmid,wave,cty,country,sic,management,operations,monitor,target,people,...,aa_196,aa_197,aa_198,aa_199,aa_200,aa_201,aa_202,aa_203,aa_204,aa_205
0,1,2010,us,United States,38.0,3.0,2.0,2.8,3.6,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,2004,us,United States,28.0,4.444445,4.5,4.6,4.4,4.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,2004,us,United States,34.0,2.666667,2.5,2.4,2.4,3.166667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,2004,us,United States,36.0,4.388889,3.0,4.6,4.6,4.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,2004,us,United States,35.0,4.833333,5.0,4.8,4.8,4.833333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In addition to the dataset, we can load some *meta-data* which explains each variable.

In [2]:
df_variables = pd.read_csv("../data/VARIABLES_wms.csv")
df_variables.head(5)

Unnamed: 0,variable,type,information
0,firmid,numeric,Unique firm ID
1,wave,numeric,Wave when interview was conducted
2,country,string,Country in which plant is located
3,management,numeric,Average of all management questions
4,operations,numeric,Average of lean1 & lean2


# Outcome Variable
Our outcome variable will be the *management* (`management`) score, which is the average of 18 scores collected using questions that attempt to measure the quality of specific management practices. Each score is measure on a scale of 1 to 5, where 1 is worst and 5 is best. 

Let's `describe` the distribution of management scores in the dataset.

In [3]:
df['management'].describe()

count    10282.000000
mean         2.883423
std          0.656546
min          1.000000
25%          2.444444
50%          2.888889
75%          3.333333
max          4.888889
Name: management, dtype: float64

**Question**: why do you think the max management score is not 5?

# Causal Variable
We want our outcome variable to indicate whether the company was owned by founder/family or not.

Let's begin by looking at the counts of the values in the `ownership` column 

In [4]:
df['ownership'].value_counts()

Dispersed Shareholders            2745
Private Individuals               2118
Founder owned, founder CEO        1856
Family owned, family CEO          1755
Other                              527
Private Equity/Venture Capital     353
Family owned, external CEO         346
Founder owned, external CEO        300
Government                         170
Family owned, CEO unknown           55
Founder owned, CEO unknown          41
Name: ownership, dtype: int64

We see there are 11 categories in total, some indicating whether the company was founder/family owned and some indicating other types of owners (e.g., dispersed shareholders, private individuals, government, etc).

From these categories, let's create a binary variable that is 1 if the firm is founder/family owned and 0 otherwise.

To do this, we will drop any rows where `ownership` is missing and use the `startswith` string method to check if each `ownership` value starts with `Founder`/`Family` or not, and map to an `int` value so the resulting values are 0 or 1.

In [5]:
df.dropna(subset=['ownership'],inplace=True)
df['foundfam_owned'] = df['ownership'].str.startswith(('Founder','Family')).astype(int)

Let's look at the counts of the values in `foundfam_owned` to see how many examples we have in each of our created categories.

In [6]:
df['foundfam_owned'].value_counts()

0    5913
1    4353
Name: foundfam_owned, dtype: int64

**Question**: Do we have more or less founder/family owned companies in the dataset than companies with other types of ownership?

# Exercise
Calculate the mean of the `management` scores for firms which are and are not founder/family owned.

Is the mean `management` score higher/lower for firms which are founder/family owned? 

What's the difference between the two means? 

Is the difference a good estimate of the effect of being founder/family owned on management quality? Why?

In [7]:
# (SOLUTION)
