# Introduction
To reduce bias in our estimate of the effect of the causal variable (`foundfam_owned`) on the outcome variable (`management`), we want to condition on (i.e. include in a regression model) all variables that might be a "common cause" of both the causal variable (`foundfam_owned`) and the outcome variable (`management`). These types of variables are named "confounders" or "confounding variables". In this notebook, we will present arguments for including confounding variables and will perform any preprocessing required to include them.

Let's load in the data and create our causal variable `foundfam_owned` column again.

In [1]:
import pandas as pd
pd.set_option('mode.chained_assignment',None)
df = pd.read_csv("https://osf.io/download/5tse9/")

df.dropna(subset=['ownership'],inplace=True)
df['foundfam_owned'] = df['ownership'].str.startswith(('Founder','Family')).astype(int)

# Country
We argue that the fact that firms are founder/family-owned could depend on cultural factors in society, and these same cultural factors may affect the quality of management practices. We can't measure these cultural factors directly, but we can use the county in which each firm is located as a "proxy" for them.

Let's see which countries that example in our dataset are located in.

In [2]:
df['country'].value_counts()

United States          947
Great Britain          884
China                  872
Brazil                 814
India                  707
France                 484
Australia              451
Italy                  436
Germany                424
Canada                 418
Greece                 416
Argentina              414
Chile                  410
Mexico                 404
Singapore              396
Turkey                 332
Sweden                 258
Poland                 238
Spain                  214
Portugal               193
Republic of Ireland    161
New Zealand            150
Japan                  124
Northern Ireland       119
Name: country, dtype: int64

**Question**: Do you think this is a representative sample?

# Industry
We argue that some industries will produce companies that are more/less likely to be founder/family-owned. For example, development of products in some industries will require companies to secure more/less early investment from investors. We also argue that some industries are more/less developed in the quality of their management practices. Consequently, we claim that the type of industry a company is in could be considered a common cause of the causal variable (`foundfam_owned`) and the outcome variable (`management`).

Let's look at the value counts for the `sic` (Standard Industrial Classification) column in the dataset.

In [3]:
df['sic'].value_counts()

20.0    1283
35.0     995
28.0     943
34.0     829
36.0     724
30.0     602
37.0     572
22.0     510
33.0     502
32.0     445
26.0     430
27.0     365
23.0     362
38.0     328
24.0     317
25.0     265
39.0     249
31.0     133
29.0      81
21.0      18
Name: sic, dtype: int64

To make the categories more "meaningful" we are going to map them to a textual description using the following dictionary.

In [4]:
dict_sic = {
    20:'food',#'Food and Kindred Products',
    21:'tobacco',#'Tobacco Products',
    22:'milled',#'Mill Products',
    23:'apparel',#'Apparel, Finished Products from Fabrics & Similar Materials',
    24:'lumber',#'Lumber and Wood Products, Except Furniture',
    25:'furniture',#'Furniture and Fixtures',
    26:'paper',#'Paper and Allied Products',
    27:'printing',#'Printing, Publishing and Allied Industries',
    28:'chemicals',#'Chemicals and Allied Products',
    29:'petrol',#'Petroleum Refining and Related Industries',
    30:'rubber',#'Rubber and Miscellaneous Plastic Products',
    31:'leather',#'Leather and Leather Products',
    32:'stone',#'Stone, Clay, Glass, and Concrete Products',
    33:'pri_metal',#'Primary Metal Industries',
    34:'fab_metal',#'Fabricated Metal Products',
    35:'machinery',#'Industrial and Commercial Machinery and Computer Equipment',
    36:'electronic',#'Electronic & Other Electrical Equipment & Components',
    37:'transport',#'Transportation Equipment',
    38:'measurement',#'Measuring, Photographic, Medical, & Optical Goods, & Clocks',
    39:'misc_manuf',#'Miscellaneous Manufacturing Industries',
}

df.dropna(subset=['sic'],inplace=True)
df['industry'] = df['sic'].astype(int).replace(dict_sic)
df['industry'].value_counts()

food           1283
machinery       995
chemicals       943
fab_metal       829
electronic      724
rubber          602
transport       572
milled          510
pri_metal       502
stone           445
paper           430
printing        365
apparel         362
measurement     328
lumber          317
furniture       265
misc_manuf      249
leather         133
petrol           81
tobacco          18
Name: industry, dtype: int64

# Competition
Finally, we argue that the strength of competition in the market may lead to a higher quality of management and, at the same time, maky make a firm a more desirable target for acquisition (i.e. less likley to remain founder/family owned). Consequently, we claim strength of competition in the market could be considered a common cause of the causal variable (`foundfam_owned`) and the outcome variable (`management`).

Let's look at the value counts for the `competition` column in the dataset.

In [5]:
df["competition"].value_counts()

10+ competitors      5532
  1-4 competitors    2238
 5-9 competitors     2115
   0 competitors       67
Name: competition, dtype: int64

Let's clean up the string values and categorise as low, med, high to make the groups more evenly balanced.

In [6]:
dict_comp = {
    '0 competitors':'low',
    '1-4 competitors':'low',
    '5-9 competitors':'med',
    '10+ competitors':'high',
}

df["comp_strength"] = df["competition"].str.strip()
df["comp_strength"] = df["comp_strength"].replace(dict_comp)
df["comp_strength"].value_counts()

high    5532
low     2305
med     2115
Name: comp_strength, dtype: int64

# Save data
Finally let's save the data so we can load into another notebook.

In [7]:
sample = df[['management','foundfam_owned','country','industry','comp_strength']]
sample.to_csv('../data/sample_MGMT.csv', index=False)
sample.head()

Unnamed: 0,management,foundfam_owned,country,industry,comp_strength
0,3.0,0,United States,measurement,low
1,4.444445,0,United States,chemicals,high
2,2.666667,0,United States,fab_metal,high
3,4.388889,1,United States,electronic,high
4,4.833333,0,United States,machinery,high


# Exercise
Expand the variables in the sample to include:
* age of firm
* number of employees
* proportion of employees with college education

Preprocess them in a way you think suitable.

In [8]:
# (SOLUTION)