# Introduction
To reduce bias in our estimate of the effect of the causal variable (`foundfam_owned`) on the outcome variable (`management`), we want to condition on (i.e. include in a regression model) all variables that might be a "common cause" of both the causal variable (`foundfam_owned`) and the outcome variable (`management`). These types of variables are named "confounders" or "confounding variables". In this notebook, we will present arguments for including confounding variables and will perform any preprocessing required to include them.

Let's load in the data and create our causal variable `foundfam_owned` column again.

In [None]:
import pandas as pd
pd.set_option('mode.chained_assignment',None)
df = pd.read_csv("https://osf.io/download/5tse9/")

df.dropna(subset=['ownership'],inplace=True)
df['foundfam_owned'] = df['ownership'].str.startswith(('Founder','Family')).astype(int)

# Country
We argue that the fact that firms are founder/family-owned could depend on cultural factors in society, and these same cultural factors may affect the quality of management practices. We can't measure these cultural factors directly, but we can use the county in which each firm is located as a "proxy" for them.

Let's see which countries that example in our dataset are located in.

In [None]:
df['country'].value_counts()

**Question**: Do you think this is a representative sample?

# Industry
We argue that some industries will produce companies that are more/less likely to be founder/family-owned. For example, development of products in some industries will require companies to secure more/less early investment from investors. We also argue that some industries are more/less developed in the quality of their management practices. Consequently, we claim that the type of industry a company is in could be considered a common cause of the causal variable (`foundfam_owned`) and the outcome variable (`management`).

Let's look at the value counts for the `sic` (Standard Industrial Classification) column in the dataset.

In [None]:
df['sic'].value_counts()

To make the categories more "meaningful" we are going to map them to a textual description using the following dictionary.

In [None]:
dict_sic = {
    20:'food',#'Food and Kindred Products',
    21:'tobacco',#'Tobacco Products',
    22:'milled',#'Mill Products',
    23:'apparel',#'Apparel, Finished Products from Fabrics & Similar Materials',
    24:'lumber',#'Lumber and Wood Products, Except Furniture',
    25:'furniture',#'Furniture and Fixtures',
    26:'paper',#'Paper and Allied Products',
    27:'printing',#'Printing, Publishing and Allied Industries',
    28:'chemicals',#'Chemicals and Allied Products',
    29:'petrol',#'Petroleum Refining and Related Industries',
    30:'rubber',#'Rubber and Miscellaneous Plastic Products',
    31:'leather',#'Leather and Leather Products',
    32:'stone',#'Stone, Clay, Glass, and Concrete Products',
    33:'pri_metal',#'Primary Metal Industries',
    34:'fab_metal',#'Fabricated Metal Products',
    35:'machinery',#'Industrial and Commercial Machinery and Computer Equipment',
    36:'electronic',#'Electronic & Other Electrical Equipment & Components',
    37:'transport',#'Transportation Equipment',
    38:'measurement',#'Measuring, Photographic, Medical, & Optical Goods, & Clocks',
    39:'misc_manuf',#'Miscellaneous Manufacturing Industries',
}

df.dropna(subset=['sic'],inplace=True)
df['industry'] = df['sic'].astype(int).replace(dict_sic)
df['industry'].value_counts()

# Competition
Finally, we argue that the strength of competition in the market may lead to a higher quality of management and, at the same time, maky make a firm a more desirable target for acquisition (i.e. less likley to remain founder/family owned). Consequently, we claim strength of competition in the market could be considered a common cause of the causal variable (`foundfam_owned`) and the outcome variable (`management`).

Let's look at the value counts for the `competition` column in the dataset.

In [None]:
df["competition"].value_counts()

Let's clean up the string values and categorise as low, med, high to make the groups more evenly balanced.

In [None]:
dict_comp = {
    '0 competitors':'low',
    '1-4 competitors':'low',
    '5-9 competitors':'med',
    '10+ competitors':'high',
}

df["comp_strength"] = df["competition"].str.strip()
df["comp_strength"] = df["comp_strength"].replace(dict_comp)
df["comp_strength"].value_counts()

# Save data
Finally let's save the data so we can load into another notebook.

In [None]:
sample = df[['management','foundfam_owned','country','industry','comp_strength']]
sample.to_csv('../data/sample_MGMT.csv', index=False)
sample.head()

# Exercise
Expand the variables in the sample to include:
* age of firm
* number of employees
* proportion of employees with college education

Preprocess them in a way you think suitable.

In [None]:
# (SOLUTION)
sample = df[['management','foundfam_owned','country','industry','comp_strength','firmage','emp_firm','degree_t']]
sample.dropna(subset=['firmage','emp_firm','degree_t'],inplace=True)
sample.info()

# 9953 rows => 7735