# Lecture 11 - Part II

## Feature Engineering <a class="anchor" id="TOC"></a>    
                                                     
                                                     
### Bisnode Data                                   
                                                     

- imputing:                                     
    - A: replacing with mean or median            
    - B: outside knowledge to replace values      
    - C: introduce new value:                     
        - only for categorical values           
- log transformation adjustment:                
        log(0) is -inf -> adjust numerically        
- create dummy variable(s) with                 
        multiple statements: using shift() function  
- randomizing large data for visualization      
- growth rate with log difference:
    - using shift() function                      
- winsorizing                                   

___

Import packages

In [None]:
import pandas as pd
import numpy as np
from plotnine import *
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

Using bisnode data for firm exit

In [None]:
bisnode = pd.read_csv("https://osf.io/3qyut/download")

bisnode.head()

Sample selection\
drop variables with many NAs

In [None]:
bisnode = bisnode.drop(
    ["COGS", "finished_prod", "net_dom_sales", "net_exp_sales", "wages"], axis=1
).loc[bisnode["year"] != 2016]

add all missing year and comp_id combinations -

   
(originally missing combinations will have NAs in all other columns)

In [None]:
bisnode = (
    bisnode.set_index(["year", "comp_id"])
    .unstack(fill_value=np.nan)
    .stack(dropna=False)
    .reset_index()
)

### Imputing

A) Replacing with mean or median:
    
   number of employed in firm is a noisy measure with many missing value.\
   replace missing values with the mean or median\
   also add a flag variable for the imputed values (need to include in the model!)

In [None]:
# mean
bisnode["labor_avg_mod"] = np.where(
    bisnode["labor_avg"].isnull(),
    np.nanmean(bisnode["labor_avg"]),
    bisnode["labor_avg"],
)
# median
bisnode["labor_med_mod"] = np.where(
    bisnode["labor_avg"].isnull(),
    np.nanmedian(bisnode["labor_avg"]),
    bisnode["labor_avg"],
)
# flag
bisnode["flag_miss_labor_avg"] = bisnode["labor_avg"].isnull()

#### Task
add `Nmiss` as a custom function to datasummary and check the \
mean, median, sd, N and Nmiss for labor_avg, labor_avg_mod, labor_med_mod

In [None]:
def Nmiss(x):
    return x.isnull().sum()

Check how stats altered, discuss!

In [None]:
bisnode.filter(["labor_avg", "labor_avg_mod", "labor_med_mod"]).agg(
    ["mean", "median", "std", "count", Nmiss]
).T

### Imputing:

B) Using outside knowledge to replace values:

Negative sales should not happen, thus we can overwrite it to a small value: 1

In [None]:
bisnode["sales"].describe()

In [None]:
bisnode["sales"] = np.where(bisnode["sales"] < 0, 1, bisnode["sales"])

In [None]:
bisnode["sales"].describe()

### Imputing:

C) Categorical variables

Simplify some industry category codes and set missing values to 99

In [None]:
bisnode["ind2_cat"] = np.where(bisnode["ind2"] > 56, 60, bisnode["ind2"])
bisnode["ind2_cat"] = np.where(bisnode["ind2"] < 26, 20, bisnode["ind2_cat"])
bisnode["ind2_cat"] = np.where(
    (bisnode["ind2"] < 55) & (bisnode["ind2"] > 35), 40, bisnode["ind2_cat"]
)
bisnode["ind2_cat"] = np.where(bisnode["ind2"] == 31, 30, bisnode["ind2_cat"])
bisnode["ind2_cat"] = np.where(bisnode["ind2"].isnull(), 99, bisnode["ind2_cat"])

In [None]:
bisnode["ind2_cat"].value_counts().sort_index()

___

Adjusting negative sale and for log transformation:

In [None]:
bisnode["ln_sales"] = np.where(bisnode["sales"] > 0, np.log(bisnode["sales"]), 0)
bisnode["sales_mil"] = bisnode["sales"] / 10**6
bisnode["sales_mil_log"] = np.where(bisnode["sales"] > 0, np.log(bisnode["sales_mil"]), 0)

***Creating 'status_alive' variable to decide if firm exists or not***

Generate status_alive; if sales larger than zero and not-NA, then firm is alive

In [None]:
bisnode["status_alive"] = np.where(
    (bisnode["sales"] > 0) & (bisnode["sales"].notnull()), 1, 0
)

Defaults in two years if there are sales in this year but no sales two years later

In [None]:
bisnode = bisnode.sort_values(by=["comp_id","year"])

bisnode["default"] = bisnode.groupby("comp_id")["status_alive"].transform(
    lambda x: (x == 1) & (x.shift(2) == 0)
).astype(int)

Select years before 2013

In [None]:
bisnode = bisnode.loc[bisnode["year"]<= 2013]

To speed up let take a randomly selected 5k companies

In [None]:
comp_id_f = bisnode.drop_duplicates(subset=["comp_id"]).sample(5000, random_state = 20123123)["comp_id"]

In [None]:
bisnode_sample = bisnode.loc[lambda x: x["comp_id"].isin(comp_id_f)]

### Numeric vs Factor Representation

Numeric representation (good)

In [None]:


(
    ggplot(bisnode_sample, aes(x="sales_mil_log", y="default"))
    + geom_point(size=2, alpha=0.3, color="blue")
    + geom_smooth(method="lm", formula="y ~ x**2", color="black", se=False, size=1)
    + geom_smooth(method="loess", se=False, colour="red", size=1.5)
    + labs(x="sales_mil_log", y="default")
    + theme_bw()
)

#### Task
convert default to a factor variable and plot!\
what is the problem? It is a bad idea to convert to a factor?

In [None]:
bisnode_sample["default_factor"] = bisnode_sample["default"].astype("category")

(
    ggplot(bisnode_sample, aes(x="sales_mil_log", y="default_factor"))
    + geom_point(size=2, alpha=0.3, color="blue")
    + geom_smooth(method="lm", formula="y ~ x**2", color="black", se=False, size=1)
    + geom_smooth(method="loess", se=False, colour="red", size=1.5)
    + labs(x="sales_mil_log", y="default")
    + theme_bw()
)

Growth (%) in sales \
Take the lags but make sure only for the same company!

In [None]:
bisnode["d1_sales_mil_log"] = bisnode.groupby("comp_id")["sales_mil_log"].transform(
    lambda x: x - x.shift(1)
)

Repeat random sample to include the new variables

In [None]:
bisnode_sample = bisnode.loc[lambda x: x["comp_id"].isin(comp_id_f)]

 First measure for change in sales: take the sale change in logs

In [None]:
nw = (
    ggplot(bisnode_sample, aes(x="d1_sales_mil_log", y="default"))
    + geom_point(size=1, fill="blue", color="blue")
    + geom_smooth(method="loess", se=False, colour="red", size=1.5)
    + labs(x="Growth rate (Diff of ln sales)", y="default")
    + theme_bw()
    + scale_x_continuous(limits=(-6, 10), breaks=np.arange(-5, 10, 5))
)
nw

### Winsorized Data:
  - set (extreme) values to a certain (lower) value

Note: 
    
 winsorizing is the action to set manually a value \
      'censoring' is called if the values are already 'winsorized' \
      thus it is unknown what was the original value, but can only see the set value \
        e.g. mother's wage who are at home is 0, however if she would work this value would be different \
      'truncation' is when we dropping certain values below or above a threshold from the data 

Create new variable and add flag variables for modelling

In [None]:
bisnode["flag_low_d1_sales_mil_log"] = np.where(
    bisnode["d1_sales_mil_log"] < -1.5, 1, 0
)
bisnode["flag_high_d1_sales_mil_log"] = np.where(
    bisnode["d1_sales_mil_log"] > 1.5, 1, 0
)
bisnode["d1_sales_mil_log_mod"] = np.where(
    bisnode["d1_sales_mil_log"] < -1.5,
    -1.5,
    np.where(bisnode["d1_sales_mil_log"] > 1.5, 1.5, bisnode["d1_sales_mil_log"]),
)

Repeat random sample to include the new variables

In [None]:
bisnode_sample = bisnode.loc[lambda x: x["comp_id"].isin(comp_id_f)]

In [None]:
w = (
    ggplot(bisnode_sample, aes(x="d1_sales_mil_log_mod", y="default"))
    + geom_point(size=1, fill="blue", color="blue")
    + geom_smooth(method="loess", se=False, colour="red", size=1.5)
    + labs(x="Growth rate (Diff of ln sales)", y="default")
    + theme_bw()
    + scale_x_continuous(limits=(-1.5, 1.5), breaks=np.arange(-1.5, 1.51, 0.5))
)
w

#### Task:
Show the effect of winsorizing: transformation of the original data\
put d1_sales_mil_log on x-axis and d1_sales_mil_log_mod to the y-axis

In [None]:
(
    ggplot(bisnode_sample, aes(x="d1_sales_mil_log", y="d1_sales_mil_log_mod"))
    + geom_point(size=1, fill="blue", color="blue")
    + labs(
        x="Growth rate (Diff of ln sales) (original)",
        y="Growth rate (Diff of ln sales) (winsorized)",
    )
    + theme_bw()
    + scale_x_continuous(limits=(-5, 5), breaks=np.arange(-5, 5, 1))
    + scale_y_continuous(limits=(-3, 3), breaks=np.arange(-3, 3, 1))
)