# Lecture 11 - Part I

## Feature Engineering <a class="anchor" id="TOC"></a>

### World-Management Survey Data 
                                                               
  - Creating new variable(s) from multiple           
       already existing (mean of multiple variable)  
  - Grouping a categorical variable:                 
       countries to continents                       
  - Ordered variables:                               
     - creating an ordered factor                    
         from character or integer                   
     - creating an ordered                           
         from numeric                                
  - Factors or dummy variables:                      
       creating multiple dummies                     
  - Extra: intro to principal component analysis                                     
___

Import packages

In [None]:
import pandas as pd
import numpy as np
from plotnine import *
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

Import World-Management Survey Data

In [None]:
wms  = pd.read_csv("https://osf.io/uzpce/download")

wms.head()

### Creating a continuous variable out of ordered variables

Trick: lean, perf and talent measures, but multiple variables.\
    1. `filter(regex=)` will select these variables.
    2. calculate average with `mean(axis=1)` for each observation

In [None]:
wms["avg_score"] = wms.filter(regex="lean|perf|talent").mean(axis=1)

In [None]:
wms["avg_score"].describe()

#### Task:
   create the sum of `aa_` variables \
   check that the resulting variable has value of 1 for each observation as `aa_` variables are dummies for industry code

In [None]:
wms["sum_aa"] = wms.filter(regex="aa_").sum(axis=1)
wms["sum_aa"].describe()

### Grouping categorical

Creating groups by continents -> reducing dimensionality of a categorical variable


In [None]:
wms["country"].value_counts()

In [None]:
wms["country"].value_counts(normalize = True)

`pycountry_convert` module converts country names to country codes and continents

In [None]:
import pycountry_convert as pc

Note: Norther Ireland is not in this database, so convert it by hand. Also, Ireland has to be trimmed.

In [None]:
wms["continent"] = (
    wms["country"]
    .apply(lambda x: np.where(x == "Northern Ireland", "Ireland", x))
    .apply(lambda x: np.where(x == "Republic of Ireland", "Ireland", x))
    .apply(pc.country_name_to_country_alpha2) # converts country name to country code
    .apply(pc.country_alpha2_to_continent_code) # country code to continent code
    .apply(pc.convert_continent_code_to_continent_name)# continent code to name
)

In [None]:
wms["continent"].value_counts(dropna=False)

 It is also possible to create these groups by hand, with `np.where` command.

In [None]:
wms["ownership"].value_counts(dropna=False)

In [None]:
wms["owner"] = np.where(
    wms["ownership"].isnull(),
    np.nan,
    np.where(
        wms["ownership"] == "Government",
        "govt",
        np.where(
            wms["ownership"].str.contains("family", regex=False),
            "family",
            np.where(wms["ownership"] == "Other", "other", "private"),
        ),
    ),
)

In [None]:
wms["owner"].value_counts(dropna=False)

### Good-to-know: labeled ordered categorical variable: 
labels are ordered, however difference is only in few application

In [None]:
wms["lean1_ord"] = pd.cut(
    wms["lean1"], 5, labels=["extremly poor", "bad", "mediocre", "good", "excellent"]
)

Can easily plot

In [None]:
(
    ggplot(wms, aes(x="lean1_ord", y="avg_score"))
    + stat_summary(geom="point", fun_data="mean_se", size=8, fill="red")
    + labs(x="Lean 1 score", y="Mean average management score")
    + theme_bw()
)

#### Task:
Create the same graph, but using the `talent2` variable instead

In [None]:
wms["talent2_ord"] = pd.cut(
    wms["talent2"], 5, labels=["extremly poor", "bad", "mediocre", "good", "excellent"]
)

In [None]:
(
    ggplot(wms, aes(x="talent2_ord", y="avg_score"))
    + stat_summary(geom="point", fun_data="mean_se", size=8, fill="red")
    + labs(x="Talent 1 score", y="Mean average management score")
    + theme_bw()
)

##### Numeric to ordered

It is hard to get any conclusion if we plot the pattern between 
   average management score and number of employees

In [None]:
(
    ggplot(wms, aes(x="emp_firm", y="avg_score"))
    + geom_point(color="red", size=2, alpha=0.6)
    + labs(x="Number of employees", y="Mean average management score")
    + theme_bw()
)

One simple way to solve this issue:\
Simplifying firm size: creating categories from numeric

In [None]:
wms["emp_cat"] = pd.cut(
    wms["emp_firm"], bins=[0, 200, 1000, np.inf], labels=["small", "medium", "large"]
)

In [None]:
(
    ggplot(
        wms.loc[
            lambda x: x["emp_cat"].notnull(),
        ],
        aes(x="emp_cat", y="avg_score"),
    )
    + stat_summary(geom="point", fun_data="mean_se", size=8, fill="red", na_rm=True)
    + labs(x="Firm size", y="Mean average management score")
    + theme_bw()
)

### Factors Or Dummies

Creating multiple factor dummy from a categorical

In [None]:
dummies = pd.get_dummies(wms["emp_cat"], dummy_na = True)
dummies

You can easily concatenate this to the original dataframe

In [None]:
wms = pd.concat([wms,dummies],axis=1)
wms.head()

### Extra:

principle component analysis or PCA

One can argue, that the mean of the score is not the best measure, as it takes each value with the same weight \
An alternative solution is creating principal components, which transform the original variables.

import PCA function from sklearn

In [None]:
from sklearn.decomposition import PCA

Let us create principle components with all the questionnaires. \
have to make sure there is no NA value

In [None]:
original_variables = wms.filter(regex="lean|perf|talent").filter(regex="^(?!.*ord).*$").dropna()
original_variables.shape

fit PCA model

In [None]:
pca = PCA()

pca.fit(original_variables)

We have the same number of variables, but they are transformed.

As PCA is an information reductionist approach, we can see, 
     which transformed variable explains what percent of the overall information (variation)

In [None]:
pca.explained_variance_ratio_

Let us decide to use only the first variable, which explains 45.6%

In [None]:
pca_components = pd.DataFrame(
    pca.fit_transform(original_variables),
    columns=["PC%s" % str(i + 1) for i in range(len(original_variables.columns))],
)
pca_components.shape

aux: add firmid and wave with same filter to match PCs to wms data

In [None]:
aux = (
    wms.filter(regex="lean|perf|talent|wave|firmid")
    .filter(regex="^(?!.*ord).*$")
    .dropna()
    .filter(["wave", "firmid"])
    .reset_index(drop=True)
)
aux.shape

add firmid wave and only PC0 from pca-s

In [None]:
pca_dataframe = pd.concat([aux, pca_components["PC1"]], axis=1)

pca_dataframe.shape

add to wms data

In [None]:
wms = wms.merge(pca_dataframe, on = ["firmid","wave"],how="left")

Compare descriptives with average score


In [None]:
wms.filter(["avg_score", "PC1"]).describe()

Create a bin-scatter with PC1

In [None]:
(
    ggplot(
        wms.loc[
            lambda x: x["emp_cat"].notnull(),
        ],
        aes(x="emp_cat", y="PC1"),
    )
    + stat_summary(geom="point", fun_data="mean_se", size=8, fill="red", na_rm=True)
    + labs(x="Firm size", y="Principal component")
    + theme_bw()
)

Notes: 
  1) PCA is especially useful when you have too many explanatory variables and want to reduce num vars, 
      with minimal information loss. However, should use it with care, especially with time series! \
  2) There are many variations of PCA, if one starts to `rotate` the factors 
      to make some meaningful variables out of it (especially in psychology) \
  3) There are many packages, which carry out PCA, this is pretty much the simplest intro here... \