# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Lab 3.03 | Feature Engineering Lab

In this lab, you'll implement feature engineering on the "Heads of State" data.

Your $Y$ value should be the length of time (in years) each individual reigned.

In [106]:
import pandas as pd
import numpy as np

In [2]:
!head -2 Heads\ of\ State.csv

Name,Wikipedia Page,Description,Image,Birth Year,Age Term Began (approx.),Term Began,Year Term Began,Term Ended,Year Term Ended,Term length,Days in Term,Royal?,Current?,Birth Place (current city),Country of Birth (current country),Ruler of,Country of Ruled Territory (Current),Political Party,Studies,Role,Religion
"Heinrich II, Hoya",http://de.wikipedia.org/wiki/Heinrich_II._(Hoya),,,,,1235,1235,1290,1290,55 years,20088,Yes,No,,,Hoya ,Germany,,,Count,


In [9]:
state = pd.read_csv("Heads of State.csv")

In [10]:
state.shape

(454, 22)

Exercise 1: As a first step, do some EDA and data cleaning. Don't go too far down the rabbit hole, but be able to identify potential pitfalls in the data!

#### Stuff we need to keep:

1. being royal

2. length of time they reigned

3. religion

4. age of the person when they began their term

In [11]:
state.columns

Index(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
       'Age Term Began (approx.)', 'Term Began', 'Year Term Began',
       'Term Ended', 'Year Term Ended', 'Term length', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role', 'Religion'],
      dtype='object')

In [12]:
state.dtypes

Name                                     object
Wikipedia Page                           object
Description                              object
Image                                    object
Birth Year                               object
Age Term Began (approx.)                float64
Term Began                               object
Year Term Began                          object
Term Ended                               object
Year Term Ended                          object
Term length                              object
Days in Term                             object
Royal?                                   object
Current?                                 object
Birth Place (current city)               object
Country of Birth (current country)       object
Ruler of                                 object
Country of Ruled Territory (Current)     object
Political Party                          object
Studies                                  object
Role                                    

In [23]:
# drop current heads o' state
state.drop(index=state[ state['Current?'] == 'Yes'].index.tolist(),
          inplace=True)
# reset the index so we don't have gaps
state.reset_index(drop=True, inplace=True)

In [26]:
state.shape[0]

363

In [25]:
state.isnull().sum()

Name                                      0
Wikipedia Page                          118
Description                             121
Image                                   214
Birth Year                              158
Age Term Began (approx.)                159
Term Began                                0
Year Term Began                           0
Term Ended                                2
Year Term Ended                           2
Term length                               0
Days in Term                              0
Royal?                                    0
Current?                                  0
Birth Place (current city)              254
Country of Birth (current country)      254
Ruler of                                  0
Country of Ruled Territory (Current)      0
Political Party                         341
Studies                                 358
Role                                      1
Religion                                337
dtype: int64

In [150]:
X = state[['Age Term Began (approx.)',
      'Royal?',
      'Name']]

y = state['Term length']

#### Exercise 2: Create the length of time in years each individual reigned. Include decimal values. (For example, if someone reigned for 330 days, we would expect this value to be approximately 0.9.)

In [None]:
to do:
age: impute missing values
is_royal: map to 0 or 1
name: replace with the length of the name

In [104]:
X['age'].median()

14.0

In [152]:
# redefine y to be fractional years (overwrites the previous y, sry)
y = state['Days in Term'].apply(lambda x: x.split()[0]).astype('float') / 365

In [153]:
y.rename('term_years', inplace=True)
X.columns = ['age', 'is_royal', 'name']

In [165]:
# impute (fill in) missing age values with the median age from the column
X['age'].fillna(value=X['age'].median(), axis=0, inplace=True)

# if royal then 1 else 0
X['is_royal'] = [ 1 if col else 0 for col in X['is_royal'] == 'Yes' ]

# replace name with the length of the name (int)
X['name'] = [ len(col) for col in X['name'] ]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [174]:
X.head(2)

Unnamed: 0,age,is_royal,name
0,14.0,1,17
1,14.0,1,25


Exercise 3: It only makes sense if we're going to analyze people who are not currently heads of state. (If we included current heads of state, we're not going to get a great look at their length of reign, because their reign is continuing!) Subset your data accordingly.

#### Done - see above

Exercise 4: Does being royal have a significant effect on the length of one's reign?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Check out the summary. Interpret the coefficient.
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the effect of being royal on the length one's reign.

Exercise 5: Does having a religion listed (column V) have a significant effect on the length of one's reign?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Check out the summary. Interpret the coefficient.
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the effect of having a religion listed on the length of one's reign.

#### No - almost everything is null

Exercise 6: Is there a significant interaction between being royal and having a religion listed?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the interaction effect of royal and religion on the length of one's reign.

#### No - again because we have almost no religion data

Exercise 7: Does the age the term began have a significant effect on the length of one's reign?

Exercise 8: Suppose you're wary of the data collection process here. Are there any concerns you might have about the data? (Perhaps about sampled versus target populations?)

Exercise 9: Build a multiple linear regression model to predict the length of an individual's reign. In addition to the previous features, engineer at least two more features. If you want to use some of the text features, you may find [.isin()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) helpful.

Discuss the results of your model.