### 1. Using the documentation for Recursive Feature Selection, apply this process to the crime dataset to create the best multivariate linear regression model. You can select what you’re trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You should be able to answer this using what’s on the documentation page + what you already know.

Recursive feature elimination (RFE) is a technique that selects features recursively starting from an initial set of features, until the optimal set of features (of chosen size) is reached. It uses an estimator that assigns weights to features, then in every round the least important feature is dropped.

X1 = total overall reported crime rate per 1 million residents

X2 = reported violent crime rate per 100,000 residents

X3 = annual police funding in $/resident

X4 = % of people 25 years+ with 4 yrs. of high school

X5 = % of 16 to 19 year-olds not in highschool and not highschool graduates.

X6 = % of 18 to 24 year-olds in college

X7 = % of people 25 years+ with at least 4 years of college

Reference: Life In America's Small Cities, By G.S. Thomas

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import RFE
from sklearn.svm import SVR

crime_df = pd.read_csv("crime_data.csv")
crime_df.head(2)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18


In [34]:
X = crime_df[['X2','X3','X4','X5','X6','X7']]
y = crime_df['X1']

estimator = SVR(kernel = 'linear')
selector = RFE(estimator,  n_features_to_select=3, step=1)
selector = selector.fit(X, y)
selector.support_

array([False,  True,  True,  True, False, False])

In [35]:
selector.ranking_

array([4, 1, 1, 1, 2, 3])

Based on this we select X3, X4 and X5 features

Here we selected 3 features, but we do not really know yet how many we shall select. In order to do that we can iterate through the number of possible features:

In [55]:
def get_selectors():
    selectors = dict()
    for i in range(1, 7):
        estimator = SVR(kernel = 'linear')
        selector = RFE(estimator,  n_features_to_select=i, step=1)
        selectors[str(i)] = selector
    return selectors

def evaluate_model(selector, X, y):
    selector = selector.fit(X, y)
    score = selector.score(X, y)
    return score

# get the models to evaluate
selectors = get_selectors()

# evaluate the models and store results
results, keys = list(), list()

for key, selector in selectors.items():
    scores = evaluate_model(selector, X, y)
    results.append(scores)
    keys.append(key)
    print(f'Number of features selected: {key} and the corresponding score: {np.mean(scores)}')


Number of features selected: 1 and the corresponding score: 0.05735147028252985
Number of features selected: 2 and the corresponding score: 0.2077037097161889
Number of features selected: 3 and the corresponding score: 0.24702845263390028
Number of features selected: 4 and the corresponding score: 0.2483870693553456
Number of features selected: 5 and the corresponding score: 0.2575267216135815
Number of features selected: 6 and the corresponding score: 0.5841293980921582


6 has the best score (which means we did not select any features), but that is likely because it is overfitted. I am sure there is a system which puts a penalty on using too many features...

### 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number. Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs.

In [1]:
def rec_digit_sum(n):
    try:
        if not n%1 == 0:
            raise ValueError('This function works only with integers')
    except ValueError as ve:
        print(ve)
    else:
        digit_sum = 0
        for digit in str(n):
            digit_sum += int(digit)        
        return digit_sum if len(str(digit_sum)) == 1 else rec_digit_sum(digit_sum)

In [32]:
rec_digit_sum(335795147896541233114569874463211126955522332555223669841489665)

7

### 3. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can.

#### 1. Data Extraction

There are many possible sources of data (from customers, from the web, through measurements): it is crucial to understand how the data was acquired before manipulationg it.

#### 2. Data Cleaning

Main aim is to get rid of redundancies. Possible sources of error include:
- duplicated entries
- error in accuracy
- the data entires might have been altered (consistency)
- missing values

Ways to eliminate sources of errors:

- removing duplicates
- keeping track of dates of updates (assumption is, most recent data is most correct)
- filling in missing values

#### 3. Data integration

Data might be acquired from different sources and thus not consistent. Consistency issues have to be resolved prior to use

#### 4. Data Transformation

The model to be built might need other formats then the raw data, possible transformations include:

- normalization (to enable scaling of data)
- aggregation (defining new features from multiple old features)
- generalization (lower level attributes are converted to higher standard)

#### 5. Data reduction

Redundancies from the data are removed

The following two steps can help better understand the needs to be satisfied using the steps above.

#### 6. In depth data exploration

Aims understanding essential patterns appearing in the data.

#### 7. Identifying the critical features

Feature engineering as well as identification of constant and variable features

