# Analyzing SOFC data (part 5)

##5) Finalizing the model
So far, I am only confident that two features should be in the model: average electron affinity and average d-count for the B-cation. I have ruled out the parent features and now look only at the average features. I have no yet completely ruled out tolerance factor and critical radius, but I don't believe they will add much.

So what subset of features do I choose for my model?

In this notebook, I will determine the best subset of features by only allowing features that contribute at least 0.05 to the average model score after electron affinity and d-electron count have been added to it. At the time of this writing, sklearn has not implemented a best subset feature selection methodology, so I will have to build one myself. If I had many features, I would write a function to do this recursively, but because I only have a few, I believe it would be more efficient and more instructive to do it step-by-step.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import csv
import math
import pandas as pd
import time
%matplotlib inline
from sklearn import cross_validation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFE

from sofc_func import *

In [2]:
# Import using pandas
df = pd.read_csv("data.csv")

# Clean out rows where there is no parent A or parent B

data = df[pd.notnull(df['A_par']) & pd.notnull(df['B_par']) & pd.notnull(df['d_star']) & pd.notnull(df['k_star']) 
             & pd.notnull(df['e affinity(B)']) & pd.notnull(df['d-electron count (B)'])]

pd.options.mode.chained_assignment = None

data = features(df)

data['dk_star'] = pd.Series(dk_star(data,1000), index=df.index)

In [3]:
f = df[pd.notnull(data['EA_A']) & pd.notnull(data['EA_B']) & pd.notnull(data['r_A'])
          & pd.notnull(data['r_B']) & pd.notnull(data['d_count_B']) & pd.notnull(data['avg_EA_A'])
          & pd.notnull(data['avg_EA_B']) & pd.notnull(data['avg_r_A']) & pd.notnull(data['avg_r_B'])
          & pd.notnull(data['avg_d_count_B']) & pd.notnull(data['dk_star'])]

X = f[ ['avg_EA_B', 'avg_d_count_B'] ]
y = f['dk_star']
features = f[ ['avg_EA_A', 'avg_r_A','avg_r_B', 'tol_factor', 'r_critical' ]]

I will attempt to add more features to the model. Only those that boost the score by 0.04 or more will be added. Note that this will take a long time to run!

In [5]:
new = add_feature(X, np.array([y]).T, features, 0.04, 100)

Determining if a feature can be added. This may take a few minutes.
Could not add avg_EA_A to the model because its change on score was -0.0118375925323 (14.138491s)
Could not add avg_r_A to the model because its change on score was 0.00482693841644 (16.809758s)
Could not add avg_r_B to the model because its change on score was -0.041283013976 (13.023931s)
Could not add tol_factor to the model because its change on score was -0.0122178551787 (16.33316s)
Could not add r_critical to the model because its change on score was 0.0104444538935 (16.215799s)
Could not add any features to the model.
Executed in  84.051801 s


Looks like nothing adds any predictability to the model. I guess this will be a short notebook.

#Conclusion:
The two features avg_EA_B and avg_d_count are the only features I need to explain variation at 1000K. In the next notebook, I will start looking at how the model holds up at lower temperatures. The goal of this analysis is to show what makes for a good SOFC material at medium and low temperatures.