# Analyzing SOFC data (part 5)

##5) Finalizing the model
So far, I am only confident that two features should be in the model: average electron affinity and average d-count for the B-cation. I have ruled out the parent features and now look only at the average features. I have no yet completely ruled out tolerance factor and critical radius, but I don't believe they will add much.

So what subset of features do I choose for my model?

In this notebook, I will determine the best subset of features by only allowing features that contribute at least 0.05 to the average model score after electron affinity and d-electron count have been added to it. At the time of this writing, sklearn has not implemented a best subset feature selection methodology, so I will have to build one myself. If I had many features, I would write a function to do this recursively, but because I only have a few, I believe it would be more efficient and more instructive to do it step-by-step.

In [1]:
from sofc_func import *

%matplotlib inline

In [2]:
# Import using pandas
df = pd.read_csv("data.csv")

# Clean out rows where there is no parent A or parent B

data = df[pd.notnull(df['A_par']) & pd.notnull(df['B_par']) & pd.notnull(df['d_star']) & pd.notnull(df['k_star']) 
             & pd.notnull(df['e affinity(B)']) & pd.notnull(df['d-electron count (B)'])]

pd.options.mode.chained_assignment = None

data = features(df)

data['dk_star'] = pd.Series(dk_star(data,1000), index=df.index)

In [3]:
f = df[pd.notnull(data['EA_A']) & pd.notnull(data['EA_B']) & pd.notnull(data['r_A'])
          & pd.notnull(data['r_B']) & pd.notnull(data['d_count_B']) & pd.notnull(data['avg_EA_A'])
          & pd.notnull(data['avg_EA_B']) & pd.notnull(data['avg_r_A']) & pd.notnull(data['avg_r_B'])
          & pd.notnull(data['avg_d_count_B']) & pd.notnull(data['dk_star'])
          & pd.notnull(data['tol_factor']) & pd.notnull(data['r_critical'])]

X = f[ ['avg_EA_B', 'avg_d_count_B'] ]
y = f['dk_star']
_features = f[ ['avg_EA_A', 'avg_r_A','avg_r_B', 'tol_factor', 'r_critical' ]]

I will attempt to add more features to the model. Only those that boost the score by 0.04 or more will be added. Note that this will take a long time to run!

In [43]:
new = add_feature(X, np.array([y]).T, _features, 0.04, 100)

Determining if a feature can be added. This may take a few minutes.
Could not add avg_EA_A to the model because its change on score was -0.00771994338648 (20.249127s)
Could not add avg_r_A to the model because its change on score was -0.00350144842272 (20.515041s)
Could not add avg_r_B to the model because its change on score was -0.00668780628091 (12.649361s)
Could not add tol_factor to the model because its change on score was -0.00590486928938 (17.87609s)
Could not add r_critical to the model because its change on score was 0.00937400226866 (17.91572s)
Could not add any features to the model.
Executed in  97.606835 s


Looks like nothing adds any predictability to the model. I will one last time make sure that these two variables are the best ones. I know this is getting a little redundant, but it's worth one last sanity check.

###Starting with an empty model:

In [5]:
_features = f[ ['avg_EA_A', 'avg_r_A','avg_r_B', 'tol_factor', 'r_critical', 'avg_EA_B', 'avg_d_count_B' ]]
ft = add_feature(None, np.array([y]).T, _features, 0.1, 100)

0.0326023034104 avg_EA_A
0.0997026493412 avg_r_A
-0.0678034707331 avg_r_B
0.0331366761701 tol_factor
0.0530407495161 r_critical
0.733332062123 avg_EA_B
0.681342225905 avg_d_count_B
The best feature to start with is avg_EA_B with a score of 0.733332062123


###With avg_EA_B:

In [6]:
ftt = add_feature(ft, np.array([y]).T, _features, 0.1, 100)

Determining if a feature can be added. This may take a few minutes.
Could not add avg_EA_A to the model because its change on score was -0.00635692138024 (9.565273s)
Could not add avg_r_A to the model because its change on score was 0.00508875453257 (7.415044s)
Could not add avg_r_B to the model because its change on score was 0.0128878821007 (8.648698s)
Could not add tol_factor to the model because its change on score was 0.0114830885825 (8.357922s)
Could not add r_critical to the model because its change on score was 0.0152336928735 (8.595857s)
Could not add avg_EA_B to the model because its change on score was 0.00341607970783 (6.105509s)
Could not add avg_d_count_B to the model because its change on score was 0.039032533647 (7.54181s)
Could not add any features to the model.
Executed in  62.361821 s


###That's interesting...
Average d-count was not added to the model because it only contributed ~0.04 to the score. Back in the first notebook I alluded to the fact that electron affinity and d-electron count are linearly dependent, but I didn't think multicollinearity would be too big an issue. Here it looks like it is. Based on this, I will conclude that the best model is actually a simply polynomial relationship:

###D\*k\* = f ( EN_B, T )

I have left out temperature to this point, but I will get there soon.

#Conclusion:
It looks like avg_EA_B is the only feature I need to explain variation at 1000K. In the next notebook, I will start looking at how the model holds up at lower temperatures. The goal of this analysis is to show what makes for a good SOFC material at medium and low temperatures.