# Recipe

This notbook is used to defining the methodology for identifying important features in wine quality datasets.

In [2]:
%%capture
%run 06_model_building.ipynb

## 1. Identify Important Features

In [3]:
 def identyfy_important_features(df, number_of_parameters): 
    """
    Identifies features with the highest coefficient for each model in the dataframe.

    Parameters:
        - df: DataFrame containing model information
        - number_of_parameters: Number of top features to select for each model

    Returns:
        List of dictionaries with model names and their corresponding top features
    """
    result_dict = []
    #check each model 
    for model in range(len(df)):

        # take only significant params
        significant_params = (df.iloc[model]['model'].params[1:][df.iloc[model]['model'].pvalues < 0.05]).abs().sort_values(ascending=False)

        # select 3 the best
        top_features = list(significant_params[:number_of_parameters].index)  
        # Create a dictionary entry for the current dataframe
        results_df = result_dict.append({df.iloc[0]['Model name']: top_features})
        
    return result_dict

## 2. Finding the receipe for the good and poor wine

This section is dedicated to applying the methodologies or processes established earlier to specifically categorize and analyze good and poor quality wines.

In [4]:
# Here, the function to identify top three important features is called with the evaluation matrix list as the argument. 
# The output will be used to understand critical factors in wine quality.
top_three_features_list = identyfy_important_features(df=results_df, number_of_parameters=3)
top_three_features_list

[{'Model_White Wine Poor (Without Outliers)': ['volatile acidity',
   'fixed acidity',
   'residual sugar']},
 {'Model_White Wine Poor (Without Outliers)': ['density',
   'pH',
   'fixed acidity']},
 {'Model_White Wine Poor (Without Outliers)': ['volatile acidity',
   'citric acid',
   'pH']},
 {'Model_White Wine Poor (Without Outliers)': ['sulphates',
   'chlorides',
   'alcohol']}]

In [5]:
features_to_reverse_white = ['chlorides', 'volatile acidity']
features_to_reverse_red = ['residual sugar', 'chlorides']

def correct_reverse_log_transformation(df, features):
    '''
    This function reverses the logarithmic transformation applied to certain features in the dataframe. 
    This is useful for converting transformed data back to its original scale, often for interpretation or further analysis.    
    '''
    for feature in features:
        if feature in df.columns:
            df[feature] = np.exp(df[feature])
    return df

df_white_good_without_outliers = correct_reverse_log_transformation(df_white_good_without_outliers, features_to_reverse_white)
df_white_poor_without_outliers = correct_reverse_log_transformation(df_white_poor_without_outliers, features_to_reverse_white)

df_red_good_without_outliers = correct_reverse_log_transformation(df_red_good_without_outliers, features_to_reverse_red)
df_red_poor_without_outliers = correct_reverse_log_transformation(df_red_poor_without_outliers, features_to_reverse_red)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[feature] = np.exp(df[feature])


In [13]:
def get_receipes(top_three_features_list, df_lists):
    '''
    his function calculates and prints the 'recipes' for different types of wine, based on the top three features identified earlier. 
    It computes the mean and standard deviation for these features, using these to define a range that characterizes each wine type.
    '''
    for df in df_lists:
        pipeline_name = f"Model_{get_wine_str(df)}"
        for model in top_three_features_list:
            if pipeline_name in model:
                feature_list = model[pipeline_name]
                numeric_columns = check_numeric_columns(df)
                numeric_df = df[numeric_columns]

                means = numeric_df[feature_list].mean()
                stds = numeric_df[feature_list].std()

                ranges = {feature: (means[feature] - stds[feature], means[feature] + stds[feature]) for feature in feature_list}

                print(f"\nRecipes of {get_wine_str(df)} wine:")
                for feature, range_vals in ranges.items():
                    print(f"{feature.capitalize()}: {range_vals[0]:.2f} - {range_vals[1]:.2f}")


In [14]:
get_receipes(top_three_features_list, wine_quality_without_outliers_dfs) 


Recipes of White Wine Poor (Without Outliers) wine:
Volatile acidity: 0.20 - 0.42
Fixed acidity: 6.08 - 7.85
Residual sugar: 1.78 - 12.35

Recipes of White Wine Poor (Without Outliers) wine:
Density: 0.99 - 1.00
Ph: 3.03 - 3.31
Fixed acidity: 6.08 - 7.85

Recipes of White Wine Poor (Without Outliers) wine:
Volatile acidity: 0.20 - 0.42
Citric acid: 0.19 - 0.48
Ph: 3.03 - 3.31

Recipes of White Wine Poor (Without Outliers) wine:
Sulphates: 0.38 - 0.58
Chlorides: 0.03 - 0.08
Alcohol: 8.97 - 10.73
