# Category Comparison

You have a dataset from a new client. Each row represents a production batch. Each column is a "property" or "feature" of that production batch. You are interested in analyzing the target variable `out_speed`. In particular, you want to assess how the batch "properties" influence the `out_speed`.


## Question 1

a. Considering the categorical features, find all the "problematic properties" 
(feature values), if any, that results in a "significant" decrease in `out_speed`. 
How would you assess if a decrease in `out_speed` is "significant" for a particular 
sub-set of batches? 

**Note1:** a subset of batches is a subset of the dataset identified by the 
value of some features. 

**Note2:** do not manually cherry-pick properties, 
implement a search for "problematic properties" instead. 


b. Extend the search to include continuous features.

## Question 2

2. How would you approach finding not a single but a *combination of multiple properties* 
resulting in a decrease in `out_speed`?


# Instructions

For questions 1 and 2, please provide a complete runnable implementation. 
Code and tests should go (as much as possible) in importable Python module(s)
or library. 
The exploratory analysis and results should be presented in a jupyter notebook 
with calls the library functions to perform the computation. 

For question 3, describe the approach you would use. You can provide code examples but you 
don't need to implement a full runnable solution.

Please illustrate your thought process, even mentioning ideas you discarded (if needed). Also, please discuss the assumptions you made
on the data, how did you choose the features 
and what are the trade-offs of your implementation.


1. We want to test your problem solving skills and the ability to structure a solution. The problem description is dry on purpose. Please illustrate your thought process, even mentioning ideas you discarded (if needed). Also, please discuss the assumptions you made
on the data, how did you choose the features 
and what are the trade-offs of your implementation.

1. We greatly value if you can write clear, well-structured python code, possibly with tests to catch or prevent errors. We also encourage you to add descriptions and formatted text to describe what you did.

1. You should use Python 3.6+ plus any additional open source library you deem necessary. Please indicate what additional libraries you used, possibly with instructions for replicating your environment.


# Data

You can find the experimental data in a file [`ds1_cats.csv`](ds1_cats.csv) in the same folder as the notebook.




## Solution 

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor, cv

### Import the data and show the data

In [None]:
df=pd.read_csv('ds1_cats.csv')
df.head(5)

### Using ProfileReport library to explore the data

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

## Question 1
Find categorial columns from the dataset 

In [None]:
cat_list=df.select_dtypes(include=['object','category']).columns
len(cat_list)

As I can see there are 9 categorical features in the dataset. Now I will create a diffierent dataset with only 9 categorical features and the `out_speed`, named `df_cat`

In [None]:
df_cat=df[cat_list]
df_cat['out_speed']=df['out_speed']
df_cat.head(5)

following function find columns with missing values and plot those 

In [None]:
def plot_missing_value(df):
    list_of_col=df.columns
    y_1=[]
    u=[]
    y_2=[]
    for l in list_of_col:
        y_1.append(len(df[df[l].isnull()]))
        u.append(len(df[l].unique()))
        y_2.append(len(df)-len(df[df[l].isnull()]))

    y_pos = np.arange(len(list_of_col))
    plt.figure(figsize=(10,5))
    plt.bar(y_pos, y_1, 0.50)
    plt.bar(y_pos, y_2, 0.50,bottom=y_1)
    plt.xticks(y_pos, list_of_col)
    plt.ylabel('number of samples')
    plt.xlabel('columns')
    plt.title('Number of missing values per column')
    for i, v in enumerate(y_1):
        plt.text(y_pos[i] - 0.10, v + 0.10, str(v))
    plt.show()

In [None]:
plot_missing_value(df_cat)

In [None]:
As I can see there are some missing values in the dataset. For example `colb`,`fr_cu_pm_nb`,`tool`,`ff_cf`,and `sin_st` has 17 missing values which is very less compared to the total number of samples so I decided to remove 17 rows with missing values.
Now I have decided to remove rows with missing values only for features missing samples are small compare to total number of samples. now I will remove rows where `tool` is missing.

In [None]:
df_cat=df_cat[df_cat['tool'].notna()]
plot_missing_value(df_cat)

from the above chart it's clear that `h2`, `gr_cat` and `h3` has quite a lot of missing values. So I decide to replace those missing values with "unknown". On the other hand I decided to drop `r1_p_b` because of the large number of missing values 

- `bar_char_cat` plot a bar chart for with different values present in the column
- `remove_column` remove a specific column from dataframe
- `replace_missing_with_unknown` replace missing value for a column with "Unknown"

In [None]:
def bar_char_cat(cat,df):
    df[cat].value_counts().sort_index().plot.bar()
    
def remove_column(col,df):
    df=df.drop(col,1)
    return df

def replace_missing_with_unknown(col,df):
    df[col].fillna('unknown', inplace=True)
    return df
    
# bar_char_cat('gr_cat',df_cat)
list_missing_cols=['h2','h3','r1_p_b','gr_cat']

for col in list_missing_cols:
    df_cat=replace_missing_with_unknown(col,df_cat)

In [None]:
df_cat=remove_column('r1_p_b',df_cat)
df_cat=replace_missing_with_unknown('h2',df_cat)

df_cat.head(5)

I have decide to use library `catboost` to create a model to understand relationship between features and `out_speed`. In the question you asked me to find "problematic properties" (feature values) which result in significant decrease in `out_speed` so if I can find the significant feature among all the features then I can also find the specific feature value for which `out_speed` decrease significantly. 

I have decided to use catboost library because it allow to use categorical features for regression and also has function to generate feature importance.

Before doing that I have created feature matrix and labels 

In [None]:
y=df_cat['out_speed']
X=remove_column('out_speed',df_cat)

In [None]:
categorical_features_indices =[0,1,2,3,4,5,6,7]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)


In [None]:
def perform_model(X_train, y_train,X_test, y_test):
    model = CatBoostRegressor(
        random_seed = 400,
        loss_function = 'RMSE',
        iterations=100,
    )
    
    model.fit(
        X_train, y_train,
        cat_features = categorical_features_indices,
        eval_set=(X_test, y_test),
        verbose=False
    )
    
    print("RMSE on training data: "+ model.score(X_train, y_train).astype(str))
    print("RMSE on test data: "+ model.score(X_test, y_test).astype(str))
    
    return model

In [None]:
def generate_importance_ranking(model):
    feature_score = pd.DataFrame(list(zip(X.dtypes.index, model.get_feature_importance(Pool(X, label=y, cat_features=categorical_features_indices)))),
                columns=['Feature','Score'])

    feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
    
    plt.rcParams["figure.figsize"] = (10,5)
    ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
    ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
    ax.set_xlabel('')

    rects = ax.patches

    labels = feature_score['Score'].round(2)

    for rect, label in zip(rects, labels):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')

    plt.show()

Train and test model of the dataset 

In [None]:
model=perform_model(X_train, y_train,X_test, y_test)

Generate feature importance ranking and ploting them 

In [None]:
generate_importance_ranking(model)

From the above plot we can see that `gr_cat` is most important feature so I have decided to find mean value for each type of gr_cat value

In [None]:
df_cat[['gr_cat','out_speed']].groupby('gr_cat').mean()

In [None]:
df_cat[['sin_st','out_speed']].groupby('sin_st').mean()

From the above table we can see that (56.0, 66.0] value significantly effect the `out_speed` to decrease it's value. Also same way I can see that if the `sin_st` is "stecca" then it decrease the `out_speed`

## Question 1/b:

For the second part of the question 1. I have consider also numerical features along with categorical features. So we will go through the same process.

- find the columns with missing values 
- handle the missing values 
- check dataset again

In [None]:
plot_missing_value(df)

In [None]:
list_missing_cols=['h2','h3','r1_p_b','gr_cat']
df_all=df[df['tool'].notna()]

for col in list_missing_cols:
    df_all=replace_missing_with_unknown(col,df_all)

df_all=remove_column('r1_p_b',df_all)


In [None]:
plot_missing_value(df_all)

Now I will create feature matrix and label from the dataset

In [None]:
y=df_all['out_speed']
X=remove_column('out_speed',df_all)
X=remove_column('Unnamed: 0',X)

In [None]:
X.head(5)

In [None]:

categorical_features_indices =[3,7,8,9,10,12,13,14]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Now I have trained the model with catbooostRegressor 

In [None]:
model=perform_model(X_train, y_train,X_test, y_test)

In [None]:
generate_importance_ranking(model)

from the feature importance ranking it's clear that `lordo_speed` has significant influence of `out_speed`

In [None]:
plt.plot(df_all['out_speed'].values,df_all['lordo_speed'].values,'bo')
plt.ylabel('lordo_speed')
plt.xlabel('out_speed')
plt.title('Correlation between out_speed and lordo_speed')

from the above plot we can see that `out_speed` increases with `lordo_speed`

## Question 2:
To find multiple features which resulting in decrease `out_speed`. I can take the most important features. for example I can take top 5 features and analyze them

In [None]:
generate_importance_ranking(model)

So top 5 most important features are `lordo_speed`,`volume`,`gaps`,`mano`,and `h1`.

In [None]:
grouped_df=df_all[['lordo_speed','volume','gaps','mano','h1','out_speed']].groupby('out_speed')

for key, item in grouped_df:
    print(grouped_df.get_group(key), "\n\n")

## Tests 