# Python Scientific Data Analysis
## Course's Final Project
### Barak Daniel - 204594329

## Installations needed for the program to run:

In [None]:
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install scipy
!pip install pydotplus
!conda install -y python-graphviz


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pydotplus
import random
from sklearn import tree
from io import StringIO
from IPython.display import Image
from sklearn.inspection import permutation_importance

# Intro

### Overview
This project is about targetted marketing, in the data set given we have 'customer' as each row of data, and the features can tell us about the customer's status in general (age, marriage, etc..) and hes shopping behavior.
In this section of the project I will go over the data set and the goal to try and understand the whole process before starting the actual work on the data.

### So what exactly is targetted marketing?
Targeting in marketing is a strategy that breaks a large market into smaller segments to concentrate on a specific group of customers within that audience. It defines a segment of customers based on their unique characteristics and focuses solely on serving them.
Instead of trying to reach an entire market, a brand uses target marketing to put their energy into connecting with a specific, defined group within that market.
So for this reason I'll break down all the features in the given dataset ('customers3.csv'), and understand them.


### Feature's data breadown
The following features are included in the data set given in 'customes3.csv':
- ID - Unique ID to each customer
- Gender - The gender of the customer
- Ever_Marries - Indicates if the customer was married
- Age - The age of the customer
- Graduated - Has the customer graduated high school
- Profession - The profession of the customer
- Work_Experience - The number of years of the customer's expirence in his profession
- Spending_Score - The spending habits of the customer classified to 3 categories
- Family_Size - The number of family members the customer has in his household
- Shop_Day - The day of the week which the customer is shopping on the most
- Shop_Other - Normalized measure of customer deviation from average store customer spending on non specified products
- Shop_Dairy - Normalized measure of customer deviation from average store customer spending on dairy products
- Shop_Household: Normalized measure of customer deviation from average store customer spending on household products
- Shop_Meat - Normalized measure of customer deviation from average store customer spending on meat products
- Group - The target group which the customer belongs to

### Feature's type breakdown
- ID - Numerical discrete (Integer)
- Gender - Categorical (Male/Female)
- Ever_Marries - Categorical nominal (Yes/No)
- Age - Numerical continuous (Integer)
- Graduated - Categorical (Yes/No)
- Profession - Categorical nominal
- Work_Experience - Numerical discrete (Integer)
- Spending_Score - Categorical ordinal (Low/average/High)
- Family_Size - Numerical discrete (Integer)
- Shop_Day - Categorical ordinal (Sunday, Monday, ..., Saturday)
- Shop_Other - Numerical continuous (Double)
- Shop_Dair - Numerical continuous (Double)
- Shop_Household - Numerical continuous (Double)
- Shop_Meat -Numerical continuous (Double)
- Group - Categorical nominal

In [None]:
df = pd.read_csv('customers3.csv')
count = df.count()

print("The number of rows is: {}".format(len(df.index)))
print("The number of columns is: {}".format(len(df.columns)))
print("The number of cells is: {}".format(len(df.index) * len(df.columns)))
print("The number of cells with concrete values is: {}".format(count.sum()))
print("The number of cells without concrete values is: {}\n".format(len(df.index) * len(df.columns) - count.sum()))

print("\nThe number of concrete values for each feature:")
df.count()


### The size of the data set is:
- 8120 rows of customer's data (+1 for the headers of each column)
- 15 columns for the features
- 8120*15 = 121,8000 cells, but we can see that not all of them has concrete values.

### Missing values:
After watching the dataset and trying to understand it, I have also encountered many cells with missing data values.
After reading the dataset a transformation of this Nan values is needed, for each feature with missing data, I'll examine it and understand which of the methods is the best to deal with those values (Mode, Mean, Median, Removal, etc..).

### Other types of missing values:
A validation for the values that are not missing must be made, after going through the features, the options are numeric value which is out of the range as given with feature definition, a numeric value that cannot be negative, etc...

After going through out the dataset, those are the features needed to be fixed:
- Shop_Day - Must contain values of 1 to 7 but there are values out of this range therefore it will be filled by the same method as all the feature values


## Initial Data Analysis

As we saw above, there are a lot of missing values and categorical values we want to transform before we can alanyze the data completely.
In this section of the project I will deal with those values, for each feature a check for the accuracy of the model will be taken and by that I can make the decision what was the best method for the feature.

In [None]:
def fillMissingValues(df, target, useRound=0):
    group_list = ["A", "B", "C", "D"]
    df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

    print("The correlation before filling the data is: ",df["Group_Transformed"].corr(df[target]))

    aggMin = df[target].min()
    aggMax = df[target].max()
    if(useRound == 0):
        aggMean = df[target].mean()
    else:
        aggMean = round(df[target].mean())
    aggMedian = df[target].median()
    aggMode = df[target].mode()[0]

    print("\n{} aggregations:\nMean = {}\nMedian = {}\nMode = {}".format(targer ,aggMean, aggMedian, aggMode))
    print("Min = {}  --- Max = {}\n".format(aggMin, aggMax))

    targetMean = target+"_mean"
    targetMedian = target+"_median"
    targetMode = target+"_mode"

    df[targetMean] = df[target].fillna(aggMean)
    df[targetMedian] = df[target].fillna(aggMedian)
    df[targetMode] = df[target].fillna(aggMode)

    corrMean = df["Group_Transformed"].corr(df[targetMean])
    corrMedian =  df["Group_Transformed"].corr(df[targetMedian])
    corrMode = df["Group_Transformed"].corr(df[targetMode])

    print("Corr with mean: ", corrMean)
    print("Corr with median: ", corrMedian)
    print("Corr with mode: ", corrMode)

    if(corrMean > corrMedian):
        if(corrMean > corrMode):
            df[target] = df[targetMean]
        else:
            df[target] = df[targetMode]
    elif(corrMedian > corrMode):
        df[target] = df[targetMedian]
    else:
        df[target] = df[targetMode]

    df = df.drop(['Group_Transformed', targetMean, targetMedian, targetMode], axis=1, inplace=True)

The first feature to handle missing data will be 'Gender', since we have less than 60 missing values and its binary the best method to do it is to check their distributions over the data set and then fill them with this distributions.


In [None]:
male = 0
female = 0
for index,row in df.iterrows():
    if(row['Gender'] == 'Male'):
        male += 1
    elif(row['Gender'] == 'Female'):
        female += 1

print("Female precentage", female/(male+female))
print("Male precentage", male/(male+female))

So we can see now that females represents ~45.25% of the rows in the data set and Males are ~54.75% .
Now we fill the missing values with this distribution:

In [None]:
nans = df['Gender'].isna()
length = sum(nans)
replacement = random.choices(['Male', 'Female'], weights=[.5475, .4525], k=length)
df.loc[nans,'Gender'] = replacement

df.Gender.count()

The next feature I'll be dealing with is Ever_Married which is also a binary answer of yes or no.


In [None]:
married = 0
notmarried = 0
for index,row in df.iterrows():
    if(row['Ever_Married'] == 'Yes'):
        married += 1
    elif(row['Ever_Married'] == 'No'):
        notmarried += 1


married_list = ["Yes", "No"]
df["Ever_Married_Transformed"] = pd.Categorical(df.Ever_Married, ordered=True, categories=married_list).codes + 1
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Ever_Married_Transformed"]))
df.drop(['Ever_Married_Transformed', 'Group_Transformed'], axis=1)

print("Percentage of married = ", (married/(married+notmarried)))
print("Percentage of married = ", (notmarried/(married+notmarried)))

df = df.drop(["Ever_Married_Transformed"], axis=1)

The correlation to the group is low and therefore I can use the same method as I did in 'Gender'.

In [None]:
nans = df['Ever_Married'].isna()
length = sum(nans)
replacement = random.choices(['Yes', 'No'], weights=[.5859, .4141], k=length)
df.loc[nans,'Ever_Married'] = replacement

df.Ever_Married.count()

The next feature to deal with is 'Age', this feature fits the fillMissingValues() function I created above, therefore I'll use it here.

In [None]:
fillMissingValues(df, "Age")
df["Age"].count()

As we can see, both of the values are nearly the same, but we will still prefer the option for the better correlation even if it is only slightly higher.

The next feature will be Graduated which missing a few values, so like the other binary features I have dealt with above, I'll do the same here.

In [None]:
grad = 0
ungrad = 0
for index,row in df.iterrows():
    if(row['Graduated'] == 'Yes'):
        grad += 1
    elif(row['Graduated'] == 'No'):
        ungrad += 1

print("Graduated precentage", ungrad/(grad+ungrad))
print("Haven't graduated precentage", grad/(grad+ungrad))

nans = df['Graduated'].isna()
length = sum(nans)
replacement = random.choices(['Yes', 'No'], weights=[.3781, .6219], k=length)
df.loc[nans,'Graduated'] = replacement

df.Graduated.count()

The next feature is Profession, the difference from the features we dealt with already is that this feature is Categorical nominal which the set of his values is not finite, therefore to fill this column and not lose the data from the rest of the rows, I will fill the values with Mode.

In [None]:
proMode = df.Profession.mode()[0]
df["Profession"] = df.Profession.fillna(proMode)

Now the feature to be dealt with is Work_Expirence, because there is the Age col, maybe here we can fill the missing values with deductive imputation, so now I'll check their corr and decide how to fill this feature.

In [None]:
df.Age.corr(df.Work_Experience)

Because the corr is low, using the Age feature won't be good enough for filling those values, now I'll check the mathematical operation that can be done.

In [None]:
fillMissingValues(df, "Work_Experience")
df["Work_Experience"].count()

As we can see the mean option for filling the values is the best we can get here and is close to the original correlation before filling the missing data.

In [None]:
fillMissingValues(df, "Family_Size", 1)

Both mean and median gave the same result and got better correlation than mode, that is why their value will be used for filling the missing values.

For the next feature, "Shop_Day", before filling the missing values I need to deal with the wrong values the feaure is containing:
Must contain values of 1 to 7 but there are values out of this range.

In [None]:
print(df.Shop_Day.unique())

The values that needs to be dealt with first are 0 and 22

In [None]:
temp_Shop_Day = df.Shop_Day
for index, value in enumerate(temp_Shop_Day):
    if(value == 0 or value == 22):
        temp_Shop_Day[index] = np.nan

df.Shop_Day = temp_Shop_Day
print(df.Shop_Day.unique())

Now I can fill the missing values by checking the best mathematical operation without wrong values.

In [None]:
fillMissingValues(df, "Shop_Day", useRound=1)
df["Shop_Day"].count()

For the Shop_Day filling options we can see that the mean is the better option here and very close to the original correlation.

Moving on to the next feature, "Shop_Diary", I will check all the fitting mathematical operations as well.

In [None]:
fillMissingValues(df, "Shop_Dairy")
df["Shop_Dairy"].count()

The mean and median are pretty close in their correlation to the group, still mean is higher so that is why it will be used to fill the values. 

For the next 2 feature's, "Shop_Household" and "Shop_Meat", I will use the same method as in Dairy 

In [None]:
fillMissingValues(df, "Shop_Household")
df["Shop_Household"].count()

In [None]:
fillMissingValues(df, "Shop_Meat")
df["Shop_Meat"].count()

Both "Shop_Household" and "Shop_Meat" will get the best result by filling with the mean value of the feature.

Now that all the dataset is fixed I'll save it to a new csv.

In [None]:
print(df.count())
df.to_csv("customer3_fixed.csv", index=False)

## Exploratory Data Analysis

feature correlation:
- Turning categorical features to numeric representation
- Features histograma
- Ploting correlation map

In [None]:
group_list = ["Male", "Female"]
df["Gender"] = pd.Categorical(df.Gender, ordered=True, categories=group_list).codes + 1

group_list = ["Yes", "Now"]
df["Ever_Married"] = pd.Categorical(df.Ever_Married, ordered=True, categories=group_list).codes + 1

group_list = ["Yes", "Now"]
df["Graduated"] = pd.Categorical(df.Graduated, ordered=True, categories=group_list).codes + 1

group_list = []
uniqueProf = df.Profession.unique()
for prof in uniqueProf:
    group_list.append(prof)

df["Profession"] = pd.Categorical(df.Profession, ordered=True, categories=group_list).codes + 1

group_list = ["Low", "Average", "High"]
df["Spending_Score"] = pd.Categorical(df.Spending_Score, ordered=True, categories=group_list).codes + 1

group_list = ["A", "B", "C", "D"]
df["Group"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

In [None]:
hist = df.hist(bins=20, figsize=(20,20))

In [None]:
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(21, 17))
cmap = sns.diverging_palette(200, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap = cmap, center=0, annot = True , square = True , linewidths = .6, vmin = -1, vmax = 1 )
plt.show()

What can we learn from the correlation heatmap?

- ID: is just the id of the observation therefore there shouldn't be any relation for it to the other features.

- Gender: this feature is not in a good correlation with Group, it can't help us learn much.

- Ever_Married: comparing to the other values, there is a small negative correlation to Group which indicates that people who got ever married mostly got into an higher group.

- Age: this is the 3rd best correlation to Group we can see in the dataset, this correlation can tell us that the younger the customer, the more likley he will be in a higher group.

- Graduated: this feature's correlation to Group is average comparing to the other features and this small correlation
 indicates that people who did graduate will be more likeley to be in an higher group.
 
- Profession: this feature is also on an average correlation to Group comparing to the other features, but we can't learn much from its correlation to Group since it is a nominal feature and can't really indicate a behavior.

- Work_Expirence: The lowest correlation to Group comparing to all the other features, we can't learn from it anything relating to the target group.

- Spending_Score: this feature is not so correlated to the Group target feature, but is still has some negative correlation, which indicates that the lower the shoping score of the customer the more likeley he will be in an higher group.

- Family_Size: this feature has an average correlation to Group and we can learn from it that as the family size grow bigger, there is more chance to be in an higher group.

- Shop_Day: the favorite shop day of the week as a very small correlation to Group and therefore we can't learn much from it.

- Shop_Other: the 2nd best feature for correlation with Group, we learn from it that the more the customer is spending on other things than dairy, meat and household, the higher group he will probably be in.

- Shop_Dairy: an average correlation compared to the other features but we can still learn that the more the customer spend on dairy the more likley he will be in an higher group.

- Shop_Household: this feature has the best correlation to group in all the dataset, we can learn that the more the customer spends on household he will probably get into an higher group.

- Shop_Meat: this feature has an average correlation comped to the other features and indicates a small connection between the amount a customer spend on meat and the group as the more he spend the higher group he will probably get into.

Addtional insights of correlation not related to the target feature:
- Age and Ever_Married has a strong correlation, this makes sense of course, I was wondering if I should remove one of them because of their high correlation but I decided that at this stage I will keep them both.

- Spending_Score has a high correlation with Age and Ever_Married, which are correlated as explained above, but here again I don't find a need to drop one of the features because the correlation isn't high enough and the data is not similliar.

Now I'll display the most relevant features in graphs:

In [None]:
group = df['Group']
features = ['Shop_Household', 'Shop_Other', 'Age']

for feat in features:
    temp  = df[feat]
    cor = group.corr(temp)
    plt.title('Group {} correlation {}'.format(feat,cor))
    plt.xlabel(feat)
    plt.ylabel('Group')
    fig = plt.gcf()
    fig.set_size_inches(8, 6)
    plt.scatter(temp, group)
    plt.plot(np.unique(temp), np.poly1d(np.polyfit(temp, group, 1))(np.unique(temp)), color='red', linewidth = 2.9)
    plt.show()

From the 3 graphs above we can see clearly that there is a linear correlation between each of these features to the Group feature.
now I will show those features in a pair plot to see if there is more insights about their data.

In [None]:
pairPlotDf = df[['Shop_Household', 'Shop_Other', 'Age', 'Group']]
sns.set_context(rc={"axes.labelsize":20})
pairPlot = sns.pairplot(pairPlotDf, hue='Group', palette='Set1', corner=True)
pairPlot.fig.set_size_inches(20, 20)
pairPlot._legend.remove()
plt.legend(title='Group', loc=(2., 1.5), labels=['A', 'B', 'C', 'D'], prop={'size': 18}, title_fontsize='21')

plt.show()

We can see that as the heatmap showed us that the most correlated features to the Group target feature are giving also the most clrear classification.
In the pairplot above the Shop_Other and Shop_Household giving the best result for the scatter of the groups.

The last feature I want to check in this part is the Profession because it is hard to know anything from it by looking at the correlation.

In [None]:
pairPlotDf = df[['Shop_Household', 'Shop_Other', 'Profession', 'Group']]
sns.set_context(rc={"axes.labelsize":20})
pairPlot = sns.pairplot(pairPlotDf, hue='Group', palette='Set1', corner=True)
pairPlot.fig.set_size_inches(20, 20)
pairPlot._legend.remove()
plt.legend(title='Group', loc=(2., 1.5), labels=['A', 'B', 'C', 'D'], prop={'size': 18}, title_fontsize='21')

plt.show()

Now we can realy see that both the heatmap and the pairplot indicates that the Profession feature is not correlated to the Group feature, and since it is not related in a strong way to any other feature, there is no information we can get from it in this stage.

## Classification Model



### Task 1
The way I'll see which are the best features for my model, is that I'll partice my model with 2 different features everytime and decide by that.

In [None]:
features = ["Gender", "Ever_Married", "Age", "Graduated", "Work_Experience", "Spending_Score", "Family_Size", "Shop_Day", "Shop_Other", "Shop_Dairy", "Shop_Household", "Shop_Meat"]
bestFeats = []
bestScore = 0

for i in range(len(features)):
    for j in range(len(features)):
        if(i != j):
            x_data = df[[features[i], features[j]]]
            y_group = df.Group
            x_train, x_test, y_train, y_test =  train_test_split(x_data, y_group, train_size= 0.8, random_state=1)
    
            from sklearn.naive_bayes import GaussianNB
            model = GaussianNB()                       
            prediction = model.fit(x_train, y_train)                  
            y_model = model.predict(x_test)

            if(bestScore < metrics.accuracy_score(y_test, y_model)):
                bestScore = metrics.accuracy_score(y_test, y_model)
                bestFeats = [features[i], features[j]]
                
print(bestScore, bestFeats)


In [None]:
temp_group = df.Group
for index,val in enumerate(temp_group):
    if(val == 1):
        temp_group[index] = "A"
    elif(val == 2):
        temp_group[index] = "B"
    elif(val == 3):
        temp_group[index] = "C"
    elif(val == 4):
        temp_group[index] = "D"
df.Group = temp_group
y = df.Group
X = df[['Shop_Meat', 'Shop_Other']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1)

clf = GaussianNB()
clf = clf.fit(X_train, y_train)

# Predict the response for test dataset

y_pred = clf.predict(X_test)
metrics.classification_report(y_test, y_pred)

hueorder = clf.classes_

x_min, x_max = X_train.loc[:, 'Shop_Meat'].min() -1, X_train.loc[:, 'Shop_Meat'].max() +1
y_min, y_max = X_train.loc[:, 'Shop_Other'].min() -1, X_train.loc[:, 'Shop_Other'].max() +1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.2), np.arange(y_min, y_max, 0.2))

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = np.argmax(Z, axis=1)


# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap='Set1', alpha=0.5)
plt.clim(0, len(clf.classes_)+3)
sns.scatterplot(data=df[::5], hue='Group', hue_order=hueorder, palette='Set1', x='Shop_Meat', y='Shop_Other')
fig = plt.gcf()
fig.set_size_inches(15, 10)

plt.show()

As we can see above, the classification gives a pretty good classification for A and B values in the Group targer feature, but we can still see the classification of values C and D, although it is not as accurate.

To see farther information I will display all the prediction which didn't work.

In [None]:
y = df.Group
X = df[['Shop_Meat', 'Shop_Other']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1)

clf = GaussianNB()
clf = clf.fit(X_train, y_train)

# Predict the response for test dataset

y_pred = clf.predict(X_test)
metrics.classification_report(y_test, y_pred)

hueorder = clf.classes_

x_min, x_max = X_train.loc[:, 'Shop_Meat'].min() -1, X_train.loc[:, 'Shop_Meat'].max() +1
y_min, y_max = X_train.loc[:, 'Shop_Other'].min() -1, X_train.loc[:, 'Shop_Other'].max() +1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.2), np.arange(y_min, y_max, 0.2))

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = np.argmax(Z, axis=1)


# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap='Set1', alpha=0.5)
plt.clim(0, len(clf.classes_)+3)
fig = plt.gcf()
fig.set_size_inches(15, 10)
plt.xlabel('Shop_Meat')
plt.ylabel('Shop_Other')


#Adding overlay:
x_data = df[['Shop_Meat', 'Shop_Other']]

y_pred = clf.predict(x_data)

incorrect_vals = []
temp_group = df.Group
temp_vals = df.values.tolist()

for i in range(len(df)):
    if(y_pred[i] != temp_vals[i][14]):
        incorrect_vals.append(temp_vals[i])

incorrect_vals = pd.DataFrame(incorrect_vals, columns=df.columns)

temp_group = clf.classes_

sns.scatterplot(data=incorrect_vals[::5], hue='Group', hue_order=temp_group, palette='Set1', x='Shop_Meat', y='Shop_Other')

plt.show()

### Task 2

In [None]:
basicDf = pd.read_csv("customers3.csv")
basicDf = basicDf.dropna()

group_list = ["Male", "Female"]
basicDf["Gender"] = pd.Categorical(basicDf.Gender, ordered=True, categories=group_list).codes + 1

group_list = ["Yes", "Now"]
basicDf["Ever_Married"] = pd.Categorical(basicDf.Ever_Married, ordered=True, categories=group_list).codes + 1

group_list = ["Yes", "Now"]
basicDf["Graduated"] = pd.Categorical(basicDf.Graduated, ordered=True, categories=group_list).codes + 1

group_list = []
uniqueProf = basicDf.Profession.unique()
for prof in uniqueProf:
    group_list.append(prof)

basicDf["Profession"] = pd.Categorical(basicDf.Profession, ordered=True, categories=group_list).codes + 1

group_list = ["Low", "Average", "High"]
basicDf["Spending_Score"] = pd.Categorical(basicDf.Spending_Score, ordered=True, categories=group_list).codes + 1

group_list = ["A", "B", "C", "D"]
basicDf["Group"] = pd.Categorical(basicDf.Group, ordered=True, categories=group_list).codes + 1

x_data = basicDf.drop("Group", axis=1)
y_group = basicDf.Group
x_train, x_test, y_train, y_test =  train_test_split(x_data, y_group, train_size= 0.8, random_state=1)
from sklearn.naive_bayes import GaussianNB
model = tree.DecisionTreeClassifier(max_depth=7)               
prediction = model.fit(x_train, y_train)                  
y_model = model.predict(x_test)
print(metrics.accuracy_score(y_test, y_model))


In [None]:
x_data = df.drop(["Group", "Gender", "ID", "Ever_Married"], axis=1)
y_group = df.Group
x_train, x_test, y_train, y_test =  train_test_split(x_data, y_group, train_size= 0.8, random_state=1)
from sklearn.naive_bayes import GaussianNB
model = tree.DecisionTreeClassifier(max_depth=7)               
prediction = model.fit(x_train, y_train)                  
y_model = model.predict(x_test)
print(metrics.accuracy_score(y_test, y_model))

In [None]:
# getting importances
result = permutation_importance(model, x_data, y_group, n_repeats=10, random_state=0)
importance = zip(x_data.columns, result['importances_mean'])

# summarize feature importance
for i,v in importance:
    print("Feature: {}, Score: {}".format(i,v))

# plot feature importance
plt.bar(range(len(x_data.columns)), result['importances_mean'])
plt.xticks(ticks=range(len(x_data.columns)), labels=x_data.columns, rotation=90)
plt.show()

dot_data = StringIO()
tree.export_graphviz(model, out_file=dot_data, filled=True, rounded=True, feature_names=x_data.columns, class_names=model.classes_)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Customers.png')
Image(graph.create_png())