# Data science - quantitative analysis

Data science can be used to numerical data as well.
We do not need to modify it to suite computer analysis, so we can read the data directly in.
Here we examine World Value Survey dataset.

[Narrated code walkthrough](https://www.youtube.com/watch?v=b-vtqbhlBaQ)

In [23]:
import pandas as pd
import numpy as np
from sklearn.metrics import *

from sklearn import metrics
from sklearn.model_selection import cross_val_score


data = pd.read_csv( './data/wvs.csv').astype(np.float64)
data['V10'].value_counts()

 2.0    45786
 1.0    29256
 3.0    11214
 4.0     2551
-1.0      514
-2.0      238
-5.0        6
Name: V10, dtype: int64

In [24]:
data.head()

Unnamed: 0.1,Unnamed: 0,V1,V2,V2A,V3,V4,V5,V6,V7,V8,...,VOICE,WEIGHT4B,S001,S007,S018,S019,S021,S024,S025,COW
0,1.0,6.0,12.0,12.0,1.0,1.0,1.0,1.0,-2.0,1.0,...,0.25,1.0,2.0,13761.0,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
1,2.0,6.0,12.0,12.0,2.0,1.0,2.0,3.0,4.0,2.0,...,0.33,1.0,2.0,13762.0,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
2,3.0,6.0,12.0,12.0,3.0,1.0,3.0,2.0,4.0,2.0,...,0.165,1.0,2.0,13763.0,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
3,4.0,6.0,12.0,12.0,4.0,1.0,1.0,3.0,4.0,3.0,...,0.0,1.0,2.0,13764.0,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
4,5.0,6.0,12.0,12.0,5.0,1.0,1.0,1.0,2.0,1.0,...,0.33,1.0,2.0,13765.0,0.833333,1.25,1206212000.0,126.0,122014.0,615.0


## Supervised machine learning

We use a general purpose library [scikit-learn](https://scikit-learn.org/stable/) in developing and workingwith these exercises.
1. splitting the data into train and test datasets
1. fit a model to the data
1. examine the model performance using test dataset

In [25]:
from sklearn import tree
from sklearn.model_selection import train_test_split

In [26]:
y = data['V10'] ## data we try to predidct
x = data.drop( 'V10', axis = 1) ## data we used to predict, let's remove the predictor from there

## We simplify data used for predicting and only use V4, V5, V6, V7, V8 and V9
x = x[['V4', 'V5', 'V6', 'V7', 'V8', 'V9']]

x_train, x_test, y_train, y_test = train_test_split( x, y , test_size=0.3, random_state=42)

In [27]:
y_train.value_counts()

 2.0    31995
 1.0    20490
 3.0     7837
 4.0     1822
-1.0      367
-2.0      179
-5.0        5
Name: V10, dtype: int64

In [28]:
y_test.value_counts()

 2.0    13791
 1.0     8766
 3.0     3377
 4.0      729
-1.0      147
-2.0       59
-5.0        1
Name: V10, dtype: int64

In [29]:
model = tree.DecisionTreeClassifier()
model = model.fit( x_train, y_train )

In [30]:
pred = model.predict( x_test )

In [31]:

print( confusion_matrix( y_test, pred ) )
print( classification_report( y_test, pred ) )

[[    0     0     0     0     1     0     0]
 [    0     6     0    16    33     3     1]
 [    0     1    10    22   111     3     0]
 [    0     3     4  2892  5785    60    22]
 [    1     9    21  2791 10726   212    31]
 [    0     3    14   767  2456   114    23]
 [    0     1     1   247   440    22    18]]
              precision    recall  f1-score   support

        -5.0       0.00      0.00      0.00         1
        -2.0       0.26      0.10      0.15        59
        -1.0       0.20      0.07      0.10       147
         1.0       0.43      0.33      0.37      8766
         2.0       0.55      0.78      0.64     13791
         3.0       0.28      0.03      0.06      3377
         4.0       0.19      0.02      0.04       729

    accuracy                           0.51     26870
   macro avg       0.27      0.19      0.20     26870
weighted avg       0.46      0.51      0.46     26870



## Exercises

* `V10` has several unwanted values: `-5`, `-2` and `-1`. Remove them from the data and rerun the analysis.
* What other variables would you add to the analysis? Do they improve accuracy? See [survey documentation](https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) for the meaning of variables.
* What other methods than `DecisionTreeClassifier` exists (see [scikit-learn supervised learning documentation](https://scikit-learn.org/stable/supervised_learning.html) documentation)? Try out them. Do you get better results?
* What does cross validation mean? Try out cross validation and test out these things.

In [32]:
# V10 has several unwanted values: -5, -2 and -1. 
# Remove them from the data and rerun the analysis.
print(data.V10.unique())
unwanted = [-2, -1, -5]
new_data = data.copy()
for uw in unwanted:
    new_data.drop(new_data.loc[new_data['V10']==uw].index, inplace=True)
new_data.V10.unique()


[ 2.  1.  3.  4. -2. -1. -5.]


array([2., 1., 3., 4.])

In [37]:
def train(data, questions):
    y = data['V10'] ## data we try to predidct
    x = data.drop( 'V10', axis = 1) ## data we used to predict, let's remove the predictor from there
    #print(y.unique())
    ## We simplify data used for predicting and only use V4, V5, V6, V7, V8 and V9
    x = x[questions]

    x_train, x_test, y_train, y_test = train_test_split( x, y , test_size=0.3, random_state=42)
    
    model = model.fit( x_train, y_train )
    pred = model.predict( x_test )
    return y_test, pred

def evaluate(y_test, pred): 
    print( confusion_matrix( y_test, pred ) )
    print( classification_report( y_test, pred ) )
# https://scikit-learn.org/stable/modules/model_evaluation.html
questions = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9']
y_test, pred = train(new_data, questions)
evaluate(y_test, pred)

[[ 2780  5871   124    18]
 [ 2616 10837   246    26]
 [  722  2410   133    29]
 [  269   513    34    15]]
              precision    recall  f1-score   support

         1.0       0.44      0.32      0.37      8793
         2.0       0.55      0.79      0.65     13725
         3.0       0.25      0.04      0.07      3294
         4.0       0.17      0.02      0.03       831

    accuracy                           0.52     26643
   macro avg       0.35      0.29      0.28     26643
weighted avg       0.46      0.52      0.47     26643



In [39]:
y_test, pred = train(data, questions)
evaluate(y_test, pred)

[[    0     0     0     0     1     0     0]
 [    0     6     1    16    33     2     1]
 [    0     1    10    21   112     3     0]
 [    0     3     5  2892  5786    60    20]
 [    2     9    20  2794 10723   210    33]
 [    0     3    12   768  2457   114    23]
 [    0     1     1   246   442    21    18]]
              precision    recall  f1-score   support

        -5.0       0.00      0.00      0.00         1
        -2.0       0.26      0.10      0.15        59
        -1.0       0.20      0.07      0.10       147
         1.0       0.43      0.33      0.37      8766
         2.0       0.55      0.78      0.64     13791
         3.0       0.28      0.03      0.06      3377
         4.0       0.19      0.02      0.04       729

    accuracy                           0.51     26870
   macro avg       0.27      0.19      0.20     26870
weighted avg       0.46      0.51      0.46     26870



* What other variables would you add to the analysis? Do they improve accuracy? See [survey documentation](https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) for the meaning of variables.

In [60]:
added_questions = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V11', 'V23'] 
# state of health, satisfaction with your life
membership = ['V24', 'V25', 'V26', 'V27' ]
y_test, pred = train(new_data, added_questions)
evaluate(y_test, pred)
# maybe more noise, bad value, etc.

[[4875 3592  276   50]
 [3300 9429  880  116]
 [ 523 1892  742  137]
 [ 144  336  225  126]]
              precision    recall  f1-score   support

         1.0       0.55      0.55      0.55      8793
         2.0       0.62      0.69      0.65     13725
         3.0       0.35      0.23      0.27      3294
         4.0       0.29      0.15      0.20       831

    accuracy                           0.57     26643
   macro avg       0.45      0.40      0.42     26643
weighted avg       0.55      0.57      0.56     26643



In [89]:
added_questions = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V11', 'V23'] # state of health, satisfaction with your life
satisfaction = ['V11','V23',]
close_rel = ['V4', 'V5', 'V6', 'V8']
membership = ['V24', 'V25', 'V26', 'V27', ]
y_test, pred = train(new_data, satisfaction)#+close_rel)
evaluate(y_test, pred)

[[ 4575  4092   112    14]
 [ 2495 10659   549    22]
 [  250  2180   783    81]
 [   75   400   248   108]]
              precision    recall  f1-score   support

         1.0       0.62      0.52      0.57      8793
         2.0       0.62      0.78      0.69     13725
         3.0       0.46      0.24      0.31      3294
         4.0       0.48      0.13      0.20       831

    accuracy                           0.61     26643
   macro avg       0.54      0.42      0.44     26643
weighted avg       0.59      0.61      0.59     26643



In [61]:
y_test, pred = train(new_data, added_questions+membership)
evaluate(y_test, pred)

[[4850 3412  454   77]
 [3761 8518 1241  205]
 [ 646 1665  825  158]
 [ 137  314  239  141]]
              precision    recall  f1-score   support

         1.0       0.52      0.55      0.53      8793
         2.0       0.61      0.62      0.62     13725
         3.0       0.30      0.25      0.27      3294
         4.0       0.24      0.17      0.20       831

    accuracy                           0.54     26643
   macro avg       0.42      0.40      0.41     26643
weighted avg       0.53      0.54      0.53     26643



In [160]:
new_data['financial_status'] = 11-data.V59
model = tree.DecisionTreeClassifier()

y_test, pred = train(new_data, satisfaction, model)
evaluate(y_test, pred)

scores: [0.30982119 0.29479136 0.29455115 0.23026029 0.30869259]
[[ 4575  4092   112    14]
 [ 2495 10659   549    22]
 [  250  2180   783    81]
 [   75   400   248   108]]
              precision    recall  f1-score   support

         1.0       0.62      0.52      0.57      8793
         2.0       0.62      0.78      0.69     13725
         3.0       0.46      0.24      0.31      3294
         4.0       0.48      0.13      0.20       831

    accuracy                           0.61     26643
   macro avg       0.54      0.42      0.44     26643
weighted avg       0.59      0.61      0.59     26643



In [69]:
trust = ['V102', 'V103']#, 'V104', 'V105', 'V106']
y_test, pred = train(new_data, added_questions+trust)
evaluate(y_test, pred)

[[4952 3432  339   70]
 [3758 8634 1141  192]
 [ 620 1698  812  164]
 [ 145  324  207  155]]
              precision    recall  f1-score   support

         1.0       0.52      0.56      0.54      8793
         2.0       0.61      0.63      0.62     13725
         3.0       0.32      0.25      0.28      3294
         4.0       0.27      0.19      0.22       831

    accuracy                           0.55     26643
   macro avg       0.43      0.41      0.42     26643
weighted avg       0.54      0.55      0.54     26643



In [80]:
trust = ['V102', 'V103']#, 'V104', 'V105', 'V106']
y_test, pred = train(new_data, added_questions+['V147','V148']) # religious person, believe in god
evaluate(y_test, pred)

[[5074 3363  303   53]
 [3638 8954  994  139]
 [ 587 1770  764  173]
 [ 113  285  228  205]]
              precision    recall  f1-score   support

         1.0       0.54      0.58      0.56      8793
         2.0       0.62      0.65      0.64     13725
         3.0       0.33      0.23      0.27      3294
         4.0       0.36      0.25      0.29       831

    accuracy                           0.56     26643
   macro avg       0.46      0.43      0.44     26643
weighted avg       0.55      0.56      0.56     26643



* What other methods than `DecisionTreeClassifier` exists (see [scikit-learn supervised learning documentation](https://scikit-learn.org/stable/supervised_learning.html) documentation)? Try out them. Do you get better results?


In [136]:
from sklearn.linear_model import SGDClassifier

def train(data, questions, model):
    y = data['V10'] ## data we try to predidct
    x = data.drop( 'V10', axis = 1) ## data we used to predict, let's remove the predictor from there
    #print(y.unique())
    ## We simplify data used for predicting and only use V4, V5, V6, V7, V8 and V9
    x = x[questions]

    x_train, x_test, y_train, y_test = train_test_split( x, y , test_size=0.3, random_state=42)
    
    
    #model = tree.DecisionTreeClassifier()

    model = model.fit( x_train, y_train )
    pred = model.predict( x_test )
    scores = cross_val_score(sgd, x, y, cv=5, scoring='f1_macro')
    print('cv scores:', scores)
    return y_test, pred

In [138]:
model = tree.DecisionTreeClassifier()
y_test, pred = train(new_data, satisfaction, model)
evaluate(y_test, pred)#, zero_division=1)

scores: [0.28823824 0.29359717 0.27510752 0.28297433 0.29333707]
[[4870 3600  275   48]
 [3291 9436  877  121]
 [ 516 1891  758  129]
 [ 142  330  233  126]]
              precision    recall  f1-score   support

         1.0       0.55      0.55      0.55      8793
         2.0       0.62      0.69      0.65     13725
         3.0       0.35      0.23      0.28      3294
         4.0       0.30      0.15      0.20       831

    accuracy                           0.57     26643
   macro avg       0.46      0.41      0.42     26643
weighted avg       0.55      0.57      0.56     26643



In [130]:
sgd = SGDClassifier(loss="hinge", penalty="l1", max_iter=100)
y_test, pred = train(new_data, satisfaction, sgd)
evaluate(y_test, pred)#, zero_division=1)

[[ 2843  5950     0     0]
 [ 1177 12548     0     0]
 [   75  3219     0     0]
 [   16   815     0     0]]
              precision    recall  f1-score   support

         1.0       0.69      0.32      0.44      8793
         2.0       0.56      0.91      0.69     13725
         3.0       0.00      0.00      0.00      3294
         4.0       0.00      0.00      0.00       831

    accuracy                           0.58     26643
   macro avg       0.31      0.31      0.28     26643
weighted avg       0.52      0.58      0.50     26643



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [154]:
from sklearn import svm
svm = svm.SVC(kernel='linear', C=1)
y_test, pred = train(new_data, added_questions, svm)
evaluate(y_test, pred)

KeyboardInterrupt: 

* What does cross validation mean? Try out cross validation and test out these things.

In [157]:
sgd = SGDClassifier(loss="hinge", penalty="l1", max_iter=100)
y_test, pred = train(new_data, satisfaction, sgd)
evaluate(y_test, pred)

scores: [0.30306004 0.23727446 0.28952843 0.17009659 0.24698004]
[[6329 2464    0    0]
 [5394 8331    0    0]
 [ 436 2858    0    0]
 [  86  745    0    0]]
              precision    recall  f1-score   support

         1.0       0.52      0.72      0.60      8793
         2.0       0.58      0.61      0.59     13725
         3.0       0.00      0.00      0.00      3294
         4.0       0.00      0.00      0.00       831

    accuracy                           0.55     26643
   macro avg       0.27      0.33      0.30     26643
weighted avg       0.47      0.55      0.50     26643



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Unsupervised machine learning

Beyond seeking to classify data based on existing attributes, in some questions you may want to find groups of similar entries from the data.
This is unsupervised machine learning; several methods excists for this.

In [9]:
from sklearn.cluster import KMeans

data_for_kmeans = data.iloc[1:500,].copy() ## slices the data to be smaller and easier

model = KMeans(n_clusters=5, random_state=42)
clusters = model.fit_predict( data_for_kmeans )

In [10]:
## add clusters to our data

data_for_kmeans['clusters'] = clusters

data_for_kmeans.groupby('clusters').agg('mean')

Unnamed: 0_level_0,Unnamed: 0,V1,V2,V2A,V3,V4,V5,V6,V7,V8,...,VOICE,WEIGHT4B,S001,S007,S018,S019,S021,S024,S025,COW
clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,244.345972,6.0,12.0,12.0,244.345972,1.118483,1.952607,2.009479,2.156398,1.345972,...,0.33021,0.951659,2.0,14004.345972,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
1,300.232558,6.0,12.0,12.0,300.232558,1.0,1.581395,1.953488,2.976744,1.372093,...,0.245162,0.976279,2.0,14060.232558,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
2,249.880829,6.0,12.0,12.0,249.880829,1.051813,1.564767,2.088083,2.637306,1.341969,...,0.299515,0.90487,2.0,14009.880829,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
3,273.444444,6.0,12.0,12.0,273.444444,1.111111,1.777778,2.222222,2.666667,1.444444,...,0.285556,0.962222,2.0,14033.444444,0.833333,1.25,1206212000.0,126.0,122014.0,615.0
4,234.744186,6.0,12.0,12.0,234.744186,1.0,1.72093,1.813953,2.72093,1.186047,...,0.33787,0.944651,2.0,13994.744186,0.833333,1.25,1206212000.0,126.0,122014.0,615.0


Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,V1,V1,...,S025,S025,COW,COW,COW,COW,COW,COW,COW,COW
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
clusters,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,211.0,244.345972,149.596596,2.0,127.5,227.0,383.0,500.0,211.0,6.0,...,122014.0,122014.0,211.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
1,43.0,300.232558,111.445874,41.0,275.5,303.0,365.5,497.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
2,193.0,249.880829,148.059501,3.0,106.0,254.0,391.0,499.0,193.0,6.0,...,122014.0,122014.0,193.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
3,9.0,273.444444,165.050834,7.0,210.0,309.0,403.0,452.0,9.0,6.0,...,122014.0,122014.0,9.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
4,43.0,234.744186,117.14787,11.0,155.5,277.0,312.5,495.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0


## Exercises

* Which cluster has highest number of data points
* How does changing the number of clusters change there results

In [147]:
data_for_kmeans.groupby('clusters').describe() 

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,V1,V1,...,S025,S025,COW,COW,COW,COW,COW,COW,COW,COW
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
clusters,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,211.0,244.345972,149.596596,2.0,127.5,227.0,383.0,500.0,211.0,6.0,...,122014.0,122014.0,211.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
1,43.0,300.232558,111.445874,41.0,275.5,303.0,365.5,497.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
2,193.0,249.880829,148.059501,3.0,106.0,254.0,391.0,499.0,193.0,6.0,...,122014.0,122014.0,193.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
3,9.0,273.444444,165.050834,7.0,210.0,309.0,403.0,452.0,9.0,6.0,...,122014.0,122014.0,9.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
4,43.0,234.744186,117.14787,11.0,155.5,277.0,312.5,495.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0


In [149]:
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict( data_for_kmeans )
data_for_kmeans['clusters'] = clusters
data_for_kmeans.groupby('clusters').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,V1,V1,...,S025,S025,COW,COW,COW,COW,COW,COW,COW,COW
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
clusters,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,245.0,259.583673,143.713649,3.0,121.0,289.0,383.0,499.0,245.0,6.0,...,122014.0,122014.0,245.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
1,43.0,234.744186,117.14787,11.0,155.5,277.0,312.5,495.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
2,211.0,244.345972,149.596596,2.0,127.5,227.0,383.0,500.0,211.0,6.0,...,122014.0,122014.0,211.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0


In [153]:
model = KMeans(n_clusters=7, random_state=42)
clusters = model.fit_predict( data_for_kmeans )
data_for_kmeans['clusters'] = clusters
data_for_kmeans.groupby('clusters').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,V1,V1,...,S025,S025,COW,COW,COW,COW,COW,COW,COW,COW
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
clusters,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,204.0,244.843137,151.116681,2.0,125.25,225.0,389.25,500.0,204.0,6.0,...,122014.0,122014.0,204.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
1,193.0,249.880829,148.059501,3.0,106.0,254.0,391.0,499.0,193.0,6.0,...,122014.0,122014.0,193.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
2,43.0,300.232558,111.445874,41.0,275.5,303.0,365.5,497.0,43.0,6.0,...,122014.0,122014.0,43.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
3,36.0,233.833333,121.854597,11.0,154.75,271.5,314.75,495.0,36.0,6.0,...,122014.0,122014.0,36.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
4,9.0,273.444444,165.050834,7.0,210.0,309.0,403.0,452.0,9.0,6.0,...,122014.0,122014.0,9.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
5,4.0,242.75,10.904892,231.0,237.0,241.5,247.25,257.0,4.0,6.0,...,122014.0,122014.0,4.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
6,10.0,231.4,114.720143,33.0,183.0,278.0,301.5,380.0,10.0,6.0,...,122014.0,122014.0,10.0,615.0,0.0,615.0,615.0,615.0,615.0,615.0
